[persist] Shift all heartbeat / frontier management for the read handle to a background task #34590

bkirwi · 2025-12-19T22:21:58Z

Originally, Persist was built with the assumption that callers would reliably call the ReadHandle methods (like maybe_downgrade_since) with some reasonable frequency - say, at least every few minutes. This assumption turns out not to be true for a variety of reasons... and hasn't even been true for Persist's own internal use of ReadHandle! (For example, calls to snapshot can block indefinitely, so you'd need to add timeouts to those calls and call them in a loop with a downgrade operation, like we do in this recent PR. Not a nice pattern!) We shifted heartbeating to a background thread, which helped quite a bit with lease timeouts, but didn't entirely solve the issue - failing to downgrade regularly could cause unnecessary seqno holds, for example.

This PR shifts all interaction with the reader lease to the heartbeat task, which is now responsible for downgrading both the since and seqno holds, and manages the final expiry of the state. Communication between the handle and the task is via some shared state, which is managed with an abstraction similar to Persist's Applier for shard-level state. This ensures that the state is updated regularly, and with the most recent possible information, even when whomever holds the handle is busy with other stuff.

Motivation

https://github.com/MaterializeInc/incidents-and-escalations/issues/224

A core issue with this stuff is that we can't downgrade our seqno hold when nobody is interacting with the client (downgrading frontiers etc.) and in practice it's difficult to ensure that every handle is interacted with promptly. The background heartbeat task was a partial solution for this, which handled only extending leases but not releasing any unnecessarily held resources. Automatically downgrading the seqno hold without requiring the caller to do anything proactively should help avoid most cases of leaked seqnos.

This is a more general solution than the recent #34458, which times out certain long-running operations so we can run downgrades etc. With this PR, the downgrades will run in the background heartbeat task no matter what long-running operations prevents progress in the frontend.

Tips for reviewer

After this PR, the only way to have a long-running seqno hold should be if a caller has leased a part and not returned it. I'm optimistic that that case is rare - it involves more obviously-wrong client behaviour, where the case this PR fixes is pretty subtle - and anyways there's not much we could do about that in Persist internals, since holding back the seqno is needed for correctness.

This is somewhere between Tokio's `watch` and a normal mutex with a condvar. Kind of annoying to have to write one's own library for this, but everything else I tried was either awkward or needed an unbounded channel...

These two are much more tightly related than the task is to the read handle, and this allows for some additional privacy for shared state.

Observe new seqnos in both the handle and the heartbeat thread Always downgrade to the shared seqno Should be pretty recent! Use the main downgrade-since method in the background thread Do all downgrades in the background task Shift expiry to the background task Move the heartbeat task to read.rs This is a bit more read-handle business than machine business, in my opinion! Comment fixup

bkirwi · 2026-01-05T21:12:24Z

Deployed to staging. Overall metrics look good - the rate of downgrades is less spiky and they see fewer conflicts.

One negative metric change is that the heartbeat task wakes up substantially more often. I believe this is because frequent downgrading of the since will trigger the background task to wake, even if there's nothing to do yet. I think I could rework the code to avoid this, though it's a bit tedious, and it's only a mild drawback, so I will wait to be asked first. :)

DAlperin · 2026-01-06T20:52:29Z

src/persist-client/src/read.rs

+        self.since = new_since.clone();
+        self.hold_state.modify(|s| {
+            s.downgrade_since(new_since);
+            s.request_sync = true;


You could probably peek the outstanding seqno here to decide if you wanted to request a wakeup... but I'm not too worried

DAlperin

Very exciting, thank you!

bkirwi force-pushed the task-lease branch from 6d8ae18 to a96044d Compare December 23, 2025 01:58

bkirwi force-pushed the task-lease branch 6 times, most recently from 5e66761 to 080ec06 Compare January 3, 2026 00:26

bkirwi changed the title ~~[persist] [wip] Task lease~~ [persist] Shift all heartbeat / frontier management for the read handle to a background task Jan 5, 2026

bkirwi added 4 commits January 5, 2026 13:54

Remove now-redundant background heartbeat task

843cabc

Add an AwaitableState abstraction in persist internals

c91f64a

This is somewhere between Tokio's `watch` and a normal mutex with a condvar. Kind of annoying to have to write one's own library for this, but everything else I tried was either awkward or needed an unbounded channel...

Move the reader heartbeat task to live with ReadHandle

82ff49f

These two are much more tightly related than the task is to the read handle, and this allows for some additional privacy for shared state.

bkirwi force-pushed the task-lease branch from 080ec06 to eadaaea Compare January 5, 2026 18:54

bkirwi marked this pull request as ready for review January 5, 2026 20:16

bkirwi requested review from a team as code owners January 5, 2026 20:16

DAlperin reviewed Jan 6, 2026

View reviewed changes

DAlperin approved these changes Jan 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[persist] Shift all heartbeat / frontier management for the read handle to a background task #34590

[persist] Shift all heartbeat / frontier management for the read handle to a background task #34590

Uh oh!

bkirwi commented Dec 19, 2025 •

edited

Loading

Uh oh!

bkirwi commented Jan 5, 2026

Uh oh!

DAlperin Jan 6, 2026

Uh oh!

DAlperin left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[persist] Shift all heartbeat / frontier management for the read handle to a background task #34590

Are you sure you want to change the base?

[persist] Shift all heartbeat / frontier management for the read handle to a background task #34590

Uh oh!

Conversation

bkirwi commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Tips for reviewer

Uh oh!

bkirwi commented Jan 5, 2026

Uh oh!

DAlperin Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

DAlperin left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bkirwi commented Dec 19, 2025 •

edited

Loading