Fix scheduler bugs: heap invariant corruption and lost stop event at dispatch boundary#246
Conversation
9743e85 to
f9a160e
Compare
llucax
left a comment
There was a problem hiding this comment.
I have a few comments about structure mainly. Also a few extra minor comments:
-
No empty line between the last line and signed-off-by in the commit message:
Fix: call heapify() after pop(idx) to restore the heap invariant. Signed-off-by: Mathias L. Baumann <mathias.baumann@frequenz.com> -
The release notes update is mixed in the last commit, I guess it makes more sense to put it in its own commit.
-
The PR description also IMHO focus on the wrong stuff, it didn't help a lot understanding what was not working. Asking AI to create a better description that explains what was exactly the bug, and how it was fixed by the PR, it came up with:
This PR fixes two scheduler bugs when updating an existing dispatch.
- Removing a queued event could corrupt the scheduler heap because the queue was not re-heapified afterward, which could lead to incorrect ordering of future events.
- An update exactly at the stop boundary could remove the pending stop event without notifying the actor, causing the stop transition to be missed.
This PR restores the heap after removals, tracks which queued event was removed, and sends the missing stop notification when a boundary update unschedules the pending stop event.
Regression tests and release notes were added.
PS: TIL: re-sifted == re-heapified
|
One more question. Could the boundary issue happen for updates happen at the start boundary instead of the stop boundary? |
f9a160e to
30182fa
Compare
list.pop(idx) removes an arbitrary element from the heap without restoring the heap property. Add heapify() after the removal to fix the invariant. Change the return type from bool to QueueItem | None so callers can inspect the removed item (needed for the follow-up fix). Signed-off-by: Mathias L. Baumann <mathias.baumann@frequenz.com>
When a dispatch update arrives at exactly the stop boundary, _remove_scheduled consumes the pending stop event from the queue. _update_changed_running_state does not fire because old_dispatch.started and dispatch.started are both False at this point. The actor would then remain running indefinitely. Fix: if _remove_scheduled returned a stop event (priority == 1) and the dispatch is no longer started, send the running-state change explicitly. Signed-off-by: Mathias L. Baumann <mathias.baumann@frequenz.com>
becf4de to
aafc196
Compare
Signed-off-by: Mathias L. Baumann <mathias.baumann@frequenz.com>
Signed-off-by: Mathias L. Baumann <mathias.baumann@frequenz.com>
aafc196 to
f1bb780
Compare
|
Okay, I think I addressed all feedback now |
The lost-event bug only occurs at the stop boundary, not the start boundary. At the start boundary, the update path already has a fallback notification path through That means:
So even if At the stop boundary, the problematic case is different:
|
|
Wow, no merge queue for this repo? |
|
On a second thought, I'm doubtful the merge queue really adds any value to us, I guess the try-merge and remove, go to the next in the queue is good for monorepos, that might have many many concurrent merges, but for us, that we hardly have any (maybe only when dependabot kicks in), it is starting to feel it just makes things slower and more expensive (testing twice), without much extra gain. But I guess this is something to discuss elsewhere... |
|
Shit. I targeted the wrong upstream branch. We are on |
|
And that's also why there is no merge queue. |
|
Maybe remove the v0.x.x branch if we are not making more releases from it? |
Rebased follow-up of #246 for `v1.x.x`. - restores the scheduler heap after removing queued events - emits the missing stop notification when an update lands on the stop boundary - adds regression coverage for both scheduler issues - adapts the new scheduler tests to the `time-machine` 3 API used on `v1.x.x`
already done :P |
This PR fixes two scheduler bugs when updating an existing dispatch.
This PR restores the heap after removals, tracks which queued event was removed, and sends the missing stop notification when a boundary update unschedules the pending stop event.
Regression tests and release notes were added.