Fix guest induced reboot/ shutdown migration failure#87
Open
Coffeeri wants to merge 4 commits intocyberus-technology:gardenlinuxfrom
Open
Fix guest induced reboot/ shutdown migration failure#87Coffeeri wants to merge 4 commits intocyberus-technology:gardenlinuxfrom
Coffeeri wants to merge 4 commits intocyberus-technology:gardenlinuxfrom
Conversation
9dfc444 to
5141abb
Compare
tpressure
reviewed
Feb 17, 2026
5141abb to
24498cf
Compare
Coffeeri
commented
Feb 18, 2026
cf161ea to
ac45988
Compare
ac45988 to
77b08fe
Compare
Author
|
I will add the libvirt-tests shortly. They currently have timing issues, but when they run, the test cases pass successfully:
|
This was referenced Feb 25, 2026
Collaborator
|
Very nice! Very pragmatic and robust workaround for now. Otherwise I am good with the design. I'll let our Rust experts do the fine grained code tweaking. I approve once #96 is merged. |
77b08fe to
e75560d
Compare
hertrste
approved these changes
Feb 26, 2026
phip1611
reviewed
Feb 27, 2026
Member
phip1611
left a comment
There was a problem hiding this comment.
Thanks for this wonderful solution and work-around for the current reboot design!
I left a few remarks. Further (very nit): I personally think that "latched" doesn't provide me much context - this could also come from my limited english skills perhaps. I'd prefer "postponed_lifecycle" event. Feel free to ignore.
4338e29 to
0b0f82d
Compare
ab4fb3d to
51380d6
Compare
51380d6 to
89cf2dc
Compare
During live migration, VM ownership is moved away from the VMM thread. To preserve guest-triggered reboot and shutdown lifecycle intent across that ownership handover, we need a small lifecycle marker to travel with the migrated VM state. This change introduces `PostMigrationLifecycleEvent` and stores it in `VmSnapshot` with `#[serde(default)]` for backward compatibility. `Vm::snapshot()` now serializes the marker, and VM construction from a snapshot restores it. No control-loop behavior is changed in this commit. This is only the data model/plumbing needed by follow-up commits. On-behalf-of: SAP [email protected] Signed-off-by: Leander Kohler <[email protected]>
While a live migration is running, the migration worker owns the VM and the VMM control loop cannot execute vm_reboot()/vmm_shutdown() directly. Guest-triggered reset/exit events in that window currently hit VmMigrating and fail. This change makes the control loop consume reset/exit as before, but when ownership is `MaybeVmOwnership::Migration` it postpones a post-migration lifecycle intent instead of calling lifecycle handlers directly. The postponed state is first-event-wins and is cleared when a new send migration starts, preventing stale lifecycle intent from leaking between migrations. This commit only introduces source-side postponing behavior and does not yet apply or replay the postponed event. On-behalf-of: SAP [email protected] Signed-off-by: Leander Kohler <[email protected]>
89cf2dc to
e0f513c
Compare
phip1611
reviewed
Mar 2, 2026
Member
phip1611
left a comment
There was a problem hiding this comment.
LGTM! Just a few more remarks
Add migration plumbing to carry the postponed lifecycle intent from source to destination and replay it through the existing control-loop paths. The migration worker now passes the shared postponed lifecycle state into the send path, and the sender writes the selected `PostMigrationLifecycleEvent` into the VM snapshot before transmitting state. On the receiving side, migration state restore extracts that snapshot field and stores it in VMM state. After `Command::Complete`, the target resumes the VM and replays the lifecycle action by writing to the existing eventfds: - VmReboot -> reset_evt - VmmShutdown -> exit_evt On-behalf-of: SAP [email protected] Signed-off-by: Leander Kohler <[email protected]>
When a lifecycle event like reset or shutdown is postponed during pre-copy, switch to downtime at the next iteration boundary. This keeps the current iteration send intact and then transitions into the existing graceful downtime path (`stop_vcpu_throttling()`, `pause()`, final transfer, snapshot). To keep behavior deterministic on source migration failure, replay the postponed lifecycle event locally after ownership is returned: - VmReboot -> reset_evt - VmmShutdown -> exit_evt Postponed state is cleared on both success and failure paths to avoid stale state across migrations. On-behalf-of: SAP [email protected] Signed-off-by: Leander Kohler <[email protected]>
bba74b6 to
9872a8a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR fixes an issue https://github.com/cobaltcore-dev/cobaltcore/issues/374 where the VMM fails with
VmError::VmMigratingif the guest triggers a reboot or shutdown during an active live migration.When this happens, the reset or shutdown request is handled by the VMM control loop while the VM ownership is
MaybeVmOwnership::Migration. Callingvm_reboot()orvmm_shutdown()in that state returnsVmMigrating, which causes the normal lifecycle handling to abort.The longterm idea is to refactor the
vm_rebootfunction, able to outlive migration, by introducing a reset-capability of the VMs components.This PR introduces a first workaround: Intercept the reboot/ shutdown event, pause the VM, migrate, resume VM, and finally re-emit the reboot/shutdown event.
What this change does
Introduces a small
PostMigrationLifecycleEventEnum field inVmSnapshot:Migration Source
reset_evt/exit_evt.Migration Destination
Command::Complete, the VM is resumed and the lifecycle action is replayed through the existing eventfds:VmReboot→reset_evtVmmShutdown→exit_evtBehavior and edge cases