Skip to content

Fix guest induced reboot/ shutdown migration failure#87

Open
Coffeeri wants to merge 4 commits intocyberus-technology:gardenlinuxfrom
Coffeeri:fix-guest-induced-shutdown-migration-failure
Open

Fix guest induced reboot/ shutdown migration failure#87
Coffeeri wants to merge 4 commits intocyberus-technology:gardenlinuxfrom
Coffeeri:fix-guest-induced-shutdown-migration-failure

Conversation

@Coffeeri
Copy link

@Coffeeri Coffeeri commented Feb 17, 2026

This PR fixes an issue https://github.com/cobaltcore-dev/cobaltcore/issues/374 where the VMM fails with VmError::VmMigrating if the guest triggers a reboot or shutdown during an active live migration.
When this happens, the reset or shutdown request is handled by the VMM control loop while the VM ownership is MaybeVmOwnership::Migration. Calling vm_reboot() or vmm_shutdown() in that state returns VmMigrating, which causes the normal lifecycle handling to abort.

The longterm idea is to refactor the vm_reboot function, able to outlive migration, by introducing a reset-capability of the VMs components.
This PR introduces a first workaround: Intercept the reboot/ shutdown event, pause the VM, migrate, resume VM, and finally re-emit the reboot/shutdown event.

What this change does

  • Introduces a small PostMigrationLifecycleEvent Enum field in VmSnapshot:

  • Migration Source

    • The control loop still consumes reset_evt / exit_evt.
    • If a migration is in progress, it latches the first lifecycle event (reset/ shutdown) instead of executing it.
    • The latched event is written into the VM snapshot before sending state.
    • If a lifecycle event is latched during pre-copy, the migration switches to the downtime phase at the next iteration boundary.
    • On migration failure, once ownership returns to the VMM, the latched event is replayed locally via the existing eventfds.
  • Migration Destination

    • The lifecycle Enum field is read from the received snapshot.
    • After Command::Complete, the VM is resumed and the lifecycle action is replayed through the existing eventfds:
      • VmRebootreset_evt
      • VmmShutdownexit_evt

Behavior and edge cases

  • No latched event → migration behaves exactly as before.
  • Multiple lifecycle signals during one migration → deterministic first-event-wins.
  • Downtime flow remains unchanged (throttling stop, pause, final transfer, snapshot, complete).
  • Latch state is cleared on migration start and after success or failure to avoid stale state.

@Coffeeri Coffeeri force-pushed the fix-guest-induced-shutdown-migration-failure branch 5 times, most recently from 9dfc444 to 5141abb Compare February 17, 2026 15:02
@Coffeeri Coffeeri self-assigned this Feb 17, 2026
@Coffeeri Coffeeri changed the title Fix guest induced shutdown migration failure Fix guest induced reboot/ shutdown migration failure Feb 17, 2026
@Coffeeri Coffeeri force-pushed the fix-guest-induced-shutdown-migration-failure branch from 5141abb to 24498cf Compare February 18, 2026 07:26
@Coffeeri Coffeeri force-pushed the fix-guest-induced-shutdown-migration-failure branch 3 times, most recently from cf161ea to ac45988 Compare February 20, 2026 07:50
@Coffeeri Coffeeri marked this pull request as ready for review February 25, 2026 09:46
@Coffeeri Coffeeri force-pushed the fix-guest-induced-shutdown-migration-failure branch from ac45988 to 77b08fe Compare February 25, 2026 09:46
@Coffeeri
Copy link
Author

I will add the libvirt-tests shortly. They currently have timing issues, but when they run, the test cases pass successfully:

  • Reboot/Shutdown during migration → verify that the event is re-emitted on the target after migration.
  • Reboot/Shutdown during migration with enforced migration failure → verify that the event is re-emitted locally.

@hertrste
Copy link
Collaborator

Very nice!

Very pragmatic and robust workaround for now. Otherwise I am good with the design. I'll let our Rust experts do the fine grained code tweaking. I approve once #96 is merged.

@Coffeeri Coffeeri force-pushed the fix-guest-induced-shutdown-migration-failure branch from 77b08fe to e75560d Compare February 26, 2026 15:59
Copy link
Member

@phip1611 phip1611 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this wonderful solution and work-around for the current reboot design!

I left a few remarks. Further (very nit): I personally think that "latched" doesn't provide me much context - this could also come from my limited english skills perhaps. I'd prefer "postponed_lifecycle" event. Feel free to ignore.

@Coffeeri Coffeeri force-pushed the fix-guest-induced-shutdown-migration-failure branch 5 times, most recently from 4338e29 to 0b0f82d Compare March 2, 2026 09:01
@Coffeeri Coffeeri force-pushed the fix-guest-induced-shutdown-migration-failure branch 2 times, most recently from ab4fb3d to 51380d6 Compare March 2, 2026 09:15
@Coffeeri Coffeeri requested a review from phip1611 March 2, 2026 10:02
@Coffeeri Coffeeri force-pushed the fix-guest-induced-shutdown-migration-failure branch from 51380d6 to 89cf2dc Compare March 2, 2026 10:53
Coffeeri added 2 commits March 2, 2026 15:43
During live migration, VM ownership is moved away from the VMM thread.
To preserve guest-triggered reboot and shutdown lifecycle intent
across that ownership handover, we need a small lifecycle marker to
travel with the migrated VM state.

This change introduces `PostMigrationLifecycleEvent` and stores it in
`VmSnapshot` with `#[serde(default)]` for backward compatibility.
`Vm::snapshot()` now serializes the marker, and VM construction from a
snapshot restores it.

No control-loop behavior is changed in this commit. This is only the
data model/plumbing needed by follow-up commits.

On-behalf-of: SAP [email protected]
Signed-off-by: Leander Kohler <[email protected]>
While a live migration is running, the migration worker owns the VM and
the VMM control loop cannot execute vm_reboot()/vmm_shutdown() directly.
Guest-triggered reset/exit events in that window currently hit
VmMigrating and fail.

This change makes the control loop consume reset/exit as before, but
when ownership is `MaybeVmOwnership::Migration` it postpones a
post-migration lifecycle intent instead of calling lifecycle handlers
directly.

The postponed state is first-event-wins and is cleared when a new send
migration starts, preventing stale lifecycle intent from leaking between
migrations.

This commit only introduces source-side postponing behavior and does not
yet apply or replay the postponed event.

On-behalf-of: SAP [email protected]
Signed-off-by: Leander Kohler <[email protected]>
@Coffeeri Coffeeri force-pushed the fix-guest-induced-shutdown-migration-failure branch from 89cf2dc to e0f513c Compare March 2, 2026 14:45
Copy link
Member

@phip1611 phip1611 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just a few more remarks

Coffeeri added 2 commits March 3, 2026 16:25
Add migration plumbing to carry the postponed lifecycle intent from
source to destination and replay it through the existing control-loop
paths.

The migration worker now passes the shared postponed lifecycle state
into the send path, and the sender writes the selected
`PostMigrationLifecycleEvent` into the VM snapshot before transmitting
state.

On the receiving side, migration state restore extracts that snapshot
field and stores it in VMM state. After `Command::Complete`, the target
resumes the VM and replays the lifecycle action by writing to the
existing eventfds:
  - VmReboot -> reset_evt
  - VmmShutdown -> exit_evt

On-behalf-of: SAP [email protected]
Signed-off-by: Leander Kohler <[email protected]>
When a lifecycle event like reset or shutdown is postponed during
pre-copy, switch to downtime at the next iteration boundary.
This keeps the current iteration send intact and then transitions
into the existing graceful downtime path (`stop_vcpu_throttling()`,
`pause()`, final transfer, snapshot).

To keep behavior deterministic on source migration failure,
replay the postponed lifecycle event locally after ownership is
returned:
- VmReboot -> reset_evt
- VmmShutdown -> exit_evt

Postponed state is cleared on both success and failure paths to avoid
stale state across migrations.

On-behalf-of: SAP [email protected]
Signed-off-by: Leander Kohler <[email protected]>
@Coffeeri Coffeeri force-pushed the fix-guest-induced-shutdown-migration-failure branch from bba74b6 to 9872a8a Compare March 3, 2026 15:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants