Skip to content

write down the reason a VMM moved to Failed#10285

Open
hawkw wants to merge 6 commits intomainfrom
eliza/fail-reason
Open

write down the reason a VMM moved to Failed#10285
hawkw wants to merge 6 commits intomainfrom
eliza/fail-reason

Conversation

@hawkw
Copy link
Copy Markdown
Member

@hawkw hawkw commented Apr 17, 2026

In preparation for working on #10275, I thought it would be nice if there was a way for Nexus to write down why a VMM record was moved to the Failed state, for debugging purposes. This seemed worthwhile since #10275 proposes that we add more cases under which Nexus will mark a VMM as failed, and it would be useful to have a record for debugging any potential issues. This is intended to be displayed in OMDB, rather than presented directly to the user --- for user-facing diagnostics, I think we will want something beyond an unstructured string that gets jammed into the vmm database record when it's moved to Failed, especially because that is a circumstance under which that record will shortly be soft-deleted. Thus, this change is not intended to solve #9185; I've started (and still must finish) a proposal for a user-facing instance event log in RFD 631. Instead, this is a stop-gap temporary solution to record something for Oxide support and engineering to use in possible debugging scenarios, which is why it's so stringly-typed.

@hawkw hawkw requested a review from smklein April 17, 2026 22:56
Copy link
Copy Markdown
Member

@iximeow iximeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👀 👀 👀

Comment thread schema/crdb/dbinit.sql
Comment on lines +5944 to +5946
CONSTRAINT failure_reason_iff_failed CHECK (
(failure_reason IS NOT NULL AND state = 'failed')
OR (failure_reason IS NULL)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I skimmed through other Omicron bits alongside this change and I think with this change there really is no way an instance should be failed without a reason?

I guess this is not NOT (failure_reason IS NULL AND state == 'failed')` because you don't want a bug in Nexus to trip the constraint and prevent us from failing an instance that should be failed. Since failing an instance means something is Really Broken, refusing to be broken because of a bug would be bad?

that seems worth a note on the schema maybe, even though this is just for debugging?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that's correct, i should probably write that down here

Comment on lines +5 to +6
WHERE state = 'failed'
AND failure_reason IS NULL;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, right, I suppose any failures we've migrated into the future, we don't have a "legacy" reason.. though maybe could be "failure_reason = "predates precise information"?

Comment thread nexus/db-model/src/vmm.rs
Comment on lines +115 to +122
// /// Returns the runtime state of this VMM.
// pub fn runtime(&self) -> VmmRuntimeState {
// VmmRuntimeState {
// time_state_updated: self.time_state_updated,
// generation: self.generation,
// state: self.state,
// }
// }
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised this isn't used? but also what led you to noticing this, and do you want to just delete it lol

especially since it seems like Instance::runtime is the .runtime() that's actually used?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it should be deleted. i was surprised to discover this also

state: s.state.into(),
generation: s.gen_,
time_updated: s.time_updated,
failure_reason: None, // TODO(eliza)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw I'm ... not sure what this means. taking a swing: "vmm_register() or vmm_get_state() returned a SledVmmState that was immediately Failed"? would we expect Propolis to fill in a reason here, or would we expect sled-agent to invent a reason on Propolis' behalf when it sees the failure?

Comment on lines +7706 to +7713
const UPDATED: &'static str = " updated at";
const INSTANCE_ID: &'static str = "instance ID";
const SLED_ID: &'static str = "sled ID";
const SLED_SERIAL: &'static str = "sled serial";
const CPU_PLATFORM: &'static str = "CPU platform";
const ADDRESS: &'static str = "propolis address";
const STATE: &'static str = "state";
const FAILURE_REASON: &'static str = " failed because:";
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dumb question, but aren't we already juggling the ident below? Why add spaces as prefixes here?

Comment thread nexus/src/app/instance.rs

/// Attempts to transition an instance to to the `Failed` state in
/// response to an error returned from a call to a sled
/// agent instance API, supplied in `reason`.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// agent instance API, supplied in `error`.

Comment thread nexus/src/app/instance.rs
time_state_updated: chrono::Utc::now(),
generation: db::model::Generation(vmm.generation.next()),
failure_reason: Some(
"VMM no longer known to sled-agent".to_string(),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be using error to populate this string?


self.fail_vmm_and_terminate().await;
self.fail_vmm_and_terminate(
"propolis denies knowledge of VM",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit; I might change the wording here. By itself this string is a bit mysterious?

Maybe: "Proposlis refused to change state; VM is missing (might have restarted?)"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants