[external-api] Add contact support field to update status by karencfv · Pull Request #10271 · oxidecomputer/omicron

karencfv · 2026-04-15T07:41:16Z

This PR is the last piece for a minimal system health check for update status. It is a new field in the system/update/status API called contact_support which is either true or false based on the information in the latest inventory collection and a few additional health checks.

Disclaimer: I used the claude code skill to make the endpoint edit, and also for part of the code (trying to learn how to use it here). I checked the code several times and tested manually, but just thought I'd mention it here.

Manual tests:

There are unhealthy services

$ ./target/debug/omdb db inventory collections show latest --db-url "postgresql://root@[::1]:59809/omicron?sslmode=disable"
<...>
    zpools
      0be6eab2-9e27-4c3e-bbaf-11435e393ed2: total size: 16 GiB health: online
      4ac3f3b4-a423-46cb-93d1-bc393545b9e1: total size: 16 GiB health: online
      77468dca-740c-49f3-b10e-a21a3d9e6462: total size: 16 GiB health: online
<...>
SMF SERVICES STATUS
    4 SMF services enabled but not online at 2026-04-16T06:27:35.387Z
        FMRI                                ZONE       STATE       
        svc:/site/fake-service2:default     global     maintenance 
        svc:/site/fake-service3:default     global     offline     
        svc:/site/fake-service4:default     global     degraded    
        svc:/site/fake-service:default      global     maintenance
<...>
$ curl -b cookies.txt -H "api-version: 2026041500.0.0"   http://127.0.0.1:12220/v1/system/update/status | jq
  % Total    % Received % Xferd  Average Speed  Time    Time    Time   Current
                                 Dload  Upload  Total   Spent   Left   Speed
100    189 100    189   0      0    959      0                              0
{
  "target_release": null,
  "components_by_release_version": {
    "install dataset": 8,
    "unknown": 11
  },
  "time_last_step_planned": "2026-04-16T07:08:43.121286Z",
  "suspended": false,
  "contact_support": true
}

Everything is happy!

$ ./target/debug/omdb db inventory collections show latest --db-url "postgresql://root@[::1]:59809/omicron?sslmode=disable"
<...>
    zpools
      337ab774-358d-4cb4-bdf4-5672caa90d5f: total size: 16 GiB health: online
      c8118f52-a5f4-451a-87ce-cf331b80988c: total size: 16 GiB health: online
      e2b28628-9c8e-4be3-9086-5c52082c3f85: total size: 16 GiB health: online
<...>
SMF SERVICES STATUS
    0 SMF services enabled but not online at 2026-04-16T07:11:45.570Z
<...>
$ curl -b cookies.txt -H "api-version: 2026041500.0.0"   http://127.0.0.1:12220/v1/system/update/status | jq
  % Total    % Received % Xferd  Average Speed  Time    Time    Time   Current
                                 Dload  Upload  Total   Spent   Left   Speed
100    188 100    188   0      0   1197      0                              0
{
  "target_release": null,
  "components_by_release_version": {
    "install dataset": 8,
    "unknown": 11
  },
  "time_last_step_planned": "2026-04-16T07:11:46.268131Z",
  "suspended": false,
  "contact_support": false
}

Closes: #9418

david-crespo · 2026-04-16T16:57:43Z

It worries me slightly to tell the user the system is unhealthy at times when that's expected.

karencfv · 2026-04-16T21:02:45Z

It worries me slightly to tell the user the system is unhealthy at times when that's expected.

I totally get it. My first instinct was to call this "is_system_updateable" or something like that. We discussed somewhere, but I think it was during a meeting or something. I was looking for the discussion but couldn't find it. I don't remember the specifics, but I think the reasoning behind this naming was to make sure users don't ignore this issue if they encounter an "unhealthy" system and they do call support.

Maybe @davepacheco can expand

An idea was floated around that the console could hide the status while there was an ongoing update, @david-crespo what is your take on this?

david-crespo · 2026-04-17T02:11:00Z

That’s interesting, so it would be like health/unhealthy, unless less than 100% of components are on the target version, in which case we’re “updating” or something. I guess I wonder what “unhealthy” is supposed to tell the user. I’d much rather have it in the form of an active problem.

karencfv · 2026-04-17T02:52:46Z

The idea of this work is to take the place of the health check script the support team currently runs before and after each update until we have a proper FM implementation. We want it specifically tied to the update process https://rfd.shared.oxide.computer/rfd/0612. More detail here #9876.

Perhaps we can chat further on the topic at the next update sync to make sure we're all on the same page?

david-crespo · 2026-04-17T12:45:06Z

That's helpful, I'll read that issue. Off the top of my head I think it would feel better to me (and possibly be more useful to support) to have all the sub-checks as separate booleans rather than synthesizing them all into one big AND. And it doesn't really feel like that update-specific, even though it's used during update. So maybe it belongs in its own endpoint?

karencfv · 2026-04-20T02:26:12Z

Off the top of my head I think it would feel better to me (and possibly be more useful to support) to have all the sub-checks as separate booleans rather than synthesizing them all into one big AND.

There are a few things at play here.

From the user's perspective none of the failed checks are actionable to them, so we don't want to give them any more information than they need. In this case the only information they need is "something isn't right after the update go call support". There is more detail on this here -> https://rfd.shared.oxide.computer/rfd/0612#_user_facing.

The support team does need more information about what went wrong. For them, we are adding all of the health data to inventory, which is included in the support bundles. This endpoint isn't really for them. Initially we were going to have dedicated health checks running in the background and they were going to be part of a "health monitor" object in inventory. Ultimately, we backtracked on this as it was overlapping too much with what will eventually be FM, here is the reasoning behind that restructure #9876.

So maybe it belongs in its own endpoint?

Maybe? The thing is, this will all go away when FM is implemented most likely. We don't want to give these checks too much importance. Or have customers rely on them too much. For now we just want them to be part of update status, so customers can have some sort of confidence that an update went well or not. Or if something is wrong and they should not begin an update process at all.

davepacheco · 2026-04-20T23:32:38Z

@david-crespo thanks for taking a look. Definitely the intended long-term solution here is that this information feeds into an "active problems" API driven by the FM subsystem. We explicitly decided not to try to do this here. From RFD 612:

This proposal should be viewed as a first useful customer-visible deliverable along a path towards integration with the fault management system. It is not a replacement for that subsystem, nor is it seeking to take on more technical debt to make up for the absence of that system.
To that end, our goals are to do as little throwaway work as possible, and where we need to do new work, do it in a way that’s aligned with what the fault management project will eventually need.

@david-crespo wrote:

I guess I wonder what “unhealthy” is supposed to tell the user.

These are the two goals:

When the system is obviously broken after an update (e.g., cockroachdb in maintenance), we want the customer to be able to know that. In all of the cases we intend to look for, the only action for them is to call support.
When the system is similarly broken before an update, we want the customer to be warned that they should call support and resolve that before starting the update.

To that end, I would rename this field call_support: bool.

@david-crespo wrote:

It worries me slightly to tell the user the system is unhealthy at times when that's expected.

and @karencfv wrote:

An idea was floated around that the console could hide the status while there was an ongoing update

Yeah, we definitely don't want false alarms during an upgrade. We did discuss that and wrote it into RFD 612:

As health checks often fail during an update, we only want them visible via the external API when the system is idle or when an ongoing update has stalled for a set period of time.

As I read that, the API should not report call_support: true unless the health checks fail and either (1) there's no update in progress or (2) there's been no new blueprint planned for N minutes. This is the same guidance we give to support: the Reconfigurator Ops Guide suggests waiting 10-15 minutes before deciding the update is stuck.

davepacheco · 2026-04-20T23:33:52Z

Forgot to add: @david-crespo hopefully that's clarifying and if it makes sense, great. If not, we could discuss on tomorrow's update sync (or another time, if that's a bad time)?

david-crespo · 2026-04-20T23:50:51Z

Yes, it does help. I like call_support (maybe contact_support). And having the API do the hiding logic during an update also makes sense.

davepacheco · 2026-04-20T23:36:01Z

+            // There should always be an inventory collection before or after an
+            // update
+            None => false,
+            Some(collection) => collection.is_system_healthy(),


I think we also want checks here:

that the inventory collection is not too old

that if there's an update in progress, then time_last_step_planned is within the last N minutes

davepacheco · 2026-04-20T23:38:23Z

+    /// - All zpools are online.
+    /// - All enabled SMF services are in an online state.
+    pub fn is_system_healthy(&self) -> bool {
+        self.are_all_zpools_healthy()


What about:

missing zpools (would probably have to compare against blueprint?)

missing sled agents, SPs, or other components (presumably would compare against blueprint)

I'm torn about whether any of this belongs in impl Inventory vs. out in the health check function. I think is_system_healthy(&self) -> bool is too much of a footgun. We don't want people reaching for this randomly because they want to know if things currently seem okay. Its criteria are very specific to the use case we have in mind.

My intention here was to get an initial endpoint going with just a couple of checks (the ones that are indicated as priority in the health check table in #9876) to keep the PR small. In follow up PRs we would have additional checks trickling in.

I think is_system_healthy(&self) -> bool is too much of a footgun. We don't want people reaching for this randomly because they want to know if things currently seem okay. Its criteria are very specific to the use case we have in mind.

Yeah, that's a good point. I should at least move is_system_healthy closer to the API and rename it contact_support or something 😄

karencfv

Thanks for the input, both! I think contact_support is a much better name as well

karencfv · 2026-04-21T07:33:57Z

+    /// - All zpools are online.
+    /// - All enabled SMF services are in an online state.
+    pub fn is_system_healthy(&self) -> bool {
+        self.are_all_zpools_healthy()


My intention here was to get an initial endpoint going with just a couple of checks (the ones that are indicated as priority in the health check table in #9876) to keep the PR small. In follow up PRs we would have additional checks trickling in.

I think is_system_healthy(&self) -> bool is too much of a footgun. We don't want people reaching for this randomly because they want to know if things currently seem okay. Its criteria are very specific to the use case we have in mind.

Yeah, that's a good point. I should at least move is_system_healthy closer to the API and rename it contact_support or something 😄

karencfv added 5 commits April 15, 2026 19:38

[external-api] Add health field to update status

b80e3aa

clean up

e3b0bd1

Add logic to determine whether the system is healthy

1357192

move some code around

18fc75b

fix API description

9a8b5de

karencfv marked this pull request as ready for review April 16, 2026 07:37

karencfv requested review from davepacheco and jgallagher April 16, 2026 07:37

davepacheco reviewed Apr 20, 2026

View reviewed changes

karencfv commented Apr 21, 2026

View reviewed changes

karencfv added 9 commits April 22, 2026 13:49

retrieve sagas within a time limit

a44c47f

better name?

e26fec1

fmt

16f42e5

merge main

054238d

fix versioning

0357a09

rename field

60a9384

move function and check stale sagas

3de3c1d

adapt tests

4e4157d

fmt

68636ca

karencfv changed the title ~~[external-api] Add health field to update status~~ [external-api] Add contact support field to update status Apr 22, 2026

rename files

34674f6

undo rust analyser's unhelpful edits

386fad1

Conversation

karencfv commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

david-crespo commented Apr 16, 2026

Uh oh!

karencfv commented Apr 16, 2026

Uh oh!

david-crespo commented Apr 17, 2026

Uh oh!

karencfv commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

david-crespo commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

karencfv commented Apr 20, 2026

Uh oh!

davepacheco commented Apr 20, 2026

Uh oh!

davepacheco commented Apr 20, 2026

Uh oh!

david-crespo commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davepacheco Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

davepacheco Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

karencfv Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

karencfv left a comment

Choose a reason for hiding this comment

Uh oh!

karencfv Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

karencfv commented Apr 15, 2026 •

edited

Loading

karencfv commented Apr 17, 2026 •

edited

Loading

david-crespo commented Apr 17, 2026 •

edited

Loading

david-crespo commented Apr 20, 2026 •

edited

Loading