Skip to content

[external-api] Add contact support field to update status#10271

Open
karencfv wants to merge 16 commits intooxidecomputer:mainfrom
karencfv:include-health-in-update-status-api
Open

[external-api] Add contact support field to update status#10271
karencfv wants to merge 16 commits intooxidecomputer:mainfrom
karencfv:include-health-in-update-status-api

Conversation

@karencfv
Copy link
Copy Markdown
Contributor

@karencfv karencfv commented Apr 15, 2026

This PR is the last piece for a minimal system health check for update status. It is a new field in the system/update/status API called contact_support which is either true or false based on the information in the latest inventory collection and a few additional health checks.

Disclaimer: I used the claude code skill to make the endpoint edit, and also for part of the code (trying to learn how to use it here). I checked the code several times and tested manually, but just thought I'd mention it here.

Manual tests:

There are unhealthy services

$ ./target/debug/omdb db inventory collections show latest --db-url "postgresql://root@[::1]:59809/omicron?sslmode=disable"
<...>
    zpools
      0be6eab2-9e27-4c3e-bbaf-11435e393ed2: total size: 16 GiB health: online
      4ac3f3b4-a423-46cb-93d1-bc393545b9e1: total size: 16 GiB health: online
      77468dca-740c-49f3-b10e-a21a3d9e6462: total size: 16 GiB health: online
<...>
SMF SERVICES STATUS
    4 SMF services enabled but not online at 2026-04-16T06:27:35.387Z
        FMRI                                ZONE       STATE       
        svc:/site/fake-service2:default     global     maintenance 
        svc:/site/fake-service3:default     global     offline     
        svc:/site/fake-service4:default     global     degraded    
        svc:/site/fake-service:default      global     maintenance
<...>
$ curl -b cookies.txt -H "api-version: 2026041500.0.0"   http://127.0.0.1:12220/v1/system/update/status | jq
  % Total    % Received % Xferd  Average Speed  Time    Time    Time   Current
                                 Dload  Upload  Total   Spent   Left   Speed
100    189 100    189   0      0    959      0                              0
{
  "target_release": null,
  "components_by_release_version": {
    "install dataset": 8,
    "unknown": 11
  },
  "time_last_step_planned": "2026-04-16T07:08:43.121286Z",
  "suspended": false,
  "contact_support": true
}

Everything is happy!

$ ./target/debug/omdb db inventory collections show latest --db-url "postgresql://root@[::1]:59809/omicron?sslmode=disable"
<...>
    zpools
      337ab774-358d-4cb4-bdf4-5672caa90d5f: total size: 16 GiB health: online
      c8118f52-a5f4-451a-87ce-cf331b80988c: total size: 16 GiB health: online
      e2b28628-9c8e-4be3-9086-5c52082c3f85: total size: 16 GiB health: online
<...>
SMF SERVICES STATUS
    0 SMF services enabled but not online at 2026-04-16T07:11:45.570Z
<...>
$ curl -b cookies.txt -H "api-version: 2026041500.0.0"   http://127.0.0.1:12220/v1/system/update/status | jq
  % Total    % Received % Xferd  Average Speed  Time    Time    Time   Current
                                 Dload  Upload  Total   Spent   Left   Speed
100    188 100    188   0      0   1197      0                              0
{
  "target_release": null,
  "components_by_release_version": {
    "install dataset": 8,
    "unknown": 11
  },
  "time_last_step_planned": "2026-04-16T07:11:46.268131Z",
  "suspended": false,
  "contact_support": false
}

Closes: #9418

@karencfv karencfv marked this pull request as ready for review April 16, 2026 07:37
@david-crespo
Copy link
Copy Markdown
Contributor

It worries me slightly to tell the user the system is unhealthy at times when that's expected.

@karencfv
Copy link
Copy Markdown
Contributor Author

It worries me slightly to tell the user the system is unhealthy at times when that's expected.

I totally get it. My first instinct was to call this "is_system_updateable" or something like that. We discussed somewhere, but I think it was during a meeting or something. I was looking for the discussion but couldn't find it. I don't remember the specifics, but I think the reasoning behind this naming was to make sure users don't ignore this issue if they encounter an "unhealthy" system and they do call support.

Maybe @davepacheco can expand

An idea was floated around that the console could hide the status while there was an ongoing update, @david-crespo what is your take on this?

@david-crespo
Copy link
Copy Markdown
Contributor

That’s interesting, so it would be like health/unhealthy, unless less than 100% of components are on the target version, in which case we’re “updating” or something. I guess I wonder what “unhealthy” is supposed to tell the user. I’d much rather have it in the form of an active problem.

@karencfv
Copy link
Copy Markdown
Contributor Author

karencfv commented Apr 17, 2026

The idea of this work is to take the place of the health check script the support team currently runs before and after each update until we have a proper FM implementation. We want it specifically tied to the update process https://rfd.shared.oxide.computer/rfd/0612. More detail here #9876.

Perhaps we can chat further on the topic at the next update sync to make sure we're all on the same page?

@david-crespo
Copy link
Copy Markdown
Contributor

david-crespo commented Apr 17, 2026

That's helpful, I'll read that issue. Off the top of my head I think it would feel better to me (and possibly be more useful to support) to have all the sub-checks as separate booleans rather than synthesizing them all into one big AND. And it doesn't really feel like that update-specific, even though it's used during update. So maybe it belongs in its own endpoint?

@karencfv
Copy link
Copy Markdown
Contributor Author

Off the top of my head I think it would feel better to me (and possibly be more useful to support) to have all the sub-checks as separate booleans rather than synthesizing them all into one big AND.

There are a few things at play here.

From the user's perspective none of the failed checks are actionable to them, so we don't want to give them any more information than they need. In this case the only information they need is "something isn't right after the update go call support". There is more detail on this here -> https://rfd.shared.oxide.computer/rfd/0612#_user_facing.

The support team does need more information about what went wrong. For them, we are adding all of the health data to inventory, which is included in the support bundles. This endpoint isn't really for them. Initially we were going to have dedicated health checks running in the background and they were going to be part of a "health monitor" object in inventory. Ultimately, we backtracked on this as it was overlapping too much with what will eventually be FM, here is the reasoning behind that restructure #9876.

So maybe it belongs in its own endpoint?

Maybe? The thing is, this will all go away when FM is implemented most likely. We don't want to give these checks too much importance. Or have customers rely on them too much. For now we just want them to be part of update status, so customers can have some sort of confidence that an update went well or not. Or if something is wrong and they should not begin an update process at all.

@davepacheco
Copy link
Copy Markdown
Collaborator

@david-crespo thanks for taking a look. Definitely the intended long-term solution here is that this information feeds into an "active problems" API driven by the FM subsystem. We explicitly decided not to try to do this here. From RFD 612:

This proposal should be viewed as a first useful customer-visible deliverable along a path towards integration with the fault management system. It is not a replacement for that subsystem, nor is it seeking to take on more technical debt to make up for the absence of that system.
To that end, our goals are to do as little throwaway work as possible, and where we need to do new work, do it in a way that’s aligned with what the fault management project will eventually need.

@david-crespo wrote:

I guess I wonder what “unhealthy” is supposed to tell the user.

These are the two goals:

  • When the system is obviously broken after an update (e.g., cockroachdb in maintenance), we want the customer to be able to know that. In all of the cases we intend to look for, the only action for them is to call support.
  • When the system is similarly broken before an update, we want the customer to be warned that they should call support and resolve that before starting the update.

To that end, I would rename this field call_support: bool.

@david-crespo wrote:

It worries me slightly to tell the user the system is unhealthy at times when that's expected.

and @karencfv wrote:

An idea was floated around that the console could hide the status while there was an ongoing update

Yeah, we definitely don't want false alarms during an upgrade. We did discuss that and wrote it into RFD 612:

As health checks often fail during an update, we only want them visible via the external API when the system is idle or when an ongoing update has stalled for a set period of time.

As I read that, the API should not report call_support: true unless the health checks fail and either (1) there's no update in progress or (2) there's been no new blueprint planned for N minutes. This is the same guidance we give to support: the Reconfigurator Ops Guide suggests waiting 10-15 minutes before deciding the update is stuck.

@davepacheco
Copy link
Copy Markdown
Collaborator

Forgot to add: @david-crespo hopefully that's clarifying and if it makes sense, great. If not, we could discuss on tomorrow's update sync (or another time, if that's a bad time)?

@david-crespo
Copy link
Copy Markdown
Contributor

david-crespo commented Apr 20, 2026

Yes, it does help. I like call_support (maybe contact_support). And having the API do the hiding logic during an update also makes sense.

Comment thread nexus/src/app/update.rs Outdated
// There should always be an inventory collection before or after an
// update
None => false,
Some(collection) => collection.is_system_healthy(),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also want checks here:

  • that the inventory collection is not too old
  • that if there's an update in progress, then time_last_step_planned is within the last N minutes

Comment thread nexus/types/src/inventory.rs Outdated
/// - All zpools are online.
/// - All enabled SMF services are in an online state.
pub fn is_system_healthy(&self) -> bool {
self.are_all_zpools_healthy()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about:

  • missing zpools (would probably have to compare against blueprint?)
  • missing sled agents, SPs, or other components (presumably would compare against blueprint)

I'm torn about whether any of this belongs in impl Inventory vs. out in the health check function. I think is_system_healthy(&self) -> bool is too much of a footgun. We don't want people reaching for this randomly because they want to know if things currently seem okay. Its criteria are very specific to the use case we have in mind.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention here was to get an initial endpoint going with just a couple of checks (the ones that are indicated as priority in the health check table in #9876) to keep the PR small. In follow up PRs we would have additional checks trickling in.

I think is_system_healthy(&self) -> bool is too much of a footgun. We don't want people reaching for this randomly because they want to know if things currently seem okay. Its criteria are very specific to the use case we have in mind.

Yeah, that's a good point. I should at least move is_system_healthy closer to the API and rename it contact_support or something 😄

Comment thread nexus/types/versions/src/add_healthy_system_to_update_status/update.rs Outdated
Copy link
Copy Markdown
Contributor Author

@karencfv karencfv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the input, both! I think contact_support is a much better name as well

Comment thread nexus/types/src/inventory.rs Outdated
/// - All zpools are online.
/// - All enabled SMF services are in an online state.
pub fn is_system_healthy(&self) -> bool {
self.are_all_zpools_healthy()
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention here was to get an initial endpoint going with just a couple of checks (the ones that are indicated as priority in the health check table in #9876) to keep the PR small. In follow up PRs we would have additional checks trickling in.

I think is_system_healthy(&self) -> bool is too much of a footgun. We don't want people reaching for this randomly because they want to know if things currently seem okay. Its criteria are very specific to the use case we have in mind.

Yeah, that's a good point. I should at least move is_system_healthy closer to the API and rename it contact_support or something 😄

@karencfv karencfv changed the title [external-api] Add health field to update status [external-api] Add contact support field to update status Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expose health check information via "update status" API

3 participants