[external-api] Add contact support field to update status#10271
[external-api] Add contact support field to update status#10271karencfv wants to merge 16 commits intooxidecomputer:mainfrom
Conversation
|
It worries me slightly to tell the user the system is unhealthy at times when that's expected. |
I totally get it. My first instinct was to call this "is_system_updateable" or something like that. We discussed somewhere, but I think it was during a meeting or something. I was looking for the discussion but couldn't find it. I don't remember the specifics, but I think the reasoning behind this naming was to make sure users don't ignore this issue if they encounter an "unhealthy" system and they do call support. Maybe @davepacheco can expand An idea was floated around that the console could hide the status while there was an ongoing update, @david-crespo what is your take on this? |
|
That’s interesting, so it would be like health/unhealthy, unless less than 100% of components are on the target version, in which case we’re “updating” or something. I guess I wonder what “unhealthy” is supposed to tell the user. I’d much rather have it in the form of an active problem. |
|
The idea of this work is to take the place of the health check script the support team currently runs before and after each update until we have a proper FM implementation. We want it specifically tied to the update process https://rfd.shared.oxide.computer/rfd/0612. More detail here #9876. Perhaps we can chat further on the topic at the next update sync to make sure we're all on the same page? |
|
That's helpful, I'll read that issue. Off the top of my head I think it would feel better to me (and possibly be more useful to support) to have all the sub-checks as separate booleans rather than synthesizing them all into one big AND. And it doesn't really feel like that update-specific, even though it's used during update. So maybe it belongs in its own endpoint? |
There are a few things at play here. From the user's perspective none of the failed checks are actionable to them, so we don't want to give them any more information than they need. In this case the only information they need is "something isn't right after the update go call support". There is more detail on this here -> https://rfd.shared.oxide.computer/rfd/0612#_user_facing. The support team does need more information about what went wrong. For them, we are adding all of the health data to inventory, which is included in the support bundles. This endpoint isn't really for them. Initially we were going to have dedicated health checks running in the background and they were going to be part of a "health monitor" object in inventory. Ultimately, we backtracked on this as it was overlapping too much with what will eventually be FM, here is the reasoning behind that restructure #9876.
Maybe? The thing is, this will all go away when FM is implemented most likely. We don't want to give these checks too much importance. Or have customers rely on them too much. For now we just want them to be part of update status, so customers can have some sort of confidence that an update went well or not. Or if something is wrong and they should not begin an update process at all. |
|
@david-crespo thanks for taking a look. Definitely the intended long-term solution here is that this information feeds into an "active problems" API driven by the FM subsystem. We explicitly decided not to try to do this here. From RFD 612:
@david-crespo wrote:
These are the two goals:
To that end, I would rename this field @david-crespo wrote:
and @karencfv wrote:
Yeah, we definitely don't want false alarms during an upgrade. We did discuss that and wrote it into RFD 612:
As I read that, the API should not report |
|
Forgot to add: @david-crespo hopefully that's clarifying and if it makes sense, great. If not, we could discuss on tomorrow's update sync (or another time, if that's a bad time)? |
|
Yes, it does help. I like |
| // There should always be an inventory collection before or after an | ||
| // update | ||
| None => false, | ||
| Some(collection) => collection.is_system_healthy(), |
There was a problem hiding this comment.
I think we also want checks here:
- that the inventory collection is not too old
- that if there's an update in progress, then
time_last_step_plannedis within the last N minutes
| /// - All zpools are online. | ||
| /// - All enabled SMF services are in an online state. | ||
| pub fn is_system_healthy(&self) -> bool { | ||
| self.are_all_zpools_healthy() |
There was a problem hiding this comment.
What about:
- missing zpools (would probably have to compare against blueprint?)
- missing sled agents, SPs, or other components (presumably would compare against blueprint)
I'm torn about whether any of this belongs in impl Inventory vs. out in the health check function. I think is_system_healthy(&self) -> bool is too much of a footgun. We don't want people reaching for this randomly because they want to know if things currently seem okay. Its criteria are very specific to the use case we have in mind.
There was a problem hiding this comment.
My intention here was to get an initial endpoint going with just a couple of checks (the ones that are indicated as priority in the health check table in #9876) to keep the PR small. In follow up PRs we would have additional checks trickling in.
I think is_system_healthy(&self) -> bool is too much of a footgun. We don't want people reaching for this randomly because they want to know if things currently seem okay. Its criteria are very specific to the use case we have in mind.
Yeah, that's a good point. I should at least move is_system_healthy closer to the API and rename it contact_support or something 😄
karencfv
left a comment
There was a problem hiding this comment.
Thanks for the input, both! I think contact_support is a much better name as well
| /// - All zpools are online. | ||
| /// - All enabled SMF services are in an online state. | ||
| pub fn is_system_healthy(&self) -> bool { | ||
| self.are_all_zpools_healthy() |
There was a problem hiding this comment.
My intention here was to get an initial endpoint going with just a couple of checks (the ones that are indicated as priority in the health check table in #9876) to keep the PR small. In follow up PRs we would have additional checks trickling in.
I think is_system_healthy(&self) -> bool is too much of a footgun. We don't want people reaching for this randomly because they want to know if things currently seem okay. Its criteria are very specific to the use case we have in mind.
Yeah, that's a good point. I should at least move is_system_healthy closer to the API and rename it contact_support or something 😄
This PR is the last piece for a minimal system health check for update status. It is a new field in the
system/update/statusAPI calledcontact_supportwhich is either true or false based on the information in the latest inventory collection and a few additional health checks.Disclaimer: I used the claude code skill to make the endpoint edit, and also for part of the code (trying to learn how to use it here). I checked the code several times and tested manually, but just thought I'd mention it here.
Manual tests:
There are unhealthy services
Everything is happy!
Closes: #9418