Conversation
df22bbb to
aaa6509
Compare
| /// Currently, we only do this for libpq ("pq-sys" package), but this pattern | ||
| /// could be generalized for other native libraries. | ||
| pub static RPATH_ENV_VARS: &'static [&'static str] = &["DEP_PQ_LIBDIRS"]; | ||
| /// We scan all of these on every build.rs call. Only env vars that are |
There was a problem hiding this comment.
This file is worth a careful review.
We're adding a dependency on fmd-adm - which uses fmd-adm-sys and links against a shared library which we expect to exist on our helios deployments. However, it's not in a stable location, so we need to provide some path info about how to find it.
Within fmd-adm-sys, we expose LIBDIRS metadata that gets picked up here. This is similar to the mechanism used by pq-sys, to configure path information for anyone linking against it.
However, this pre-existing version of configuring rpaths really only expected pq-sys to be the only native library we'd link against. With the addition of fmd-adm-sys, it's more complicated. We might want one. We might want the other. We might want both.
BEFORE this PR, this behavior was:
"Iterate over RPATH_ENV_VARS. If anything is missing an environment variable, panic. Otherwise, configure rpath based on the value of that environment variable"
IN THIS PR, I went with the option:
"Iterate over RPATH_ENV_VARS. If anything is missing an environment variable, return. Otherwise, configure rpath based on the value of that environment variable"
If we want to, I could change this to:
"explicitly take a list of dependencies, panic if any of them are missing"
That would basically be like "bring-your-own RPATH_ENV_VARS".
I didn't do this, because it seemed a little redundant to "specify the direct dependency, depend on omicron_rpaths, then also specify the transitive dependencies you collected", when those transitive dependencies can be inferred. However, if this seems too "fast and loose", we can push this work back onto the caller. But I'll probably need to change all the callsites to configure_default_omicron_rpaths to make this option work.
dad43a7 to
84a6975
Compare
Exposes illumos Fault Management Daemon (FMD) data through the sled-agent inventory endpoint. This lets the control plane see diagnosed hardware/software faults on each sled. New API version 35 adds an `fmd: Option<FmdInventory>` field to `Inventory`. When present, it contains: - Cases: diagnosed faults with UUID, diagnostic code, URL, and the full event nvlist serialized as JSON - Resources: affected components with FMRI, fault status flags On illumos, sled-agent queries FMD on each inventory request. On non-illumos (sim, tests), the field is None. Database storage is not included — that's a follow-up.
84a6975 to
322d5b3
Compare
|
|
||
| [target.'cfg(target_os = "illumos")'.dependencies] | ||
| # See omicron-rpaths for more about the "fmd-adm-sys" dependency. | ||
| fmd-adm-sys.workspace = true |
There was a problem hiding this comment.
Depends on nexus-test-utils, which depends on sled-agent, which depends on fmd-adm
|
|
||
| [target.'cfg(target_os = "illumos")'.dependencies] | ||
| # See omicron-rpaths for more about the "fmd-adm-sys" dependency. | ||
| fmd-adm-sys.workspace = true |
There was a problem hiding this comment.
Depends on nexus-test-utils, which depends on sled-agent, which depends on fmd-adm
|
|
||
| [target.'cfg(target_os = "illumos")'.dependencies] | ||
| # See omicron-rpaths for more about the "fmd-adm-sys" dependency. | ||
| fmd-adm-sys.workspace = true |
There was a problem hiding this comment.
Depends on nexus-test-utils, which depends on sled-agent, which depends on fmd-adm
|
|
||
| [target.'cfg(target_os = "illumos")'.dependencies] | ||
| # See omicron-rpaths for more about the "fmd-adm-sys" dependency. | ||
| fmd-adm-sys.workspace = true |
There was a problem hiding this comment.
Depends on sled-agent, which depends on fmd-adm
|
|
||
| [target.'cfg(target_os = "illumos")'.dependencies] | ||
| # See omicron-rpaths for more about the "fmd-adm-sys" dependency. | ||
| fmd-adm-sys.workspace = true |
There was a problem hiding this comment.
Depends on nexus-test-utils, which depends on sled-agent, which depends on fmd-adm
|
|
||
| [target.'cfg(target_os = "illumos")'.dependencies] | ||
| # See omicron-rpaths for more about the "fmd-adm-sys" dependency. | ||
| fmd-adm-sys.workspace = true |
There was a problem hiding this comment.
Depends on nexus-test-utils, which depends on sled-agent, which depends on fmd-adm
|
|
||
| [target.'cfg(target_os = "illumos")'.dependencies] | ||
| # See omicron-rpaths for more about the "fmd-adm-sys" dependency. | ||
| fmd-adm-sys.workspace = true |
There was a problem hiding this comment.
Depends on nexus-test-utils, which depends on sled-agent, which depends on fmd-adm
|
|
||
| [target.'cfg(target_os = "illumos")'.dependencies] | ||
| # See omicron-rpaths for more about the "fmd-adm-sys" dependency. | ||
| fmd-adm-sys.workspace = true |
There was a problem hiding this comment.
Depends on sled-agent, which depends on fmd-adm
|
|
||
| [target.'cfg(target_os = "illumos")'.dependencies] | ||
| # See omicron-rpaths for more about the "fmd-adm-sys" dependency. | ||
| fmd-adm-sys.workspace = true |
There was a problem hiding this comment.
Depends on nexus-test-utils, which depends on sled-agent, which depends on fmd-adm
|
|
||
| [target.'cfg(target_os = "illumos")'.dependencies] | ||
| # See omicron-rpaths for more about the "fmd-adm-sys" dependency. | ||
| fmd-adm-sys.workspace = true |
There was a problem hiding this comment.
Depends on nexus-test-utils, which depends on sled-agent, which depends on fmd-adm
The inventory endpoint is async, but FMD queries go through door calls to fmd(1M) that can stall the calling thread. Move the work onto spawn_blocking so it doesn't occupy a Tokio worker; surface any JoinError as FmdInventory::Error.
3715de3 to
1fba06e
Compare
The FmdCase.url docstring now uses backticks instead of quotes around the example URL, which changes the schema description and thus the spec hash.
The verify-libraries xtask checks that binaries don't link against unexpected libraries. Add libfmd_adm.so.1 to the allowlist for the binaries that legitimately need it (sled-agent, sled-agent-sim, and omicron-dev which spins up sled-agent for tests).
1fba06e to
887f61a
Compare
| /// Unique identifier for this case. | ||
| pub uuid: Uuid, |
There was a problem hiding this comment.
I would really like this to be a typed UUID (not least because it seems worthwhile to ensure that they are not confuseable with Nexus FM Cases...)
| /// Unique identifier for this resource entry. | ||
| pub uuid: Uuid, |
There was a problem hiding this comment.
similarly, 'twould be nice if this would a typed UUID.
| /// UUID of the case that diagnosed this fault. | ||
| pub case_id: Uuid, |
There was a problem hiding this comment.
as would this, since it's named case_id, and again, we ought to ensure this is clearly disticnt from Nexus cases.
| #[serde(tag = "type", content = "value", rename_all = "snake_case")] | ||
| pub enum FmdInventory { | ||
| /// FMD data was successfully collected. | ||
| Available { cases: Vec<FmdCase>, resources: Vec<FmdResource> }, |
There was a problem hiding this comment.
out of curiosity: might we want cases andresources to be IdOrdMaps keyed by the UUIDs of those types, rather than Vecs?
| pub smf_services_enabled_not_online: | ||
| v34::inventory::SvcsEnabledNotOnlineResult, | ||
| pub reference_measurements: IdOrdMap<SingleMeasurementInventory>, | ||
| pub fmd: Option<FmdInventory>, |
There was a problem hiding this comment.
what does it mean for this to be None? how is that different from FmdInventory::Error, and how is it different from FmdInventory::Available where cases and resources are empty? it might be good to have a comment explaining this or, ideally, remove the Option and represent this with a third FmdInventory variant with a name explaining what it means?
| .map(|c| FmdCase { | ||
| uuid: c.uuid, | ||
| code: c.code, | ||
| url: c.url, | ||
| event: c.event.as_ref().map(nvlist_to_json), | ||
| }) |
There was a problem hiding this comment.
similarly, as I mentioned here, it would be nice if this was matched exhaustively so that we get a compiler error if the fmd_adm type changes:
| .map(|c| FmdCase { | |
| uuid: c.uuid, | |
| code: c.code, | |
| url: c.url, | |
| event: c.event.as_ref().map(nvlist_to_json), | |
| }) | |
| .map(|c| { | |
| let fmd_adm::Case { uuid, code, url, event } = c; | |
| FmdCase { | |
| uuid, | |
| code, | |
| url, | |
| event: event.as_ref().map(nvlist_to_json), | |
| } | |
| }) |
| Err(e) => { | ||
| return FmdInventory::Error { | ||
| error: format!("failed to open fmd: {e}"), | ||
| }; | ||
| } |
There was a problem hiding this comment.
should this error get logged someplace?
| return FmdInventory::Error { | ||
| error: format!("failed to list fmd cases: {e}"), | ||
| }; |
| return FmdInventory::Error { | ||
| error: format!("failed to list fmd resources: {e}"), | ||
| }; |
| Err(e) => Some(FmdInventory::Error { | ||
| error: format!("fmd collection task failed: {e}"), | ||
| }), |
There was a problem hiding this comment.
nit, take it or leave it: i think it would be okay if this was an unwrap --- the only reason spawn_blocking's JoinHandle will return an error is if the blocking task panicked (since blocking tasks cannot be aborted), and since we compile with panic="abort", that means we are also panicking. so this could be a panic. it doesn't really make this all that much simpler, but 🤷♀️
Exposes data from fmd-adm through the sled-agent inventory endpoint.
We're extracting:
I'm only exposing this through the API right now - Nexus isn't yet shoving it into the DB. Soon! But wanted feedback
on this data first.
To give you a sense of "what does case/resource data look like", here's what I pulled out of Atrium, using
fmd-adm: