Add support for re-adopting physical disks#10221
Conversation
67bdc40 to
ce4b577
Compare
This change implements the determinations in RFD 693. It allows re-adopting physical disks in the control plane after the control plane level disk in the `physical_disk` table is expunged. It does this by forcing manual adoption of disks by an operator, where requests are placed in the `physical_disk_adoption_request` table. A disk will now only be adopted or re-adopted by the disk adoption background task if its physical vendor/model/serial information is present in a `physical_disk_adoption_request` row. The typical flow for an operator is to list unitialized disks and then issue an adoption request via the external API.
ce4b577 to
edc851b
Compare
Nit: 663 |
Whoops. I even have it open in a tab. Thanks! |
ahl
left a comment
There was a problem hiding this comment.
To the best of my understanding, we have a couple of metaphors in here. Disks that are unknown to the control plane are "uninitialized" and appear in that list. The verb we're using is "adopt", as in "this uninitialized disk is adopted by the control plane". Is the term "adopt" intended to be the opposite of "expunge"?
I'm not clear on the how the operator would use this. Presumably there's supposed to be some step (e.g. of exploration or cognition) between "list uninit disks" and "approve disk(s)", but I'm not sure what it is.
We don’t want to allow automatic disk adoption due to the risk of the insertion of malicious hardware during casual physical access. This is especially problematic before we have disk attestation support, and in the case of existing sleds with empty disk bays.
Presumably I want to make sure the hardware I just put into the U.2 bay is the same as what I'm about to adopt. How do I do that?
The API changes look fine; I'd ask you think about nomenclature ("uninitialized" "adopt").
| query: Query<PaginationParams<EmptyScanParams, String>>, | ||
| ) -> Result< | ||
| HttpResponseOk< | ||
| ResultsPage<latest::physical_disk::UninitializedPhysicalDisk>, |
There was a problem hiding this comment.
if I expunge a disk, I think that changes both the policy and state properties (I'm not sure why there are both of these--is one intent and the other is status?), does the disk immediately show up in this list?
Do disks show up in only one place or the other? Do some show up in both?
There was a problem hiding this comment.
Expunge happens immediately, and is a terminating enum variant. It's the intention of the desired state. Once it occurs we can assume it will never change and most things just look at expunge. Decommissioned state occurs after other steps, when the intended state is realized. So there is some delay there.
Is the idea that one would do this by comparing the manufacturer/model number/serial listed by the API endpoint with those physically printed on the actual disk, and also based on foreknowledge that disks have or have not been inserted in specific locations at the current point in time? |
| pub async fn physical_disk_adoptable_list( | ||
| &self, | ||
| opctx: &OpContext, | ||
| inventory_collection_id: CollectionUuid, | ||
| ) -> ListResultVec<InvPhysicalDisk> { |
There was a problem hiding this comment.
i thought about suggesting that this be paginated, but...do we expect the maximum number of rows to be limited by the physical fact that 32 sleds * 10 U.2s - 320 disks maximum?
There was a problem hiding this comment.
There's not really a good way to paginate this right now AFAIK.
| /// A physical disk that has not yet been adopted by the control plane | ||
| #[derive(Clone, Debug, Deserialize, Eq, PartialEq, Serialize, JsonSchema)] | ||
| pub struct UninitializedPhysicalDisk { | ||
| pub sled_id: SledUuid, |
There was a problem hiding this comment.
do we expect the client to hydrate this sled UUID into a sled's physical location? it seems like it would be desirable for a UI listing physical disks that need to be adopted to be able to say which sled they are in as well as the slot within that sled...
There was a problem hiding this comment.
Good question. I suppose we could also provide the sled cubby. That would help operators out a bit probably.
There was a problem hiding this comment.
It's not guaranteed we have this, right? Uninitalized physical disks come from sled-agent inventory, and sled-agent doesn't know its own cubby number. We could try to match up the sled serial against the SP inventory contents to identify a cubby, and that will usually work, but we'd still need to be able to represent "physical disk for sled X for cubby I Dunno Ask Again Later".
Yes, basically, if they actually cared. I think the larger security issue mitigated is that any disk inserted can only be activated by an operator and not just adopted automatically for usage. Checking the serial is how they would know which disk they were validating against. |
| ), | ||
| ), | ||
| ) | ||
| // Ensure that each inventory disk has a valid adoption request |
There was a problem hiding this comment.
Is there a precondition that "you cannot have an adoption request for an already in-service disk"?
(Having already-in-service disks show up here seems wrong, just confirming what prevents that. I think the answer is "yes, a non-deleted adoption request means you have no live disks here")
There was a problem hiding this comment.
Yes, that's a precondition.
It's not necessarily the opposite of
As Eliza noted, an operator would have to look at the vendor/model/serial on the drive and compare it to what shows up in the API.
I'm open to changing this. |
| /// A physical disk that has not yet been adopted by the control plane | ||
| #[derive(Clone, Debug, Deserialize, Eq, PartialEq, Serialize, JsonSchema)] | ||
| pub struct UninitializedPhysicalDisk { | ||
| pub sled_id: SledUuid, |
There was a problem hiding this comment.
It's not guaranteed we have this, right? Uninitalized physical disks come from sled-agent inventory, and sled-agent doesn't know its own cubby number. We could try to match up the sled serial against the SP inventory contents to identify a cubby, and that will usually work, but we'd still need to be able to represent "physical disk for sled X for cubby I Dunno Ask Again Later".
@smklein and I have been talking through disk replacement scenarios a bit from the fault management context, and I think the adoption requests will eventually be part of the service flow for disk replacements. I think in particular, we would really like it if the adoption requests could easily include the physical location of the disk as part of a "you just replaced the disk in sled 19 slot 3, okay yeah it's that one" kinda spot check. |
I just talked about this with John a bit and if you don't like the term |
|
I'm good with whatever of those you choose. I appreciate you discussing. |
As @jgallagher pointed out earlier today, we don't actually have this information in sled-agent inventory. What happens if the sp inventory doesn't exist for this collection? Then we can't list the sled as uninitialized or we always need to carry around an option here. Would it make sense to tell a customer: "Hey, actually I don't know what cubby this disk is in right now." or "This sled is not uninitialized" event though the customer just inserted it into the rack and expects to see it? |
|
One thing I just realized was that with the current code, we will no longer automatically adopt disks when sleds are added to a rack. I confirmed with @rmustacc that this was not what he intended. Unfortunately, I'm not immediately sure how to fit this behavior in with the current one. Disks are detected asynchronously after a sled is added and currently adopted by the background task automatically. The new behavior ensures that a disk adoption request is made by a user to allow adoption by the background task, but the adoption itself is still done asynchronously in the background task and is separate from the sled add request. What we would want is a state attached to the sled that says automatically adopt disks that were present when the sled was added to the rack. But we don't really have any mechanism to discover that information. I think the best we can do with the current code base is say something like "automatically adopt disks for this sled for 5 minutes" after it has been created. Given inventory delays this is also problematic as a disk the customer expected to be adopted that was in the sled when it was added may not actually get adopted. That's a terrible user experience. We could lessen the likelihood of non-adoption by increasing this window, but that now lengthens the time when casual physical attack is possible by inserting arbitrary disks. The only other thing I can think of doing is adding disk information to the |
NOT TO BE ANNOYING BUT: I really don't like "import" in this context, it feels like it is too easily misconstrued as "import the data that was on this disk", which is not what one would expect to be offered but which is a somewhat conceivable thing that might occur. |
NOT ANNOYING AT ALL!!! I appreciate the feedback. I'm not a huge fan of import either. I still like I think to make the metaphors less mixed I'd switch to listing |
|
Thanks for the reviews @smklein @hawkw @jgallagher I believe I've addressed everything. Unfortunately, I discovered an issue that should probably be resolved before merge or explicitly decided not to implement: #10221 (comment) |
There are further problems that make the implementation next to impossible to do in an ideal manner.
We really seem to be restricted to a time based setup, or forcing manual disk adoption at all times. |
|
@andrewjstone and I chatted about this a bit offline. Recording some of our thoughts here:
|
Based on discussion in update huddle last week, we decided that to move forward we would auto-adopt disks that haven't been part of the control plane before. c9618fc makes this change. Importantly it makes this change by inserting new disks into the |
|
This is ready for a re-review @jgallagher @smklein. I still need to test it on hardware which I'll do after feedback. I'd like to test either Thursday afternoon or Friday. Thanks! |
This change implements the determinations in RFD 663. It allows
re-adopting physical disks in the control plane after the control plane
level disk in the
physical_disktable is expunged.It does this by forcing manual adoption of disks by an operator, where
requests are placed in the
physical_disk_adoption_requesttable.A disk will now only be adopted or re-adopted by the disk adoption
background task if its physical vendor/model/serial information is
present in a
physical_disk_adoption_requestrow.The typical flow for an operator is to list uninitialized disks and then
issue an adoption request via the external API.