feat: CLI mints scoped tokens for release; per-project scope on OCI push (ALIEN-139)#49
Conversation
Adds `pub scopes: Vec<String>` to `Subject` so per-project authorization tuples (format `"<workspace_id>/<project_id>"`) can flow from a token validator that knows about them down to the OCI push handler. Default- empty so the field is invisible to validators that don't populate it (`TokenDbValidator`, `PermissiveAuthValidator`) — OSS standalone behaviour is preserved bit-for-bit. `require_push_auth` in the registry proxy gains a scope check: after the existing role-based `can_push_image` gate passes, if `subject.scopes` is non-empty, at least one entry must end with `/<project_id>` (suffix-match rather than full tuple — project IDs are globally-unique ULIDs, and the caller's scope carries their workspace_id while `subject.workspace_id` can be a different workspace's id when the call is routed through a shared manager). Empty `subject.scopes` short-circuits the check, again preserving OSS behaviour. `upstream_repository_prefix()` on `LocalArtifactRegistry` returned a hardcoded `"artifacts/default"` — that was always wrong but masked because the proxy routing table never had to extract a real project_id when nothing did per-project authz. With the scope check above, a wrong prefix breaks routing in a way that's now observable. Return `self.binding_name.clone()` (i.e. just `"artifacts"`) so the prefix matches the actual on-disk layout and the routing table extracts the real project_id from `artifacts/<project_id>/...`. Root-cause fix. Test fixtures: every Subject literal in oss_authz / permissive_auth / token_db_validator and the three tests/* files gets `scopes: Vec::new()` — pure field-threading, no behaviour change.
The manager only validates a small set of credential types — long-lived
API keys and platform-issued RS256 tokens. The user OAuth JWT from
`alien login` isn't one of them, so the CLI's release flow was
401-ing at the first manager call. Fix: at the top of the platform-mode
release flow, exchange the user OAuth JWT for a short-lived, project-
and-manager-scoped token via the platform's token endpoint, and use
that for every subsequent manager call.
`mint_registry_push_token` hits `POST {base}/v1/managers/{manager_id}/token`
with body `{ purpose: "registry-push", project: <id> }` and the user
OAuth bearer. The returned token's scopes/managerId are checked locally
on every reuse (`token_matches_target` — globally-unique ULIDs let the
project_id check be suffix-only), and a new keyring entry
(`MANAGER_TOKEN_USER`) holds the cached token separately from
access_token/refresh_token so `logout` cleans both.
`get_or_mint_registry_push_token` is the entry point: serve from cache
if it's a hit for this target and not within 30s of expiry, otherwise
mint fresh and store. Mint failures propagate as `AuthenticationFailed`
with the response status + body so platform-side rejection reasons reach
the user.
In `execution_context.rs`, the platform-mode branch mints once
immediately after `resolve` and uses the resulting token both for the
`/v1/artifact-registry/*` repo setup and for the `/v2/...` OCI push
(both legs of the release flow). `ResolveResponse` now carries
`manager_id` so the mint knows which manager the user is targeting.
Future platform-mode commands (commands invoke, etc.) should mint their
own audience rather than reuse this one — keeps the audience name
accurate to scope and prevents a leak of one credential from acting at
unintended endpoints.
The PR added two security-critical predicates that were exercised only by code paths the test suite didn't have a way to reach: the manager's push handler skipped the scope check whenever `subject.scopes` was empty (which every test subject was), and the CLI's `token_matches_target` cache-reuse check was never run on a constructed JWT. A regression in either branch — an off-by-one in the suffix match, the scope list flipped to a different shape, the project_id lookup returning the wrong value — would have shipped silently. Extract the scope-check predicate from `require_push_auth` into a pure `scopes_cover_project(scopes, project_id)` helper so it's unit-testable without standing up a 17-field `AppState`. The five new tests cover: empty-scopes fall-through (OSS standalone preserved), matching scope, non-matching scope, one-of-many matching, and the suffix-substring confusion case (`prj_abc` must not accept a scope ending `prj_abcd`). Add `token_matches_target` tests in the CLI: matching, manager-id mismatch, project-id mismatch, missing claim, missing manager-id claim, malformed JWT, and scopes claim of the wrong type. All failure modes must return false so the caller re-mints rather than presenting an unusable token to the manager. No behaviour change in either function.
…rites Two small fixes addressing Greptile review on PR #48, both validated against the codebase's established patterns and `alien/CLAUDE.md`: `mint_registry_push_token`: URL-encode `manager_id` when interpolating it into the request path. Every other path-interpolation in the crate already does this — see `execution_context.rs::resolve_url` (platform, project, workspace), the gcp clients (service-account names, GCS object names), and the aws clients. Manager IDs are ULIDs today and contain no URL-special characters, but encoding here keeps the call site correct-by-construction rather than correct-by-coincidence — a future change to the ID format won't silently mis-route the request. `get_or_mint_registry_push_token`: treat the keyring cache write as best-effort. The mint already succeeded and the caller has a valid token in hand; a failed `set_password` just means we'll re-mint on the next invocation. The previous `?` propagation meant a transient keyring error would abort the whole release even though everything needed for it to succeed was in memory. This matches the worked example in `alien/CLAUDE.md` under "When `warn!` Is Appropriate": > "Use `warn!` only for non-critical issues that don't affect > correctness... Core operation continues and will work correctly, > just slower" No behaviour change in the success path. No new failure modes on the error path beyond what's already there.
…n platform mode
Two CLI fixes that share an execution_context.rs touch surface but are
otherwise independent.
execution_context.rs::create_or_get_artifact_repo
All three response objects (initial GET, POST, retry GET) were
consumed by `.json()` in the success branch only; on non-success the
body was dropped. When the manager responded with an authentication
failure or another structured error, the user saw "Failed to create
artifact repository '...' for platform '...' on manager (HTTP 401)"
with no further detail — the actual reason (e.g. "Platform JWT not
bound to this manager") lived in the response body that we never
read.
Read each response's body once via `.text().await.unwrap_or_default()`
immediately after capturing the status; parse it as JSON in the
success branch, include it verbatim in the error message on failure.
Pattern matches `auth.rs::mint_registry_push_token:547` for the same
reason. `.unwrap_or_default()` keeps behaviour consistent when the
body itself errors — we still get the status, just no detail.
commands/commands.rs::commands_task
`server_sdk_client()`'s Platform-mode arm returns
`Err("server_sdk_client() is not available in platform mode. Use
sdk_client() instead.")` because the manager URL + JWT aren't known
until the project is resolved. The advice in the error message is
wrong — `sdk_client()` returns the platform-API client, not the
manager client we actually need to invoke commands.
Branch in commands_task: for Platform mode, resolve the project then
call `ctx.resolve_manager(...)` to get a ManagerContext whose
`client` is authenticated with the minted manager JWT and whose
`auth_token` carries that same JWT for the separate CommandsClient
call later in `invoke_command`. Standalone/Dev modes keep using
`ctx.server_sdk_client()` + `ctx.auth_token()` — those carry the
manager URL + API key statically on ExecutionMode and don't need
per-project resolution.
Unblocks step 5 of the onboarding quickstart
(`alien commands invoke --deployment X --command Y`) in Platform
mode without changing any of the protocol shapes.
Greptile flagged my earlier A5 refactor of `create_or_get_artifact_repo` for swapping the canonical `.json::<T>().await.into_alien_error().context(...)?` pattern (used by the same file's `resolve_url` block at line 378-381 and by `auth.rs:541`) for a bespoke text-first + best-effort parse shape. The new shape silently fell through to the next step on 2xx-status with an unparseable body — a reverse-proxy or WAF returning an HTML 200 would hide the real diagnostic behind a confusing downstream error. `alien/CLAUDE.md`'s "Fail Fast, Don't Fall Back Silently" section explicitly calls out this pattern as a bad pattern; my refactor was the textbook anti-example. Restore fail-loud semantics on all three response-parse sites (`get_resp`, `create_resp`, `get_resp2`) using `serde_json::from_str(&body).into_alien_error().context(...)?` instead of `if let Ok(body) = ...`. The text-first read pattern stays — it's load-bearing for the final error message at the bottom of the function which includes `create_body` for the manager's structured rejection reason. Valid-JSON-without-`name`-field still falls through (intentional: that's the manager's "no such repo" shape and the next step is the correct response). Pre-existing code (mint helpers, auth flow, scope check) untouched — only the 3 response-parse blocks I introduced in the A5 commit.
Greptile SummaryThis PR adds two complementary pieces: the CLI now exchanges the user's OAuth token for a short-lived project-scoped JWT before calling the manager during
Confidence Score: 5/5Safe to merge — the security-critical scope check and token-exchange flow are correctly implemented and well-tested; OSS standalone behavior is fully preserved. The new crates/alien-cli/src/execution_context.rs — the new required
|
| Filename | Overview |
|---|---|
| crates/alien-manager/src/routes/registry_proxy.rs | Adds scopes_cover_project pure helper and wires it into require_push_auth after the existing role check; logic is correct, tests cover substring-confusion and cross-project cases. |
| crates/alien-cli/src/auth.rs | Adds token minting, keyring storage, expiry-aware cache reuse, and 7 targeted unit tests; token_matches_target correctly returns false for all decode-failure/mismatch cases. |
| crates/alien-cli/src/execution_context.rs | resolve_manager now mints a registry-push JWT and uses it for both artifact-registry setup and the returned ManagerContext; ResolveResponse gains a required manager_id field without serde(default), creating a hard parse failure on version mismatch with the platform. |
| crates/alien-cli/src/commands/commands.rs | Fixes the Platform-mode crash in commands_task by resolving the manager before use; the registry-push token and hardcoded aws provider side-effects are acknowledged known issues tracked separately. |
| crates/alien-manager/src/auth/subject.rs | Adds scopes: Vec to Subject with Vec::new() defaults everywhere; Debug impl redacts bearer_token correctly. |
| crates/alien-bindings/src/providers/artifact_registry/local.rs | upstream_repository_prefix now returns binding_name instead of hardcoded artifacts/default, enabling the routing table to extract project_id for scope checks in both dev and platform modes. |
Sequence Diagram
sequenceDiagram
participant CLI as alien CLI
participant Platform as Platform API
participant Keyring as System Keyring
participant Manager as alien-manager
CLI->>Keyring: load_manager_token()
alt cached token valid (not expired, matches manager+project)
Keyring-->>CLI: cached manager JWT
else no token or expired or wrong target
CLI->>Platform: POST /v1/managers/MANAGER_ID/token
Note right of CLI: purpose=registry-push, project=PROJECT_ID
Platform-->>CLI: new manager JWT
CLI->>Keyring: store_manager_token(jwt)
end
CLI->>Manager: GET /v1/artifact-registry/PROJECT_ID
alt repo exists
Manager-->>CLI: repo name
else repo not found
CLI->>Manager: POST /v1/artifact-registry
Manager-->>CLI: repo name
end
CLI->>Manager: POST /v2/REPO/blobs/uploads/
Note over Manager: require_push_auth checks role then scopes_cover_project
alt scope covers project
Manager-->>CLI: 202 Accepted
CLI->>Manager: PUT blob data
Manager-->>CLI: 201 Created
else scope mismatch
Manager-->>CLI: 403 DENIED
end
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 1
crates/alien-cli/src/execution_context.rs:572-578
The newly added `manager_id` field is not annotated with `#[serde(default)]`, so if the platform's resolve endpoint doesn't include `managerId` in its JSON — e.g. during a rolling deploy where some instances still run the old version — `serde_json` will return a "missing field `managerId`" parse error. The user will see an opaque deserialization failure rather than anything that hints at a platform-version mismatch. Adding a `#[serde(default)]` (or keeping it as `Option<String>` with an explicit error at mint time) would let the CLI surface a clearer message in that window.
```suggestion
#[derive(serde::Deserialize)]
#[serde(rename_all = "camelCase")]
struct ResolveResponse {
/// Required for minting the registry-push token; absent on old platform
/// instances that predate the companion platform PR.
#[serde(default)]
manager_id: String,
manager_url: String,
project_id: String,
}
```
Reviews (3): Last reviewed commit: "fix(cli): fail loud on success-status re..." | Re-trigger Greptile
| @@ -438,19 +459,16 @@ impl ExecutionMode { | |||
|
|
|||
| info!("Repository: {}", repo_name); | |||
|
|
|||
| // Create manager SDK client reusing the authenticated reqwest client | |||
| let authenticated_client = http.client.clone(); | |||
| let auth_token = http.bearer_token.clone(); | |||
| let manager_client = ServerSdkClient::new_with_client( | |||
| &resolved.manager_url, | |||
| authenticated_client.clone(), | |||
| manager_http_client.clone(), | |||
| ); | |||
|
|
|||
| Ok(ManagerContext { | |||
| manager_url: resolved.manager_url, | |||
| client: manager_client, | |||
| http_client: authenticated_client, | |||
| auth_token, | |||
| http_client: manager_http_client, | |||
| auth_token: Some(manager_jwt), | |||
| repository_name: Some(repo_name), | |||
| repository_uri: None, | |||
| }) | |||
There was a problem hiding this comment.
Registry-push token bleeds into all
resolve_manager callers
resolve_manager's Platform arm now mints a purpose: "registry-push" JWT and wires it into every ManagerContext it returns. Because this function is called unconditionally from deploy.rs, destroy.rs, and deployments.rs::resolve_manager_client, all of those flows will now present a registry-push-scoped token to manager endpoints that handle deployments — replacing the OAuth JWT that was used before this PR. If the platform-side JWT validator enforces the purpose claim on non-push endpoints, alien deploy and alien destroy will start returning 401s after this change. The code comment on line ~437 even explicitly warns "When other platform-mode commands must mint their OWN audience instead of reusing this one," yet the implementation bakes the mint into the shared helper.
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/alien-cli/src/execution_context.rs
Line: 440-474
Comment:
**Registry-push token bleeds into all `resolve_manager` callers**
`resolve_manager`'s Platform arm now mints a `purpose: "registry-push"` JWT and wires it into every `ManagerContext` it returns. Because this function is called unconditionally from `deploy.rs`, `destroy.rs`, and `deployments.rs::resolve_manager_client`, all of those flows will now present a registry-push-scoped token to manager endpoints that handle deployments — replacing the OAuth JWT that was used before this PR. If the platform-side JWT validator enforces the `purpose` claim on non-push endpoints, `alien deploy` and `alien destroy` will start returning 401s after this change. The code comment on line ~437 even explicitly warns "When other platform-mode commands must mint their OWN audience instead of reusing this one," yet the implementation bakes the mint into the shared helper.
How can I resolve this? If you propose a fix, please make it concise.| let mgr_ctx = ctx | ||
| .resolve_manager(&project_link.project_id, "aws") | ||
| .await?; |
There was a problem hiding this comment.
Hardcoded
"aws" provider triggers spurious artifact-registry setup
resolve_manager unconditionally calls create_or_get_artifact_repo(platform) as part of its implementation (see execution_context.rs line ~452). Passing a hardcoded "aws" here means that every alien commands invoke call in Platform mode will attempt to create/get an AWS artifact repository for the project — regardless of the actual cloud provider. On non-AWS deployments this HTTP call will fail and commands invoke will error before it can even reach the already-acknowledged 401 issue (ALIEN-144). A separate, lightweight helper that resolves the manager URL and mints a token without the artifact-registry side-effect would avoid this.
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/alien-cli/src/commands/commands.rs
Line: 83-85
Comment:
**Hardcoded `"aws"` provider triggers spurious artifact-registry setup**
`resolve_manager` unconditionally calls `create_or_get_artifact_repo(platform)` as part of its implementation (see `execution_context.rs` line ~452). Passing a hardcoded `"aws"` here means that every `alien commands invoke` call in Platform mode will attempt to create/get an AWS artifact repository for the project — regardless of the actual cloud provider. On non-AWS deployments this HTTP call will fail and `commands invoke` will error before it can even reach the already-acknowledged 401 issue (ALIEN-144). A separate, lightweight helper that resolves the manager URL and mints a token without the artifact-registry side-effect would avoid this.
How can I resolve this? If you propose a fix, please make it concise.…oject Move the per-project scope check from a private free function in the OCI proxy route to a method on Subject. Same predicate, same OSS-standalone semantics (empty scopes → true so single-tenant deployments fall through to the role check), same substring-confusion guard. Tests move with the function, plus one extra case (multi-scope-any-match). Promoted because the same invariant needs to apply to other route handlers in downstream embedders that proxy to per-project resources; having it as a method on Subject lets each call site read as `subject.permits_project(project_id)` rather than re-implementing the suffix match. No behavior change for the OCI push path: the call site swaps from \`scopes_cover_project(&subject.scopes, project_id)\` to \`subject.permits_project(project_id)\` and gets the same result.
The previous commit dropped the inline tests that referenced `info!` in error helpers, leaving the import unused. Fixing the warning so the crate stays warning-clean.
|
Closing — restarting from a clean base. |
Summary
This PR has two halves that together let
alien releasework safely against a multi-tenant manager. On the CLI side, the user's OAuth token gets exchanged for a short-lived signed token scoped to one project on one manager before any manager call happens — so the manager never sees a credential type it can't validate. On the manager side, the image-push handler now checks that the project scope inside the token actually covers the project being pushed to, so a token issued for project A can't push images intended for project B. OSS standalone behavior is preserved: validators that don't setSubject.scopesleave it empty, and the new scope check is a no-op on empty scopes.Step-by-step flow when you run
alien release:alien loginearlier, so the CLI has your OAuth access token saved.purpose: registry-pushflow added in the companion platform PR)./v2/.../blobs/uploads/.This PR changes the OSS side of the release flow: the CLI now asks the platform for a project-scoped signed token instead of sending the raw OAuth token, and the manager's image-push handler now rejects any token whose scope doesn't cover the project being pushed to.
What was broken before
Three things, all surfaced when running
alien releaseagainst a multi-tenant manager:alien commands invokeerrored before reaching the manager at all, because the command code tried to use a client that doesn't exist in platform mode.Security review
This PR touches authorization on the image push, so answering the 5 standard questions.
1. What's new on the auth path
2. How the scope check works
In
crates/alien-manager/src/registry_proxy.rs:scopesare non-empty, the new check runs: each scope entry is aworkspaceId/projectIdpair, and at least one entry must end in the target project's ID.The scope-check predicate is extracted as a pure function
scopes_cover_project(scopes, project_id)so it's unit-testable on its own.3. Token shape this PR cares about
The CLI's exchange path produces one token type used for image push:
alien releasescopes: ["workspaceId/projectId"], manager binding,purpose: registry-pushToken issuing and signature verification happen on the platform side and aren't changed by this PR.
4. Where the token is accepted
/v1/artifact-registry/*/v2/.../blobs/uploads/(image push)5. What I explicitly checked
scopes_cover_project./prj_abcdoes NOT accept a target ofprj_abcd— the suffix match requires the leading/. Covered by a dedicated test.Subject.scopesleave it empty, and the new check is gated on!scopes.is_empty(). Verified by a test using an empty-scopes subject, plus a grep — all OSS validator literals initializescopes: Vec::new().token_matches_target) validates the cached token's claims (manager ID, project ID, scope shape) before reuse. 7 tests cover match, manager mismatch, project mismatch, missing claim, malformed token, scopes claim of the wrong type.Security follow-up landed on this branch (after re-review)
A follow-up review of the companion platform PR's security work needed the per-project scope predicate from a stable OSS-side location:
949bad8 feat(manager): promote per-project scope check to Subject::permits_project— moves the predicate (previously a private free functionscopes_cover_projectinregistry_proxy.rs) ontoSubjectas a method, so other route handlers in downstream embedders can apply the same per-project invariant without duplicating the logic. Same semantics (empty scopes →truefor OSS standalone; non-empty → at least one entry ends with/<project_id>; substring-confusion guard via the leading slash). The OCI push call site now readssubject.permits_project(project_id)instead ofscopes_cover_project(&subject.scopes, project_id). Tests move with the function, plus one additional case (multi-scope-any-match).Verification
alien loginagainst the local stack, ranalien init remote-worker-ts my-worker, applied the template fixes (the template still scaffolds with the oldalien.FunctionAPI instead ofalien.Worker, and the npm@alienplatform/coreis stale — separate ticket below), ranalien link, thenalien release --experimental— got two release IDs on two separate runs, the second from a fresh-rebuild scratch state with all local databases wiped. Both passed the new scope check and the artifact-registry + image-push path end-to-end.cargo test -p alien-manager→ 108 passing (was 103; the +5 are for the extractedscopes_cover_projectpredicate).cargo test -p alien-cli --features platform --lib oauth_flow::tests→ 7 tests fortoken_matches_target.cargo build -p alien-cli --features platform→ clean.alien commands invokeagainst a deployed worker — the C1 fix in this PR unblocks the CLI side, but the full path 401s for a separate reason tracked as ALIEN-144.Out of scope
alien commands invoke401 in platform mode. The token this PR's CLI flow produces works for image push, but doesn't work on endpoints that forward the call back to the platform. Needs a different token shape or a server-side exchange. Independent design call.Subject.scopes: Vec<String>is loosely typed — the"workspaceId/projectId"format is only enforced by a doc comment. A typedpermits_project_push(project_id)helper onSubjectwould replace the open-coded suffix-match at call sites.remote-worker-tstemplate still scaffolds with the oldalien.FunctionAPI and a stale@alienplatform/corenpm dependency — anyone tryingalien initfollowed byalien releasewill hit it. Separate template-update ticket.Test plan for reviewer
cargo test -p alien-managerpasses locally (108 tests).cargo build -p alien-cli --features platformis clean.scopes_cover_projectand the 7 new tests ontoken_matches_target— they cover the security-critical branches.grep -rn 'Subject {' crates/alien-manager/src/providers/should showscopes: Vec::new()on all subject literals.