Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Pull request overview
Updates the ToolHive documentation to reflect user-facing features shipped in v0.12.3–v0.13.0, aligning guides and concept pages with current CRD/CLI capabilities (scaling, auth, composite workflows, and skills install).
Changes:
- Document tool annotation overrides for aggregated tools (vMCP).
- Rewrite scaling guidance to cover new
replicas/backendReplicasfields and session storage requirements. - Add docs for
forEachcomposite-tool steps and for installing skills viathv skill install(registry + git sources), plus cross-references between guides.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/toolhive/guides-vmcp/tool-aggregation.mdx | Adds annotation override documentation and example configuration. |
| docs/toolhive/guides-vmcp/scaling-and-performance.mdx | Updates horizontal scaling guidance and adds session storage + MCPServer scaling sections. |
| docs/toolhive/guides-vmcp/composite-tools.mdx | Documents forEach iteration steps and expands template context table. |
| docs/toolhive/guides-registry/skills.mdx | Adds CLI-based skill installation instructions and flags table. |
| docs/toolhive/guides-k8s/run-mcp-k8s.mdx | Adds cross-link to scaling guidance for MCPServer replicas. |
| docs/toolhive/guides-k8s/redis-session-storage.mdx | Adds scaling-related context and links session storage to horizontal scaling. |
| docs/toolhive/guides-k8s/auth-k8s.mdx | Updates upstream provider note to reflect multi-upstream support in VirtualMCPServer. |
| docs/toolhive/concepts/skills.mdx | Updates current status to reflect CLI install support and links to guide. |
| docs/toolhive/concepts/backend-auth.mdx | Replaces “single upstream” language with multi-upstream (vMCP) + clarifies MCPServer limitations. |
| This guide explains how to scale MCPServer and Virtual MCP Server (vMCP) | ||
| deployments. |
There was a problem hiding this comment.
The guide now claims it covers both MCPServer and Virtual MCP Server, but the vertical scaling section/examples still only describe VirtualMCPServer (podTemplateSpec with vmcp container). Either add the MCPServer equivalent (its podTemplateSpec uses the mcp container) or adjust the intro/description to avoid implying full MCPServer coverage for vertical scaling.
There was a problem hiding this comment.
+1, adding MCPServer scaling to this guide seems out of place since this is specifically inside the vMCP section. If horizontal scaling for MCPServer is now supported, it seems like a scaling guide is called for in the main operator section, or a new section in the guides-k8s/run-mcp-k8s guide?
| | Field | Description | Default | | ||
| | --------------- | --------------------------------------------------- | ------- | | ||
| | `collection` | Template expression that produces an array | — | | ||
| | `itemVar` | Variable name for the current item | — | | ||
| | `maxParallel` | Maximum concurrent iterations (max 50) | 10 | | ||
| | `maxIterations` | Maximum total iterations (max 1000) | 100 | | ||
| | `step` | Inner step definition (tool call to execute per item) | — | | ||
| | `onError` | Error handling: `abort` (stop) or `continue` (skip) | abort | |
There was a problem hiding this comment.
The forEach fields table is inconsistent with the generated CRD reference (docs/toolhive/reference/crd-spec.md): collection should be described as resolving to a JSON array or slice; itemVar defaults to "item" (not required); and maxParallel/maxIterations caps/defaults should match the reference (only maxIterations is documented as hard-capped at 1000). Updating this table will prevent users from configuring fields incorrectly.
| | Field | Description | Default | | |
| | --------------- | --------------------------------------------------- | ------- | | |
| | `collection` | Template expression that produces an array | — | | |
| | `itemVar` | Variable name for the current item | — | | |
| | `maxParallel` | Maximum concurrent iterations (max 50) | 10 | | |
| | `maxIterations` | Maximum total iterations (max 1000) | 100 | | |
| | `step` | Inner step definition (tool call to execute per item) | — | | |
| | `onError` | Error handling: `abort` (stop) or `continue` (skip) | abort | | |
| | Field | Description | Default | | |
| | --------------- | ----------------------------------------------------------- | ------- | | |
| | `collection` | Template expression that resolves to a JSON array or slice | — | | |
| | `itemVar` | Variable name for the current item | item | | |
| | `maxParallel` | Maximum concurrent iterations | 10 | | |
| | `maxIterations` | Maximum total iterations (hard cap 1000) | 100 | | |
| | `step` | Inner step definition (tool call to execute per item) | — | | |
| | `onError` | Error handling: `abort` (stop) or `continue` (skip) | abort | |
b9da28d to
de415e8
Compare
de415e8 to
bfecf66
Compare
bfecf66 to
7a9131f
Compare
7a9131f to
46d40ca
Compare
46d40ca to
d4c8244
Compare
danbarr
left a comment
There was a problem hiding this comment.
As noted on Copilot's review comment, the new content about scaling MCPServer resources is out of place in the vMCP scaling/performance guide.
It should live with the other MCPServer docs in the main operator section - either as a new guide if the amount of content warrants it, or in the existing "Run MCP servers in Kubernetes" guide.
| This guide explains how to scale MCPServer and Virtual MCP Server (vMCP) | ||
| deployments. |
There was a problem hiding this comment.
+1, adding MCPServer scaling to this guide seems out of place since this is specifically inside the vMCP section. If horizontal scaling for MCPServer is now supported, it seems like a scaling guide is called for in the main operator section, or a new section in the guides-k8s/run-mcp-k8s guide?
d4c8244 to
2abd168
Compare
|
@danbarr - I think I addressed the feedback so it should be ready for another round 👍 |
The updated locations of the scaling info make sense to me now. I'll clear my "changes requested" review and leave it to the platform team to do their technical review. |
| there is no per-user attribution in audit logs. | ||
|
|
||
| When using static credentials, consider integrating with a secret management | ||
| system like [HashiCorp Vault](../integrations/vault.mdx) for secure storage and |
There was a problem hiding this comment.
should we mention 1password support around here?
There was a problem hiding this comment.
Two things:
- In the context of this guide, which is mostly going to be used in a Kubernetes deployment, the Vault callout makes sense since we've validated that integration; but we don't have any similar integration guide for 1Password (I'm also not sure how popular it is in a K8s setting anyway?)
- The CLI support for 1Password that the new link goes to is really only halfway implemented. It's read-only, was never carried forward into the UI, and honestly I'm surprised it's still there. I'd wager no one uses it.
There was a problem hiding this comment.
alright, let me revert this commit so we can unblock this one for now
Catch up documentation with features shipped in v0.12.3 through v0.13.0. Auto-generated CLI/CRD reference docs were already current; these changes cover manual doc updates verified against source code at each release tag. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Radoslav Dimitrov <radoslav@stacklok.com>
762b2d3 to
ef1ea56
Compare
This reverts commit ef1ea56.
yrobla
left a comment
There was a problem hiding this comment.
Scalability review comment — session storage warning behavior
| :::warning | ||
|
|
||
| If you configure multiple replicas without session storage, the operator sets a | ||
| `SessionStorageMissingForReplicas` status condition on the resource. Ensure |
There was a problem hiding this comment.
Advisory-only: the operator does not block the scale-out
I checked the operator source (virtualmcpserver_controller.go:355 and the equivalent in mcpserver_controller.go:1962). Both validateSessionStorageForReplicas functions are explicitly annotated as // Advisory: — they only set the SessionStorageWarning status condition and reconciliation continues normally. The deployment is created/updated with the requested replica count regardless.
The current wording is accurate but incomplete — readers may assume the condition is a rejection. It should be explicit that the operator still applies the replica count:
| `SessionStorageMissingForReplicas` status condition on the resource. Ensure | |
| `SessionStorageMissingForReplicas` status condition on the resource but **still | |
| applies the replica count**. Pods will start, but requests routed to a replica | |
| that did not establish the session will fail. Ensure |
One additional nit: the condition type shown in kubectl describe is SessionStorageWarning; SessionStorageMissingForReplicas is the reason. Users trying to watch or filter on this condition in scripts or alerts will need the type, not the reason — worth clarifying or using the type as the primary reference.
yrobla
left a comment
There was a problem hiding this comment.
Remaining scalability review comments
| backendReplicas: 3 | ||
| sessionStorage: | ||
| provider: redis | ||
| address: redis-master.toolhive-system.svc.cluster.local:6379 |
There was a problem hiding this comment.
Hardcoded Redis address implies a specific namespace
The example uses redis-master.toolhive-system.svc.cluster.local as the address. Users deploying Redis in a different namespace (or using an external Redis) won't get any hint that this needs to change — they'll just have a silently misconfigured deployment. Worth adding a comment or callout that this is an example address and should match the actual Redis Service location.
|
|
||
| When running multiple replicas, configure | ||
| [Redis session storage](./redis-session-storage.mdx) so that sessions are shared | ||
| across pods. If you omit `replicas` or `backendReplicas`, the operator defers |
There was a problem hiding this comment.
Missing guidance on mixing replicas and backendReplicas independently
The doc states each can be scaled independently, but doesn't say what the meaningful combinations are. For example: is scaling only the proxy useful if there's one backend? What happens if backendReplicas > replicas? Operators will naturally ask this when deciding how to size their deployments. A brief note on the intended use case for each (proxy handles MCP protocol / auth; backend handles tool execution) would help readers make informed decisions.
There was a problem hiding this comment.
After digging into the operator source, here's what each layer does and the meaningful combinations:
Architecture (from the code):
- Proxy runner (
spec.replicas) → a Deployment; handles MCP protocol, authentication, and session management. Stateless w.r.t. tool execution. - Backend (
spec.backendReplicas) → a StatefulSet backed by a ClusterIP Service withClientIPaffinity (hardcoded 1800s timeout). A comment inclient.goexplicitly notes: "this provides proxy-runner-level stickiness (L4), not per-MCP-session stickiness (L7). True per-session routing would require Mcp-Session-Id-based routing at the proxy layer." - The
SessionStorageWarningcondition only checksspec.replicas, notspec.backendReplicas— so scaling only the backend beyond 1 replica produces no warning, even though backendClientIPaffinity has the same NAT limitations.
Suggested addition after the bullet list at line 448:
| across pods. If you omit `replicas` or `backendReplicas`, the operator defers | |
| - `spec.replicas` controls the proxy runner pod count | |
| - `spec.backendReplicas` controls the backend MCP server pod count | |
| The proxy runner handles authentication, MCP protocol framing, and session | |
| management; it is stateless with respect to tool execution. The backend runs | |
| the actual MCP server and executes tools. | |
| Common configurations: | |
| - **Scale only the proxy** (`replicas: N`, omit `backendReplicas`): useful when | |
| auth and connection overhead is the bottleneck with a single backend. | |
| - **Scale only the backend** (omit `replicas`, `backendReplicas: M`): useful when | |
| tool execution is CPU/memory-bound and the proxy is not a bottleneck. The backend | |
| StatefulSet uses client-IP session affinity (30-minute timeout) to route repeated | |
| connections to the same pod — subject to the same NAT limitations as proxy-level | |
| affinity. | |
| - **Scale both** (`replicas: N`, `backendReplicas: M`): full horizontal scale. | |
| Redis session storage is required when `replicas > 1`. | |
| :::note | |
| The `SessionStorageWarning` condition fires only when `spec.replicas > 1`. Scaling | |
| only the backend (`backendReplicas > 1`) does not trigger a warning, but backend | |
| client-IP affinity is still unreliable behind NAT or shared egress IPs. | |
| ::: |
| When running multiple replicas, configure | ||
| [Redis session storage](./redis-session-storage.mdx) so that sessions are shared | ||
| across pods. If you omit `replicas` or `backendReplicas`, the operator defers | ||
| replica management to an HPA or other external controller. |
There was a problem hiding this comment.
No mention of connection draining during scale-down
The doc covers scale-out but is silent on scale-in. For MCP servers with long-lived SSE or streaming connections, abrupt pod termination can drop active sessions. Is there a terminationGracePeriodSeconds recommendation, or does the operator inject any preStop hook? If not, it's worth noting so users can configure it themselves.
There was a problem hiding this comment.
Dug into the operator source. Here's the full picture:
What the operator does:
- Both the proxy Deployment and vMCP Deployment hardcode
terminationGracePeriodSeconds: 30(mcpserver_controller.go:132, virtualmcpserver_deployment.go:61). - The proxy runner catches SIGTERM via
signal.NotifyContext(main.go:39) and callsserver.Shutdown(ctx)with a matching 30s shutdown timeout (transparent_proxy.go:142). If drain completes within 30s, connections close cleanly; if not, they are force-closed. - No preStop hook is injected by either controller.
- The 30s grace period and 30s drain timeout are identical, leaving zero headroom. In practice, process startup overhead means SIGKILL can arrive before drain finishes for long-lived SSE streams.
terminationGracePeriodSecondsis not a CRD field — users can only change it viapodTemplateSpec(which is supported but undocumented for this use case).- The backend StatefulSet does not have
terminationGracePeriodSecondsset by the operator — it inherits the Kubernetes default of 30s.
Suggested addition at the end of the horizontal scaling section:
:::note[Connection draining on scale-down]
When a proxy runner pod is terminated (scale-in, rolling update, or node
eviction), Kubernetes sends SIGTERM and the proxy drains in-flight requests
for up to 30 seconds before force-closing connections. The grace period and
drain timeout are both 30 seconds with no headroom, so long-lived SSE or
streaming connections may be dropped if they exceed the drain window.
No preStop hook is injected by the operator. If your workload requires
additional time — for example, to let kube-proxy propagate endpoint removal
before the pod stops accepting traffic — override `terminationGracePeriodSeconds`
via `podTemplateSpec`:
```yaml
spec:
podTemplateSpec:
spec:
terminationGracePeriodSeconds: 60The same 30-second default applies to the backend StatefulSet and to
VirtualMCPServer pods.
:::
|
|
||
| The `VirtualMCPServer` CRD includes a `sessionAffinity` field that controls how | ||
| the Kubernetes Service routes repeated client connections. By default, it uses | ||
| `ClientIP` affinity, which routes connections from the same client IP to the |
There was a problem hiding this comment.
ClientIP affinity is unreliable behind NAT or load balancers
The doc presents ClientIP as the default sticky-session mechanism but doesn't mention its well-known limitation: when clients sit behind a NAT gateway, corporate proxy, or cloud load balancer, all traffic appears to originate from the same IP, routing everyone to the same pod and defeating horizontal scaling. For users deploying in typical cloud environments (EKS, GKE, AKS), this will silently underperform. Worth adding a caveat, e.g.:
Note:
ClientIPaffinity may be ineffective when clients share a NAT or egress IP. For more reliable session routing, consider stateless backends or a dedicated vMCP instance per team.
There was a problem hiding this comment.
The CRD confirms sessionAffinity accepts ClientIP or None, and the field comment (virtualmcpserver_types.go:41) already notes: "Set to None for stateless servers or when using an external load balancer with its own affinity." — worth surfacing in the doc.
Suggested replacement for the sessionAffinity paragraph:
| `ClientIP` affinity, which routes connections from the same client IP to the | |
| The `VirtualMCPServer` CRD includes a `sessionAffinity` field that controls how | |
| the Kubernetes Service routes repeated client connections. By default, it uses | |
| `ClientIP` affinity, which routes connections from the same client IP to the | |
| same pod: | |
| ```yaml | |
| spec: | |
| sessionAffinity: ClientIP # default |
:::warning[ClientIP affinity is unreliable behind NAT or shared egress IPs]
ClientIP affinity relies on the source IP reaching kube-proxy. When clients
sit behind a NAT gateway, corporate proxy, or cloud load balancer (common in
EKS, GKE, and AKS), all traffic appears to originate from the same IP —
routing every client to the same pod and eliminating the benefit of horizontal
scaling. This fails silently: the deployment appears healthy but only one pod
handles all load.
For stateless backends, disable affinity so the Service load-balances freely:
spec:
sessionAffinity: NoneFor stateful backends where true per-session routing is required, ClientIP
affinity is a best-effort mechanism only. Prefer vertical scaling or a
dedicated vMCP instance per team instead.
:::
| | --------------- | ----------------------------------------------------- | ------- | | ||
| | `collection` | Template expression that produces an array | — | | ||
| | `itemVar` | Variable name for the current item | item | | ||
| | `maxParallel` | Maximum concurrent iterations (max 50) | 10 | |
There was a problem hiding this comment.
maxParallel fan-out is per-pod, not distributed across replicas
The maxParallel: 50 cap may give the impression that concurrency is spread across the vMCP deployment. In practice, all iterations of a forEach step are dispatched by the single pod handling that composite tool request — so 50 parallel backend calls will originate from one pod regardless of spec.replicas. This is worth noting so users size their backend MCP servers and pod resources against the per-pod fan-out, not the aggregate deployment capacity.
There was a problem hiding this comment.
Confirmed by workflow_engine.go:701-741: all iterations run in an errgroup on the pod that received the composite tool request — no distribution across replicas.
The note fits right after the template access paragraph (line 402), before ### Error handling:
| | `maxParallel` | Maximum concurrent iterations (max 50) | 10 | | |
| `maxParallel` controls how many iterations run concurrently **on the pod that | |
| received the composite tool request**. Iterations are not distributed across | |
| vMCP replicas — all parallel backend calls originate from a single pod | |
| regardless of `spec.replicas`. | |
| When sizing your deployment, account for the per-pod fan-out: a `maxParallel: 50` | |
| forEach step can open up to 50 simultaneous connections to backend MCP servers | |
| from one pod. Ensure both the vMCP pod resources and the backend MCP servers | |
| can handle that per-pod concurrency. | |
| ::: |
| | `collection` | Template expression that produces an array | — | | ||
| | `itemVar` | Variable name for the current item | item | | ||
| | `maxParallel` | Maximum concurrent iterations (max 50) | 10 | | ||
| | `maxIterations` | Maximum total iterations (max 1000) | 100 | |
There was a problem hiding this comment.
No guidance on timeout interaction with maxIterations and maxParallel
At maxIterations: 1000 and maxParallel: 10 (default), a forEach loop could run for 100 serial batches — if each backend call takes a few seconds, the total can easily exceed a workflow-level timeout. The doc covers per-step timeouts elsewhere but doesn't connect the two here. A brief callout would help users avoid silent truncation: e.g., workflow timeout should be at least ceil(maxIterations / maxParallel) × expected step duration.
Summary
Catch up documentation with user-facing features shipped in ToolHive v0.12.3 through v0.13.0. Changes were verified against source code at each release tag.
replicasandbackendReplicasCRD fields, session storage config, and stdio transport limitationsthv skill installdocumentation (registry and git sources)Releases covered
Excluded (verified as not needing docs)
--publishflag (not present in CLI at v0.13.0)Test plan
npm run buildpasses with no broken links or MDX errors🤖 Generated with Claude Code