Skip to content

Restructure MCPTelemetryConfig CRD to nested spec per RFC THV-0023#4487

Merged
ChrisJBurns merged 7 commits intomainfrom
restructure-mcptelemetryconfig-nested-spec
Apr 2, 2026
Merged

Restructure MCPTelemetryConfig CRD to nested spec per RFC THV-0023#4487
ChrisJBurns merged 7 commits intomainfrom
restructure-mcptelemetryconfig-nested-spec

Conversation

@ChrisJBurns
Copy link
Copy Markdown
Collaborator

@ChrisJBurns ChrisJBurns commented Apr 1, 2026

Summary

  • The MCPTelemetryConfig CRD spec was using a flat inline embedding of telemetry.Config which exposed CLI-only fields (environmentVariables) and didn't match the nested openTelemetry/prometheus structure specified in RFC THV-0023
  • Restructure to use spec.openTelemetry and spec.prometheus sub-objects, add resourceAttributes (replacing customAttributes), move sensitiveHeaders under openTelemetry, and replace referencingServers []string with structured referencingWorkloads []WorkloadReference
  • Update the conversion layer to map nested CRD types to flat telemetry.Config at runtime
  • Add operator example YAML files for MCPTelemetryConfig and MCPServer with telemetryConfigRef

Closes #4262

Type of change

  • Refactoring (no behavior change)

Test plan

  • Unit tests (task test)
  • Linting (task lint-fix)

Changes

File Change
api/v1alpha1/mcptelemetryconfig_types.go Replace flat telemetry.Config inline embed with nested MCPTelemetryOTelConfig/PrometheusConfig. Add WorkloadReference type. Update validation, CEL rules, printcolumns
api/v1alpha1/zz_generated.deepcopy.go Regenerated for new types
controllers/mcptelemetryconfig_controller.go findReferencingServersfindReferencingWorkloads returning []WorkloadReference. Extract workloadRefsEqual helper to fix cyclomatic complexity
pkg/spectoconfig/telemetry.go Rewrite NormalizeMCPTelemetryConfig from one-liner config := spec.Config to field-by-field mapping from nested CRD to flat telemetry.Config
pkg/controllerutil/telemetry.go Access sensitive headers via Spec.OpenTelemetry.SensitiveHeaders
deploy/charts/operator-crds/ CRD manifests regenerated
examples/operator/telemetry-configs/ New example YAMLs: basic config, production config with sensitiveHeaders, MCPServer with telemetryConfigRef

Does this introduce a user-facing change?

Yes — this is a breaking change to the MCPTelemetryConfig CRD spec structure. Since the API is v1alpha1 (explicitly unstable), breaking changes are expected. Users must update their MCPTelemetryConfig manifests from the flat structure to the nested structure. See the examples in examples/operator/telemetry-configs/ for the new format.

Large PR Justification

  • is just a api type change with a lot of auto generation

Generated with Claude Code

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Large PR Detected

This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.

How to unblock this PR:

Add a section to your PR description with the following format:

## Large PR Justification

[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformation

Alternative:

Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.

See our Contributing Guidelines for more details.


This review will be automatically dismissed once you add the justification section.

@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 1, 2026
@ChrisJBurns
Copy link
Copy Markdown
Collaborator Author

@claude review this PR focusing only on critical and high severity issues. Ignore medium/low severity, style preferences, and controversial opinions. Focus on:

  • Correctness bugs (wrong logic, missing nil checks, data loss)
  • Security issues (credential exposure, injection)
  • Breaking changes that aren't documented
  • Test coverage gaps for critical paths

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 1, 2026

Claude finished @ChrisJBurns's task in 4m 10s —— View job


PR Review: Restructure MCPTelemetryConfig CRD

  • Gather context
  • Review type definitions and conversion layer
  • Review controller changes
  • Review test coverage
  • Review CRD manifests
  • Post findings

Found 3 high-severity issues. No critical security issues.


HIGH — MCPTelemetryOTelConfig.Enabled is never read in NormalizeMCPTelemetryConfig

File: cmd/thv-operator/pkg/spectoconfig/telemetry.go:120

The parallel function ConvertTelemetryConfig (for inline MCPServer telemetry) gates all OTel config on the Enabled flag:

// ConvertTelemetryConfig — inline MCPServer path
if telemetryConfig.OpenTelemetry != nil && telemetryConfig.OpenTelemetry.Enabled {
    // only configures OTel if Enabled == true
}

But NormalizeMCPTelemetryConfig (for the shared MCPTelemetryConfig CRD) never reads otel.Enabled:

if spec.OpenTelemetry != nil {
    otel := spec.OpenTelemetry
    // otel.Enabled is silently ignored
    config.Endpoint = otel.Endpoint
    config.Insecure = otel.Insecure
    // ...
}

The field carries // +kubebuilder:default=false, so every new MCPTelemetryConfig without openTelemetry.enabled: true will have Enabled == false — yet OTel is still fully configured from the sub-fields. This has two concrete failure modes:

  1. A user who sets enabled: false intending to "disable" OTel cannot — the endpoint and tracing config are still applied.
  2. Any config-as-code tooling that relies on the enabled toggle to control rollout will behave incorrectly.

The examples file (mcpserver_fetch_otel.yaml) correctly sets enabled: true, but the inconsistency between the two paths is a correctness bug. The fix is either to gate the mapping on otel.Enabled, or to remove the field from MCPTelemetryOTelConfig if it's intentionally not a gate.

Fix this →


HIGH — Status schema breaking change not documented

Files: mcptelemetryconfig_types.go:126, PR description

The PR's "Does this introduce a user-facing change?" section only mentions the spec breaking change. However, the status subresource also changed:

# Before
.status.referencingServers: ["server-a", "server-b"]   # []string

# After
.status.referencingWorkloads: [                         # []WorkloadReference
  {kind: MCPServer, namespace: default, name: server-a},
  ...
]

Any tooling, scripts, or Chainsaw/e2e tests that probe .status.referencingServers will silently receive empty/null instead of an error, making this failure invisible. The PR should explicitly call this out. The condition type description in the PR ("The condition type remains Valid") implies status fields were considered carefully, but this particular field isn't mentioned.


HIGH — Endpoint + no tracing/metrics passes validation but fails at runtime

File: cmd/thv-operator/pkg/spectoconfig/telemetry.go, cmd/thv-operator/api/v1alpha1/mcptelemetryconfig_types.go

If a user creates an MCPTelemetryConfig with an endpoint but omits the tracing and metrics sub-objects:

spec:
  openTelemetry:
    endpoint: otel-collector:4317   # TracingEnabled=false, MetricsEnabled=false (nil sub-objects)

NormalizeMCPTelemetryConfig will produce config.Endpoint != "" with both TracingEnabled=false and MetricsEnabled=false. The MCPTelemetryConfig reconciler's Validate() method only checks sensitive headers — it won't catch this. CEL rules don't validate it either. The config will be marked Valid=True in the CRD status.

At runtime, when the MCPServer proxy runner calls NewProvider, validateOtelConfig fires:

// pkg/telemetry/config.go:330
if config.Endpoint != "" && !config.TracingEnabled && !config.MetricsEnabled {
    return fmt.Errorf("OTLP endpoint is configured but both tracing and metrics are disabled...")
}

The proxy runner fails to start with an error that won't show up in the MCPTelemetryConfig status. This path is more likely to be hit with the new nested structure than the old flat embedding (where tracing/metrics had flat boolean fields).

Suggested fix: Add a CEL validation rule to MCPTelemetryConfigSpec (or MCPTelemetryOTelConfig) that rejects endpoint-without-tracing-or-metrics:

// +kubebuilder:validation:XValidation:rule="!has(self.endpoint) || self.endpoint == '' || (has(self.tracing) && self.tracing.enabled) || (has(self.metrics) && self.metrics.enabled)",message="endpoint requires tracing or metrics to be enabled"

Fix this →


Test coverage gap (supporting the issues above)

TestNormalizeMCPTelemetryConfig in pkg/spectoconfig/telemetry_test.go has no test case for:

The existing test "service name override takes precedence" actually creates this broken config (endpoint + no tracing/metrics) but only validates the normalized struct, not runtime validity.


No issues found in

  • Controller logic (findReferencingWorkloads, sort, workloadRefsEqual)
  • Sensitive header nil-guard in controllerutil/telemetry.go
  • Deep copy generation
  • CRD manifest regeneration

@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 1, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review.

@github-actions github-actions bot dismissed their stale review April 1, 2026 21:29

Large PR justification has been provided. Thank you!

@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 1, 2026
@ChrisJBurns
Copy link
Copy Markdown
Collaborator Author

@claude re-review this PR. The two high-severity issues from your previous review have been addressed:

  1. Enabled flag ignoredNormalizeMCPTelemetryConfig now gates on otel.Enabled (commit fa4e4fa)
  2. Endpoint without tracing/metrics — Added CEL validation rule rejecting endpoint when neither is enabled (commit fa4e4fa)

Both have new test cases. Please verify the fixes and flag any remaining critical/high issues only.

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 1, 2026

Claude finished @ChrisJBurns's task in 1m 55s —— View job


PR Re-review: Restructure MCPTelemetryConfig CRD


Both previously-flagged high-severity fixes are confirmed. No new critical or high issues found.

Fix #1 VERIFIED — otel.Enabled gate ✅

NormalizeMCPTelemetryConfig at pkg/spectoconfig/telemetry.go:121 now correctly gates the entire OTel mapping block on spec.OpenTelemetry.Enabled, matching the ConvertTelemetryConfig behavior:

if spec.OpenTelemetry != nil && spec.OpenTelemetry.Enabled {

Test case "enabled false skips OTel config entirely" (line 280) covers the case of Enabled: false with sub-fields set, verifying the endpoint and tracing config are not mapped through.

Fix #2 VERIFIED — CEL endpoint validation ✅

CEL rule added to MCPTelemetryOTelConfig at mcptelemetryconfig_types.go:34:

// +kubebuilder:validation:XValidation:rule="!has(self.endpoint) || size(self.endpoint) == 0 || (has(self.tracing) && self.tracing.enabled) || (has(self.metrics) && self.metrics.enabled)",message="endpoint requires at least one of tracing or metrics to be enabled"

This correctly rejects endpoint-with-no-signals at admission time, preventing the runtime validateOtelConfig failure. Test case "endpoint with nil tracing and metrics produces no tracing or metrics" (line 296) documents the normalized output for a config that would be rejected by CEL, providing defense-in-depth coverage.

Remaining: Status field rename documentation gap (LOW)

The previous HIGH finding about .status.referencingServers.status.referencingWorkloads not being documented in "Does this introduce a user-facing change?" is still technically open — the PR body only calls out the spec breaking change. Given this is v1alpha1, the practical risk is low. If you want to be thorough, add a note like:

Status field: .status.referencingServers ([]string) is replaced by .status.referencingWorkloads ([]WorkloadReference). Any tooling probing the old field will silently receive null.

Not blocking.


LGTM on the two fixes. The implementation is correct and the new test cases are well-scoped.

@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 1, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 1, 2026

Codecov Report

❌ Patch coverage is 77.02703% with 17 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.35%. Comparing base (610075f) to head (92f1e78).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
...rator/controllers/mcptelemetryconfig_controller.go 70.37% 6 Missing and 2 partials ⚠️
...-operator/api/v1alpha1/mcptelemetryconfig_types.go 63.15% 4 Missing and 3 partials ⚠️
cmd/thv-operator/pkg/spectoconfig/telemetry.go 87.50% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #4487   +/-   ##
=======================================
  Coverage   69.34%   69.35%           
=======================================
  Files         502      502           
  Lines       51643    51683   +40     
=======================================
+ Hits        35814    35843   +29     
- Misses      13063    13069    +6     
- Partials     2766     2771    +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 1, 2026
@ChrisJBurns ChrisJBurns force-pushed the restructure-mcptelemetryconfig-nested-spec branch from 41bd22c to e863bd1 Compare April 2, 2026 14:15
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 2, 2026
@ChrisJBurns
Copy link
Copy Markdown
Collaborator Author

Summary of offline discussion

Jeremy and I discussed this concern in depth offline. Here's the conclusion we arrived at:

Why we're keeping separate types (CRD types vs. runtime config types):

  1. Override safety — Go's json:",inline" embedding exposes every exported field from the embedded struct in the CRD schema. There's no way to hide or re-tag individual fields from the embedding site. For example, if telemetry.Config has serviceName (which the CLI needs), the CRD would expose it too — even though it must not be user-configurable per-server. The only options would be CEL admission rules (field is still visible in kubectl explain and the schema) or not embedding. Separate types give us explicit control over the CRD surface area.

  2. CRD stability — The cost of a missed conversion layer is a feature that doesn't work, which gets caught before or shortly after a PR is submitted. The cost of an unintended field leaking into the CRD schema is potentially a breaking API change. We'd rather fail loudly on a missing mapping than silently expand the CRD contract.

What we agreed to improve going forward:

We acknowledged that the duplication and manual conversion layers are a real maintenance burden and error-prone. We agreed on this direction:

  • Keep types split — there are genuine reasons for the structural differences.
  • Automate drift detection — use reflection-based tests that enumerate fields on both the CRD type and the runtime config type, and fail if a field isn't either mapped in the conversion function or explicitly listed in an exclusion set. This catches missed conversions at test time without needing expensive e2e tests.
  • Standardize conversion patterns — make all type conversions follow a consistent structure to reduce cognitive load.
  • RFC in progress — I've opened a draft RFC at stacklok/toolhive-rfcs#65 to formalize the approach. Jeremy will review later this week.

TL;DR: Merging this PR as-is. The separate types are intentional for safety, and we'll address the maintainability concerns through automated drift detection and a broader RFC.

ChrisJBurns and others added 7 commits April 2, 2026 18:16
The MCPTelemetryConfig spec was using a flat inline embedding of
telemetry.Config which exposed CLI-only fields and didn't match the
nested structure specified in RFC THV-0023.

Restructure to use openTelemetry/prometheus sub-objects:
- spec.openTelemetry: endpoint, headers, sensitiveHeaders,
  resourceAttributes, tracing, metrics, useLegacyAttributes
- spec.prometheus: enabled
- status.referencingWorkloads: structured WorkloadReference
  (kind/namespace/name) replacing bare string list

Update the conversion layer (spectoconfig, controllerutil) to map
the nested CRD types to the flat telemetry.Config at runtime. Add
operator example YAML files for MCPTelemetryConfig and MCPServer
with telemetryConfigRef.

Closes #4262

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Combine MCPTelemetryConfig and MCPServer-with-ref examples into
mcpserver_fetch_otel.yaml as a single apply-able file. Remove the
separate telemetry-configs directory.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix two issues from PR review:

- NormalizeMCPTelemetryConfig now gates OTel config on Enabled flag,
  matching ConvertTelemetryConfig behavior. Setting enabled: false
  correctly disables OTLP even if sub-fields are populated.

- Add CEL validation rejecting endpoint when neither tracing nor
  metrics is enabled, catching the misconfiguration at admission
  time instead of failing at proxy runner startup.

Add test cases covering both scenarios.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use single-quoted YAML for the endpoint validation CEL rule to match
the quoting style of other CEL rules in the chart. Double-quoted YAML
caused the rule value to include literal quote characters, failing
CRD installation with a CEL syntax error.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The gci linter was replacing ASCII quotes with Unicode smart quotes
(U+201C/U+201D) in kubebuilder annotations, breaking controller-gen
CRD generation. Additionally, the CEL rule for endpoint validation
generated double-quoted YAML that the Helm wrapper mishandled.

Move the endpoint-requires-signals validation from a CEL rule to the
programmatic Validate() method, which avoids both issues and provides
clearer error messages. Add test case for this validation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove duplicate WorkloadReference definition from
mcptelemetryconfig_types.go and use the shared type already
defined in mcpoidcconfig_types.go. Drop the Namespace field
(cross-namespace refs are not supported) to match the shared
type's schema.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ChrisJBurns ChrisJBurns force-pushed the restructure-mcptelemetryconfig-nested-spec branch from e863bd1 to 92f1e78 Compare April 2, 2026 17:20
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 2, 2026
Copy link
Copy Markdown
Contributor

@jerm-dro jerm-dro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving — the nested types approach is well-implemented and the conversion layer is clean and well-tested.

For context: the broader concern around diverging config/CRD types and the cognitive load of maintaining separate type hierarchies will be addressed systemically in toolhive-rfcs#65.

@ChrisJBurns ChrisJBurns merged commit a78d9bf into main Apr 2, 2026
42 of 46 checks passed
@ChrisJBurns ChrisJBurns deleted the restructure-mcptelemetryconfig-nested-spec branch April 2, 2026 17:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XL Extra large PR: 1000+ lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Define MCPTelemetryConfig API Types

2 participants