Skip to content

[FLINK-39127][metrics] Truncate OTel attribute values via config#27989

Open
Izeren wants to merge 1 commit intoapache:masterfrom
Izeren:FLINK-39127/otel-attribute-truncation
Open

[FLINK-39127][metrics] Truncate OTel attribute values via config#27989
Izeren wants to merge 1 commit intoapache:masterfrom
Izeren:FLINK-39127/otel-attribute-truncation

Conversation

@Izeren
Copy link
Copy Markdown
Contributor

@Izeren Izeren commented Apr 21, 2026

What is the purpose of the change

Part 3 of FLIP-553. Large OpenTelemetry metric attribute values can be rejected by OTel collectors as oversized payloads. This PR adds two configuration options to the OpenTelemetry metric reporter to keep attribute values within safe bounds and to track post-truncation metric name collisions without unbounded heap growth.

Brief change log

  • New option transform.attribute-value-length-limits.<attribute-name> (with * as the global key) truncates metric attribute values to the configured length. 0 drops the attribute, negative values disable the limit for that attribute, and the * global key applies to any attribute without an explicit entry.
  • New option transform.collision-tracking-max-slots (default 50000) bounds the memory footprint of the post-truncation collision tracker via an access-ordered LRU. 0 disables collision tracking. Malformed or negative values WARN and fall back to the default — a typo should not tear down metric reporting at JobManager startup.
  • New package-private MetricAttributeTransformer applies the limits and tracks collisions. Serialized via the existing reporter lock (@NotThreadSafe).
  • OpenTelemetryMetricReporter.open() builds the transformer from reporter config; notifyOfAddedMetric applies it before registration.
  • Options live in OpenTelemetryReporterOptions, which is shared with the trace and event reporters. Their descriptions explicitly state that these two options apply only to the metric reporter and are ignored by the trace and event reporters.
  • Empty-string attribute values are preserved under any non-zero configured limit.

Verifying this change

This change added tests and can be verified as follows:

  • Unit tests in MetricAttributeTransformerTest covering: empty attribute values preserved under non-zero limits, access-ordered LRU hot-slot survives eviction, bounded slot tracking under maxSlots * 4 distinct-slot load, malformed NumberFormatException path falls back to default, negative max-slots falls back to default, global vs per-attribute limit interaction, 0 drop semantics, and negative-limit disable semantics.
  • Unit tests in OpenTelemetryMetricReporterTest covering the reporter-level wiring.
  • Integration test OpenTelemetryMetricReporterITCase#testAttributeValueTruncation verifying truncated attribute values reach the exporter.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no (the new ConfigOptions are @PublicEvolving; the transformer class is package-private)
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no (metric reporter registration path only)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? yes
  • If yes, how is the feature documented? docs (option descriptions regenerated into docs/layouts/shortcodes/generated/open_telemetry_reporter_configuration.html) and JavaDocs on MetricAttributeTransformer and the new ConfigOptions in OpenTelemetryReporterOptions.

Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)

Generated-by: Claude Code (Opus 4.7)

@flinkbot
Copy link
Copy Markdown
Collaborator

flinkbot commented Apr 21, 2026

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@Izeren Izeren force-pushed the FLINK-39127/otel-attribute-truncation branch from cffebfc to fcc26f3 Compare April 22, 2026 14:40
Adds configurable per-attribute and global length limits for metric
attribute values exported by the OpenTelemetry reporter, addressing
oversized payload rejections from OTel collectors. Configuration is
prefix-based (transform.attribute-value-length-limits.<attr>), with
a `*` global key, 0 to drop, and negative values to disable the
limit per attribute. Colliding metrics after truncation are logged
and counted best-effort without blocking registration.

A second option, transform.collision-tracking-max-slots, bounds the
memory footprint of the collision tracker via an access-ordered LRU
(default 50000, 0 disables, invalid values WARN and fall back to the
default). This prevents unbounded heap growth under failover loops
where every task attempt produces a distinct per-attempt UUID that
maps to a new tracking slot.

Part of FLIP-553.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Izeren Izeren force-pushed the FLINK-39127/otel-attribute-truncation branch from fcc26f3 to 74ad2be Compare April 23, 2026 10:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants