Skip to content

[sdk-platform-java] Race condition in TracedUnaryCallable causes flaky integration tests #12657

@blakeli0

Description

@blakeli0

Description

In gax-java, observability tracers (such as TracedUnaryCallable and TracedBatchingCallable) use TraceFinisher attached via ApiFutures.addCallback to record metrics and spans upon completion.

ApiFutures.addCallback(future, finisher, MoreExecutors.directExecutor());

This pattern creates a race condition in synchronous environments, such as integration tests (ITOtelGoldenMetrics), where the test thread calls future.get() and immediately asserts state (e.g., checks OpenTelemetry metrics).

Root Cause

When the underlying RPC future completes, the Guava/ApiFuture infrastructure notifies listeners. addCallback attaches a listener that runs independently of the future's resolution state for the blocking caller.

In ITOtelGoldenMetrics:

  1. client.echo() blocks the test thread.
  2. The gRPC network thread completes the RPC.
  3. The test thread unblocks and calls metricReader.collectAllMetrics().
  4. Concurrently (or slightly after), the TraceFinisher runs to record the metric in the OTel meter.

If the test thread executes before the TraceFinisher completes its work, the test fails with "empty metrics".

Proposed Solution

Convert the detached addCallback into a chained future transformation. This guarantees that the returned future only resolves after the tracer has finished recording metrics/spans.

Example for TracedUnaryCallable:

ApiFuture<ResponseT> onSuccessFuture =
    ApiFutures.transform(
        future,
        response -> {
            finisher.onSuccess(response);
            return response;
        },
        MoreExecutors.directExecutor());

return ApiFutures.catchingAsync(
    onSuccessFuture,
    Throwable.class,
    t -> {
        finisher.onFailure(t);
        return ApiFutures.immediateFailedFuture(t);
    },
    MoreExecutors.directExecutor());

This approach ensures structural LIFO ordering: the metric is recorded before any downstream consumer can yield from a blocking .get() on the returned future.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions