-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[sdk-platform-java] Race condition in TracedUnaryCallable causes flaky integration tests #12657
Description
Description
In gax-java, observability tracers (such as TracedUnaryCallable and TracedBatchingCallable) use TraceFinisher attached via ApiFutures.addCallback to record metrics and spans upon completion.
ApiFutures.addCallback(future, finisher, MoreExecutors.directExecutor());This pattern creates a race condition in synchronous environments, such as integration tests (ITOtelGoldenMetrics), where the test thread calls future.get() and immediately asserts state (e.g., checks OpenTelemetry metrics).
Root Cause
When the underlying RPC future completes, the Guava/ApiFuture infrastructure notifies listeners. addCallback attaches a listener that runs independently of the future's resolution state for the blocking caller.
In ITOtelGoldenMetrics:
client.echo()blocks the test thread.- The gRPC network thread completes the RPC.
- The test thread unblocks and calls
metricReader.collectAllMetrics(). - Concurrently (or slightly after), the
TraceFinisherruns to record the metric in the OTel meter.
If the test thread executes before the TraceFinisher completes its work, the test fails with "empty metrics".
Proposed Solution
Convert the detached addCallback into a chained future transformation. This guarantees that the returned future only resolves after the tracer has finished recording metrics/spans.
Example for TracedUnaryCallable:
ApiFuture<ResponseT> onSuccessFuture =
ApiFutures.transform(
future,
response -> {
finisher.onSuccess(response);
return response;
},
MoreExecutors.directExecutor());
return ApiFutures.catchingAsync(
onSuccessFuture,
Throwable.class,
t -> {
finisher.onFailure(t);
return ApiFutures.immediateFailedFuture(t);
},
MoreExecutors.directExecutor());This approach ensures structural LIFO ordering: the metric is recorded before any downstream consumer can yield from a blocking .get() on the returned future.