Skip to content

fix(amber): correct WorkerSpec input port to stop flaky timeouts#6037

Open
Ma77Ball wants to merge 4 commits into
apache:mainfrom
Ma77Ball:fix/flaky-workerspec-input-port
Open

fix(amber): correct WorkerSpec input port to stop flaky timeouts#6037
Ma77Ball wants to merge 4 commits into
apache:mainfrom
Ma77Ball:fix/flaky-workerspec-input-port

Conversation

@Ma77Ball

Copy link
Copy Markdown
Contributor

What changes were proposed in this PR?

  • WorkerSpec's input-port AssignPortRequest now passes an empty storageUris (List()) matching its empty partitionings, satisfying InputManager.addPort's size invariant, so the input port registers and the three worker tests pass deterministically.
  • AsyncRPCServer.invokeMethod now re-throws Errors (e.g. failed assertions) after replying to the sender, instead of swallowing them; normal Exceptions are still returned to the sender unchanged, so genuine invariant violations surface loudly rather than degrading into flaky timeouts.

Any related issues, documentation, discussions?

Closes: #6036

How was this PR tested?

  • Run sbt "WorkflowExecutionService/testOnly *WorkerSpec"; expect Tests: succeeded 3, failed 0.
  • Diagnostic: grep the run log for assertion failed. Before this fix the count is 3 (the assertion fired and was swallowed every run); after it is 0. Verified locally under Java 17.
  • The full amber suite is heavy and flaky to run locally, so the engine-wide AsyncRPCServer change is left to the amber CI job to exercise across all RPC handlers.

Was this PR authored or co-authored using generative AI tooling?

Co-authored with Claude Opus 4.8 in compliance with ASF

@github-actions

Copy link
Copy Markdown
Contributor

Automated Reviewer Suggestions

Based on the git blame history of the changed files, we recommend the following reviewers:

  • Contributors with relevant context: @Yicong-Huang
    You can notify them by mentioning @Yicong-Huang in a comment.

@github-actions

github-actions Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

⚠️ Benchmark changes need a look

🟢 0 better · 🔴 4 worse · ⚪ 11 noise (<±5%) · 0 without baseline

Compared against main 24b587f benchmarked on this same runner, so the delta is largely free of cross-runner hardware noise. The "7d avg" column still reflects the gh-pages dashboard. Treat <±5% as noise unless repeated.

Dashboard · Run

config throughput MB/s latency max Δ latest / 7d
🔴 bs=10 sw=10 sl=64 392 0.239 24,592/33,541/33,541 us 🔴 -8.1% / 🔴 +119.4%
🔴 bs=100 sw=10 sl=64 830 0.507 116,487/153,454/153,454 us 🔴 +7.8% / 🔴 +41.6%
bs=1000 sw=10 sl=64 952 0.581 1,051,134/1,084,853/1,084,853 us ⚪ within ±5% / 🔴 -5.7%
Baseline details

Latest main 24b587f from same runner

config metric PR latest main 7d avg Δ latest Δ 7d
bs=10 sw=10 sl=64 throughput 392 tuples/sec 426 tuples/sec 770.95 tuples/sec -8.0% -49.2%
bs=10 sw=10 sl=64 MB/s 0.239 MB/s 0.26 MB/s 0.471 MB/s -8.1% -49.2%
bs=10 sw=10 sl=64 p50 24,592 us 24,489 us 12,775 us +0.4% +92.5%
bs=10 sw=10 sl=64 p95 33,541 us 31,963 us 15,286 us +4.9% +119.4%
bs=10 sw=10 sl=64 p99 33,541 us 31,963 us 18,795 us +4.9% +78.5%
bs=100 sw=10 sl=64 throughput 830 tuples/sec 843 tuples/sec 976.93 tuples/sec -1.5% -15.0%
bs=100 sw=10 sl=64 MB/s 0.507 MB/s 0.515 MB/s 0.596 MB/s -1.6% -15.0%
bs=100 sw=10 sl=64 p50 116,487 us 118,153 us 102,557 us -1.4% +13.6%
bs=100 sw=10 sl=64 p95 153,454 us 142,316 us 108,383 us +7.8% +41.6%
bs=100 sw=10 sl=64 p99 153,454 us 142,316 us 115,249 us +7.8% +33.1%
bs=1000 sw=10 sl=64 throughput 952 tuples/sec 953 tuples/sec 1,009 tuples/sec -0.1% -5.7%
bs=1000 sw=10 sl=64 MB/s 0.581 MB/s 0.582 MB/s 0.616 MB/s -0.2% -5.7%
bs=1000 sw=10 sl=64 p50 1,051,134 us 1,043,762 us 997,695 us +0.7% +5.4%
bs=1000 sw=10 sl=64 p95 1,084,853 us 1,101,275 us 1,036,731 us -1.5% +4.6%
bs=1000 sw=10 sl=64 p99 1,084,853 us 1,101,275 us 1,069,334 us -1.5% +1.5%
Raw CSV
config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,510.10,200,128000,392,0.239,24592.35,33540.60,33540.60
1,100,10,64,20,2408.79,2000,1280000,830,0.507,116486.79,153453.63,153453.63
2,1000,10,64,20,21010.36,20000,12800000,952,0.581,1051134.32,1084852.93,1084852.93

@codecov-commenter

codecov-commenter commented Jun 30, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.38%. Comparing base (24b587f) to head (cdccb02).
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...exera/amber/engine/common/rpc/AsyncRPCServer.scala 0.00% 3 Missing and 2 partials ⚠️

❌ Your patch status has failed because the patch coverage (0.00%) is below the target coverage (60.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #6037      +/-   ##
============================================
- Coverage     56.89%   56.38%   -0.52%     
- Complexity     3058     3062       +4     
============================================
  Files          1129     1142      +13     
  Lines         43802    44154     +352     
  Branches       4743     4758      +15     
============================================
- Hits          24922    24896      -26     
- Misses        17449    17828     +379     
+ Partials       1431     1430       -1     
Flag Coverage Δ *Carryforward flag
access-control-service 70.00% <ø> (ø) Carriedforward from a1a7eb0
agent-service 44.59% <ø> (ø) Carriedforward from a1a7eb0
amber 57.36% <0.00%> (-1.29%) ⬇️
computing-unit-managing-service 0.00% <ø> (ø) Carriedforward from a1a7eb0
config-service 52.30% <ø> (ø) Carriedforward from a1a7eb0
file-service 62.81% <ø> (ø) Carriedforward from a1a7eb0
frontend 50.12% <ø> (ø) Carriedforward from a1a7eb0
notebook-migration-service 78.57% <ø> (ø) Carriedforward from a1a7eb0
pyamber 90.51% <ø> (-0.64%) ⬇️ Carriedforward from a1a7eb0
python 90.69% <ø> (ø) Carriedforward from a1a7eb0
workflow-compiling-service 55.14% <ø> (ø) Carriedforward from a1a7eb0

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Updated error handling comment for clarity.

Signed-off-by: Matthew B. <mgball@uci.edu>
@xuang7 xuang7 added the release/v1.2 back porting to release/v1.2 label Jun 30, 2026
Comment on lines +106 to +108
// Re-throw Errors (e.g. failed assertions) after replying; only
// Exceptions are returned to the sender and recovered from.
if (err.isInstanceOf[Error]) throw err

@Yicong-Huang Yicong-Huang Jul 1, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how come non errors can reach here? that's a bigger problem...

with your change, this "non-error" is omitted silently...

can we at least log warning when it happens?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped the throw since it can take down the actor loop. The Error was an AssertionError from the test's mismatched addPort args; it now just logs a warning so it's not silent.

@Yicong-Huang Yicong-Huang left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The src code seems to have a bigger issue. we should not make source code less robust to satisfy test cases...

@Ma77Ball

Ma77Ball commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

Reverted the re-throw. Fixed the real root cause: the test passed mismatched storageUris/partitionings, now corrected in one line. AsyncRPCServer only keeps a warning log so a swallowed Error isn't silent.

@Ma77Ball Ma77Ball requested a review from Yicong-Huang July 1, 2026 19:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

engine fix release/v1.2 back porting to release/v1.2

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky WorkerSpec: input-port AssignPortRequest violates addPort size invariant

4 participants