fix(amber): correct WorkerSpec input port to stop flaky timeouts#6037
fix(amber): correct WorkerSpec input port to stop flaky timeouts#6037Ma77Ball wants to merge 4 commits into
Conversation
Automated Reviewer SuggestionsBased on the
|
|
| config | throughput | MB/s | latency | max Δ latest / 7d | |
|---|---|---|---|---|---|
| 🔴 | bs=10 sw=10 sl=64 | 392 | 0.239 | 24,592/33,541/33,541 us | 🔴 -8.1% / 🔴 +119.4% |
| 🔴 | bs=100 sw=10 sl=64 | 830 | 0.507 | 116,487/153,454/153,454 us | 🔴 +7.8% / 🔴 +41.6% |
| ⚪ | bs=1000 sw=10 sl=64 | 952 | 0.581 | 1,051,134/1,084,853/1,084,853 us | ⚪ within ±5% / 🔴 -5.7% |
Baseline details
Latest main 24b587f from same runner
| config | metric | PR | latest main | 7d avg | Δ latest | Δ 7d |
|---|---|---|---|---|---|---|
| bs=10 sw=10 sl=64 | throughput | 392 tuples/sec | 426 tuples/sec | 770.95 tuples/sec | -8.0% | -49.2% |
| bs=10 sw=10 sl=64 | MB/s | 0.239 MB/s | 0.26 MB/s | 0.471 MB/s | -8.1% | -49.2% |
| bs=10 sw=10 sl=64 | p50 | 24,592 us | 24,489 us | 12,775 us | +0.4% | +92.5% |
| bs=10 sw=10 sl=64 | p95 | 33,541 us | 31,963 us | 15,286 us | +4.9% | +119.4% |
| bs=10 sw=10 sl=64 | p99 | 33,541 us | 31,963 us | 18,795 us | +4.9% | +78.5% |
| bs=100 sw=10 sl=64 | throughput | 830 tuples/sec | 843 tuples/sec | 976.93 tuples/sec | -1.5% | -15.0% |
| bs=100 sw=10 sl=64 | MB/s | 0.507 MB/s | 0.515 MB/s | 0.596 MB/s | -1.6% | -15.0% |
| bs=100 sw=10 sl=64 | p50 | 116,487 us | 118,153 us | 102,557 us | -1.4% | +13.6% |
| bs=100 sw=10 sl=64 | p95 | 153,454 us | 142,316 us | 108,383 us | +7.8% | +41.6% |
| bs=100 sw=10 sl=64 | p99 | 153,454 us | 142,316 us | 115,249 us | +7.8% | +33.1% |
| bs=1000 sw=10 sl=64 | throughput | 952 tuples/sec | 953 tuples/sec | 1,009 tuples/sec | -0.1% | -5.7% |
| bs=1000 sw=10 sl=64 | MB/s | 0.581 MB/s | 0.582 MB/s | 0.616 MB/s | -0.2% | -5.7% |
| bs=1000 sw=10 sl=64 | p50 | 1,051,134 us | 1,043,762 us | 997,695 us | +0.7% | +5.4% |
| bs=1000 sw=10 sl=64 | p95 | 1,084,853 us | 1,101,275 us | 1,036,731 us | -1.5% | +4.6% |
| bs=1000 sw=10 sl=64 | p99 | 1,084,853 us | 1,101,275 us | 1,069,334 us | -1.5% | +1.5% |
Raw CSV
config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,510.10,200,128000,392,0.239,24592.35,33540.60,33540.60
1,100,10,64,20,2408.79,2000,1280000,830,0.507,116486.79,153453.63,153453.63
2,1000,10,64,20,21010.36,20000,12800000,952,0.581,1051134.32,1084852.93,1084852.93
Codecov Report❌ Patch coverage is
❌ Your patch status has failed because the patch coverage (0.00%) is below the target coverage (60.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #6037 +/- ##
============================================
- Coverage 56.89% 56.38% -0.52%
- Complexity 3058 3062 +4
============================================
Files 1129 1142 +13
Lines 43802 44154 +352
Branches 4743 4758 +15
============================================
- Hits 24922 24896 -26
- Misses 17449 17828 +379
+ Partials 1431 1430 -1
*This pull request uses carry forward flags. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Updated error handling comment for clarity. Signed-off-by: Matthew B. <mgball@uci.edu>
| // Re-throw Errors (e.g. failed assertions) after replying; only | ||
| // Exceptions are returned to the sender and recovered from. | ||
| if (err.isInstanceOf[Error]) throw err |
There was a problem hiding this comment.
how come non errors can reach here? that's a bigger problem...
with your change, this "non-error" is omitted silently...
can we at least log warning when it happens?
There was a problem hiding this comment.
Dropped the throw since it can take down the actor loop. The Error was an AssertionError from the test's mismatched addPort args; it now just logs a warning so it's not silent.
Yicong-Huang
left a comment
There was a problem hiding this comment.
The src code seems to have a bigger issue. we should not make source code less robust to satisfy test cases...
|
Reverted the re-throw. Fixed the real root cause: the test passed mismatched |
What changes were proposed in this PR?
WorkerSpec's input-portAssignPortRequestnow passes an emptystorageUris(List()) matching its emptypartitionings, satisfyingInputManager.addPort'ssizeinvariant, so the input port registers and the three worker tests pass deterministically.AsyncRPCServer.invokeMethodnow re-throwsErrors (e.g. failed assertions) after replying to the sender, instead of swallowing them; normalExceptions are still returned to the sender unchanged, so genuine invariant violations surface loudly rather than degrading into flaky timeouts.Any related issues, documentation, discussions?
Closes: #6036
How was this PR tested?
sbt "WorkflowExecutionService/testOnly *WorkerSpec"; expectTests: succeeded 3, failed 0.assertion failed. Before this fix the count is 3 (the assertion fired and was swallowed every run); after it is 0. Verified locally under Java 17.AsyncRPCServerchange is left to the amber CI job to exercise across all RPC handlers.Was this PR authored or co-authored using generative AI tooling?
Co-authored with Claude Opus 4.8 in compliance with ASF