fix(test): stabilize flaky ReconfigurationIntegrationSpec pause race#5915
fix(test): stabilize flaky ReconfigurationIntegrationSpec pause race#5915Ma77Ball wants to merge 10 commits into
Conversation
… a pause is generated leading to a failed ci
Automated Reviewer SuggestionsBased on the
|
|
| config | throughput | MB/s | latency | max Δ latest / 7d | |
|---|---|---|---|---|---|
| 🟢 | bs=10 sw=10 sl=64 | 394 | 0.241 | 24,318/34,151/34,151 us | 🟢 -9.1% / 🔴 +123.4% |
| 🔴 | bs=100 sw=10 sl=64 | 812 | 0.495 | 122,011/147,213/147,213 us | 🔴 +7.6% / 🔴 +35.8% |
| ⚪ | bs=1000 sw=10 sl=64 | 929 | 0.567 | 1,073,409/1,119,384/1,119,384 us | ⚪ within ±5% / 🔴 +8.0% |
Baseline details
Latest main 24b587f from same runner
| config | metric | PR | latest main | 7d avg | Δ latest | Δ 7d |
|---|---|---|---|---|---|---|
| bs=10 sw=10 sl=64 | throughput | 394 tuples/sec | 391 tuples/sec | 770.95 tuples/sec | +0.8% | -48.9% |
| bs=10 sw=10 sl=64 | MB/s | 0.241 MB/s | 0.239 MB/s | 0.471 MB/s | +0.8% | -48.8% |
| bs=10 sw=10 sl=64 | p50 | 24,318 us | 24,152 us | 12,775 us | +0.7% | +90.3% |
| bs=10 sw=10 sl=64 | p95 | 34,151 us | 37,589 us | 15,286 us | -9.1% | +123.4% |
| bs=10 sw=10 sl=64 | p99 | 34,151 us | 37,589 us | 18,795 us | -9.1% | +81.7% |
| bs=100 sw=10 sl=64 | throughput | 812 tuples/sec | 820 tuples/sec | 976.93 tuples/sec | -1.0% | -16.9% |
| bs=100 sw=10 sl=64 | MB/s | 0.495 MB/s | 0.5 MB/s | 0.596 MB/s | -1.0% | -17.0% |
| bs=100 sw=10 sl=64 | p50 | 122,011 us | 120,692 us | 102,557 us | +1.1% | +19.0% |
| bs=100 sw=10 sl=64 | p95 | 147,213 us | 136,873 us | 108,383 us | +7.6% | +35.8% |
| bs=100 sw=10 sl=64 | p99 | 147,213 us | 136,873 us | 115,249 us | +7.6% | +27.7% |
| bs=1000 sw=10 sl=64 | throughput | 929 tuples/sec | 904 tuples/sec | 1,009 tuples/sec | +2.8% | -7.9% |
| bs=1000 sw=10 sl=64 | MB/s | 0.567 MB/s | 0.552 MB/s | 0.616 MB/s | +2.7% | -7.9% |
| bs=1000 sw=10 sl=64 | p50 | 1,073,409 us | 1,102,636 us | 997,695 us | -2.7% | +7.6% |
| bs=1000 sw=10 sl=64 | p95 | 1,119,384 us | 1,172,947 us | 1,036,731 us | -4.6% | +8.0% |
| bs=1000 sw=10 sl=64 | p99 | 1,119,384 us | 1,172,947 us | 1,069,334 us | -4.6% | +4.7% |
Raw CSV
config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,507.15,200,128000,394,0.241,24317.57,34150.58,34150.58
1,100,10,64,20,2464.03,2000,1280000,812,0.495,122010.87,147212.93,147212.93
2,1000,10,64,20,21532.14,20000,12800000,929,0.567,1073408.63,1119384.23,1119384.23
Codecov Report❌ Patch coverage is
❌ Your patch status has failed because the patch coverage (0.00%) is below the target coverage (60.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #5915 +/- ##
============================================
- Coverage 56.89% 56.61% -0.29%
Complexity 3058 3058
============================================
Files 1129 1126 -3
Lines 43802 43342 -460
Branches 4743 4669 -74
============================================
- Hits 24922 24538 -384
+ Misses 17449 17403 -46
+ Partials 1431 1401 -30
*This pull request uses carry forward flags. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
/request-review @aglinxinyuan |
|
that's not a final fix right? if medium dataset is also processed fast enough, the same issue could happen? |
There was a problem hiding this comment.
Pull request overview
This PR aims to stabilize ReconfigurationIntegrationSpec by ensuring the CSV-backed workflows run long enough for pauseWorkflow to take effect, avoiding a race where workflows can complete before the pause is applied.
Changes:
- Introduces a helper (
boundedCsvSource) to create a CSV scan operator descriptor based on the medium CSV source. - Switches the two CSV-sourced reconfiguration tests to use the new helper instead of
smallCsvScanOpDesc.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Yes, it is a quick, temporary fix. I could do a deeper dive into this problem and find a permanent solution. This pr should help mitigate the issue to avoid it blocking PRs in the meantime. |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Matthew B. <mgball@uci.edu>
|
maybe we need manually backport? I removed the v1.2 label |
The test that keeps failing: it starts a workflow, pauses it, swaps some code, then resumes it. To pause, it asks each worker "please pause" and waits 5 seconds for a reply. The problem: some workers run Python. Starting Python is slow (it has to boot the interpreter and load libraries). If a worker is still busy starting up Python, it can't reply "paused" within 5 seconds → the test gives up → CI fails. The test that uses two Python workers is slowest to start, so it fails most.
d5da5a6 to
12e0a6e
Compare
|
Is this test even worth keeping? ReconfigurationSpec already covers the mechanism (stable); this only adds the Python-executor path, and it's been our recurring flake.
|
|
It is definitely worth to keep. Reconfiguration and all interactive features are mainly used in Python UDFs. Let's improve the test for long term stability. |
What changes were proposed in this PR?
Makes
ReconfigurationIntegrationSpecdeterministic instead of flaky.The problem: the test starts a workflow, pauses it, swaps operator code, then resumes, and it relied on the amount of data to keep the workflow alive long enough to pause. That is a race with no safe setting:
pauselands, so PAUSED is never reached and the test times out;mediumCSV is 100k rows through 1-2 Python UDFs): the workflow can't reach COMPLETED within the 1-minute wait, so it times out.Python throughput varies run to run, so either way it was flaky.
The fix: replace the data-volume race with a slow source.
slowRegionSourceOpDescemits only 30 rows, 0.25s apart (~7.5s of steady production):pauselands reliably (an explicit running window instead of racing throughput);Also kept from earlier: a
TextInput -> Python UDFwarmup inbeforeAll(speeds worker cold-start) and widened control-command awaits (now just safety bounds, no longer the fix). Test 2 is unified onto theRegioncolumn.Any related issues, documentation, discussions?
The reconfiguration mechanism itself is also covered by the pure-Scala
ReconfigurationSpec(Java operators, stableamberjob); this spec adds the Python-executor path.How was this PR tested?
DAO/compile+WorkflowExecutionService/Test/compilesucceed; scalafmt clean.amber-integrationCI job runsReconfigurationIntegrationSpecend to end (Python workers + postgres + lakekeeper + Iceberg), which is the real signal for this flake.Was this PR authored or co-authored using generative AI tooling?
Co-authored with Claude Opus 4.8 in compliance with ASF