Skip to content

fix(test): stabilize flaky ReconfigurationIntegrationSpec pause race#5915

Open
Ma77Ball wants to merge 10 commits into
apache:mainfrom
Ma77Ball:fix/reconfiguration-spec-flaky
Open

fix(test): stabilize flaky ReconfigurationIntegrationSpec pause race#5915
Ma77Ball wants to merge 10 commits into
apache:mainfrom
Ma77Ball:fix/reconfiguration-spec-flaky

Conversation

@Ma77Ball

@Ma77Ball Ma77Ball commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this PR?

Makes ReconfigurationIntegrationSpec deterministic instead of flaky.

The problem: the test starts a workflow, pauses it, swaps operator code, then resumes, and it relied on the amount of data to keep the workflow alive long enough to pause. That is a race with no safe setting:

  • too little data: the workflow finishes before pause lands, so PAUSED is never reached and the test times out;
  • too much data (the medium CSV is 100k rows through 1-2 Python UDFs): the workflow can't reach COMPLETED within the 1-minute wait, so it times out.

Python throughput varies run to run, so either way it was flaky.

The fix: replace the data-volume race with a slow source. slowRegionSourceOpDesc emits only 30 rows, 0.25s apart (~7.5s of steady production):

  • the workflow is provably still running for ~7.5s, so pause lands reliably (an explicit running window instead of racing throughput);
  • only 30 tiny tuples flow, so completion is near-instant and the timeout is never stressed;
  • the Python source generator suspends on pause and resumes the same frame, so the remaining rows flow through the reconfigured UDF(s) after resume and the assertions still hold.

Also kept from earlier: a TextInput -> Python UDF warmup in beforeAll (speeds worker cold-start) and widened control-command awaits (now just safety bounds, no longer the fix). Test 2 is unified onto the Region column.

Any related issues, documentation, discussions?

The reconfiguration mechanism itself is also covered by the pure-Scala ReconfigurationSpec (Java operators, stable amber job); this spec adds the Python-executor path.

How was this PR tested?

  • DAO/compile + WorkflowExecutionService/Test/compile succeed; scalafmt clean.
  • The amber-integration CI job runs ReconfigurationIntegrationSpec end to end (Python workers + postgres + lakekeeper + Iceberg), which is the real signal for this flake.

Was this PR authored or co-authored using generative AI tooling?

Co-authored with Claude Opus 4.8 in compliance with ASF

… a pause is generated leading to a failed ci
@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Automated Reviewer Suggestions

Based on the git blame history of the changed files, we recommend the following reviewers:

  • Contributors with relevant context: @shengquan-ni
    You can notify them by mentioning @shengquan-ni in a comment.

@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

⚠️ Benchmark changes need a look

🟢 2 better · 🔴 2 worse · ⚪ 11 noise (<±5%) · 0 without baseline

Compared against main 24b587f benchmarked on this same runner, so the delta is largely free of cross-runner hardware noise. The "7d avg" column still reflects the gh-pages dashboard. Treat <±5% as noise unless repeated.

Dashboard · Run

config throughput MB/s latency max Δ latest / 7d
🟢 bs=10 sw=10 sl=64 394 0.241 24,318/34,151/34,151 us 🟢 -9.1% / 🔴 +123.4%
🔴 bs=100 sw=10 sl=64 812 0.495 122,011/147,213/147,213 us 🔴 +7.6% / 🔴 +35.8%
bs=1000 sw=10 sl=64 929 0.567 1,073,409/1,119,384/1,119,384 us ⚪ within ±5% / 🔴 +8.0%
Baseline details

Latest main 24b587f from same runner

config metric PR latest main 7d avg Δ latest Δ 7d
bs=10 sw=10 sl=64 throughput 394 tuples/sec 391 tuples/sec 770.95 tuples/sec +0.8% -48.9%
bs=10 sw=10 sl=64 MB/s 0.241 MB/s 0.239 MB/s 0.471 MB/s +0.8% -48.8%
bs=10 sw=10 sl=64 p50 24,318 us 24,152 us 12,775 us +0.7% +90.3%
bs=10 sw=10 sl=64 p95 34,151 us 37,589 us 15,286 us -9.1% +123.4%
bs=10 sw=10 sl=64 p99 34,151 us 37,589 us 18,795 us -9.1% +81.7%
bs=100 sw=10 sl=64 throughput 812 tuples/sec 820 tuples/sec 976.93 tuples/sec -1.0% -16.9%
bs=100 sw=10 sl=64 MB/s 0.495 MB/s 0.5 MB/s 0.596 MB/s -1.0% -17.0%
bs=100 sw=10 sl=64 p50 122,011 us 120,692 us 102,557 us +1.1% +19.0%
bs=100 sw=10 sl=64 p95 147,213 us 136,873 us 108,383 us +7.6% +35.8%
bs=100 sw=10 sl=64 p99 147,213 us 136,873 us 115,249 us +7.6% +27.7%
bs=1000 sw=10 sl=64 throughput 929 tuples/sec 904 tuples/sec 1,009 tuples/sec +2.8% -7.9%
bs=1000 sw=10 sl=64 MB/s 0.567 MB/s 0.552 MB/s 0.616 MB/s +2.7% -7.9%
bs=1000 sw=10 sl=64 p50 1,073,409 us 1,102,636 us 997,695 us -2.7% +7.6%
bs=1000 sw=10 sl=64 p95 1,119,384 us 1,172,947 us 1,036,731 us -4.6% +8.0%
bs=1000 sw=10 sl=64 p99 1,119,384 us 1,172,947 us 1,069,334 us -4.6% +4.7%
Raw CSV
config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,507.15,200,128000,394,0.241,24317.57,34150.58,34150.58
1,100,10,64,20,2464.03,2000,1280000,812,0.495,122010.87,147212.93,147212.93
2,1000,10,64,20,21532.14,20000,12800000,929,0.567,1073408.63,1119384.23,1119384.23

@codecov-commenter

codecov-commenter commented Jun 23, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.61%. Comparing base (24b587f) to head (3f02ad2).

Files with missing lines Patch % Lines
...g/apache/texera/amber/operator/TestOperators.scala 0.00% 7 Missing ⚠️

❌ Your patch status has failed because the patch coverage (0.00%) is below the target coverage (60.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #5915      +/-   ##
============================================
- Coverage     56.89%   56.61%   -0.29%     
  Complexity     3058     3058              
============================================
  Files          1129     1126       -3     
  Lines         43802    43342     -460     
  Branches       4743     4669      -74     
============================================
- Hits          24922    24538     -384     
+ Misses        17449    17403      -46     
+ Partials       1431     1401      -30     
Flag Coverage Δ *Carryforward flag
access-control-service 70.00% <ø> (ø)
agent-service 44.59% <ø> (ø) Carriedforward from c16058b
amber 58.62% <0.00%> (-0.03%) ⬇️
computing-unit-managing-service 0.00% <ø> (ø)
config-service 52.30% <ø> (ø)
file-service 62.81% <ø> (ø)
frontend 49.38% <ø> (-0.75%) ⬇️ Carriedforward from c16058b
notebook-migration-service 78.57% <ø> (ø)
pyamber 90.51% <ø> (-0.64%) ⬇️ Carriedforward from c16058b
python 90.69% <ø> (ø) Carriedforward from c16058b
workflow-compiling-service 55.14% <ø> (ø)

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Ma77Ball Ma77Ball marked this pull request as ready for review June 26, 2026 23:39
@Ma77Ball

Copy link
Copy Markdown
Contributor Author

/request-review @aglinxinyuan

@github-actions github-actions Bot requested a review from aglinxinyuan June 27, 2026 00:20
@Yicong-Huang

Copy link
Copy Markdown
Contributor

that's not a final fix right? if medium dataset is also processed fast enough, the same issue could happen?

@aglinxinyuan aglinxinyuan requested a review from Copilot June 28, 2026 23:47

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to stabilize ReconfigurationIntegrationSpec by ensuring the CSV-backed workflows run long enough for pauseWorkflow to take effect, avoiding a race where workflows can complete before the pause is applied.

Changes:

  • Introduces a helper (boundedCsvSource) to create a CSV scan operator descriptor based on the medium CSV source.
  • Switches the two CSV-sourced reconfiguration tests to use the new helper instead of smallCsvScanOpDesc.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@xuang7 xuang7 added the release/v1.2 back porting to release/v1.2 label Jun 29, 2026
@Ma77Ball

Copy link
Copy Markdown
Contributor Author

that's not a final fix right? if medium dataset is also processed fast enough, the same issue could happen?

Yes, it is a quick, temporary fix. I could do a deeper dive into this problem and find a permanent solution. This pr should help mitigate the issue to avoid it blocking PRs in the meantime.

@Yicong-Huang Yicong-Huang left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Approved for a short term fix. Please keep the issue open or create a new issue to track for long term solution.

Ma77Ball and others added 2 commits June 30, 2026 15:35
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Matthew B. <mgball@uci.edu>
@Yicong-Huang

Copy link
Copy Markdown
Contributor

maybe we need manually backport? I removed the v1.2 label

@Yicong-Huang Yicong-Huang removed the release/v1.2 back porting to release/v1.2 label Jun 30, 2026
The test that keeps failing: it starts a workflow,
  pauses it, swaps some code, then resumes it. To
  pause, it asks each worker "please pause" and waits
  5 seconds for a reply.

  The problem: some workers run Python. Starting
  Python is slow (it has to boot the interpreter and
  load libraries). If a worker is still busy starting
  up Python, it can't reply "paused" within 5 seconds
  → the test gives up → CI fails. The test that uses
  two Python workers is slowest to start, so it fails
  most.
@github-actions github-actions Bot added the engine label Jul 1, 2026
@Ma77Ball Ma77Ball force-pushed the fix/reconfiguration-spec-flaky branch from d5da5a6 to 12e0a6e Compare July 1, 2026 18:50
@github-actions github-actions Bot added the common label Jul 1, 2026
@Ma77Ball

Ma77Ball commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

Is this test even worth keeping? ReconfigurationSpec already covers the mechanism (stable); this only adds the Python-executor path, and it's been our recurring flake.

  1. Keep the deterministic rewrite
  2. @ignore + tracking issue, rely on ReconfigurationSpec
  3. Remove the spec

@Yicong-Huang

@Yicong-Huang

Copy link
Copy Markdown
Contributor

It is definitely worth to keep. Reconfiguration and all interactive features are mainly used in Python UDFs.

Let's improve the test for long term stability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants