fix(test): stabilize flaky ReconfigurationIntegrationSpec pause race by Ma77Ball · Pull Request #5915 · apache/texera

Ma77Ball · 2026-06-23T22:09:58Z

What changes were proposed in this PR?

Makes ReconfigurationIntegrationSpec deterministic instead of flaky.

The problem: the test starts a workflow, pauses it, swaps operator code, then resumes, and it relied on the amount of data to keep the workflow alive long enough to pause. That is a race with no safe setting:

too little data: the workflow finishes before pause lands, so PAUSED is never reached and the test times out;
too much data (the medium CSV is 100k rows through 1-2 Python UDFs): the workflow can't reach COMPLETED within the 1-minute wait, so it times out.

Python throughput varies run to run, so either way it was flaky.

The fix: replace the data-volume race with a slow source. slowRegionSourceOpDesc emits only 30 rows, 0.25s apart (~7.5s of steady production):

the workflow is provably still running for ~7.5s, so pause lands reliably (an explicit running window instead of racing throughput);
only 30 tiny tuples flow, so completion is near-instant and the timeout is never stressed;
the Python source generator suspends on pause and resumes the same frame, so the remaining rows flow through the reconfigured UDF(s) after resume and the assertions still hold.

Also kept from earlier: a TextInput -> Python UDF warmup in beforeAll (speeds worker cold-start) and widened control-command awaits (now just safety bounds, no longer the fix). Test 2 is unified onto the Region column.

Any related issues, documentation, discussions?

The reconfiguration mechanism itself is also covered by the pure-Scala ReconfigurationSpec (Java operators, stable amber job); this spec adds the Python-executor path.

How was this PR tested?

DAO/compile + WorkflowExecutionService/Test/compile succeed; scalafmt clean.
The amber-integration CI job runs ReconfigurationIntegrationSpec end to end (Python workers + postgres + lakekeeper + Iceberg), which is the real signal for this flake.

Was this PR authored or co-authored using generative AI tooling?

Co-authored with Claude Opus 4.8 in compliance with ASF

… a pause is generated leading to a failed ci

github-actions · 2026-06-23T22:10:15Z

Automated Reviewer Suggestions

Based on the git blame history of the changed files, we recommend the following reviewers:

Contributors with relevant context: @shengquan-ni
You can notify them by mentioning @shengquan-ni in a comment.

github-actions · 2026-06-23T22:15:38Z

⚠️ Benchmark changes need a look

🟢 2 better · 🔴 2 worse · ⚪ 11 noise (<±5%) · 0 without baseline

Compared against main 24b587f benchmarked on this same runner, so the delta is largely free of cross-runner hardware noise. The "7d avg" column still reflects the gh-pages dashboard. Treat <±5% as noise unless repeated.

Dashboard · Run

	config	throughput	MB/s	latency	max Δ latest / 7d
🟢	bs=10 sw=10 sl=64	394	0.241	24,318/34,151/34,151 us	🟢 -9.1% / 🔴 +123.4%
🔴	bs=100 sw=10 sl=64	812	0.495	122,011/147,213/147,213 us	🔴 +7.6% / 🔴 +35.8%
⚪	bs=1000 sw=10 sl=64	929	0.567	1,073,409/1,119,384/1,119,384 us	⚪ within ±5% / 🔴 +8.0%

Baseline details

Latest main 24b587f from same runner

config	metric	PR	latest main	7d avg	Δ latest	Δ 7d
bs=10 sw=10 sl=64	throughput	394 tuples/sec	391 tuples/sec	770.95 tuples/sec	+0.8%	-48.9%
bs=10 sw=10 sl=64	MB/s	0.241 MB/s	0.239 MB/s	0.471 MB/s	+0.8%	-48.8%
bs=10 sw=10 sl=64	p50	24,318 us	24,152 us	12,775 us	+0.7%	+90.3%
bs=10 sw=10 sl=64	p95	34,151 us	37,589 us	15,286 us	-9.1%	+123.4%
bs=10 sw=10 sl=64	p99	34,151 us	37,589 us	18,795 us	-9.1%	+81.7%
bs=100 sw=10 sl=64	throughput	812 tuples/sec	820 tuples/sec	976.93 tuples/sec	-1.0%	-16.9%
bs=100 sw=10 sl=64	MB/s	0.495 MB/s	0.5 MB/s	0.596 MB/s	-1.0%	-17.0%
bs=100 sw=10 sl=64	p50	122,011 us	120,692 us	102,557 us	+1.1%	+19.0%
bs=100 sw=10 sl=64	p95	147,213 us	136,873 us	108,383 us	+7.6%	+35.8%
bs=100 sw=10 sl=64	p99	147,213 us	136,873 us	115,249 us	+7.6%	+27.7%
bs=1000 sw=10 sl=64	throughput	929 tuples/sec	904 tuples/sec	1,009 tuples/sec	+2.8%	-7.9%
bs=1000 sw=10 sl=64	MB/s	0.567 MB/s	0.552 MB/s	0.616 MB/s	+2.7%	-7.9%
bs=1000 sw=10 sl=64	p50	1,073,409 us	1,102,636 us	997,695 us	-2.7%	+7.6%
bs=1000 sw=10 sl=64	p95	1,119,384 us	1,172,947 us	1,036,731 us	-4.6%	+8.0%
bs=1000 sw=10 sl=64	p99	1,119,384 us	1,172,947 us	1,069,334 us	-4.6%	+4.7%

Raw CSV

config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,507.15,200,128000,394,0.241,24317.57,34150.58,34150.58
1,100,10,64,20,2464.03,2000,1280000,812,0.495,122010.87,147212.93,147212.93
2,1000,10,64,20,21532.14,20000,12800000,929,0.567,1073408.63,1119384.23,1119384.23

codecov-commenter · 2026-06-23T22:19:07Z

Codecov Report

❌ Patch coverage is 0% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.61%. Comparing base (24b587f) to head (3f02ad2).

Files with missing lines	Patch %	Lines
...g/apache/texera/amber/operator/TestOperators.scala	0.00%	7 Missing ⚠️

❌ Your patch status has failed because the patch coverage (0.00%) is below the target coverage (60.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #5915      +/-   ##
============================================
- Coverage     56.89%   56.61%   -0.29%     
  Complexity     3058     3058              
============================================
  Files          1129     1126       -3     
  Lines         43802    43342     -460     
  Branches       4743     4669      -74     
============================================
- Hits          24922    24538     -384     
+ Misses        17449    17403      -46     
+ Partials       1431     1401      -30

Flag	Coverage Δ		*Carryforward flag
access-control-service	`70.00% <ø> (ø)`
agent-service	`44.59% <ø> (ø)`		Carriedforward from c16058b
amber	`58.62% <0.00%> (-0.03%)`	⬇️
computing-unit-managing-service	`0.00% <ø> (ø)`
config-service	`52.30% <ø> (ø)`
file-service	`62.81% <ø> (ø)`
frontend	`49.38% <ø> (-0.75%)`	⬇️	Carriedforward from c16058b
notebook-migration-service	`78.57% <ø> (ø)`
pyamber	`90.51% <ø> (-0.64%)`	⬇️	Carriedforward from c16058b
python	`90.69% <ø> (ø)`		Carriedforward from c16058b
workflow-compiling-service	`55.14% <ø> (ø)`

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Ma77Ball · 2026-06-27T00:20:53Z

/request-review @aglinxinyuan

Yicong-Huang · 2026-06-28T22:57:22Z

that's not a final fix right? if medium dataset is also processed fast enough, the same issue could happen?

Copilot

Pull request overview

This PR aims to stabilize ReconfigurationIntegrationSpec by ensuring the CSV-backed workflows run long enough for pauseWorkflow to take effect, avoiding a race where workflows can complete before the pause is applied.

Changes:

Introduces a helper (boundedCsvSource) to create a CSV scan operator descriptor based on the medium CSV source.
Switches the two CSV-sourced reconfiguration tests to use the new helper instead of smallCsvScanOpDesc.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Ma77Ball · 2026-06-30T04:21:21Z

that's not a final fix right? if medium dataset is also processed fast enough, the same issue could happen?

Yes, it is a quick, temporary fix. I could do a deeper dive into this problem and find a permanent solution. This pr should help mitigate the issue to avoid it blocking PRs in the meantime.

Yicong-Huang

OK. Approved for a short term fix. Please keep the issue open or create a new issue to track for long term solution.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Matthew B. <mgball@uci.edu>

Yicong-Huang · 2026-06-30T22:59:51Z

maybe we need manually backport? I removed the v1.2 label

The test that keeps failing: it starts a workflow, pauses it, swaps some code, then resumes it. To pause, it asks each worker "please pause" and waits 5 seconds for a reply. The problem: some workers run Python. Starting Python is slow (it has to boot the interpreter and load libraries). If a worker is still busy starting up Python, it can't reply "paused" within 5 seconds → the test gives up → CI fails. The test that uses two Python workers is slowest to start, so it fails most.

Ma77Ball · 2026-07-01T18:52:36Z

Is this test even worth keeping? ReconfigurationSpec already covers the mechanism (stable); this only adds the Python-executor path, and it's been our recurring flake.

Keep the deterministic rewrite
@ignore + tracking issue, rely on ReconfigurationSpec
Remove the spec

@Yicong-Huang

Yicong-Huang · 2026-07-01T20:45:06Z

It is definitely worth to keep. Reconfiguration and all interactive features are mainly used in Python UDFs.

Let's improve the test for long term stability.

increase amount of data for csv scan to avoid engine finishing before…

291e31c

… a pause is generated leading to a failed ci

github-actions Bot assigned Ma77Ball Jun 23, 2026

github-actions Bot added fix amber-integration labels Jun 23, 2026

Ma77Ball added 3 commits June 23, 2026 15:20

Merge branch 'main' into fix/reconfiguration-spec-flaky

94a697b

Merge branch 'main' into fix/reconfiguration-spec-flaky

740e963

bound the csv source spec to 10000

a0841b4

Ma77Ball marked this pull request as ready for review June 26, 2026 23:39

github-actions Bot requested a review from aglinxinyuan June 27, 2026 00:20

aglinxinyuan requested a review from Copilot June 28, 2026 23:47

aglinxinyuan approved these changes Jun 28, 2026

View reviewed changes

Copilot started reviewing on behalf of aglinxinyuan June 28, 2026 23:48 View session

Copilot AI reviewed Jun 28, 2026

View reviewed changes

Comment thread ...src/test/integration/org/apache/texera/amber/engine/e2e/ReconfigurationIntegrationSpec.scala Outdated

xuang7 added the release/v1.2 back porting to release/v1.2 label Jun 29, 2026

Yicong-Huang approved these changes Jun 30, 2026

View reviewed changes

Ma77Ball and others added 2 commits June 30, 2026 15:35

Potential fix for pull request finding

b1728be

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Matthew B. <mgball@uci.edu>

Merge branch 'main' into fix/reconfiguration-spec-flaky

d4e85e7

Yicong-Huang removed the release/v1.2 back porting to release/v1.2 label Jun 30, 2026

github-actions Bot added the engine label Jul 1, 2026

add slow source to avoid ci failure

12e0a6e

Ma77Ball force-pushed the fix/reconfiguration-spec-flaky branch from d5da5a6 to 12e0a6e Compare July 1, 2026 18:50

github-actions Bot added the common label Jul 1, 2026

Merge branch 'main' into fix/reconfiguration-spec-flaky

38f532f

Merge branch 'main' into fix/reconfiguration-spec-flaky

3f02ad2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(test): stabilize flaky ReconfigurationIntegrationSpec pause race#5915

fix(test): stabilize flaky ReconfigurationIntegrationSpec pause race#5915
Ma77Ball wants to merge 10 commits into
apache:mainfrom
Ma77Ball:fix/reconfiguration-spec-flaky

Ma77Ball commented Jun 23, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 23, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 23, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Jun 23, 2026 •

edited

Loading

Uh oh!

Ma77Ball commented Jun 27, 2026

Uh oh!

Yicong-Huang commented Jun 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Ma77Ball commented Jun 30, 2026

Uh oh!

Yicong-Huang left a comment •

edited

Loading

Uh oh!

Yicong-Huang commented Jun 30, 2026

Uh oh!

Ma77Ball commented Jul 1, 2026 •

edited

Loading

Uh oh!

Yicong-Huang commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Uh oh!

Conversation

Ma77Ball commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this PR?

Any related issues, documentation, discussions?

How was this PR tested?

Was this PR authored or co-authored using generative AI tooling?

Uh oh!

github-actions Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Automated Reviewer Suggestions

Uh oh!

github-actions Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ Benchmark changes need a look

Uh oh!

codecov-commenter commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Ma77Ball commented Jun 27, 2026

Uh oh!

Yicong-Huang commented Jun 28, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Ma77Ball commented Jun 30, 2026

Uh oh!

Yicong-Huang left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Yicong-Huang commented Jun 30, 2026

Uh oh!

Ma77Ball commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Yicong-Huang commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Ma77Ball commented Jun 23, 2026 •

edited

Loading

github-actions Bot commented Jun 23, 2026 •

edited

Loading

github-actions Bot commented Jun 23, 2026 •

edited

Loading

codecov-commenter commented Jun 23, 2026 •

edited

Loading

Yicong-Huang left a comment •

edited

Loading

Ma77Ball commented Jul 1, 2026 •

edited

Loading