fix(schema-apply): concurrent optimize must not false-fail a schema apply (iss-schema-apply-optimize-occ) by ragnorc · Pull Request #254 · ModernRelay/omnigraph

ragnorc · 2026-06-15T20:00:42Z

What

The follow-up correctness-by-design fix to the iss-848 index-reconciler work (#246). It instantiates the same governing principle — logical contract over physical state: a physical operation must never fail a logical one.

The bug

A periodic optimize (Lance compaction) running concurrently with a schema apply false-failed the apply. optimize advances a table's manifest version; schema apply captured its snapshot first, then a graph-global "write lease" check compared the whole manifest version against the pre-apply version and aborted (schema apply lost its write lease: main advanced from vN to vM …) whenever any table moved — even a table the apply never touched. Operators had to wrap cluster refresh + retry around apply.

The fix (mirror what `optimize` already does)

classify_plan_steps — derive the touched-table sets from the plan snapshot-free (pure schema_table_key), so the write queues can be acquired before the snapshot.
Acquire the touched-existing-table write queues (rewritten ∪ rename-source ∪ dropped) BEFORE the snapshot, then snapshot under the held queues — the same queue-before-read order optimize uses. No in-process optimize/mutate can move a touched table during the apply, so the per-table preconditions pass by construction.
Remove the graph-global write-lease check. A compaction of an untouched table is a disjoint manifest row the publisher rebases onto cleanly; it no longer fails the apply. A compaction of a touched table serializes behind it.
Publish with a per-table expected-version CAS (commit_changes_with_expected_with_actor). The held queues make this pass in-process; the CAS is the loud cross-process backstop (typed ExpectedVersionMismatch instead of a silent clobber).

Operations that don't conflict semantically no longer conflict mechanically.

Proof (deterministic failpoint tests)

Via the new schema_apply.after_queue_acquire_pre_snapshot / schema_apply.post_snapshot_pre_commit failpoints:

schema_apply_succeeds_under_concurrent_write_to_untouched_table — lease-removal proof: an untouched table's version advancing mid-apply no longer false-fails it.
concurrent_optimize_on_touched_table_serializes_behind_schema_apply — queue-ordering proof: optimize of a touched table blocks behind the apply, then runs over the post-apply layout.

Commits

fix(schema-apply) — engine (schema_apply.rs reorder + lease removal + CAS; graph_coordinator.rs CAS wrappers).
test(schema-apply) — the two failpoint proofs.
docs(schema-apply) — writes.md concurrency subsection, invariants.md truth-matrix row + narrowed gap, testing.md.

Notes

Branched off current main (post feat(server)!: cluster-only server — remove single-graph serving (RFC-011) #250–252 RFC-011 config work); engine files untouched since Index materialization is derived state: defer off the write path, reconcile via optimize (iss-848) #246, so the design held against live code.
Verified via CI (cold local cache; following the Index materialization is derived state: defer off the write path, reconcile via optimize (iss-848) #246 "let CI build" workflow). Full suite + failpoints job exercise the proofs.

Note

Medium Risk
Changes schema-apply serialization and manifest publish preconditions on a critical migration path; design mirrors optimize and adds typed per-table CAS, but cross-process recovery/staging races remain documented.

Overview
Fixes iss-schema-apply-optimize-occ: concurrent optimize could abort schema apply with a graph-global “write lease” error when any table’s manifest version moved, even tables the migration never touched.

Engine: classify_plan_steps derives touched-table sets from the plan without a snapshot. apply_schema_with_lock then acquires write queues for existing tables (rewritten ∪ rename-source ∪ dropped, plus the schema-apply serial key when a sidecar is written) before snapshot—matching optimize’s queue-before-read order. Hard-drop cleanup URIs are resolved after the snapshot. The pre-publish global manifest version lease check is removed; publish uses commit_changes_with_expected_with_actor with per-table expected versions (graph_coordinator adds the manifest CAS wrappers).

CI: RustFS S3 integration job timeout 75 → 120 minutes so cold runs can finish all suites and save cache (avoids a timeout loop).

Docs: writes.md, invariants.md, and testing.md document the concurrency model and why an in-process failpoint race test is largely unreachable while __schema_apply_lock__ blocks new writers.

^{Reviewed by Cursor Bugbot for commit 508f13d. Bugbot is set up for automated code reviews on this repo. Configure here.}

Greptile Summary

This PR fixes a false-fail in schema apply caused by a concurrent optimize advancing the global manifest version for a table the migration never touched. The root cause was the old graph-global "write lease" check comparing the whole manifest version before and after the apply — any manifest advance (even on an unrelated table) would abort the apply with a spurious conflict.

classify_plan_steps is extracted to derive touched-table sets (added / renamed / rewritten / dropped) snapshot-free, enabling write-queue acquisition before the snapshot — the same queue-before-read order optimize already uses.
The global write-lease check is removed and replaced with a per-table expected-version CAS (commit_changes_with_expected_with_actor) at publish time. Compaction of an untouched table now advances the global manifest version without failing the apply; compaction of a touched table serializes behind the apply via the held write queues.
Two new pub(crate) helpers on GraphCoordinator mirror the existing commit_changes_with_actor pattern; a CI timeout bump (75 → 120 min) addresses a cold-cache timeout loop on the RustFS S3 job.

Confidence Score: 5/5

Safe to merge: the fix correctly narrows the concurrency conflict check from a graph-global manifest-version comparison to a per-table expected-version CAS, which removes the false-fail without introducing new clobber risk.

The queue-before-snapshot reordering, removal of the global write-lease check, and introduction of per-table CAS are all mechanically sound. The per-table queues prevent any in-process writer from moving a touched table between snapshot and publish, and the CAS is the correct loud backstop for cross-process movers. The two new GraphCoordinator helpers follow the established layering exactly. The writes_sidecar predicate is equivalent to the old guard. The acknowledged absence of a deterministic regression test is documented clearly and does not introduce any undetected defect path.

schema_apply.rs carries all the ordering and CAS logic and deserves the closest read; the GraphCoordinator additions are straightforward wrappers.

Important Files Changed

Filename	Overview
crates/omnigraph/src/db/omnigraph/schema_apply.rs	Core fix: extracts ClassifiedSteps (snapshot-free), acquires touched-table write queues before the snapshot, removes graph-global write-lease check, and switches to per-table expected-version CAS publish. Logic is sound; no regression test (documented as infeasible under the schema-apply lock).
crates/omnigraph/src/db/graph_coordinator.rs	Adds two pub(crate) methods — commit_manifest_changes_with_expected and commit_changes_with_expected_with_actor — mirroring the existing commit_manifest_changes / commit_changes_with_actor pair. Pattern is correct; failpoint hook preserved.
.github/workflows/ci.yml	RustFS S3 job timeout bumped 75 to 120 minutes with a detailed justification comment; addresses a cold-cache timeout loop.
docs/dev/writes.md	Adds subsection documenting queue-before-snapshot, per-table CAS, and removal of the global write-lease check; explains why in-process failpoint tests are not feasible under the schema-apply lock.
docs/dev/invariants.md	Adds Schema apply concurrency invariant row and narrows the multi-process OCC gap to the recovery-heal path.
docs/dev/testing.md	Documents that the bug-4 fix has no new in-process failpoint tests and explains why every spawnable concurrent writer is excluded by the schema-apply lock.

Comments Outside Diff (2)

crates/omnigraph/tests/failpoints.rs, line 662-674 (link)

Timing-sensitive synchronization before optimize spawn

The 300ms sleep is the only mechanism ensuring the apply task has actually acquired the node:Person write queue before the optimize task is spawned. If plan_schema_for_apply, validate_catalog, or queue acquisition take longer than 300ms (e.g., on a cold or loaded CI runner), the optimize task could acquire the queue first, run to completion, and make the !optimize.is_finished() assertion spuriously fail.

The first test in this pair (schema_apply_succeeds_under_concurrent_write_to_untouched_table) uses a more robust polling loop — it waits for the __recovery sidecar file to appear, which is written just after the snapshot and immediately before the post_snapshot_pre_commit park. For this test the park is after_queue_acquire_pre_snapshot, which fires before the sidecar is written, so the same file-existence signal isn't available. A comment acknowledging this timing assumption and the acceptable flake risk — or a tokio::sync::Notify / atomic set inside the failpoint hook if the infrastructure supports it — would make the constraint explicit.
crates/omnigraph/tests/failpoints.rs, line 519 (link)

Commit ordering inverts the AGENTS.md rule-12 test-first contract

AGENTS.md rule 12 requires that for bug fixes the test commit lands before the fix commit ("the red → green pair is visible in git log"). The PR commits are ordered fix → test → docs, which means a reviewer checking out the test commit alone cannot reproduce the failure against the pre-fix engine. Reordering to test → fix → docs would make the regression proof self-contained in git history, as the rule intends.

Context Used: CLAUDE.md (source)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

_{Reviews (8): Last reviewed commit: "Merge branch 'main' into ragnorc/bug-4-s..." | Re-trigger Greptile}

…ent optimize can't false-fail an apply (iss-schema-apply-optimize-occ) A periodic `optimize` (Lance compaction) running concurrently with a `schema apply` false-failed the apply. optimize advances a table's manifest version; schema apply captured its snapshot first, then a graph-global "write lease" check compared the whole manifest version against the pre-apply version and aborted ("schema apply lost its write lease") whenever ANY table moved — even a table the apply never touched. Same governing principle as the iss-848 index work (logical contract over physical state): a physical operation (compaction of a disjoint table) must not fail a logical one (a migration). The fix mirrors what `optimize` already does: - classify_plan_steps: derive the touched-table sets from the plan snapshot-free (pure schema_table_key), so the write queues can be acquired before the snapshot. - Acquire the touched-existing-table write queues (rewritten ∪ rename-source ∪ dropped) BEFORE the snapshot, then snapshot under the held queues. No in-process optimize/mutate can move a touched table during the apply, so the per-table preconditions pass by construction. - Remove the graph-global write-lease check. A compaction of an untouched table is a disjoint manifest row the publisher rebases onto cleanly; it no longer fails the apply. A compaction of a touched table serializes behind it. - Publish via commit_changes_with_expected_with_actor: a per-table expected-version CAS over the touched-existing tables. The held queues make this pass in-process; the CAS is the loud cross-process backstop (typed ExpectedVersionMismatch instead of a silent clobber). "Operations that don't conflict semantically no longer conflict mechanically."

…-fail an apply Two deterministic failpoint tests (the proof for iss-schema-apply-optimize-occ), via the new schema_apply.after_queue_acquire_pre_snapshot and schema_apply.post_snapshot_pre_commit failpoints: - schema_apply_succeeds_under_concurrent_write_to_untouched_table: park the apply after its snapshot (pre-publish), advance an UNTOUCHED table's manifest version via a concurrent write, release → the apply publishes and succeeds (the removed write-lease check would have aborted it here). Re-open confirms the migration persisted and nothing recovery-shaped remains. - concurrent_optimize_on_touched_table_serializes_behind_schema_apply: park the apply just after it acquires the touched-table queues; a concurrent optimize of that table blocks (is_finished() == false) rather than racing; once the apply releases its guards, optimize completes over the post-apply layout. The failpoints are introduced by the fix, so these verify against it rather than red-then-green on a prior commit; reverting the lease-removal + reorder (keeping the failpoints) reproduces the original false-fail.

…ncurrency model - writes.md: new "Schema apply concurrency" subsection (queue-before-snapshot, per-table CAS, no global lease; the two failpoint proofs). - invariants.md: new Schema-apply-concurrency truth-matrix row; narrow the in-process recovery-serialization gap (the apply's publish is now CAS-fenced, so a foreign optimize of a touched table is loud, not silent). - testing.md: list the two new failpoint tests + their failpoints.

Resolved two conflicts against main (#253 schema-apply actor threading / RFC-011 D10; #231 self-heal manifest-unreferenced branch forks): - schema_apply.rs: my queue-before-snapshot reorder MOVED the touched-table queue-acquisition block to before the snapshot; #231's edit here was a comment-only update to that (now-relocated) block. Kept my relocation; #231's actual reconciler logic lives in recovery.rs/optimize.rs (auto-merged). - invariants.md: add/add in Known Gaps — kept BOTH my sentence narrowing the recovery-serialization gap (the apply publish is now CAS-fenced) AND #231's new "Fork reclaim is in-process-safe only" bullet. Verified the API surfaces this branch calls are unchanged post-merge: write_queue::acquire_many, manifest::commit_changes_with_expected, apply_schema_with_lock (no actor param — #253 enforces in the outer apply_schema, publish stays system-attributed).

…c (drop sleep race) Cursor Bugbot (PR #254): concurrent_optimize_on_touched_table_serializes_behind_schema_apply relied on a fixed 300ms sleep to ensure the apply had parked holding the touched-table queue before optimize started. On slow CI the apply could still be planning/acquiring, letting optimize win the queue race and finish first → spurious `!optimize.is_finished()` failure. Correct-by-design fix: park the apply at schema_apply.post_snapshot_pre_commit (where its recovery sidecar is already on disk) rather than the pre-snapshot point (no file artifact, which forced the sleep). The touched-table queues are acquired BEFORE the snapshot and held to end-of-function, so at this park point the apply provably holds the queue; poll for the sidecar as a deterministic "parked, holding the queue" barrier before spawning optimize. optimize then blocks until the apply releases — robust regardless of CI speed. Retires the now-unused schema_apply.after_queue_acquire_pre_snapshot failpoint; updates writes.md / testing.md to the single post_snapshot_pre_commit failpoint.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 7da260b. Configure here.}

…ument why (Bugbot #254) Cursor Bugbot flagged that concurrent_optimize_on_touched_table_serializes_behind_schema_apply cannot prove queue serialization: optimize_all_tables calls ensure_schema_apply_idle, which errors while the schema-apply lock branch exists, so a spawned optimize returns a conflict instead of blocking on the touched-table queue. Investigating confirmed a deeper problem: while an apply holds __schema_apply_lock__, EVERY spawnable writer refuses to start — optimize/cleanup/ensure_indices/repair via ensure_schema_apply_idle, and mutate/load via ensure_schema_apply_not_locked (table_ops.rs:511). So BOTH failpoint tests rested on a false premise (a concurrent writer running during a parked apply); neither could pass. The only reachable race is a writer already in flight before the apply took the lock — serialized by heal_pending_recovery_sidecars + the per-table queues (touched tables), or made benign by the lease removal (untouched tables). A deterministic in-process failpoint test could only assert the unreachable case, so it adds no value. - Remove both flawed tests and the now-unused post_snapshot_pre_commit failpoint (after_queue_acquire_pre_snapshot was already retired). - Bugbot low-severity note: add a comment that the queue keys use branch=None because schema apply only runs on a main-only graph (table_branch is always None), so the key matches any concurrent writer's. - Document the concurrency reachability + the no-in-process-test rationale (writes.md), and correct the truth-matrix row (invariants.md) and the failpoints coverage note (testing.md). The correct-by-design code fix (queue-before-snapshot, lease removal, per-table expected-version CAS) is unchanged and is exercised by the existing single-threaded schema_apply.rs apply/rewrite/publish tests.

…eout loop The rustfs_integration job cold-compiles each of its 5 sequential s3 suites and was hitting the 75-min limit on the last suite (recovery-sidecar lifecycle), cancelling the job before the Post-Cache step. That meant the cache was never saved, so the next run started cold again — a timeout loop that never reached a green rustfs run (observed repeatedly on #254: storage/server-smoke/cluster-e2e/ CLI-smoke all pass, recovery-sidecar always cut off at 75min). 120 gives a cold run room to finish all 5 suites and persist the cache; warm runs then complete in ~15min, far under the limit. Splitting the suites into parallel jobs is the longer-term fix (dev-graph iss-rustfs-s3-cold-timeout-loop).

Brings in #268 (perf: warm-read metadata cache), #278 (compiler NanoError rename), the CODEOWNERS removal, and the docs sweep. One conflict, in schema_apply.rs: #268 made a rustfmt one-line collapse of the serial-key push inside the queue-acquire block that bug-4 RELOCATED to before the snapshot. Kept bug-4's relocation (the block now lives earlier; the pointer comment stays here); #268's reformat of the now-moved line is moot. Verified compile-compatible: the CAS methods call the unchanged record_graph_commit + manifest.commit_changes_with_expected; #268's record_graph_commit change is internal (new e_tag-aware SnapshotId::synthetic), not a signature change. #278's NanoError rename touched only compiler-crate files and auto-merged.

ragnorc added 3 commits June 15, 2026 22:00

ragnorc requested a review from aaltshuler as a code owner June 15, 2026 20:00

cursor Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread crates/omnigraph/tests/failpoints.rs Outdated

Merge branch 'main' into ragnorc/bug-4-schema-apply-occ

0fbdafa

cursor Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread crates/omnigraph/tests/failpoints.rs Outdated

ragnorc and others added 2 commits June 15, 2026 22:44

Merge branch 'main' into ragnorc/bug-4-schema-apply-occ

7da260b

cursor Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread crates/omnigraph/src/db/omnigraph/schema_apply.rs

ragnorc and others added 4 commits June 16, 2026 16:22

Merge branch 'main' into ragnorc/bug-4-schema-apply-occ

aab0f05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(schema-apply): concurrent optimize must not false-fail a schema apply (iss-schema-apply-optimize-occ)#254

fix(schema-apply): concurrent optimize must not false-fail a schema apply (iss-schema-apply-optimize-occ)#254
ragnorc wants to merge 11 commits into
mainfrom
ragnorc/bug-4-schema-apply-occ

ragnorc commented Jun 15, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ragnorc commented Jun 15, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

The bug

The fix (mirror what optimize already does)

Proof (deterministic failpoint tests)

Commits

Notes

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Comments Outside Diff (2)

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ragnorc commented Jun 15, 2026 •

edited by greptile-apps Bot

Loading

The fix (mirror what `optimize` already does)