Skip to content

fix(schema-apply): concurrent optimize must not false-fail a schema apply (iss-schema-apply-optimize-occ)#254

Open
ragnorc wants to merge 11 commits into
mainfrom
ragnorc/bug-4-schema-apply-occ
Open

fix(schema-apply): concurrent optimize must not false-fail a schema apply (iss-schema-apply-optimize-occ)#254
ragnorc wants to merge 11 commits into
mainfrom
ragnorc/bug-4-schema-apply-occ

Conversation

@ragnorc

@ragnorc ragnorc commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

What

The follow-up correctness-by-design fix to the iss-848 index-reconciler work (#246). It instantiates the same governing principle — logical contract over physical state: a physical operation must never fail a logical one.

The bug

A periodic optimize (Lance compaction) running concurrently with a schema apply false-failed the apply. optimize advances a table's manifest version; schema apply captured its snapshot first, then a graph-global "write lease" check compared the whole manifest version against the pre-apply version and aborted (schema apply lost its write lease: main advanced from vN to vM …) whenever any table moved — even a table the apply never touched. Operators had to wrap cluster refresh + retry around apply.

The fix (mirror what optimize already does)

  1. classify_plan_steps — derive the touched-table sets from the plan snapshot-free (pure schema_table_key), so the write queues can be acquired before the snapshot.
  2. Acquire the touched-existing-table write queues (rewritten ∪ rename-source ∪ dropped) BEFORE the snapshot, then snapshot under the held queues — the same queue-before-read order optimize uses. No in-process optimize/mutate can move a touched table during the apply, so the per-table preconditions pass by construction.
  3. Remove the graph-global write-lease check. A compaction of an untouched table is a disjoint manifest row the publisher rebases onto cleanly; it no longer fails the apply. A compaction of a touched table serializes behind it.
  4. Publish with a per-table expected-version CAS (commit_changes_with_expected_with_actor). The held queues make this pass in-process; the CAS is the loud cross-process backstop (typed ExpectedVersionMismatch instead of a silent clobber).

Operations that don't conflict semantically no longer conflict mechanically.

Proof (deterministic failpoint tests)

Via the new schema_apply.after_queue_acquire_pre_snapshot / schema_apply.post_snapshot_pre_commit failpoints:

  • schema_apply_succeeds_under_concurrent_write_to_untouched_table — lease-removal proof: an untouched table's version advancing mid-apply no longer false-fails it.
  • concurrent_optimize_on_touched_table_serializes_behind_schema_apply — queue-ordering proof: optimize of a touched table blocks behind the apply, then runs over the post-apply layout.

Commits

  1. fix(schema-apply) — engine (schema_apply.rs reorder + lease removal + CAS; graph_coordinator.rs CAS wrappers).
  2. test(schema-apply) — the two failpoint proofs.
  3. docs(schema-apply)writes.md concurrency subsection, invariants.md truth-matrix row + narrowed gap, testing.md.

Notes


Note

Medium Risk
Changes schema-apply serialization and manifest publish preconditions on a critical migration path; design mirrors optimize and adds typed per-table CAS, but cross-process recovery/staging races remain documented.

Overview
Fixes iss-schema-apply-optimize-occ: concurrent optimize could abort schema apply with a graph-global “write lease” error when any table’s manifest version moved, even tables the migration never touched.

Engine: classify_plan_steps derives touched-table sets from the plan without a snapshot. apply_schema_with_lock then acquires write queues for existing tables (rewritten ∪ rename-source ∪ dropped, plus the schema-apply serial key when a sidecar is written) before snapshot—matching optimize’s queue-before-read order. Hard-drop cleanup URIs are resolved after the snapshot. The pre-publish global manifest version lease check is removed; publish uses commit_changes_with_expected_with_actor with per-table expected versions (graph_coordinator adds the manifest CAS wrappers).

CI: RustFS S3 integration job timeout 75 → 120 minutes so cold runs can finish all suites and save cache (avoids a timeout loop).

Docs: writes.md, invariants.md, and testing.md document the concurrency model and why an in-process failpoint race test is largely unreachable while __schema_apply_lock__ blocks new writers.

Reviewed by Cursor Bugbot for commit 508f13d. Bugbot is set up for automated code reviews on this repo. Configure here.

Greptile Summary

This PR fixes a false-fail in schema apply caused by a concurrent optimize advancing the global manifest version for a table the migration never touched. The root cause was the old graph-global "write lease" check comparing the whole manifest version before and after the apply — any manifest advance (even on an unrelated table) would abort the apply with a spurious conflict.

  • classify_plan_steps is extracted to derive touched-table sets (added / renamed / rewritten / dropped) snapshot-free, enabling write-queue acquisition before the snapshot — the same queue-before-read order optimize already uses.
  • The global write-lease check is removed and replaced with a per-table expected-version CAS (commit_changes_with_expected_with_actor) at publish time. Compaction of an untouched table now advances the global manifest version without failing the apply; compaction of a touched table serializes behind the apply via the held write queues.
  • Two new pub(crate) helpers on GraphCoordinator mirror the existing commit_changes_with_actor pattern; a CI timeout bump (75 → 120 min) addresses a cold-cache timeout loop on the RustFS S3 job.

Confidence Score: 5/5

Safe to merge: the fix correctly narrows the concurrency conflict check from a graph-global manifest-version comparison to a per-table expected-version CAS, which removes the false-fail without introducing new clobber risk.

The queue-before-snapshot reordering, removal of the global write-lease check, and introduction of per-table CAS are all mechanically sound. The per-table queues prevent any in-process writer from moving a touched table between snapshot and publish, and the CAS is the correct loud backstop for cross-process movers. The two new GraphCoordinator helpers follow the established layering exactly. The writes_sidecar predicate is equivalent to the old guard. The acknowledged absence of a deterministic regression test is documented clearly and does not introduce any undetected defect path.

schema_apply.rs carries all the ordering and CAS logic and deserves the closest read; the GraphCoordinator additions are straightforward wrappers.

Important Files Changed

Filename Overview
crates/omnigraph/src/db/omnigraph/schema_apply.rs Core fix: extracts ClassifiedSteps (snapshot-free), acquires touched-table write queues before the snapshot, removes graph-global write-lease check, and switches to per-table expected-version CAS publish. Logic is sound; no regression test (documented as infeasible under the schema-apply lock).
crates/omnigraph/src/db/graph_coordinator.rs Adds two pub(crate) methods — commit_manifest_changes_with_expected and commit_changes_with_expected_with_actor — mirroring the existing commit_manifest_changes / commit_changes_with_actor pair. Pattern is correct; failpoint hook preserved.
.github/workflows/ci.yml RustFS S3 job timeout bumped 75 to 120 minutes with a detailed justification comment; addresses a cold-cache timeout loop.
docs/dev/writes.md Adds subsection documenting queue-before-snapshot, per-table CAS, and removal of the global write-lease check; explains why in-process failpoint tests are not feasible under the schema-apply lock.
docs/dev/invariants.md Adds Schema apply concurrency invariant row and narrows the multi-process OCC gap to the recovery-heal path.
docs/dev/testing.md Documents that the bug-4 fix has no new in-process failpoint tests and explains why every spawnable concurrent writer is excluded by the schema-apply lock.

Comments Outside Diff (2)

  1. crates/omnigraph/tests/failpoints.rs, line 662-674 (link)

    P2 Timing-sensitive synchronization before optimize spawn

    The 300ms sleep is the only mechanism ensuring the apply task has actually acquired the node:Person write queue before the optimize task is spawned. If plan_schema_for_apply, validate_catalog, or queue acquisition take longer than 300ms (e.g., on a cold or loaded CI runner), the optimize task could acquire the queue first, run to completion, and make the !optimize.is_finished() assertion spuriously fail.

    The first test in this pair (schema_apply_succeeds_under_concurrent_write_to_untouched_table) uses a more robust polling loop — it waits for the __recovery sidecar file to appear, which is written just after the snapshot and immediately before the post_snapshot_pre_commit park. For this test the park is after_queue_acquire_pre_snapshot, which fires before the sidecar is written, so the same file-existence signal isn't available. A comment acknowledging this timing assumption and the acceptable flake risk — or a tokio::sync::Notify / atomic set inside the failpoint hook if the infrastructure supports it — would make the constraint explicit.

    Fix in Claude Code

  2. crates/omnigraph/tests/failpoints.rs, line 519 (link)

    P2 Commit ordering inverts the AGENTS.md rule-12 test-first contract

    AGENTS.md rule 12 requires that for bug fixes the test commit lands before the fix commit ("the red → green pair is visible in git log"). The PR commits are ordered fix → test → docs, which means a reviewer checking out the test commit alone cannot reproduce the failure against the pre-fix engine. Reordering to test → fix → docs would make the regression proof self-contained in git history, as the rule intends.

    Context Used: CLAUDE.md (source)

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

    Fix in Claude Code

Reviews (8): Last reviewed commit: "Merge branch 'main' into ragnorc/bug-4-s..." | Re-trigger Greptile

ragnorc added 3 commits June 15, 2026 22:00
…ent optimize can't false-fail an apply (iss-schema-apply-optimize-occ)

A periodic `optimize` (Lance compaction) running concurrently with a
`schema apply` false-failed the apply. optimize advances a table's manifest
version; schema apply captured its snapshot first, then a graph-global "write
lease" check compared the whole manifest version against the pre-apply version
and aborted ("schema apply lost its write lease") whenever ANY table moved —
even a table the apply never touched.

Same governing principle as the iss-848 index work (logical contract over
physical state): a physical operation (compaction of a disjoint table) must not
fail a logical one (a migration). The fix mirrors what `optimize` already does:

- classify_plan_steps: derive the touched-table sets from the plan snapshot-free
  (pure schema_table_key), so the write queues can be acquired before the
  snapshot.
- Acquire the touched-existing-table write queues (rewritten ∪ rename-source ∪
  dropped) BEFORE the snapshot, then snapshot under the held queues. No
  in-process optimize/mutate can move a touched table during the apply, so the
  per-table preconditions pass by construction.
- Remove the graph-global write-lease check. A compaction of an untouched table
  is a disjoint manifest row the publisher rebases onto cleanly; it no longer
  fails the apply. A compaction of a touched table serializes behind it.
- Publish via commit_changes_with_expected_with_actor: a per-table
  expected-version CAS over the touched-existing tables. The held queues make
  this pass in-process; the CAS is the loud cross-process backstop (typed
  ExpectedVersionMismatch instead of a silent clobber).

"Operations that don't conflict semantically no longer conflict mechanically."
…-fail an apply

Two deterministic failpoint tests (the proof for iss-schema-apply-optimize-occ),
via the new schema_apply.after_queue_acquire_pre_snapshot and
schema_apply.post_snapshot_pre_commit failpoints:

- schema_apply_succeeds_under_concurrent_write_to_untouched_table: park the
  apply after its snapshot (pre-publish), advance an UNTOUCHED table's manifest
  version via a concurrent write, release → the apply publishes and succeeds
  (the removed write-lease check would have aborted it here). Re-open confirms
  the migration persisted and nothing recovery-shaped remains.
- concurrent_optimize_on_touched_table_serializes_behind_schema_apply: park the
  apply just after it acquires the touched-table queues; a concurrent optimize
  of that table blocks (is_finished() == false) rather than racing; once the
  apply releases its guards, optimize completes over the post-apply layout.

The failpoints are introduced by the fix, so these verify against it rather
than red-then-green on a prior commit; reverting the lease-removal + reorder
(keeping the failpoints) reproduces the original false-fail.
…ncurrency model

- writes.md: new "Schema apply concurrency" subsection (queue-before-snapshot,
  per-table CAS, no global lease; the two failpoint proofs).
- invariants.md: new Schema-apply-concurrency truth-matrix row; narrow the
  in-process recovery-serialization gap (the apply's publish is now CAS-fenced,
  so a foreign optimize of a touched table is loud, not silent).
- testing.md: list the two new failpoint tests + their failpoints.
@ragnorc ragnorc requested a review from aaltshuler as a code owner June 15, 2026 20:00
Resolved two conflicts against main (#253 schema-apply actor threading /
RFC-011 D10; #231 self-heal manifest-unreferenced branch forks):

- schema_apply.rs: my queue-before-snapshot reorder MOVED the touched-table
  queue-acquisition block to before the snapshot; #231's edit here was a
  comment-only update to that (now-relocated) block. Kept my relocation; #231's
  actual reconciler logic lives in recovery.rs/optimize.rs (auto-merged).
- invariants.md: add/add in Known Gaps — kept BOTH my sentence narrowing the
  recovery-serialization gap (the apply publish is now CAS-fenced) AND #231's
  new "Fork reclaim is in-process-safe only" bullet.

Verified the API surfaces this branch calls are unchanged post-merge:
write_queue::acquire_many, manifest::commit_changes_with_expected,
apply_schema_with_lock (no actor param — #253 enforces in the outer
apply_schema, publish stays system-attributed).
Comment thread crates/omnigraph/tests/failpoints.rs Outdated
Comment thread crates/omnigraph/tests/failpoints.rs Outdated
ragnorc and others added 2 commits June 15, 2026 22:44
…c (drop sleep race)

Cursor Bugbot (PR #254): concurrent_optimize_on_touched_table_serializes_behind_schema_apply
relied on a fixed 300ms sleep to ensure the apply had parked holding the
touched-table queue before optimize started. On slow CI the apply could still
be planning/acquiring, letting optimize win the queue race and finish first →
spurious `!optimize.is_finished()` failure.

Correct-by-design fix: park the apply at schema_apply.post_snapshot_pre_commit
(where its recovery sidecar is already on disk) rather than the pre-snapshot
point (no file artifact, which forced the sleep). The touched-table queues are
acquired BEFORE the snapshot and held to end-of-function, so at this park point
the apply provably holds the queue; poll for the sidecar as a deterministic
"parked, holding the queue" barrier before spawning optimize. optimize then
blocks until the apply releases — robust regardless of CI speed.

Retires the now-unused schema_apply.after_queue_acquire_pre_snapshot failpoint;
updates writes.md / testing.md to the single post_snapshot_pre_commit failpoint.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 7da260b. Configure here.

Comment thread crates/omnigraph/src/db/omnigraph/schema_apply.rs
ragnorc and others added 4 commits June 16, 2026 16:22
…ument why (Bugbot #254)

Cursor Bugbot flagged that concurrent_optimize_on_touched_table_serializes_behind_schema_apply
cannot prove queue serialization: optimize_all_tables calls ensure_schema_apply_idle,
which errors while the schema-apply lock branch exists, so a spawned optimize
returns a conflict instead of blocking on the touched-table queue.

Investigating confirmed a deeper problem: while an apply holds
__schema_apply_lock__, EVERY spawnable writer refuses to start —
optimize/cleanup/ensure_indices/repair via ensure_schema_apply_idle, and
mutate/load via ensure_schema_apply_not_locked (table_ops.rs:511). So BOTH
failpoint tests rested on a false premise (a concurrent writer running during a
parked apply); neither could pass. The only reachable race is a writer already
in flight before the apply took the lock — serialized by
heal_pending_recovery_sidecars + the per-table queues (touched tables), or
made benign by the lease removal (untouched tables). A deterministic in-process
failpoint test could only assert the unreachable case, so it adds no value.

- Remove both flawed tests and the now-unused post_snapshot_pre_commit
  failpoint (after_queue_acquire_pre_snapshot was already retired).
- Bugbot low-severity note: add a comment that the queue keys use branch=None
  because schema apply only runs on a main-only graph (table_branch is always
  None), so the key matches any concurrent writer's.
- Document the concurrency reachability + the no-in-process-test rationale
  (writes.md), and correct the truth-matrix row (invariants.md) and the
  failpoints coverage note (testing.md).

The correct-by-design code fix (queue-before-snapshot, lease removal, per-table
expected-version CAS) is unchanged and is exercised by the existing
single-threaded schema_apply.rs apply/rewrite/publish tests.
…eout loop

The rustfs_integration job cold-compiles each of its 5 sequential s3 suites and
was hitting the 75-min limit on the last suite (recovery-sidecar lifecycle),
cancelling the job before the Post-Cache step. That meant the cache was never
saved, so the next run started cold again — a timeout loop that never reached a
green rustfs run (observed repeatedly on #254: storage/server-smoke/cluster-e2e/
CLI-smoke all pass, recovery-sidecar always cut off at 75min).

120 gives a cold run room to finish all 5 suites and persist the cache; warm
runs then complete in ~15min, far under the limit. Splitting the suites into
parallel jobs is the longer-term fix (dev-graph iss-rustfs-s3-cold-timeout-loop).
Brings in #268 (perf: warm-read metadata cache), #278 (compiler NanoError
rename), the CODEOWNERS removal, and the docs sweep. One conflict, in
schema_apply.rs: #268 made a rustfmt one-line collapse of the serial-key push
inside the queue-acquire block that bug-4 RELOCATED to before the snapshot.
Kept bug-4's relocation (the block now lives earlier; the pointer comment
stays here); #268's reformat of the now-moved line is moot. Verified
compile-compatible: the CAS methods call the unchanged record_graph_commit +
manifest.commit_changes_with_expected; #268's record_graph_commit change is
internal (new e_tag-aware SnapshotId::synthetic), not a signature change.
#278's NanoError rename touched only compiler-crate files and auto-merged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant