Skip to content

Fix race condition in RINS and lost nodes in branch and bound#1138

Open
hlinsen wants to merge 4 commits intoNVIDIA:mainfrom
hlinsen:fix-bb-rins
Open

Fix race condition in RINS and lost nodes in branch and bound#1138
hlinsen wants to merge 4 commits intoNVIDIA:mainfrom
hlinsen:fix-bb-rins

Conversation

@hlinsen
Copy link
Copy Markdown
Contributor

@hlinsen hlinsen commented Apr 23, 2026

Attempt at fixing a rare benchmark bug on H100.
Fixes:

  • Race condition on rins::node_callback and rins::stop
  • Re-stack node on iteration and concurrent limit

@hlinsen hlinsen requested review from aliceb-nv and nguidotti April 23, 2026 14:22
@hlinsen hlinsen requested a review from a team as a code owner April 23, 2026 14:22
@hlinsen hlinsen added the bug Something isn't working label Apr 23, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 23, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 12a83088-ddf2-4443-9155-29e586d16749

📥 Commits

Reviewing files that changed from the base of the PR and between 0fed0c7 and aee6ec8.

📒 Files selected for processing (2)
  • cpp/src/branch_and_bound/branch_and_bound.cpp
  • cpp/src/mip_heuristics/diversity/lns/rins.cu
✅ Files skipped from review due to trivial changes (1)
  • cpp/src/branch_and_bound/branch_and_bound.cpp
🚧 Files skipped from review as they are similar to previous changes (1)
  • cpp/src/mip_heuristics/diversity/lns/rins.cu

📝 Walkthrough

Walkthrough

Branch-and-bound plunge loop now marks and requeues pending nodes when LP returns concurrency/iteration limits. RINS gains a destructor, an incumbent-solution snapshot, and tightened thread synchronization: node callbacks, enable/stop, and run logic are reworked to use mutex-protected state and snapshots.

Changes

Branch-and-bound LP limit handling

Layer / File(s) Summary
Control flag / Local state
cpp/src/branch_and_bound/branch_and_bound.cpp
Adds local requeue_pending_nodes boolean in plunge_with.
Early-exit handling
cpp/src/branch_and_bound/branch_and_bound.cpp
On dual::status_t::CONCURRENT_LIMIT/ITERATION_LIMIT push current node back to local stack, set requeue_pending_nodes = true, and break from loop.
Flush / Requeue logic
cpp/src/branch_and_bound/branch_and_bound.cpp
Post-loop, broaden condition to re-queue remaining local stack to node_queue_ when gap tolerances are met OR requeue_pending_nodes is true; comment updated accordingly.

RINS thread lifecycle & snapshotting

Layer / File(s) Summary
API / Data members
cpp/src/mip_heuristics/diversity/lns/rins.cuh
Declare ~rins_t(); and add std::vector<f_t> incumbent_solution_snapshot;.
Destructor / Teardown
cpp/src/mip_heuristics/diversity/lns/rins.cu
Define ~rins_t() to call stop_rins().
Node callback / Snapshotting
cpp/src/mip_heuristics/diversity/lns/rins.cu
Rewrite node_callback to use rins_mutex-protected eligibility checks, snapshot best_feasible() host assignment into incumbent_solution_snapshot, copy callback solution into lp_optimal_solution, and start CPU solver.
Enable / Start
cpp/src/mip_heuristics/diversity/lns/rins.cu
Move thread creation/activation and problem copy under rins_mutex; set enabled = true while holding lock.
Stop / Termination
cpp/src/mip_heuristics/diversity/lns/rins.cu
Under rins_mutex set enabled = false and move rins_thread out; request termination outside the lock.
Run / Launch-time seeding
cpp/src/mip_heuristics/diversity/lns/rins.cu
At run_rins() launch, seed best_sol from incumbent_solution_snapshot and check feasibility instead of locking and copying live dm.population.best_feasible().

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main changes: fixing a race condition in RINS and addressing lost nodes in branch and bound.
Description check ✅ Passed The description is related to the changeset, outlining specific fixes for race conditions and node re-stacking logic.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
cpp/src/branch_and_bound/branch_and_bound.cpp (1)

1485-1492: Optional cleanup: merge identical limit branches.

Both status branches perform the same operations; collapsing them reduces maintenance drift.

♻️ Proposed simplification
-    } else if (lp_status == dual::status_t::CONCURRENT_LIMIT) {
-      stack.push_front(node_ptr);
-      requeue_pending_nodes = true;
-      break;
-    } else if (lp_status == dual::status_t::ITERATION_LIMIT) {
+    } else if (lp_status == dual::status_t::CONCURRENT_LIMIT ||
+               lp_status == dual::status_t::ITERATION_LIMIT) {
       stack.push_front(node_ptr);
       requeue_pending_nodes = true;
       break;
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/branch_and_bound/branch_and_bound.cpp` around lines 1485 - 1492, The
two consecutive branches checking lp_status == dual::status_t::CONCURRENT_LIMIT
and lp_status == dual::status_t::ITERATION_LIMIT perform identical actions;
collapse them into a single condition to avoid duplication. Replace the two
else-if blocks with a single branch that checks if lp_status is either
dual::status_t::CONCURRENT_LIMIT or dual::status_t::ITERATION_LIMIT, then call
stack.push_front(node_ptr), set requeue_pending_nodes = true, and break;
referencing the existing lp_status variable and the
stack.push_front(node_ptr)/requeue_pending_nodes/break sequence.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@cpp/src/branch_and_bound/branch_and_bound.cpp`:
- Around line 1485-1492: The two consecutive branches checking lp_status ==
dual::status_t::CONCURRENT_LIMIT and lp_status ==
dual::status_t::ITERATION_LIMIT perform identical actions; collapse them into a
single condition to avoid duplication. Replace the two else-if blocks with a
single branch that checks if lp_status is either
dual::status_t::CONCURRENT_LIMIT or dual::status_t::ITERATION_LIMIT, then call
stack.push_front(node_ptr), set requeue_pending_nodes = true, and break;
referencing the existing lp_status variable and the
stack.push_front(node_ptr)/requeue_pending_nodes/break sequence.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: a1bddad0-abbf-4409-901c-017c5fabffd6

📥 Commits

Reviewing files that changed from the base of the PR and between 55fb9b9 and 0fed0c7.

📒 Files selected for processing (3)
  • cpp/src/branch_and_bound/branch_and_bound.cpp
  • cpp/src/mip_heuristics/diversity/lns/rins.cu
  • cpp/src/mip_heuristics/diversity/lns/rins.cuh

Copy link
Copy Markdown
Contributor

@nguidotti nguidotti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thanks for the fix, Hugo.

@anandhkb anandhkb added this to the 26.06 milestone Apr 28, 2026
@hlinsen hlinsen added the non-breaking Introduces a non-breaking change label May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants