Skip to content

feat: add missed block gap fill worker#60

Open
pthmas wants to merge 2 commits intomainfrom
pthmas/missed-blocks-refetch-worker
Open

feat: add missed block gap fill worker#60
pthmas wants to merge 2 commits intomainfrom
pthmas/missed-blocks-refetch-worker

Conversation

@pthmas
Copy link
Copy Markdown
Collaborator

@pthmas pthmas commented Apr 14, 2026

Summary

  • add a background gap-fill worker that retries missed blocks from failed_blocks with backoff
  • make recovered block writes clear failed_blocks rows atomically and refresh the missing-blocks metric
  • add focused integration coverage for recovery, retry backoff, and missing-block metric updates

Summary by CodeRabbit

  • New Features
    • Automatic background recovery of failed block indexing with exponential backoff retry logic.

Closes #57

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 14, 2026

📝 Walkthrough

Walkthrough

This PR introduces a GapFillWorker background service that periodically retries blocks recorded in a failed_blocks table using exponential backoff. It refactors indexer methods for code reuse, adds wiremock testing dependency, and includes comprehensive integration tests with mocked RPC responses.

Changes

Cohort / File(s) Summary
Dependencies
backend/Cargo.toml, backend/crates/atlas-server/Cargo.toml
Added wiremock = "0.6" workspace dependency and referenced it as dev-dependency in atlas-server crate.
GapFillWorker Implementation
backend/crates/atlas-server/src/indexer/gap_fill_worker.rs
New module implementing background retry loop that processes failed blocks with exponential backoff, fetches blocks via RPC with rate-limiting, writes recovered data, updates metrics, and broadcasts notifications.
Indexer Refactoring
backend/crates/atlas-server/src/indexer/indexer.rs
Exposed connect_copy_client, collect_block, and write_batch as pub(crate). Added write_batch_and_clear_failed_block() and extracted partition logic into free function ensure_partitions_exist() for code reuse by GapFillWorker.
Module Exposure
backend/crates/atlas-server/src/indexer/mod.rs, backend/crates/atlas-server/src/main.rs
Added gap_fill_worker module declaration and re-export. Instantiated GapFillWorker in run() with DB pool, RPC configuration, broadcast sender, and metrics; spawned as retry-wrapped Tokio task.
Integration Tests
backend/crates/atlas-server/tests/integration/gap_fill.rs, backend/crates/atlas-server/tests/integration/main.rs
Added comprehensive test suite with wiremock-mocked RPC server, testing successful recovery, metric updates, RPC failures, and exponential backoff enforcement. Serial execution via global mutex ensures test isolation.

Sequence Diagram

sequenceDiagram
    participant GW as GapFillWorker
    participant DB as Database<br/>(PgPool)
    participant RPC as RPC Server
    participant IDX as Indexer
    participant MET as Metrics
    participant BC as Broadcast<br/>Channel

    loop run() continuous retry
        GW->>DB: SELECT failed_blocks with<br/>backoff eligibility
        alt Blocks found
            GW->>RPC: Fetch block data<br/>(rate-limited)
            alt Success
                GW->>IDX: write_batch_and_clear_<br/>failed_block()
                IDX->>DB: ensure_partitions_exist()
                IDX->>DB: COPY data + DELETE<br/>from failed_blocks
                GW->>DB: Count missing blocks
                GW->>MET: Update<br/>missing_blocks gauge
                GW->>BC: Broadcast recovery<br/>notification
            else RPC/Write Failure
                GW->>DB: Increment retry_count<br/>Update last_failed_at
            end
        else No Blocks
            GW->>GW: Sleep (idle duration)
        end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • tac0turtle

Poem

🐰 A worker hops through failed blocks with care,
Retrying with backoff—exponentially fair,
RPC calls fetch what once slipped away,
While partitions align and metrics display,
Lost data recovers, hop-hop-hooray! 🎉

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 43.75% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: add missed block gap fill worker' accurately and specifically describes the main change: a new background worker that handles recovery of missed blocks using a gap-fill strategy.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch pthmas/missed-blocks-refetch-worker

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
backend/crates/atlas-server/src/indexer/gap_fill_worker.rs (2)

103-111: Consider persisting copy_client across batches for connection reuse.

Currently, connect_copy_client is called on every process_batch() invocation, establishing a new TCP connection each time. For a worker that runs infrequently (5-minute idle sleep), this overhead is acceptable, but persisting the connection in the struct would reduce latency and resource usage.

This is a minor optimization that could be addressed in a follow-up if gap-fill volume increases.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/crates/atlas-server/src/indexer/gap_fill_worker.rs` around lines 103
- 111, Persist the copy client instead of creating it every process_batch call:
add a field (e.g., copy_client: Option<CopyClient> or Arc<CopyClient>) to the
worker struct used in gap_fill_worker.rs, initialize it lazily by calling
Indexer::connect_copy_client(...) once (e.g., in the worker constructor or on
first process_batch run) and reuse that instance in subsequent process_batch
invocations, and handle reconnects by detecting connection errors from the
existing copy_client and re-calling Indexer::connect_copy_client to replace it;
reference the existing connect_copy_client function and the local variable
copy_client in the diff to locate where to change.

177-180: Consider using a more efficient query for frequently accessed metric.

SELECT COUNT(*) FROM failed_blocks is O(n), but since failed_blocks is expected to be small (only unrecoverable blocks), this is acceptable. If the table grows large, consider using pg_class.reltuples as done elsewhere in the codebase.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/crates/atlas-server/src/indexer/gap_fill_worker.rs` around lines 177
- 180, The current code calls get_missing_block_count() after successful
gap-fill which likely runs a COUNT(*) on failed_blocks (used by
set_indexer_missing_blocks); replace or guard that expensive COUNT query with a
cheaper estimate using pg_class.reltuples like elsewhere in the codebase when
table size is large: update the get_missing_block_count() implementation (or add
a new helper) to prefer querying pg_class.reltuples for relname =
'failed_blocks' and fall back to SELECT COUNT(*) only for small tables or when a
precise value is required, and ensure the call sites (including where succeeded
> 0 and set_indexer_missing_blocks is invoked) use the updated helper.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@backend/crates/atlas-server/src/indexer/gap_fill_worker.rs`:
- Around line 103-111: Persist the copy client instead of creating it every
process_batch call: add a field (e.g., copy_client: Option<CopyClient> or
Arc<CopyClient>) to the worker struct used in gap_fill_worker.rs, initialize it
lazily by calling Indexer::connect_copy_client(...) once (e.g., in the worker
constructor or on first process_batch run) and reuse that instance in subsequent
process_batch invocations, and handle reconnects by detecting connection errors
from the existing copy_client and re-calling Indexer::connect_copy_client to
replace it; reference the existing connect_copy_client function and the local
variable copy_client in the diff to locate where to change.
- Around line 177-180: The current code calls get_missing_block_count() after
successful gap-fill which likely runs a COUNT(*) on failed_blocks (used by
set_indexer_missing_blocks); replace or guard that expensive COUNT query with a
cheaper estimate using pg_class.reltuples like elsewhere in the codebase when
table size is large: update the get_missing_block_count() implementation (or add
a new helper) to prefer querying pg_class.reltuples for relname =
'failed_blocks' and fall back to SELECT COUNT(*) only for small tables or when a
precise value is required, and ensure the call sites (including where succeeded
> 0 and set_indexer_missing_blocks is invoked) use the updated helper.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e20ebd0e-64ce-40b7-a000-9cd68f12f743

📥 Commits

Reviewing files that changed from the base of the PR and between e1b3e31 and c6d38a5.

📒 Files selected for processing (8)
  • backend/Cargo.toml
  • backend/crates/atlas-server/Cargo.toml
  • backend/crates/atlas-server/src/indexer/gap_fill_worker.rs
  • backend/crates/atlas-server/src/indexer/indexer.rs
  • backend/crates/atlas-server/src/indexer/mod.rs
  • backend/crates/atlas-server/src/main.rs
  • backend/crates/atlas-server/tests/integration/gap_fill.rs
  • backend/crates/atlas-server/tests/integration/main.rs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: gap-fill sync mode

1 participant