fix(jobs): don't fail jobs prematurely, but fail stalled jobs sooner (3 days -> 10 min) by mihow · Pull Request #1235 · RolnickLab/antenna

mihow · 2026-04-15T18:58:57Z

Plain language

Two related fixes for the bug where async ML jobs sometimes flip to FAILURE in the first ~30-55 seconds and disappear from the UI while the processing service is still actively returning results.

What was happening:

When a worker reports a result, we record progress in two places — Redis (a remaining-images counter) and the database (a stage's percentage and status). If 1-2 of the first few results error, the overall failed/total ratio temporarily exceeds 50%, and the code immediately stamps the stage's status as FAILURE — even though the stage is only ~5% done.
Then any time the Job is saved, Job.update_progress notices "this stage has a final status but only 5% progress — that doesn't make sense, let me bump it to 100%."
Now the stage looks 100% complete with FAILURE status. is_complete() returns True. Cleanup fires. NATS connection torn down. Job marked FAILURE. The remaining 95% of results that arrive over the next minute have nowhere to go.

The "is the stage at 100%?" check was honest. The code that wrote 100% was lying — it was a save-time janitor that "fixed" a perceived inconsistency by silently rewriting measured data.

Fix 1 — stop lying about progress. Remove the line that bumps stage.progress = 1 when status is final. Progress is a measurement; if it's 30%, it's 30%, even if status is FAILURE. The stage will reach 100% honestly when SREM drains the pending set, at which point the FAILURE trip works correctly.

Fix 2 — catch genuinely stuck jobs faster. Without the false-positive coercion, a job whose workers genuinely stop pulling messages would sit visible as STARTED for 3 days before the reaper noticed. Two pieces:

The reaper threshold was reusing a constant called FAILED_CUTOFF_HOURS that was originally added to hide old failed jobs from UI listings (PR Error handling for jobs #368, 2022). Two unrelated concepts had collapsed into one 72-hour value. Split into FAILED_JOBS_DISPLAY_MAX_HOURS (UI hide, unchanged) and a new STALLED_JOBS_MAX_MINUTES (reaper, 10 min — conservative starting value, raise if legitimate long-running jobs get reaped).
The reaper already filters by Job.updated_at (added in feat(jobs): schedule periodic stale-job check #1227), which Django bumps on every save. A job actively making progress bumps its timestamp on every batch, so a healthy job is never reaped — only one whose worker pool stopped touching it.

Reap-time diagnostic logging. When the reaper revokes a job it now logs a single WARN line with: minutes since last update vs threshold, previous_status, celery_state, dispatch_mode, and a per-stage progress/status summary. Pairs with the per-tick NATS consumer snapshots from PR #1227, so an operator can answer "why was this stalled?" by reading the revoke line plus the most recent running_job_snapshots entry for that job, without grepping back through hours of tick logs.

Net behavior change: a stuck running job gets reaped within ~10-25 min (one Beat tick) instead of 3 days, with enough log context at the moment of revocation to start triage. UI display of old failures is unchanged.

Files

ami/jobs/models.py — drop progress coercion, split constant
ami/jobs/tasks.py — check_stale_jobs(minutes=...) defaults to STALLED_JOBS_MAX_MINUTES; per-job WARN at revoke time
ami/jobs/views.py — listing filter uses FAILED_JOBS_DISPLAY_MAX_HOURS
ami/jobs/management/commands/update_stale_jobs.py — --hours → --minutes
ami/jobs/tests/test_jobs.py — regression test for the coercion fix
ami/jobs/tests/test_update_stale_jobs.py, test_periodic_beat_tasks.py — rescale fixtures to minutes

Test plan

python manage.py test ami.jobs --keepdb — 49 tests pass
Live e2e: dispatch an async_api job that produces early errors (PSv2 / cold ADC). Verify it does NOT flip to FAILURE in the first minute, runs to completion, and reports the actual failed-image count.
Live e2e: kill the ADC worker mid-job. Verify the job stays STARTED, no progress advances, and the next jobs_health_check Beat tick (≤15 min after last update) revokes it AND emits the new "Reaping stalled job" WARN with the diagnostic context.
Verify python manage.py update_stale_jobs --minutes 5 --dry-run still works.

Stacked on

Independent of #1234 (Fix 1: ACK/SREM ordering) — this branch is based on main, no shared commits. Either can merge first.

Footnote — future direction: managed retry pattern

Async ML jobs are designed to be idempotent (save_results dedupes on (detection, source_image), SREM is a no-op for already-removed ids, progress percentage is clamped by max()). Today, when the reaper marks a job REVOKED, recovery is manual: a user re-dispatches the same collection. Since the building blocks for safe retry already exist, a natural next step is to track retry attempts on the Job model and let the periodic jobs_health_check manage automatic re-dispatch within a budget — e.g. revoke → eligible-for-retry up to N times → terminally REVOKED with reason. Out of scope for this PR; worth a design discussion before implementing because it interacts with how user-facing retry UI and audit trails should work.

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- Fixed an issue where failed job stage progress was incorrectly set to 100%, preventing proper job completion cleanup.
- Stale job detection now uses minute-based timeout (10-minute default) instead of hour-based.
New Features
- Enhanced logging for revoked stale jobs includes detailed diagnostics: previous status, execution state, stage progress, and timing information.

… stalled-job reaper Two related fixes for the premature ``cleanup_async_job_resources`` path that flips async_api jobs to FAILURE within ~30-55s of dispatch while NATS is still delivering results: 1. ``Job.update_progress`` no longer silently bumps ``stage.progress = 1`` when ``stage.status`` is in a final state. The bumped value made ``is_complete()`` return True and triggered cleanup mid-flight. The trigger was ``_update_job_progress`` writing ``status=FAILURE`` at partial progress as soon as ``failed/total`` crossed FAILURE_THRESHOLD (very easy early in a job — 1-2 errors out of the first few results). Progress is a measurement; leave it alone. The honest FAILURE trip happens when the stage actually reaches 100%. 2. Split ``Job.FAILED_CUTOFF_HOURS`` (originally added in PR #368 to hide failed jobs from UI listings after 3 days) into: - ``FAILED_JOBS_DISPLAY_MAX_HOURS = 24 * 3`` — original UI/API filter - ``STALLED_JOBS_MAX_MINUTES = 10`` — reaper threshold The 72h value was reused by PR #1227's stale-job reaper for an entirely different purpose, leaving stuck jobs invisible for 3 days. ``check_stale_jobs`` now defaults to 10 minutes against ``updated_at`` (already in place from #1227), so a job whose worker pool stops pulling messages gets reaped within one Beat tick (~10-25 min). ``--hours`` arg on the management command becomes ``--minutes`` to match. Together: fix #1 stops the false-positive FAILUREs; fix #2 ensures true positives are caught quickly without depending on the misfiring coercion. Co-Authored-By: Claude <[email protected]>

netlify · 2026-04-15T18:59:03Z

✅ Deploy Preview for antenna-preview canceled.

Name	Link
🔨 Latest commit	`6218d38`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-preview/deploys/69dfe0edfe3d25000839b90d

netlify · 2026-04-15T18:59:04Z

✅ Deploy Preview for antenna-ssec canceled.

Name	Link
🔨 Latest commit	`6218d38`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-ssec/deploys/69dfe0ed7ea5bb0008af3a81

coderabbitai · 2026-04-15T18:59:05Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ba3c9c08-5294-422e-8118-21f553d7b05f

📥 Commits

Reviewing files that changed from the base of the PR and between 4686340 and 6218d38.

📒 Files selected for processing (7)

ami/jobs/management/commands/update_stale_jobs.py
ami/jobs/models.py
ami/jobs/tasks.py
ami/jobs/tests/test_jobs.py
ami/jobs/tests/test_periodic_beat_tasks.py
ami/jobs/tests/test_update_stale_jobs.py
ami/jobs/views.py

📝 Walkthrough

Walkthrough

The PR migrates stale job detection from hours-based to minutes-based thresholds. Job model constants are renamed for clarity (FAILED_CUTOFF_HOURS → FAILED_JOBS_DISPLAY_MAX_HOURS), and a new STALLED_JOBS_MAX_MINUTES constant is introduced. Progress inflation logic in final job states is removed, and enhanced logging is added for revoked stale jobs.

Changes

Cohort / File(s)	Summary
Job Model Constants & Logic `ami/jobs/models.py`	Renamed failure retention constant for clarity, added stalled jobs minutes threshold, and removed logic that coerced stage progress to `1` in final states to prevent premature completion.
Stale Job Detection Task `ami/jobs/tasks.py`	Updated `check_stale_jobs` signature from hours to minutes; enhanced per-job warning logs with stalled duration, stage summaries, and job state details during revocation.
Management Command & API `ami/jobs/management/commands/update_stale_jobs.py`, `ami/jobs/views.py`	Updated CLI option and API view to use new constant names and minute-based thresholds.
Test Updates `ami/jobs/tests/test_*.py`	Added regression test for progress inflation fix; updated test helpers to use minute-based backdating instead of hour-based.

Poem

🐰 Minutes matter more than hours, we now say,
No progress inflation at the end of day,
Stale jobs logged with care before they're told to go,
Cleaner code, clearer thresholds—watch the garden grow!

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

fix: revoke stale jobs to "revoked" status instead of "pending" #1169: Directly modifies the same check_stale_jobs path and management command interface, changing from hours to minutes-based semantics.
PSv2 cleanup: use is_complete() and dispatch_mode in job progress handler #1125: Updates job progress/completion logic via job.progress.is_complete() checks, overlapping with the progress inflation removal in this PR.
fix(jobs): finalize async job when 0 images need processing #1191: Touches job completion and progress finalization behavior in the Job model, related to progress handling refactoring.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 31.25% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main changes: stopping premature job failures and making stalled-job detection faster (changing from 3 days to 10 minutes). It accurately summarizes the PR's primary objectives.
Description check	✅ Passed	The description comprehensively follows the template with all major sections completed: Summary, List of Changes, Files section, Test Plan, and Stacked on notes. It includes detailed explanation of the problem, fixes, implementation details, and testing approach.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/stalled-job-cutoff-and-progress-coercion

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

Fixes a premature cleanup failure mode in async ML jobs by ensuring stage progress remains a truthful measurement (no save-time coercion to 100%), and tightens the stale-job reaper threshold by splitting the former “failed display cutoff” constant into separate UI and operational cutoffs.

Changes:

Remove save-time coercion that forced stages in final status to progress=1.0, preventing premature is_complete() / cleanup.
Split the old FAILED_CUTOFF_HOURS concept into FAILED_JOBS_DISPLAY_MAX_HOURS (UI listing filter) and STALLED_JOBS_MAX_MINUTES (reaper), and rescale stale-job logic to minutes.
Update management command/tests to use minutes-based staleness and add a regression test for the progress-coercion bug.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`ami/jobs/models.py`	Removes progress inflation on save; introduces separate UI vs reaper cutoffs.
`ami/jobs/tasks.py`	Updates `check_stale_jobs` to use minutes and default to `STALLED_JOBS_MAX_MINUTES`; updates related docstring.
`ami/jobs/views.py`	Job listing cutoff default now uses the UI-only failed-job display constant.
`ami/jobs/management/commands/update_stale_jobs.py`	Switches CLI flag/behavior from hours to minutes for stale-job reconciliation.
`ami/jobs/tests/test_jobs.py`	Adds regression coverage to ensure partial FAILURE progress is not inflated on save.
`ami/jobs/tests/test_update_stale_jobs.py`	Updates stale-job fixtures from hours to minutes.
`ami/jobs/tests/test_periodic_beat_tasks.py`	Updates beat-task stale-job fixtures from hours to minutes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

When the reaper revokes a job, log a single WARN line capturing the state needed to triage "why was this stalled?" without grepping back through prior tick logs: - minutes since last update vs threshold - previous_status, dispatch_mode, celery_state - per-stage progress + status summary - pointer to running_job_snapshots for prior NATS consumer state Pairs with PR #1227's per-tick consumer snapshots: every 15 min, each running async_api job gets a NATS state snapshot; this new line tells operators which of those snapshots is the last one before revocation. Co-Authored-By: Claude <[email protected]>

mihow · 2026-04-15T19:16:09Z

@coderabbitai ready for a review

coderabbitai · 2026-04-15T19:16:23Z

@mihow Sure, I'll kick off a full review of the PR now!

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copilot AI review requested due to automatic review settings April 15, 2026 18:58

Copilot started reviewing on behalf of mihow April 15, 2026 18:59 View session

Copilot AI reviewed Apr 15, 2026

View reviewed changes

Comment thread ami/jobs/management/commands/update_stale_jobs.py

mihow changed the title ~~fix(jobs): stop bumping partial progress to 100%, reap stalled jobs after 10 min~~ fix(jobs): don't fail jobs prematurely, but fail stalled jobs sooner (3 days -> 10 min) Apr 15, 2026

mihow merged commit af2f688 into main Apr 15, 2026
7 checks passed

mihow deleted the fix/stalled-job-cutoff-and-progress-coercion branch April 15, 2026 20:04

This was referenced Apr 15, 2026

Review and simplify job logs #1236

Open

fix(jobs): mark failed or lost images as failed so job can be marked complete #1244

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(jobs): don't fail jobs prematurely, but fail stalled jobs sooner (3 days -> 10 min)#1235

fix(jobs): don't fail jobs prematurely, but fail stalled jobs sooner (3 days -> 10 min)#1235
mihow merged 2 commits intomainfrom
fix/stalled-job-cutoff-and-progress-coercion

mihow commented Apr 15, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

netlify bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

netlify bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Apr 15, 2026 •

edited

Loading

Walkthrough

Changes

Poem

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

mihow commented Apr 15, 2026

Uh oh!

coderabbitai bot commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mihow commented Apr 15, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Plain language

Files

Test plan

Stacked on

Footnote — future direction: managed retry pattern

Summary by CodeRabbit

Uh oh!

netlify bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-preview canceled.

Uh oh!

netlify bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-ssec canceled.

Uh oh!

coderabbitai bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Poem

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

mihow commented Apr 15, 2026

Uh oh!

coderabbitai bot commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mihow commented Apr 15, 2026 •

edited by coderabbitai bot

Loading

netlify bot commented Apr 15, 2026 •

edited

Loading

netlify bot commented Apr 15, 2026 •

edited

Loading

coderabbitai bot commented Apr 15, 2026 •

edited

Loading