Conversation
… stalled-job reaper Two related fixes for the premature ``cleanup_async_job_resources`` path that flips async_api jobs to FAILURE within ~30-55s of dispatch while NATS is still delivering results: 1. ``Job.update_progress`` no longer silently bumps ``stage.progress = 1`` when ``stage.status`` is in a final state. The bumped value made ``is_complete()`` return True and triggered cleanup mid-flight. The trigger was ``_update_job_progress`` writing ``status=FAILURE`` at partial progress as soon as ``failed/total`` crossed FAILURE_THRESHOLD (very easy early in a job — 1-2 errors out of the first few results). Progress is a measurement; leave it alone. The honest FAILURE trip happens when the stage actually reaches 100%. 2. Split ``Job.FAILED_CUTOFF_HOURS`` (originally added in PR #368 to hide failed jobs from UI listings after 3 days) into: - ``FAILED_JOBS_DISPLAY_MAX_HOURS = 24 * 3`` — original UI/API filter - ``STALLED_JOBS_MAX_MINUTES = 10`` — reaper threshold The 72h value was reused by PR #1227's stale-job reaper for an entirely different purpose, leaving stuck jobs invisible for 3 days. ``check_stale_jobs`` now defaults to 10 minutes against ``updated_at`` (already in place from #1227), so a job whose worker pool stops pulling messages gets reaped within one Beat tick (~10-25 min). ``--hours`` arg on the management command becomes ``--minutes`` to match. Together: fix #1 stops the false-positive FAILUREs; fix #2 ensures true positives are caught quickly without depending on the misfiring coercion. Co-Authored-By: Claude <[email protected]>
✅ Deploy Preview for antenna-preview canceled.
|
✅ Deploy Preview for antenna-ssec canceled.
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (7)
📝 WalkthroughWalkthroughThe PR migrates stale job detection from hours-based to minutes-based thresholds. Job model constants are renamed for clarity ( Changes
Poem
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
Fixes a premature cleanup failure mode in async ML jobs by ensuring stage progress remains a truthful measurement (no save-time coercion to 100%), and tightens the stale-job reaper threshold by splitting the former “failed display cutoff” constant into separate UI and operational cutoffs.
Changes:
- Remove save-time coercion that forced stages in final status to
progress=1.0, preventing prematureis_complete()/ cleanup. - Split the old
FAILED_CUTOFF_HOURSconcept intoFAILED_JOBS_DISPLAY_MAX_HOURS(UI listing filter) andSTALLED_JOBS_MAX_MINUTES(reaper), and rescale stale-job logic to minutes. - Update management command/tests to use minutes-based staleness and add a regression test for the progress-coercion bug.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
ami/jobs/models.py |
Removes progress inflation on save; introduces separate UI vs reaper cutoffs. |
ami/jobs/tasks.py |
Updates check_stale_jobs to use minutes and default to STALLED_JOBS_MAX_MINUTES; updates related docstring. |
ami/jobs/views.py |
Job listing cutoff default now uses the UI-only failed-job display constant. |
ami/jobs/management/commands/update_stale_jobs.py |
Switches CLI flag/behavior from hours to minutes for stale-job reconciliation. |
ami/jobs/tests/test_jobs.py |
Adds regression coverage to ensure partial FAILURE progress is not inflated on save. |
ami/jobs/tests/test_update_stale_jobs.py |
Updates stale-job fixtures from hours to minutes. |
ami/jobs/tests/test_periodic_beat_tasks.py |
Updates beat-task stale-job fixtures from hours to minutes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
When the reaper revokes a job, log a single WARN line capturing the state needed to triage "why was this stalled?" without grepping back through prior tick logs: - minutes since last update vs threshold - previous_status, dispatch_mode, celery_state - per-stage progress + status summary - pointer to running_job_snapshots for prior NATS consumer state Pairs with PR #1227's per-tick consumer snapshots: every 15 min, each running async_api job gets a NATS state snapshot; this new line tells operators which of those snapshots is the last one before revocation. Co-Authored-By: Claude <[email protected]>
|
@coderabbitai ready for a review |
|
✅ Actions performedReview triggered.
|
Plain language
Two related fixes for the bug where async ML jobs sometimes flip to FAILURE in the first ~30-55 seconds and disappear from the UI while the processing service is still actively returning results.
What was happening:
Job.update_progressnotices "this stage has a final status but only 5% progress — that doesn't make sense, let me bump it to 100%."is_complete()returns True. Cleanup fires. NATS connection torn down. Job marked FAILURE. The remaining 95% of results that arrive over the next minute have nowhere to go.The "is the stage at 100%?" check was honest. The code that wrote 100% was lying — it was a save-time janitor that "fixed" a perceived inconsistency by silently rewriting measured data.
Fix 1 — stop lying about progress. Remove the line that bumps
stage.progress = 1when status is final. Progress is a measurement; if it's 30%, it's 30%, even if status is FAILURE. The stage will reach 100% honestly when SREM drains the pending set, at which point the FAILURE trip works correctly.Fix 2 — catch genuinely stuck jobs faster. Without the false-positive coercion, a job whose workers genuinely stop pulling messages would sit visible as STARTED for 3 days before the reaper noticed. Two pieces:
FAILED_CUTOFF_HOURSthat was originally added to hide old failed jobs from UI listings (PR Error handling for jobs #368, 2022). Two unrelated concepts had collapsed into one 72-hour value. Split intoFAILED_JOBS_DISPLAY_MAX_HOURS(UI hide, unchanged) and a newSTALLED_JOBS_MAX_MINUTES(reaper, 10 min — conservative starting value, raise if legitimate long-running jobs get reaped).Job.updated_at(added in feat(jobs): schedule periodic stale-job check #1227), which Django bumps on every save. A job actively making progress bumps its timestamp on every batch, so a healthy job is never reaped — only one whose worker pool stopped touching it.Reap-time diagnostic logging. When the reaper revokes a job it now logs a single WARN line with: minutes since last update vs threshold, previous_status, celery_state, dispatch_mode, and a per-stage progress/status summary. Pairs with the per-tick NATS consumer snapshots from PR #1227, so an operator can answer "why was this stalled?" by reading the revoke line plus the most recent
running_job_snapshotsentry for that job, without grepping back through hours of tick logs.Net behavior change: a stuck running job gets reaped within ~10-25 min (one Beat tick) instead of 3 days, with enough log context at the moment of revocation to start triage. UI display of old failures is unchanged.
Files
ami/jobs/models.py— drop progress coercion, split constantami/jobs/tasks.py—check_stale_jobs(minutes=...)defaults toSTALLED_JOBS_MAX_MINUTES; per-job WARN at revoke timeami/jobs/views.py— listing filter usesFAILED_JOBS_DISPLAY_MAX_HOURSami/jobs/management/commands/update_stale_jobs.py—--hours→--minutesami/jobs/tests/test_jobs.py— regression test for the coercion fixami/jobs/tests/test_update_stale_jobs.py,test_periodic_beat_tasks.py— rescale fixtures to minutesTest plan
python manage.py test ami.jobs --keepdb— 49 tests passjobs_health_checkBeat tick (≤15 min after last update) revokes it AND emits the new "Reaping stalled job" WARN with the diagnostic context.python manage.py update_stale_jobs --minutes 5 --dry-runstill works.Stacked on
Independent of #1234 (Fix 1: ACK/SREM ordering) — this branch is based on
main, no shared commits. Either can merge first.Footnote — future direction: managed retry pattern
Async ML jobs are designed to be idempotent (
save_resultsdedupes on(detection, source_image), SREM is a no-op for already-removed ids, progress percentage is clamped bymax()). Today, when the reaper marks a job REVOKED, recovery is manual: a user re-dispatches the same collection. Since the building blocks for safe retry already exist, a natural next step is to track retry attempts on theJobmodel and let the periodicjobs_health_checkmanage automatic re-dispatch within a budget — e.g. revoke → eligible-for-retry up to N times → terminally REVOKED with reason. Out of scope for this PR; worth a design discussion before implementing because it interacts with how user-facing retry UI and audit trails should work.🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
New Features