Skip to content

fix(jobs): don't fail jobs prematurely, but fail stalled jobs sooner (3 days -> 10 min)#1235

Merged
mihow merged 2 commits intomainfrom
fix/stalled-job-cutoff-and-progress-coercion
Apr 15, 2026
Merged

fix(jobs): don't fail jobs prematurely, but fail stalled jobs sooner (3 days -> 10 min)#1235
mihow merged 2 commits intomainfrom
fix/stalled-job-cutoff-and-progress-coercion

Conversation

@mihow
Copy link
Copy Markdown
Collaborator

@mihow mihow commented Apr 15, 2026

Plain language

Two related fixes for the bug where async ML jobs sometimes flip to FAILURE in the first ~30-55 seconds and disappear from the UI while the processing service is still actively returning results.

What was happening:

  1. When a worker reports a result, we record progress in two places — Redis (a remaining-images counter) and the database (a stage's percentage and status). If 1-2 of the first few results error, the overall failed/total ratio temporarily exceeds 50%, and the code immediately stamps the stage's status as FAILURE — even though the stage is only ~5% done.
  2. Then any time the Job is saved, Job.update_progress notices "this stage has a final status but only 5% progress — that doesn't make sense, let me bump it to 100%."
  3. Now the stage looks 100% complete with FAILURE status. is_complete() returns True. Cleanup fires. NATS connection torn down. Job marked FAILURE. The remaining 95% of results that arrive over the next minute have nowhere to go.

The "is the stage at 100%?" check was honest. The code that wrote 100% was lying — it was a save-time janitor that "fixed" a perceived inconsistency by silently rewriting measured data.

Fix 1 — stop lying about progress. Remove the line that bumps stage.progress = 1 when status is final. Progress is a measurement; if it's 30%, it's 30%, even if status is FAILURE. The stage will reach 100% honestly when SREM drains the pending set, at which point the FAILURE trip works correctly.

Fix 2 — catch genuinely stuck jobs faster. Without the false-positive coercion, a job whose workers genuinely stop pulling messages would sit visible as STARTED for 3 days before the reaper noticed. Two pieces:

  • The reaper threshold was reusing a constant called FAILED_CUTOFF_HOURS that was originally added to hide old failed jobs from UI listings (PR Error handling for jobs #368, 2022). Two unrelated concepts had collapsed into one 72-hour value. Split into FAILED_JOBS_DISPLAY_MAX_HOURS (UI hide, unchanged) and a new STALLED_JOBS_MAX_MINUTES (reaper, 10 min — conservative starting value, raise if legitimate long-running jobs get reaped).
  • The reaper already filters by Job.updated_at (added in feat(jobs): schedule periodic stale-job check #1227), which Django bumps on every save. A job actively making progress bumps its timestamp on every batch, so a healthy job is never reaped — only one whose worker pool stopped touching it.

Reap-time diagnostic logging. When the reaper revokes a job it now logs a single WARN line with: minutes since last update vs threshold, previous_status, celery_state, dispatch_mode, and a per-stage progress/status summary. Pairs with the per-tick NATS consumer snapshots from PR #1227, so an operator can answer "why was this stalled?" by reading the revoke line plus the most recent running_job_snapshots entry for that job, without grepping back through hours of tick logs.

Net behavior change: a stuck running job gets reaped within ~10-25 min (one Beat tick) instead of 3 days, with enough log context at the moment of revocation to start triage. UI display of old failures is unchanged.

Files

  • ami/jobs/models.py — drop progress coercion, split constant
  • ami/jobs/tasks.pycheck_stale_jobs(minutes=...) defaults to STALLED_JOBS_MAX_MINUTES; per-job WARN at revoke time
  • ami/jobs/views.py — listing filter uses FAILED_JOBS_DISPLAY_MAX_HOURS
  • ami/jobs/management/commands/update_stale_jobs.py--hours--minutes
  • ami/jobs/tests/test_jobs.py — regression test for the coercion fix
  • ami/jobs/tests/test_update_stale_jobs.py, test_periodic_beat_tasks.py — rescale fixtures to minutes

Test plan

  • python manage.py test ami.jobs --keepdb — 49 tests pass
  • Live e2e: dispatch an async_api job that produces early errors (PSv2 / cold ADC). Verify it does NOT flip to FAILURE in the first minute, runs to completion, and reports the actual failed-image count.
  • Live e2e: kill the ADC worker mid-job. Verify the job stays STARTED, no progress advances, and the next jobs_health_check Beat tick (≤15 min after last update) revokes it AND emits the new "Reaping stalled job" WARN with the diagnostic context.
  • Verify python manage.py update_stale_jobs --minutes 5 --dry-run still works.

Stacked on

Independent of #1234 (Fix 1: ACK/SREM ordering) — this branch is based on main, no shared commits. Either can merge first.


Footnote — future direction: managed retry pattern

Async ML jobs are designed to be idempotent (save_results dedupes on (detection, source_image), SREM is a no-op for already-removed ids, progress percentage is clamped by max()). Today, when the reaper marks a job REVOKED, recovery is manual: a user re-dispatches the same collection. Since the building blocks for safe retry already exist, a natural next step is to track retry attempts on the Job model and let the periodic jobs_health_check manage automatic re-dispatch within a budget — e.g. revoke → eligible-for-retry up to N times → terminally REVOKED with reason. Out of scope for this PR; worth a design discussion before implementing because it interacts with how user-facing retry UI and audit trails should work.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Fixed an issue where failed job stage progress was incorrectly set to 100%, preventing proper job completion cleanup.
    • Stale job detection now uses minute-based timeout (10-minute default) instead of hour-based.
  • New Features

    • Enhanced logging for revoked stale jobs includes detailed diagnostics: previous status, execution state, stage progress, and timing information.

… stalled-job reaper

Two related fixes for the premature ``cleanup_async_job_resources`` path
that flips async_api jobs to FAILURE within ~30-55s of dispatch while
NATS is still delivering results:

1. ``Job.update_progress`` no longer silently bumps ``stage.progress = 1``
   when ``stage.status`` is in a final state. The bumped value made
   ``is_complete()`` return True and triggered cleanup mid-flight. The
   trigger was ``_update_job_progress`` writing ``status=FAILURE`` at
   partial progress as soon as ``failed/total`` crossed
   FAILURE_THRESHOLD (very easy early in a job — 1-2 errors out of the
   first few results). Progress is a measurement; leave it alone. The
   honest FAILURE trip happens when the stage actually reaches 100%.

2. Split ``Job.FAILED_CUTOFF_HOURS`` (originally added in PR #368 to
   hide failed jobs from UI listings after 3 days) into:
   - ``FAILED_JOBS_DISPLAY_MAX_HOURS = 24 * 3`` — original UI/API filter
   - ``STALLED_JOBS_MAX_MINUTES = 10`` — reaper threshold
   The 72h value was reused by PR #1227's stale-job reaper for an
   entirely different purpose, leaving stuck jobs invisible for 3 days.
   ``check_stale_jobs`` now defaults to 10 minutes against ``updated_at``
   (already in place from #1227), so a job whose worker pool stops
   pulling messages gets reaped within one Beat tick (~10-25 min).
   ``--hours`` arg on the management command becomes ``--minutes`` to
   match.

Together: fix #1 stops the false-positive FAILUREs; fix #2 ensures true
positives are caught quickly without depending on the misfiring coercion.

Co-Authored-By: Claude <[email protected]>
Copilot AI review requested due to automatic review settings April 15, 2026 18:58
@netlify
Copy link
Copy Markdown

netlify bot commented Apr 15, 2026

Deploy Preview for antenna-preview canceled.

Name Link
🔨 Latest commit 6218d38
🔍 Latest deploy log https://app.netlify.com/projects/antenna-preview/deploys/69dfe0edfe3d25000839b90d

@netlify
Copy link
Copy Markdown

netlify bot commented Apr 15, 2026

Deploy Preview for antenna-ssec canceled.

Name Link
🔨 Latest commit 6218d38
🔍 Latest deploy log https://app.netlify.com/projects/antenna-ssec/deploys/69dfe0ed7ea5bb0008af3a81

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 15, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ba3c9c08-5294-422e-8118-21f553d7b05f

📥 Commits

Reviewing files that changed from the base of the PR and between 4686340 and 6218d38.

📒 Files selected for processing (7)
  • ami/jobs/management/commands/update_stale_jobs.py
  • ami/jobs/models.py
  • ami/jobs/tasks.py
  • ami/jobs/tests/test_jobs.py
  • ami/jobs/tests/test_periodic_beat_tasks.py
  • ami/jobs/tests/test_update_stale_jobs.py
  • ami/jobs/views.py

📝 Walkthrough

Walkthrough

The PR migrates stale job detection from hours-based to minutes-based thresholds. Job model constants are renamed for clarity (FAILED_CUTOFF_HOURSFAILED_JOBS_DISPLAY_MAX_HOURS), and a new STALLED_JOBS_MAX_MINUTES constant is introduced. Progress inflation logic in final job states is removed, and enhanced logging is added for revoked stale jobs.

Changes

Cohort / File(s) Summary
Job Model Constants & Logic
ami/jobs/models.py
Renamed failure retention constant for clarity, added stalled jobs minutes threshold, and removed logic that coerced stage progress to 1 in final states to prevent premature completion.
Stale Job Detection Task
ami/jobs/tasks.py
Updated check_stale_jobs signature from hours to minutes; enhanced per-job warning logs with stalled duration, stage summaries, and job state details during revocation.
Management Command & API
ami/jobs/management/commands/update_stale_jobs.py, ami/jobs/views.py
Updated CLI option and API view to use new constant names and minute-based thresholds.
Test Updates
ami/jobs/tests/test_*.py
Added regression test for progress inflation fix; updated test helpers to use minute-based backdating instead of hour-based.

Poem

🐰 Minutes matter more than hours, we now say,
No progress inflation at the end of day,
Stale jobs logged with care before they're told to go,
Cleaner code, clearer thresholds—watch the garden grow!

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 31.25% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main changes: stopping premature job failures and making stalled-job detection faster (changing from 3 days to 10 minutes). It accurately summarizes the PR's primary objectives.
Description check ✅ Passed The description comprehensively follows the template with all major sections completed: Summary, List of Changes, Files section, Test Plan, and Stacked on notes. It includes detailed explanation of the problem, fixes, implementation details, and testing approach.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/stalled-job-cutoff-and-progress-coercion

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a premature cleanup failure mode in async ML jobs by ensuring stage progress remains a truthful measurement (no save-time coercion to 100%), and tightens the stale-job reaper threshold by splitting the former “failed display cutoff” constant into separate UI and operational cutoffs.

Changes:

  • Remove save-time coercion that forced stages in final status to progress=1.0, preventing premature is_complete() / cleanup.
  • Split the old FAILED_CUTOFF_HOURS concept into FAILED_JOBS_DISPLAY_MAX_HOURS (UI listing filter) and STALLED_JOBS_MAX_MINUTES (reaper), and rescale stale-job logic to minutes.
  • Update management command/tests to use minutes-based staleness and add a regression test for the progress-coercion bug.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
ami/jobs/models.py Removes progress inflation on save; introduces separate UI vs reaper cutoffs.
ami/jobs/tasks.py Updates check_stale_jobs to use minutes and default to STALLED_JOBS_MAX_MINUTES; updates related docstring.
ami/jobs/views.py Job listing cutoff default now uses the UI-only failed-job display constant.
ami/jobs/management/commands/update_stale_jobs.py Switches CLI flag/behavior from hours to minutes for stale-job reconciliation.
ami/jobs/tests/test_jobs.py Adds regression coverage to ensure partial FAILURE progress is not inflated on save.
ami/jobs/tests/test_update_stale_jobs.py Updates stale-job fixtures from hours to minutes.
ami/jobs/tests/test_periodic_beat_tasks.py Updates beat-task stale-job fixtures from hours to minutes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ami/jobs/management/commands/update_stale_jobs.py
When the reaper revokes a job, log a single WARN line capturing the
state needed to triage "why was this stalled?" without grepping back
through prior tick logs:

  - minutes since last update vs threshold
  - previous_status, dispatch_mode, celery_state
  - per-stage progress + status summary
  - pointer to running_job_snapshots for prior NATS consumer state

Pairs with PR #1227's per-tick consumer snapshots: every 15 min, each
running async_api job gets a NATS state snapshot; this new line tells
operators which of those snapshots is the last one before revocation.

Co-Authored-By: Claude <[email protected]>
@mihow mihow changed the title fix(jobs): stop bumping partial progress to 100%, reap stalled jobs after 10 min fix(jobs): don't fail jobs prematurely, but fail stalled jobs sooner (3 days -> 10 min) Apr 15, 2026
@mihow
Copy link
Copy Markdown
Collaborator Author

mihow commented Apr 15, 2026

@coderabbitai ready for a review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 15, 2026

@mihow Sure, I'll kick off a full review of the PR now!

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@mihow mihow merged commit af2f688 into main Apr 15, 2026
7 checks passed
@mihow mihow deleted the fix/stalled-job-cutoff-and-progress-coercion branch April 15, 2026 20:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants