Skip to content

Refactor job logging to use separate table#1259

Merged
mihow merged 5 commits intomainfrom
copilot/implement-option-a-job-logs
Apr 30, 2026
Merged

Refactor job logging to use separate table#1259
mihow merged 5 commits intomainfrom
copilot/implement-option-a-job-logs

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 20, 2026

Summary

When the platform processes images through the async ML pipeline, every result the ML worker posts back triggers a small log line on the parent job ("saved 3 detections", "task acknowledged", etc). On busy jobs that meant dozens of concurrent UPDATEs hitting the same row of the jobs_job table on its logs JSONB column — a row-lock contention point that, under real load, freezes a whole job: workers stall waiting for the row lock, NATS message acks expire while waiting, redelivery storms follow, and the job gets stuck mid-run until the stale-job reaper revokes it.

This PR moves per-job log lines onto a separate append-only JobLog table. Each log line is a fast INSERT into a child table instead of a read-modify-write of the JSONB column on the parent row. No more row-lock fight, no more redelivery storms, jobs finish.

The UI is unchanged — the same {stdout, stderr} shape is reconstructed from JobLog rows by the serializer, with a fallback to the legacy column for jobs created before this PR.

Move job-log persistence off the jobs_job.logs JSON-field UPDATE path onto a joined append-only JobLog table, without changing the UI or adding new endpoints. Replaces the earlier draft of this PR (new /logs/ endpoint + UI rewire) with a narrower, serializer-compat approach.

List of Changes

  • Added a JobLog model and migration (0021_joblog) for append-only per-job log rows
  • Switched JobLogHandler.emit() to insert one joined log row instead of refreshing and rewriting the parent jobs_job row (gated by the JOB_LOG_PERSIST_ENABLED flag introduced in fix(jobs): fixes for concurrent ML processing jobs #1261)
  • Rewired JobListSerializer.logs / JobSerializer.logs to a SerializerMethodField that reads joined rows on detail responses, falls back to the legacy jobs_job.logs JSON for jobs created before this table existed, and keeps the cheap legacy shape on list responses to avoid N+1
  • Annotated get_logs with @extend_schema_field so the OpenAPI schema retains the {stdout: string[], stderr: string[]} contract that SchemaField(JobLogs) advertised
  • Added ?logs_limit= query param on the detail endpoint (default 1000, capped at 5000). Validated via SingleParamSerializer[int].clean(...) with DRF IntegerField at the view boundary in JobViewSet.get_serializer_context. Also migrated ?cutoff_hours= to the same helper — the previous django.forms.IntegerField path was raising HTTP 500 on bad input. Pagination over older entries lands in a separate PR (see Dedicated /jobs/logs/ endpoint with pagination for JobLog rows #1280).
  • Marked Job.logs as deprecated in help_text — read-only fallback for pre-Refactor job logging to use separate table #1259 jobs only. Migration 0022_alter_job_logs_help_text.
  • Updated backend tests to assert against joined log rows. 14 new cases total: 8 in TestJobLogPersistence (emit, flag-off short-circuit, joined-table read, legacy-JSON fallback, list-vs-detail paths, logs_limit cap), 5 in TestJobLogsLimitHTTPValidation (HTTP-level validation of ?logs_limit=), 1 in TestJobView for ?cutoff_hours= 400-on-bad-input.

Not in this PR (deliberately deferred)

Related

Why this scope

The earlier draft of this PR added a /api/v2/jobs/{id}/logs/ endpoint and changed 5 UI files to consume it. That's more surface area than the fix requires: the only functional change needed is moving the write path off the row-locked JSON field and onto a child table. Keeping the serializer shape identical lets the UI stay on its current contract, and a SerializerMethodField with a legacy-JSON fallback covers the read side without a schema break or a transition endpoint. Rollout is one Django migration + one code change, no UI deploy coordination.

How to Test

Unit:

docker compose exec -T django python manage.py test \
  ami.jobs.tests.test_jobs \
  ami.jobs.tests.test_tasks \
  ami.ml.tests.TestPipelineWithProcessingService.test_run_pipeline_with_errors_from_processing_service \
  --keepdb

Local reproduction of the row-lock pathology (prereqs and exact steps in docs/claude/debugging/row-lock-contention-reproduction.md):

export ANTENNA_TOKEN=<TOKEN>
python scripts/load_test_result_endpoint.py <JOB_ID> \
    --batch 50 --concurrency 20 --rounds 5

Observed on the dev stack (WEB_CONCURRENCY=1, 8 Celery ML-fork workers), flag JOB_LOG_PERSIST_ENABLED=true (default with the new write path):

Signal PR #1261 flag=true baseline PR #1261 flag=false baseline This PR (flag=true, JobLog)
10×50 POST success 0/10 (timeout) 10/10 30/30
20×50 POST success not re-measured not re-measured 100/100
p95 latency (10×50) 120s+ ~5s ~13s
blocker_chain peak 30+ sustained 0–1 transient 0–7 transient (mostly 0)
blocker_chain peak (20×50) n/a n/a 1 transient
idle_in_tx 8–10 sustained 0 tracks concurrency; clears between waves

The latency delta vs. flag=false is the expected cost of actually persisting the rows (vs. short-circuiting emit entirely). The flag=true → JobLog path trades that small cost for no 120s-timeout stalls.

End-to-end pressure test on a dev box (live ADC worker + NATS)

Project 9, "Random 100" collection, pipeline quebec_vermont_moths_2023, 4 concurrent ADC workers, AMI_ANTENNA_API_BATCH_SIZE=2 (small batches → many small fast result POSTs):

Run Branch JOB_LOG_PERSIST_ENABLED Outcome
Job 164 main True (default) stuck 18 min, reaper revoked it; 86 NATS msgs unacked, 52 redeliveries; only 5 log entries reached jobs_job.logs before everything stalled
Job 166 main True (default) killed manually at 2 min mid-stall, same pattern: 11 RowExclusiveLock + 10 ExclusiveLock on jobs_job, ~20 backends idle in transaction (aborted), 8+ active workers waiting on Lock/tuple for 16+ s, all running UPDATE "jobs_job" SET "logs" = ...
Job 167 this PR True (writes JobLog rows) 35.4 s SUCCESS, 1966 JobLog rows, max 1 row lock at any moment, mean 0.03
Job 168 this PR False (escape hatch, stdout only) 33.4 s SUCCESS, 0 DB log rows

The 2 s gap between persist=True (35.4 s) and persist=False (33.4 s) is the cost of 1966 child-table INSERTs — small enough that persist=True stays the right default and the flag remains a useful emergency switch.

Deployment Notes

Checklist

  • Tested these changes locally (unit + load test per the repro runbook).
  • Added and/or modified relevant tests.
  • Updated relevant documentation / comments.
  • Soaked under real end-to-end load on a staging-equivalent deployment (arctia dev box: JobLogHandler.emit() exercised via live Celery + ADC worker round-trip; pressure test above demonstrates contention vs. main).

@netlify
Copy link
Copy Markdown

netlify Bot commented Apr 20, 2026

Deploy Preview for antenna-ssec canceled.

Name Link
🔨 Latest commit 8657e16
🔍 Latest deploy log https://app.netlify.com/projects/antenna-ssec/deploys/69f38cc5ff0ccf0008621fb0

@netlify
Copy link
Copy Markdown

netlify Bot commented Apr 20, 2026

Deploy Preview for antenna-preview canceled.

Name Link
🔨 Latest commit 8657e16
🔍 Latest deploy log https://app.netlify.com/projects/antenna-preview/deploys/69f38cc5aa5cb50008a894a6

Comment thread ami/jobs/models.py Outdated
Comment thread ami/jobs/models.py Outdated
@mihow mihow marked this pull request as ready for review April 21, 2026 01:17
Copilot AI review requested due to automatic review settings April 21, 2026 01:17
Comment thread ami/jobs/views.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR moves high-frequency per-job log persistence from the jobs_job.logs JSON field into an append-only joined JobLog table, and updates the API/UI to read job logs via a dedicated endpoint (with legacy JSON fallback).

Changes:

  • Added JobLog model + migration and updated JobLogHandler.emit() to insert joined log rows.
  • Added /api/v2/jobs/{id}/logs/ endpoint + serializer helpers to return formatted stdout/stderr with legacy JSON fallback.
  • Updated UI job detail dialogs to fetch logs separately via a new useJobLogs hook; updated backend tests accordingly.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
ui/src/pages/session-details/playback/capture-details/capture-job/capture-job-dialog.tsx Fetch logs/errors via useJobLogs and pass into JobDetails.
ui/src/pages/jobs/jobs.tsx Fetch logs/errors via useJobLogs and pass into JobDetails.
ui/src/pages/job-details/job-details.tsx Make JobDetails accept logs/errors props and render them instead of job.logs/job.errors.
ui/src/data-services/hooks/jobs/useJobLogs.ts New hook to fetch /jobs/{id}/logs/ and expose stdout/stderr arrays.
ami/ml/tests.py Update pipeline error test to assert against JobLog rows.
ami/jobs/views.py Add logs action endpoint returning serialized joined logs with a limit param.
ami/jobs/tests/test_jobs.py Update assertions to read from JobLog rows; add logs endpoint test.
ami/jobs/tasks.py Move per-job logger writes out of the select_for_update() transaction region.
ami/jobs/serializers.py Add serialize_job_logs, adjust job serializers to use method field + new JobLogsSerializer.
ami/jobs/models.py Add JobLog model and refactor JobLogHandler.emit() to append joined log rows.
ami/jobs/migrations/0021_joblog.py Migration creating the JobLog table + index/ordering.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ui/src/pages/job-details/job-details.tsx Outdated
Comment thread ami/jobs/views.py Outdated
Comment thread ami/jobs/serializers.py Outdated
mihow added a commit that referenced this pull request Apr 21, 2026
…ntion

Under concurrent async_api load the per-record UPDATE on jobs_job.logs
becomes the dominant bottleneck after the select_for_update fix in the
previous commit. Observed on demo: 5 concurrent jobs produced a 56-deep
pg_blocking_pids chain, all waiting on UPDATE jobs_job SET logs = ...
from either /result (view, under ATOMIC_REQUESTS) or _update_job_progress
(ml worker).

Short-circuit emit to the module-level logger only (container stdout).
The per-job UI log feed goes blank while this hotfix is active; revisit
once PR #1259 lands the append-only JobLog child table.
mihow added a commit that referenced this pull request Apr 21, 2026
Replace the unconditional noop of JobLogHandler.emit from the previous
commit with a feature flag (default True, preserves existing behavior).
Deployments hitting row-lock contention on jobs_job can set
JOB_LOG_PERSIST_ENABLED=false to short-circuit the per-record UPDATE
until PR #1259's append-only JobLog table is in place.

Validated locally (dev compose, WEB_CONCURRENCY=1, 8 ml fork workers,
batched POSTs at 50 results × 10 concurrent):

  flag=true:  blocker_chain=37, 0/10 POSTs complete (120s timeout)
  flag=false: blocker_chain=1 (trivial), 20/20 POSTs complete, p95=5.5s

Also add:
- scripts/load_test_result_endpoint.py — standalone batched-POST driver.
  A single-result-per-POST curl loop does NOT reproduce the contention
  because it skips the per-iter job.logger.info call inside one
  ATOMIC_REQUESTS tx. Batching N>1 results per POST is load-bearing.
- docs/claude/debugging/row-lock-contention-reproduction.md — full
  runbook: pathology, prereqs, steps, pg_stat_activity queries,
  before/after signal table, flag usage.
@mihow mihow force-pushed the copilot/implement-option-a-job-logs branch from 4307d69 to 1cfe52a Compare April 21, 2026 06:56
@mihow
Copy link
Copy Markdown
Collaborator

mihow commented Apr 21, 2026

Claude says (on behalf of @mihow): Force-pushed this branch to replace the earlier draft with a narrower scope. The core change — moving job-log writes off the parent row onto a child table — is preserved, but the new /logs/ endpoint and the 5 UI files it required have been dropped in favour of a SerializerMethodField that keeps the existing {stdout, stderr} wire shape.

What stays:

  • JobLog model + migration 0021_joblog (append-only: job FK, level, message, created_at)
  • JobLogHandler.emit → single JobLog.objects.create(...) INSERT

What's removed:

  • GET /api/v2/jobs/{id}/logs/ action, JobLogsSerializer, and the useJobLogs.ts / job-details.tsx / jobs.tsx / capture-job-dialog.tsx changes

What's new:

  • JobListSerializer.logs / JobSerializer.logs become a SerializerMethodField that reads JobLog rows on detail responses, falls back to the legacy JSON column on jobs created before this table existed, and keeps the cheap legacy shape on list responses to avoid N+1
  • JOB_LOG_PERSIST_ENABLED flag (from fix(jobs): fixes for concurrent ML processing jobs #1261) still gates the emit-side write; setting it false short-circuits emit entirely (no DB write, container stdout unchanged)

PR body has the full scope / out-of-scope list and local validation numbers. This PR depends on #1261 landing first (for the flag) — once that merges, this will rebase onto main and its diff will collapse to just the JobLog-specific commits.

Next step on my side is soaking this under end-to-end concurrent-job load on a staging-equivalent deployment with a demo-DB copy, before requesting review.

@mihow mihow changed the title Refactor job logging to use joined log rows Refactor job logging to use separate table Apr 21, 2026
@mihow mihow force-pushed the copilot/implement-option-a-job-logs branch 3 times, most recently from 2b0d524 to c320600 Compare April 21, 2026 08:12
Comment thread ami/jobs/serializers.py Outdated
@mihow mihow force-pushed the copilot/implement-option-a-job-logs branch from c320600 to 7207a29 Compare April 21, 2026 09:29
mihow added a commit that referenced this pull request Apr 23, 2026
* fix(jobs): drop select_for_update in _update_job_progress to break row-lock contention

Under concurrent async_api result processing, every ML result task
queued a contending exclusive lock on the jobs_job row via
`select_for_update()`, stacking behind gunicorn view threads also
holding the row under ATOMIC_REQUESTS (from per-iter job.logger.info
writes in /result).

Observed in a validation environment under 5 concurrent async_api jobs
(2x10 + 2x100 + 1x500 images, one pipeline): py-spy stacks of
celeryworker_ml showed 7 of 8 ForkPoolWorker threads stuck in
`psycopg wait` -> `_update_job_progress (tasks.py:493)`. pg_stat_activity
consistently showed 17-20 backends "idle in transaction" holding
`SELECT ... FOR UPDATE OF jobs_job` for 0.5-5+ seconds at a time.

The `max()` guard below the dropped lock still prevents progress
regression: two racing workers that both read `progress=0.4` and both
compute a new value still write monotonically via `max(existing, new)`.
The trade-off is that accumulated counts (detections, classifications,
captures) can drift by one batch under a read-modify-write race --
cosmetic only, since the underlying Detection/Classification rows are
written authoritatively by `save_results` before this function runs.

Refs: #1256 (log refactor proposal; this PR targets
the parallel progress-update path, not the log-write path).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(jobs): short-circuit JobLogHandler.emit to break log-UPDATE contention

Under concurrent async_api load the per-record UPDATE on jobs_job.logs
becomes the dominant bottleneck after the select_for_update fix in the
previous commit. Observed on demo: 5 concurrent jobs produced a 56-deep
pg_blocking_pids chain, all waiting on UPDATE jobs_job SET logs = ...
from either /result (view, under ATOMIC_REQUESTS) or _update_job_progress
(ml worker).

Short-circuit emit to the module-level logger only (container stdout).
The per-job UI log feed goes blank while this hotfix is active; revisit
once PR #1259 lands the append-only JobLog child table.

* feat(jobs): add JOB_LOG_PERSIST_ENABLED flag + local repro docs

Replace the unconditional noop of JobLogHandler.emit from the previous
commit with a feature flag (default True, preserves existing behavior).
Deployments hitting row-lock contention on jobs_job can set
JOB_LOG_PERSIST_ENABLED=false to short-circuit the per-record UPDATE
until PR #1259's append-only JobLog table is in place.

Validated locally (dev compose, WEB_CONCURRENCY=1, 8 ml fork workers,
batched POSTs at 50 results × 10 concurrent):

  flag=true:  blocker_chain=37, 0/10 POSTs complete (120s timeout)
  flag=false: blocker_chain=1 (trivial), 20/20 POSTs complete, p95=5.5s

Also add:
- scripts/load_test_result_endpoint.py — standalone batched-POST driver.
  A single-result-per-POST curl loop does NOT reproduce the contention
  because it skips the per-iter job.logger.info call inside one
  ATOMIC_REQUESTS tx. Batching N>1 results per POST is load-bearing.
- docs/claude/debugging/row-lock-contention-reproduction.md — full
  runbook: pathology, prereqs, steps, pg_stat_activity queries,
  before/after signal table, flag usage.

* perf(jobs): wrap process_nats_pipeline_result with cachalot_disabled

Under concurrent async_api load the ML result task is a pure write path:
save_results inserts Detection/Classification rows and _update_job_progress
UPDATEs jobs_job. Cachalot's post-write invalidation added ~2.5s of
overhead per task in production measurement (issue #1256 Path 4), which
throttles the ml_results queue drain when ADC POSTs arrive faster than
Celery can clear them.

Nothing inside the task benefits from the query cache — every read is of
a freshly-written row — so skipping invalidation is strictly a throughput
win. Decorator stack: @celery_app.task wraps the cachalot-wrapped
function, so Celery sees the context manager enter/exit on each task run.

This was prototyped on demo during the 2026-04-20 investigation but
never committed (demo was reset to clean main before PR #1261 was
opened). Restoring it here closes the last documented ML-path
throughput regression.

* chore(jobs): address PR #1261 review feedback

- tasks.py: remove reference to non-existent planning doc; point at
  issue #1256 and PR #1261 instead (Copilot finding).
- load_test_result_endpoint.py: move token to $ANTENNA_TOKEN env var
  (or optional --token) to keep secrets out of argv/shell history;
  validate batch/concurrency/rounds > 0; require http(s) scheme on
  --host; capture error type+reason per-request instead of returning
  bare -1 (CodeRabbit findings).
- docs: update the repro runbook to match the new CLI shape.

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(jobs): narrow job.save() to update_fields in _update_job_progress

Previously `job.save()` (no update_fields) wrote the entire row, which under
concurrent async_api load meant a stale read-modify-write could clobber
unrelated fields mutated by a concurrent worker — most notably
`progress.errors` (appended by `_reconcile_lost_images`) and `logs`
(appended by JobLogHandler.emit when persistence is enabled).

Narrow the write to the fields this function actually mutates
(progress, status, finished_at) plus updated_at so auto_now still fires.
The underlying read-modify-write race remains — this only limits blast
radius, not the race itself. The counter-drift and `max()` guard semantics
discussed in the block comment above are unchanged.

Address PR #1261 CodeRabbit review feedback on ami/jobs/tasks.py:514.

Co-Authored-By: Claude <noreply@anthropic.com>

* feat(settings): wire NATS_TASK_TTR env var with 60s default

nats_queue.py already reads getattr(settings, "NATS_TASK_TTR", 30), but
the setting was never declared in base.py — so the ack_wait was pinned
at 30s everywhere regardless of env. Wire it through and raise the
default to 60s: 30s is too tight for 24 MB images + GPU cold start +
synchronous result POST, and caused spurious NATS redeliveries under
concurrent load.

Per-env override via NATS_TASK_TTR (demo bumps this further to 120s).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(jobs): fair worker polling via last-touched ordering

Workers polling GET /api/v2/jobs/?ids_only=1&incomplete_only=1 currently
get jobs in PK order (oldest first), so N concurrent pollers all race to
the same first job. Under the workshop load (4 pollers × 4 concurrent
jobs) this caused the drain-one-job-at-a-time pattern observed in the
6-job retest: j192 starved at 0% for ~500s while all pollers hammered
j191, and NATS nearly exhausted max_deliver=2 on j192.

Order ids_only responses by updated_at ASC (least-recently-touched
first). updated_at freshens on every progress write, log write, or
status transition, so actively-worked jobs drop to the bottom and
starved jobs bubble to the top — rotation without a dedicated
last_polled_at field.

Fairness is a property of the serving endpoint, not of the worker.
Third-party workers with their own consumption patterns should still
see evenly-distributed handoffs; keeping the scheduling logic in
Antenna rather than ADC preserves that guarantee.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(jobs): default ids_only to limit=1 for pop()-style polling

Worker polling (?ids_only=1) now hands back one least-recently-touched
job per request rather than the default paginated batch of 10. Concurrent
workers were caching the full list and draining it serially, which
starved later jobs until the cache drained. Forcing a re-poll per job
lets the updated_at fairness sort rotate work across jobs every
iteration.

Callers who want a batch can still pass ?limit=N or ?page_size=N
explicitly — back-compat preserved for the few non-polling ids_only
consumers.

No worker-side change required: existing clients iterate a list of zero
or more and work identically with a list of length one.

* fix(jobs): randomize ids_only ordering instead of updated_at sort

The updated_at sort had a degenerate case: freshly queued jobs all share
near-identical updated_at (ties broken by pk), so simultaneous pollers
deterministically pick the same "oldest" job. Replacing the sort with
ORDER BY RANDOM() gives probabilistic disjoint assignment across
concurrent pollers without writing a poll-stamp column.

Combined with the limit=1 default on ?ids_only=1, each worker poll is
an independent "pick any unfinished job" draw. Fairness is probabilistic
rather than strictly oldest-first, but that's an acceptable trade given
we want to avoid write-on-read.

If we later decide to enforce strict fairness, the right follow-up is a
dedicated last_polled_at column with UPDATE ... RETURNING semantics, not
a return to the updated_at sort.

* test(jobs): update ids_only test for pop()-style default

Existing test_list_jobs_with_ids_only now passes ?limit=10 explicitly to
assert the "list all" behavior. New test_list_jobs_ids_only_pops_one
covers the new default where ?ids_only=1 returns one job to support
pop()-style polling handoff across workers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(jobs): heartbeat on idle ids_only polls

The pipeline-scoped heartbeat in _mark_pipeline_pull_services_seen only
fires when a worker hits /tasks/ or /result/ on an active job, so workers
idling on GET /jobs/?ids_only=1 between jobs had no heartbeat path. Their
last_seen would age past PROCESSING_SERVICE_LAST_SEEN_MAX and the UI
would flip them offline despite being actively online.

Add _mark_async_services_seen_for_project + matching celery task that
marks every async ProcessingService attached to the polling project as
seen. Same Redis throttle pattern as the existing heartbeat
(HEARTBEAT_THROTTLE_SECONDS = 30s), scoped by project rather than
(pipeline, project) since the list endpoint has no pipeline context.

* feat(settings): raise NATS_TASK_TTR default to 300s

The 60s default worked on push-mode services but was still too tight for
pull-mode ADC workers running cold-start GPU pipelines on 24 MB images
(staging observed a nontrivial redelivery rate at 60s; demo held steady
only after being bumped to 120s, then 300s). Raise the default to 300s
so new deployments inherit the ceiling that staging/demo actually run
at. max_deliver still bounds how long a genuinely crashed worker can
hold tasks invisibly.

* fix(jobs): heartbeat on ids_only polls when only pipeline__slug__in is sent

The project-scoped heartbeat added in 44d4c32 never fires for the actual
ADC worker request: ``GET /api/v2/jobs/?pipeline__slug__in=...&ids_only=1``
does not carry a ``project_id``, so ``get_active_project()`` returns None
and the helper short-circuits. Result: async processing services remained
``last_seen_live=null`` forever on demo despite workers polling every 5s.

The commit's original rationale — that the list endpoint has no pipeline
context — was wrong: the worker's filter slugs are right there in the
query string. Add a sibling path that derives scope from those slugs:

- ``update_async_services_seen_for_pipelines(pipeline_slugs)`` celery task
  bumps ``last_seen`` on async services whose pipelines match any slug.
- ``_mark_async_services_seen_for_pipelines`` throttles dispatch via a
  Redis key scoped on the sorted slug set.
- The ``list()`` heartbeat branch prefers the pipeline-slug path when
  those slugs are present, and falls back to the project-scoped path for
  callers that send ``?project_id=``.

Tests cover the real worker shape (no project_id, slugs in query) and the
task's filter semantics against sync + async + unrelated services.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(jobs): address PR #1261 review feedback

Response to reviewer comments on the async_api concurrency tuning PR:

- views.py: mark the three `ids_only=1` overrides (random ordering,
  paginator default_limit=1, heartbeat dispatch) as temporary — they exist
  only because list() is doubling as a claim-next-job endpoint. A dedicated
  `/next` action is tracked as #1265; comments carry a hard-coded removal
  target of 2026-04-24 to keep the expiration visible.
- views.py: rewrite the `_mark_async_services_seen_for_pipelines` and
  `_mark_async_services_seen_for_project` docstrings so the contract is
  obvious — they are Redis-throttled wrappers around the matching celery
  task. No DB work in the wrapper.
- tasks.py: add TODO on both `update_async_services_seen_*` tasks pointing
  at #1194 (client-ID auth), so the "one poller falsely marks its peers
  live" invariant-break is visible in source, not just in the PR body.
- tasks.py: fix the stale comment in `_update_job_progress` — the previous
  wording claimed the narrowed save() avoids overwriting `updated_at`, but
  `updated_at` is in `update_fields` intentionally (Django skips auto_now
  when update_fields is provided).
- config/settings/base.py: add a one-line plain-language purpose for
  `JOB_LOG_PERSIST_ENABLED` so the flag reads at-a-glance.

No behavior changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mihow mihow force-pushed the copilot/implement-option-a-job-logs branch from 3381239 to a359686 Compare April 23, 2026 00:45
@mihow
Copy link
Copy Markdown
Collaborator

mihow commented Apr 30, 2026

Claude says (on behalf of @mihow): Takeaway review before final merge. Two commits, 5 files, +263/-47. Force-pushed once — scope already trimmed correctly. Prereq #1261 merged 2026-04-23.

What this PR is FOR

Move write hot path off jobs_job.logs JSONB UPDATE → JobLog child INSERT. Read shape stays {stdout, stderr} so no UI work. Decision encoded: append-only child table over per-line UPDATE. Right call.

Takeaways (priority order)

1. Silent decision: log table now grows without bound. Old JobLogHandler.emit capped stdout at max_log_length = 1000 lines per job (models.py:355, now an orphaned constant). New path inserts unconditionally. No retention task, no per-job soft cap, no celery beat prune. Grep confirms zero rotation/cleanup in PR or codebase. High-traffic deployments → pathologically large jobs_joblog. Decide before merge: add a periodic prune task in this PR, or open a follow-up issue and link from PR body so the trail doesn't go cold.

2. OpenAPI schema regression on logs field. JobListSerializer.logs was SchemaField(schema=JobLogs, read_only=True) → typed in OpenAPI. Now serializers.SerializerMethodField() with no @extend_schema_field(JobLogsSerializer). Generated schema for logs degrades to untyped {}. Frontend codegen / external consumers lose the {stdout: string[], stderr: string[]} contract. Fix: annotate get_logs with @extend_schema_field returning the dict-of-list-of-str shape.

3. Cleanup story for jobs_job.logs JSONB column not committed. PR body says "separate cleanup PR after soak" — no follow-up issue linked. Without an explicit ticket → column + fallback path live forever. Open the follow-up issue before merging this PR, link it from the body. Otherwise drift sticks.

4. Soak still pending — own checklist flags it. Local 8-worker workstation ≠ staging-equivalent. Last unchecked box on PR. Don't merge before a real e2e load run on a deployment with realistic concurrency. Numbers in body are good; coverage is not.

5. Open thread unanswered: serialize_job_logs placement (mihow on serializers.py:32, commit a359686). Trivial to resolve — reply (it's pure shape translation, fits in serializers.py) or move to models.py next to JobLog. Don't leave dangling.

6. List-view logs empty for new jobs — verified safe. JobListSerializer.get_logs returns legacy JSON shape on action=list to skip N+1. Post-migration jobs have empty legacy column → {stdout: [], stderr: []} on list. Confirmed UI list page (ui/src/pages/jobs/jobs.tsx:135) only renders logs inside the detail dialog, which uses useJobDetails(id) (detail action, hits JobLog). No regression.

7. Mop-up. Drop the now-unused JobLogHandler.max_log_length = 1000. Consider widening the index to [job, -created_at, -pk] to match the model ordering and avoid an extra sort on tiebreak.

Land plan

Before merge:

  • (a) decide retention story (prune task in PR, or linked follow-up issue)
  • (b) add @extend_schema_field on get_logs
  • (c) open + link cleanup ticket for jobs_job.logs column drop
  • (d) resolve open serialize_job_logs thread
  • (e) run soak on staging
  • (f) trivial: kill max_log_length constant, optionally rebase onto current main (fix(jobs): fixes for concurrent ML processing jobs #1261 merged)

(a) and (c) prevent future mess. (b) prevents silent contract regression. Rest is hygiene.


@coderabbitai please review this PR with focus on the takeaways above — especially (1) unbounded growth / retention and (2) the OpenAPI schema regression from SchemaFieldSerializerMethodField.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 30, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

I'll pull the relevant source files to verify each takeaway before writing the full review.


🏁 Script executed:

#!/bin/bash
# Find and read key files
echo "=== ami/jobs/models.py (relevant sections) ==="
rg -n "JobLog|max_log_length|JobLogHandler|emit|logs" ami/jobs/models.py | head -80

echo ""
echo "=== ami/jobs/serializers.py ==="
cat ami/jobs/serializers.py 2>/dev/null || fd -i "serializers.py" ami/jobs/ --exec cat {}

echo ""
echo "=== ami/jobs/migrations/0021_joblog.py ==="
fd "0021_joblog" ami/jobs/migrations/ --exec cat {}

echo ""
echo "=== Grep for extend_schema_field in jobs ==="
rg -n "extend_schema_field|SchemaField|JobLogsSerializer|serialize_job_logs" ami/jobs/

echo ""
echo "=== Grep for retention/prune/cleanup in jobs ==="
rg -n "prune|retention|rotate|cleanup|max_log_length|periodic" ami/jobs/ --type py

echo ""
echo "=== ami/jobs/models.py full JobLog and JobLogHandler sections ==="
rg -n -A 20 "class JobLog" ami/jobs/models.py
rg -n -A 30 "class JobLogHandler" ami/jobs/models.py

echo ""
echo "=== Check index definition on JobLog ==="
rg -n "class Meta|ordering|indexes|index_together" ami/jobs/models.py -A 5

Length of output: 27503


@mihow — code read complete. Confirming all takeaways against the live source, with line-level evidence and ready-to-apply fixes for the two highest-priority items.


Confirmed findings

1 🔴 Unbounded jobs_joblog growth — block on merge

models.py Line 355: max_log_length = 1000 is a dead constant. emit() (Lines 361–380) inserts unconditionally — no per-job cap, no periodic DELETE anywhere in the codebase. The grep for prune|retention|rotate|max_log_length turns up nothing relevant to JobLog.

Minimum acceptable resolution before merge: open a follow-up issue and link it from the PR body (acceptable if you're comfortable with unbounded growth during soak). Stronger resolution: add a soft cap inside emit — e.g., skip the INSERT when JobLog.objects.filter(job_id=self.job.pk).count() >= MAX_ROWS — though a periodic Celery beat prune is cleaner for long-lived jobs. The orphaned constant is the right cap value; repurpose it.


2 🔴 OpenAPI schema regression on logsblock on merge

serializers.py Line 89: logs = serializers.SerializerMethodField() with no @extend_schema_field on get_logs (Line 185). The extend_schema_field import is absent from the file entirely. The generated schema for logs becomes {}.

One-liner fix — add the import and annotate get_logs in both JobListSerializer and JobSerializer (the latter inherits get_logs from the former, so one decoration is enough):

# top of serializers.py — add alongside existing imports
from drf_spectacular.utils import extend_schema_field
# replace the bare get_logs definition
`@extend_schema_field`(
    {
        "type": "object",
        "properties": {
            "stdout": {"type": "array", "items": {"type": "string"}},
            "stderr": {"type": "array", "items": {"type": "string"}},
        },
        "required": ["stdout", "stderr"],
    }
)
def get_logs(self, obj: Job) -> dict[str, list[str]]:
    view = self.context.get("view")
    if getattr(view, "action", None) == "list":
        return _legacy_logs_shape(obj)
    return serialize_job_logs(obj)

Alternatively, since JobLogs (models.py Line 320) is already a Pydantic model with the exact shape, you can use SchemaField-style annotation if drf-spectacular is wired to pick up pydantic schemas in your project, but the explicit dict above is the safest cross-version approach.


3 🟡 jobs_job.logs column cleanup — no follow-up issue

PR body says "separate cleanup PR after soak" but no issue is linked. Without a ticket, the fallback path and the JSONB column drift indefinitely. Please open the issue before or at merge time and link from the PR body.


4 🟡 Soak still pending

Can't validate from code, but your own checklist flags it. The load-test numbers in the PR body are compelling; coverage on a staging-equivalent deployment with real DB size is the missing piece.


5 🟢 serialize_job_logs placement — safe where it is

It's a module-level function in serializers.py (Line 32). It reads JobLog rows and produces the wire shape — pure translation, fits naturally alongside the serializers that call it. No action needed unless you want it co-located with JobLog in models.py for discoverability; either is fine.


6 🟢 List-view empty logs for new jobs — verified safe

get_logs branches on view.action == "list" (Line 188) and returns the legacy JSON shape. The UI list page only renders logs in the detail dialog (useJobDetails → detail action → JobLog query). No regression.


7 🟢 Mop-up

  • max_log_length = 1000 (Line 355): dead constant now — either repurpose it as the emit-side soft cap (see item 1) or delete it.
  • Index width: JobLog.Meta index is ["job", "-created_at"] (migration confirmed). The model ordering and the serialize_job_logs query both order by ["-created_at", "-pk"]. PostgreSQL will sort -pk ties without index support. Tiebreaks are rare in practice (sub-millisecond same created_at), but widening the index in the migration to ["job", "-created_at", "-pk"] eliminates the extra sort entirely and is a zero-cost change at this row volume.

Land checklist delta (items that need action before merge)

# Item Status
(a) Retention decision: prune task in PR or linked follow-up issue ❌ open
(b) @extend_schema_field on get_logs ❌ one-liner fix above
(c) Open + link cleanup ticket for jobs_job.logs column drop ❌ open
(d) Resolve serialize_job_logs thread 🟢 safe either way
(e) Soak on staging ❌ pending
(f) Drop max_log_length or repurpose as soft cap 🟡 trivial

(a), (b), and (c) are the three items I'd call hard gates. Everything else is hygiene.


🧠 Learnings used
Learnt from: mihow
Repo: RolnickLab/antenna PR: 1261
File: ami/jobs/tasks.py:502-514
Timestamp: 2026-04-21T08:10:37.134Z
Learning: In the RolnickLab/antenna project, `job.progress` is stored as a single JSONB column (`jobs_job.progress`). Using `job.save(update_fields=["progress", ...])` does NOT protect against concurrent clobbering of sub-fields within that column (e.g., `progress.errors` appended by `_reconcile_lost_images`), because the entire dict is still serialized and written back. The correct mitigation for sub-field concurrency is a server-side `jsonb_set` atomic update or optimistic compare-and-swap (tracked under issue `#1256` option D). Do not flag `update_fields=["progress"]` as sufficient protection against concurrent JSONB sub-field writes.

@mihow
Copy link
Copy Markdown
Collaborator

mihow commented Apr 30, 2026

Claude says (on behalf of @mihow): Re takeaway #1 (unbounded JobLog growth) — proposing we route retention through the data integrity check framework being landed in #1188 instead of a one-off prune task.

Fits the framework shape directly. New module ami/main/checks/joblogs.py (or ami/jobs/checks/... if we move per-domain later) with:

  • get_joblog_retention_violations(project_id=None, max_rows_per_job=N, max_age_days=M) — returns rows over budget (newest-N kept per job, anything older than M days)
  • reconcile_joblog_retention(..., dry_run=True) — deletes excess rows, returns IntegrityCheckResult(checked, fixed, unfixable)

Then plugs into check_data_integrity management command + Celery beat task that #1188 already wires up. No new periodic task, no new CLI surface.

Means we can land this PR as-is once (a) #1188 merges and (b) we ship the joblogs.py check + a beat schedule entry as an immediate follow-up. Cleaner than baking retention into JobLogHandler.emit (which would re-introduce write-path overhead) and avoids the orphan-max_log_length bandaid.

Defaults to discuss: per-job cap (matches the old 1000-line ceiling? or higher now that we have a real table?), age cap (30d? 90d?), and whether the --no-dry-run mode should run on a beat schedule or stay manual until soak.

If acceptable I'll open the JobLog-retention follow-up issue now and link it from this PR body so the trail is durable. Then takeaway #3 (column-drop ticket) is the only outstanding open-issue ask before merge.

Michael Bunsen and others added 3 commits April 29, 2026 18:14
Replaces the jobs_job.logs JSON-field UPDATE path in JobLogHandler.emit
with an INSERT on a new JobLog child table. Under concurrent async_api
load every log line triggered UPDATE jobs_job SET logs = ... on the
shared job row; inside ATOMIC_REQUESTS a single batched /result POST
stacked N such UPDATEs in one tx, blocking ML workers on the same row
(issue #1256).

- JobLog model + migration 0021_joblog (FK job, level, message, created_at)
- JobLogHandler.emit -> JobLog.objects.create (gated by JOB_LOG_PERSIST_ENABLED
  flag introduced in PR #1261)
- JobSerializer.logs -> SerializerMethodField that reads JobLog rows on
  detail endpoint and falls back to the legacy jobs_job.logs JSON for
  jobs created before this table existed. List endpoint keeps the cheap
  legacy shape to avoid N+1.
- No UI changes, no new endpoint. On-wire {stdout, stderr} shape preserved.

Co-Authored-By: Claude <noreply@anthropic.com>
- Route existing tests that assert on job.logs.stdout to read from
  JobLog rows via a joined_job_log_messages helper
- Same for ami/ml/tests.py error-logging assertion
- Add TestJobLogPersistence class with 6 new cases:
  - emit inserts one JobLog per call; legacy JSON column stays empty
  - flag=False short-circuits emit (no rows, no JSON write)
  - serialize_job_logs reads JobLog rows newest-first and splits stderr
  - serialize_job_logs falls back to legacy JSON when no rows exist
  - get_logs on list action returns legacy shape (no JobLog query)
  - get_logs on retrieve action reads JobLog rows

47/47 pass in ami.jobs.tests.test_jobs.

Co-Authored-By: Claude <noreply@anthropic.com>
- Drop orphaned JobLogHandler.max_log_length constant — JobLog has no
  per-job storage cap (retention will be handled by the data integrity
  check framework, #1188 follow-up).
- Move serialize_job_logs / _legacy_logs_shape / JOB_LOG_* constants
  from serializers.py to models.py so they live next to JobLog itself.
- Annotate JobListSerializer.get_logs with @extend_schema_field so the
  OpenAPI schema retains the {stdout: string[], stderr: string[]}
  contract that SchemaField(JobLogs) used to advertise.
- Mark Job.logs as deprecated in help_text — read-only fallback for
  pre-#1259 jobs only. Column stays until legacy rows are backfilled
  into JobLog (separate follow-up).

Co-Authored-By: Claude <noreply@anthropic.com>
@mihow
Copy link
Copy Markdown
Collaborator

mihow commented Apr 30, 2026

Claude says (on behalf of @mihow): Wrap-up applied. Branch rebased onto main (post-#1261 merge), one new commit on top:

458613f refactor(jobs): tidy JobLog wrap-up

Folds in the takeaway-review actions:

Tests: 56/56 in ami.jobs.tests.test_jobs + the ml.tests case pass on the rebased branch. PR body updated to link the two follow-up issues.

Outstanding:

  • (4) staging soak — last unchecked box on the PR.

Ready for re-review @coderabbitai.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 30, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

Let me read the current state of the files to verify all the wrap-up changes before writing a reply.


🏁 Script executed:

#!/bin/bash
echo "=== ami/jobs/models.py: JobLog, serialize_job_logs, JobLogHandler, max_log_length ==="
rg -n "class JobLog|serialize_job_logs|max_log_length|JOB_LOGS_DEFAULT_LIMIT|JobLogHandler|def emit|help_text.*logs\|deprecated" ami/jobs/models.py -A 5

echo ""
echo "=== ami/jobs/models.py: JobLog class full ==="
rg -n -A 30 "class JobLog\b" ami/jobs/models.py

echo ""
echo "=== ami/jobs/models.py: serialize_job_logs full ==="
rg -n -A 20 "def serialize_job_logs" ami/jobs/models.py

echo ""
echo "=== ami/jobs/models.py: Job.logs field ==="
rg -n -B2 -A5 "logs.*Field\|logs.*help_text\|deprecated" ami/jobs/models.py | head -40

echo ""
echo "=== ami/jobs/serializers.py: full file ==="
cat ami/jobs/serializers.py

echo ""
echo "=== Check extend_schema_field and get_logs in serializers.py ==="
rg -n "extend_schema_field|get_logs|SchemaField|JOB_LOGS_DEFAULT_LIMIT|serialize_job_logs|SerializerMethodField" ami/jobs/serializers.py

echo ""
echo "=== Check JOB_LOGS_DEFAULT_LIMIT defined anywhere ==="
rg -rn "JOB_LOGS_DEFAULT_LIMIT" ami/

echo ""
echo "=== Check `#1277` and `#1278` references in PR body / code comments ==="
rg -rn "1277\|1278" ami/jobs/ | head -20

Length of output: 14966


@mihow — re-review complete. All five wrap-up items confirmed against live source.


Verification results

Takeaway Claim Verified
(2) OpenAPI @extend_schema_field with {stdout, stderr} shape on get_logs ✅ Lines 151–169 of serializers.py
(1) Retention JOB_LOGS_DEFAULT_LIMIT = 1000 read cap; #1277 linked models.py Line 352; #1277 in PR body
(3) Column drop Job.logs deprecated; #1278 linked ✅ PR body; _legacy_logs_shape fallback path still in place
(5) Placement serialize_job_logs moved to models.py next to JobLog models.py Lines 363–388
(7) Mop-up max_log_length = 1000 constant removed ✅ Zero occurrences in codebase

Two remaining notes

Minor / non-blocking:

  • Index tiebreakJobLog.Meta.indexes is still ["job", "-created_at"]. The serialize_job_logs query orders by ["-created_at", "-pk"], so PostgreSQL will sort -pk ties without index support. At current row volumes this is a no-op cost, but widening to ["job", "-created_at", "-pk"] in the migration eliminates the off-index sort entirely. Worth doing if you're already cutting a follow-up migration for #1277 retention; otherwise fine to defer.

Hard gate (yours, not mine):

  • (4) Staging soak — still the one unchecked box on the PR. The load-test numbers in the body are solid; a staging-equivalent run with a real DB copy is the last signal before merge. Nothing in the code blocks it — this is purely on the ops side.

From a code-review standpoint this is clean. Ship it once the soak clears. 🐇


🧠 Learnings used
Learnt from: mihow
Repo: RolnickLab/antenna PR: 1261
File: ami/jobs/tasks.py:502-514
Timestamp: 2026-04-21T08:10:37.134Z
Learning: In the RolnickLab/antenna project, `job.progress` is stored as a single JSONB column (`jobs_job.progress`). Using `job.save(update_fields=["progress", ...])` does NOT protect against concurrent clobbering of sub-fields within that column (e.g., `progress.errors` appended by `_reconcile_lost_images`), because the entire dict is still serialized and written back. The correct mitigation for sub-field concurrency is a server-side `jsonb_set` atomic update or optimistic compare-and-swap (tracked under issue `#1256` option D). Do not flag `update_fields=["progress"]` as sufficient protection against concurrent JSONB sub-field writes.

Adds an opt-in ``logs_limit`` query parameter on ``GET /api/v2/jobs/{id}/``
so callers can cap the number of JobLog rows returned in the ``logs``
field of the response. Defaults to 1000 (unchanged), capped at 5000 to
keep payload size bounded; pagination over older entries will ship with
a dedicated ``/jobs/logs/`` endpoint in a follow-up.

- Validation follows the existing pattern (``django.forms.IntegerField``
  used elsewhere in JobViewSet for ``cutoff_hours``). Bad input raises
  ValidationError → DRF 400.
- ``OpenApiParameter`` defined in ``ami/jobs/schemas.py`` and wired onto
  the ``retrieve`` action via ``extend_schema_view``.
- Tests cover: cap honoured, default applied when absent, non-int /
  zero / over-cap values rejected.
- Migration ``0022_alter_job_logs_help_text`` reflects the deprecation
  marker on ``Job.logs`` from the previous wrap-up commit. (An
  unrelated cosmetic ``job_type_key`` choices drift surfaced during
  ``makemigrations`` and was excluded from this migration so the diff
  stays focused — it predates this PR and is left for a follow-up.)

Co-Authored-By: Claude <noreply@anthropic.com>
@mihow
Copy link
Copy Markdown
Collaborator

mihow commented Apr 30, 2026

Claude says (on behalf of @mihow): Added ?logs_limit= to GET /api/v2/jobs/{id}/ (commit 4ea7102):

  • Default 1000 (unchanged), capped at JOB_LOGS_MAX_LIMIT = 5000
  • Validated via django.forms.IntegerField(required=False, min_value=1, max_value=JOB_LOGS_MAX_LIMIT) — same pattern as ?cutoff_hours= already in JobViewSet.get_queryset. Bad input → DRF 400.
  • OpenApiParameter declared in ami/jobs/schemas.py, wired onto the retrieve action via extend_schema_view
  • Tests cover: cap honoured, default applies when absent, non-int / zero / over-cap rejected

Migration 0022_alter_job_logs_help_text reflects the deprecation help_text on Job.logs from the prior wrap-up commit. (An unrelated job_type_key choices drift — cosmetic verbose-name fix from PR #1245 — surfaced during makemigrations; excluded from this migration so the diff stays scoped to JobLog work.)

Real pagination (/api/v2/jobs/logs/ with cursor + LimitOffsetPaginationWithPermissions) tracked separately in #1280.

Tests: 61/61 in ami.jobs.tests.test_jobs + the targeted ami.ml.tests case pass on the new commit.

Both query params used django.forms.IntegerField.clean() which raises
django.core.exceptions.ValidationError. DRF's default exception handler
does NOT convert that into HTTP 400 — it bubbles up as 500. Staging
soak surfaced this for ?logs_limit=abc; ?cutoff_hours=abc has the same
bug today on main.

Switch both to ami.base.serializers.SingleParamSerializer[int].clean
with rest_framework.serializers.IntegerField. That raises DRF
ValidationError, which the default handler converts to 400 cleanly.

For logs_limit: validation moves out of JobListSerializer.get_logs and
into JobViewSet.get_serializer_context, run once at the view boundary
and passed to the serializer via context. Serializer just reads the
cleaned int.

Tests: new HTTP-level TestJobLogsLimitHTTPValidation class asserts
status_code == 400 for "abc" / 0 / -5 / above-max via APIClient
(previous unit tests only asserted the function raised, never
exercised the HTTP path). Added equivalent assertion for cutoff_hours
on TestJobView. Helper test_logs_limit_caps_response_size /
test_logs_limit_default_when_unset stay as serializer-level tests
covering the cleaned-context shape.

Pattern matches existing SingleParamSerializer callers in
ami/base/views.py:47, ami/main/api/views.py:796, :885.

Co-Authored-By: Claude <noreply@anthropic.com>
@mihow
Copy link
Copy Markdown
Collaborator

mihow commented Apr 30, 2026

Claude says (on behalf of @mihow): Pushed 8657e16f — fix the 500-on-bad-input issue surfaced during staging soak.

Problem. ?logs_limit=abc (and ?cutoff_hours=abc on the list endpoint, pre-existing on main) used django.forms.IntegerField.clean(), which raises django.core.exceptions.ValidationError. DRF's default exception handler does NOT convert that — it bubbles up as HTTP 500. The earlier wrap-up commit's claim that DRF surfaces this as 400 was incorrect.

Fix. Both query params now use ami.base.serializers.SingleParamSerializer[int].clean(...) with rest_framework.serializers.IntegerField. DRF ValidationError → 400 cleanly, matching the existing pattern at ami/base/views.py:47, ami/main/api/views.py:796, :885.

For ?logs_limit= specifically, validation moved out of JobListSerializer.get_logs and into JobViewSet.get_serializer_context — runs once at the view boundary, cleaned int passed to the serializer via context.

New tests (HTTP-level via APIClient):

  • TestJobLogsLimitHTTPValidation — 5 tests covering "abc", 0, -5, above-max, valid-200.
  • TestJobView.test_list_jobs_with_invalid_cutoff_hours_returns_400 — locks in the cutoff_hours fix.

The earlier function-level assertRaises(ValidationError) tests were structurally wrong (they passed but never exercised the HTTP path). Replaced.

Re-soak on staging (commit 8657e16f):

=== logs_limit ===
  logs_limit='abc'   -> 400
  logs_limit='0'     -> 400
  logs_limit='99999' -> 400
  logs_limit='-5'    -> 400
  logs_limit='5'     -> 200
=== cutoff_hours ===
  cutoff_hours='abc' -> 400
  cutoff_hours='-5'  -> 400
  cutoff_hours='24'  -> 200

All 168 backend tests pass locally (ami.jobs + ami.ml.tests). Staging reset back to main.

@mihow
Copy link
Copy Markdown
Collaborator

mihow commented Apr 30, 2026

Claude says (on behalf of @mihow): arctia full-stack soak complete — ready to merge.

Live ML job through real Celery + ADC worker

Job 163 ran end-to-end on arctia (project 9, SUCCESS) and wrote 209 JobLog rows via the live JobLogHandler.emit() path. This closes the gap staging couldn't cover (staging ML services unreachable), confirming the worker-side INSERT path works under real concurrency.

Validation smoke (logs_limit + cutoff_hours)

Both params now go through SingleParamSerializer[int].clean(...) with DRF's IntegerField, validated at the view boundary in JobViewSet.get_serializer_context / get_queryset. Tested against arctia (job 163, 209 logs):

Param Value Status
logs_limit abc 400 ✅
logs_limit 0 400 ✅
logs_limit -5 400 ✅
logs_limit 99999 (above max) 400 ✅
logs_limit 5 200, returns 5 entries ✅
cutoff_hours abc 400 ✅
cutoff_hours -5 400 ✅
cutoff_hours 24 200 ✅

Detail endpoint latency (job 163, 209 real logs)

logs_limit median min max
10 105.1ms 104.8ms 112.6ms
50 106.5ms 105.8ms 171.2ms
200 116.1ms 111.0ms 173.0ms

Limit grows linearly without runaway. No need for further tuning before merge.

arctia reset to main (a347d559).

Copy link
Copy Markdown
Collaborator

@mihow mihow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested well in a staging environment under pressure. Massive improvement!

@mihow mihow merged commit c8e5c72 into main Apr 30, 2026
7 checks passed
@mihow mihow deleted the copilot/implement-option-a-job-logs branch April 30, 2026 22:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants