Refactor job logging to use separate table by Copilot · Pull Request #1259 · RolnickLab/antenna

Copilot · 2026-04-20T19:46:01Z

Summary

When the platform processes images through the async ML pipeline, every result the ML worker posts back triggers a small log line on the parent job ("saved 3 detections", "task acknowledged", etc). On busy jobs that meant dozens of concurrent UPDATEs hitting the same row of the jobs_job table on its logs JSONB column — a row-lock contention point that, under real load, freezes a whole job: workers stall waiting for the row lock, NATS message acks expire while waiting, redelivery storms follow, and the job gets stuck mid-run until the stale-job reaper revokes it.

This PR moves per-job log lines onto a separate append-only JobLog table. Each log line is a fast INSERT into a child table instead of a read-modify-write of the JSONB column on the parent row. No more row-lock fight, no more redelivery storms, jobs finish.

The UI is unchanged — the same {stdout, stderr} shape is reconstructed from JobLog rows by the serializer, with a fallback to the legacy column for jobs created before this PR.

Move job-log persistence off the jobs_job.logs JSON-field UPDATE path onto a joined append-only JobLog table, without changing the UI or adding new endpoints. Replaces the earlier draft of this PR (new /logs/ endpoint + UI rewire) with a narrower, serializer-compat approach.

List of Changes

Added a JobLog model and migration (0021_joblog) for append-only per-job log rows
Switched JobLogHandler.emit() to insert one joined log row instead of refreshing and rewriting the parent jobs_job row (gated by the JOB_LOG_PERSIST_ENABLED flag introduced in fix(jobs): fixes for concurrent ML processing jobs #1261)
Rewired JobListSerializer.logs / JobSerializer.logs to a SerializerMethodField that reads joined rows on detail responses, falls back to the legacy jobs_job.logs JSON for jobs created before this table existed, and keeps the cheap legacy shape on list responses to avoid N+1
Annotated get_logs with @extend_schema_field so the OpenAPI schema retains the {stdout: string[], stderr: string[]} contract that SchemaField(JobLogs) advertised
Added ?logs_limit= query param on the detail endpoint (default 1000, capped at 5000). Validated via SingleParamSerializer[int].clean(...) with DRF IntegerField at the view boundary in JobViewSet.get_serializer_context. Also migrated ?cutoff_hours= to the same helper — the previous django.forms.IntegerField path was raising HTTP 500 on bad input. Pagination over older entries lands in a separate PR (see Dedicated /jobs/logs/ endpoint with pagination for JobLog rows #1280).
Marked Job.logs as deprecated in help_text — read-only fallback for pre-Refactor job logging to use separate table #1259 jobs only. Migration 0022_alter_job_logs_help_text.
Updated backend tests to assert against joined log rows. 14 new cases total: 8 in TestJobLogPersistence (emit, flag-off short-circuit, joined-table read, legacy-JSON fallback, list-vs-detail paths, logs_limit cap), 5 in TestJobLogsLimitHTTPValidation (HTTP-level validation of ?logs_limit=), 1 in TestJobView for ?cutoff_hours= 400-on-bad-input.

Not in this PR (deliberately deferred)

No dedicated /jobs/logs/ API endpoint with cursor pagination — interim per-request cap on the existing detail endpoint covers immediate need. Tracked in Dedicated /jobs/logs/ endpoint with pagination for JobLog rows #1280.
No UI changes — the job details page keeps consuming job.logs.{stdout,stderr} as before
No backfill of old jobs_job.logs JSON into JobLog. Tracked in Backfill legacy jobs_job.logs into JobLog and drop the column #1278.
No removal of the jobs_job.logs JSONB column. Tracked in Backfill legacy jobs_job.logs into JobLog and drop the column #1278.
No retention task on JobLog — per-job age-based prune will be a module under the data integrity check framework introduced in feat: add check_occurrences for occurrence data integrity #1188. Tracked in JobLog retention check via data integrity framework (#1188) #1277.
No jsonb_set atomic version of _update_job_progress (feat(job): refactor job logging so it isn't a bottleneck #1256 option D)
No @transaction.non_atomic_requests on /result//tasks views

Issue feat(job): refactor job logging so it isn't a bottleneck #1256 — contention analysis and path breakdown
PR fix(jobs): fixes for concurrent ML processing jobs #1261 (merged 2026-04-23) — drops select_for_update in _update_job_progress and adds the JOB_LOG_PERSIST_ENABLED flag this PR reuses. Rebased onto current main post-merge.
PR fix(jobs): throttle and defer pipeline heartbeat update #1258 (merged) — heartbeat throttle that surfaced the underlying contention
PR feat: add check_occurrences for occurrence data integrity #1188 — data integrity check framework. Retention check in JobLog retention check via data integrity framework (#1188) #1277 plugs in once it lands.
Issue JobLog retention check via data integrity framework (#1188) #1277 — JobLog retention follow-up (per-job age-based prune)
Issue Backfill legacy jobs_job.logs into JobLog and drop the column #1278 — backfill jobs_job.logs into JobLog and drop the legacy column
Issue Dedicated /jobs/logs/ endpoint with pagination for JobLog rows #1280 — dedicated /jobs/logs/ endpoint with cursor pagination

Why this scope

The earlier draft of this PR added a /api/v2/jobs/{id}/logs/ endpoint and changed 5 UI files to consume it. That's more surface area than the fix requires: the only functional change needed is moving the write path off the row-locked JSON field and onto a child table. Keeping the serializer shape identical lets the UI stay on its current contract, and a SerializerMethodField with a legacy-JSON fallback covers the read side without a schema break or a transition endpoint. Rollout is one Django migration + one code change, no UI deploy coordination.

How to Test

Unit:

docker compose exec -T django python manage.py test \
  ami.jobs.tests.test_jobs \
  ami.jobs.tests.test_tasks \
  ami.ml.tests.TestPipelineWithProcessingService.test_run_pipeline_with_errors_from_processing_service \
  --keepdb

Local reproduction of the row-lock pathology (prereqs and exact steps in docs/claude/debugging/row-lock-contention-reproduction.md):

export ANTENNA_TOKEN=<TOKEN>
python scripts/load_test_result_endpoint.py <JOB_ID> \
    --batch 50 --concurrency 20 --rounds 5

Observed on the dev stack (WEB_CONCURRENCY=1, 8 Celery ML-fork workers), flag JOB_LOG_PERSIST_ENABLED=true (default with the new write path):

Signal	PR #1261 flag=true baseline	PR #1261 flag=false baseline	This PR (flag=true, JobLog)
10×50 POST success	0/10 (timeout)	10/10	30/30
20×50 POST success	not re-measured	not re-measured	100/100
p95 latency (10×50)	120s+	~5s	~13s
`blocker_chain` peak	30+ sustained	0–1 transient	0–7 transient (mostly 0)
`blocker_chain` peak (20×50)	n/a	n/a	1 transient
`idle_in_tx`	8–10 sustained	0	tracks concurrency; clears between waves

The latency delta vs. flag=false is the expected cost of actually persisting the rows (vs. short-circuiting emit entirely). The flag=true → JobLog path trades that small cost for no 120s-timeout stalls.

End-to-end pressure test on a dev box (live ADC worker + NATS)

Project 9, "Random 100" collection, pipeline quebec_vermont_moths_2023, 4 concurrent ADC workers, AMI_ANTENNA_API_BATCH_SIZE=2 (small batches → many small fast result POSTs):

Run	Branch	`JOB_LOG_PERSIST_ENABLED`	Outcome
Job 164	main	True (default)	stuck 18 min, reaper revoked it; 86 NATS msgs unacked, 52 redeliveries; only 5 log entries reached `jobs_job.logs` before everything stalled
Job 166	main	True (default)	killed manually at 2 min mid-stall, same pattern: 11 RowExclusiveLock + 10 ExclusiveLock on `jobs_job`, ~20 backends `idle in transaction (aborted)`, 8+ active workers waiting on `Lock/tuple` for 16+ s, all running `UPDATE "jobs_job" SET "logs" = ...`
Job 167	this PR	True (writes `JobLog` rows)	35.4 s SUCCESS, 1966 `JobLog` rows, max 1 row lock at any moment, mean 0.03
Job 168	this PR	False (escape hatch, stdout only)	33.4 s SUCCESS, 0 DB log rows

The 2 s gap between persist=True (35.4 s) and persist=False (33.4 s) is the cost of 1966 child-table INSERTs — small enough that persist=True stays the right default and the flag remains a useful emergency switch.

Deployment Notes

Apply migrations 0021_joblog and 0022_alter_job_logs_help_text before deploy. Both are additive — no downtime, no backfill.
Job.logs is now flagged deprecated in help_text but the column stays in place. Pre-migration job logs surface via the legacy fallback path. See Backfill legacy jobs_job.logs into JobLog and drop the column #1278 for backfill + drop.
Per-request read cap is 1000 rows by default, configurable via ?logs_limit=N (max 5000). Pagination over older entries via Dedicated /jobs/logs/ endpoint with pagination for JobLog rows #1280.
JobLog has no per-job storage cap. Age-based retention to be added via JobLog retention check via data integrity framework (#1188) #1277 once feat: add check_occurrences for occurrence data integrity #1188 lands. Acceptable until the table accrues meaningful volume.

Checklist

Tested these changes locally (unit + load test per the repro runbook).
Added and/or modified relevant tests.
Updated relevant documentation / comments.
Soaked under real end-to-end load on a staging-equivalent deployment (arctia dev box: JobLogHandler.emit() exercised via live Celery + ADC worker round-trip; pressure test above demonstrates contention vs. main).

netlify · 2026-04-20T19:46:07Z

✅ Deploy Preview for antenna-ssec canceled.

Name	Link
🔨 Latest commit	`8657e16`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-ssec/deploys/69f38cc5ff0ccf0008621fb0

netlify · 2026-04-20T19:46:07Z

✅ Deploy Preview for antenna-preview canceled.

Name	Link
🔨 Latest commit	`8657e16`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-preview/deploys/69f38cc5aa5cb50008a894a6

Copilot

Pull request overview

This PR moves high-frequency per-job log persistence from the jobs_job.logs JSON field into an append-only joined JobLog table, and updates the API/UI to read job logs via a dedicated endpoint (with legacy JSON fallback).

Changes:

Added JobLog model + migration and updated JobLogHandler.emit() to insert joined log rows.
Added /api/v2/jobs/{id}/logs/ endpoint + serializer helpers to return formatted stdout/stderr with legacy JSON fallback.
Updated UI job detail dialogs to fetch logs separately via a new useJobLogs hook; updated backend tests accordingly.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
ui/src/pages/session-details/playback/capture-details/capture-job/capture-job-dialog.tsx	Fetch logs/errors via `useJobLogs` and pass into `JobDetails`.
ui/src/pages/jobs/jobs.tsx	Fetch logs/errors via `useJobLogs` and pass into `JobDetails`.
ui/src/pages/job-details/job-details.tsx	Make `JobDetails` accept logs/errors props and render them instead of `job.logs/job.errors`.
ui/src/data-services/hooks/jobs/useJobLogs.ts	New hook to fetch `/jobs/{id}/logs/` and expose stdout/stderr arrays.
ami/ml/tests.py	Update pipeline error test to assert against `JobLog` rows.
ami/jobs/views.py	Add `logs` action endpoint returning serialized joined logs with a `limit` param.
ami/jobs/tests/test_jobs.py	Update assertions to read from `JobLog` rows; add logs endpoint test.
ami/jobs/tasks.py	Move per-job logger writes out of the `select_for_update()` transaction region.
ami/jobs/serializers.py	Add `serialize_job_logs`, adjust job serializers to use method field + new `JobLogsSerializer`.
ami/jobs/models.py	Add `JobLog` model and refactor `JobLogHandler.emit()` to append joined log rows.
ami/jobs/migrations/0021_joblog.py	Migration creating the `JobLog` table + index/ordering.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…ntion Under concurrent async_api load the per-record UPDATE on jobs_job.logs becomes the dominant bottleneck after the select_for_update fix in the previous commit. Observed on demo: 5 concurrent jobs produced a 56-deep pg_blocking_pids chain, all waiting on UPDATE jobs_job SET logs = ... from either /result (view, under ATOMIC_REQUESTS) or _update_job_progress (ml worker). Short-circuit emit to the module-level logger only (container stdout). The per-job UI log feed goes blank while this hotfix is active; revisit once PR #1259 lands the append-only JobLog child table.

Replace the unconditional noop of JobLogHandler.emit from the previous commit with a feature flag (default True, preserves existing behavior). Deployments hitting row-lock contention on jobs_job can set JOB_LOG_PERSIST_ENABLED=false to short-circuit the per-record UPDATE until PR #1259's append-only JobLog table is in place. Validated locally (dev compose, WEB_CONCURRENCY=1, 8 ml fork workers, batched POSTs at 50 results × 10 concurrent): flag=true: blocker_chain=37, 0/10 POSTs complete (120s timeout) flag=false: blocker_chain=1 (trivial), 20/20 POSTs complete, p95=5.5s Also add: - scripts/load_test_result_endpoint.py — standalone batched-POST driver. A single-result-per-POST curl loop does NOT reproduce the contention because it skips the per-iter job.logger.info call inside one ATOMIC_REQUESTS tx. Batching N>1 results per POST is load-bearing. - docs/claude/debugging/row-lock-contention-reproduction.md — full runbook: pathology, prereqs, steps, pg_stat_activity queries, before/after signal table, flag usage.

mihow · 2026-04-21T06:57:02Z

Claude says (on behalf of @mihow): Force-pushed this branch to replace the earlier draft with a narrower scope. The core change — moving job-log writes off the parent row onto a child table — is preserved, but the new /logs/ endpoint and the 5 UI files it required have been dropped in favour of a SerializerMethodField that keeps the existing {stdout, stderr} wire shape.

What stays:

JobLog model + migration 0021_joblog (append-only: job FK, level, message, created_at)
JobLogHandler.emit → single JobLog.objects.create(...) INSERT

What's removed:

GET /api/v2/jobs/{id}/logs/ action, JobLogsSerializer, and the useJobLogs.ts / job-details.tsx / jobs.tsx / capture-job-dialog.tsx changes

What's new:

JobListSerializer.logs / JobSerializer.logs become a SerializerMethodField that reads JobLog rows on detail responses, falls back to the legacy JSON column on jobs created before this table existed, and keeps the cheap legacy shape on list responses to avoid N+1
JOB_LOG_PERSIST_ENABLED flag (from fix(jobs): fixes for concurrent ML processing jobs #1261) still gates the emit-side write; setting it false short-circuits emit entirely (no DB write, container stdout unchanged)

PR body has the full scope / out-of-scope list and local validation numbers. This PR depends on #1261 landing first (for the flag) — once that merges, this will rebase onto main and its diff will collapse to just the JobLog-specific commits.

Next step on my side is soaking this under end-to-end concurrent-job load on a staging-equivalent deployment with a demo-DB copy, before requesting review.

* fix(jobs): drop select_for_update in _update_job_progress to break row-lock contention Under concurrent async_api result processing, every ML result task queued a contending exclusive lock on the jobs_job row via `select_for_update()`, stacking behind gunicorn view threads also holding the row under ATOMIC_REQUESTS (from per-iter job.logger.info writes in /result). Observed in a validation environment under 5 concurrent async_api jobs (2x10 + 2x100 + 1x500 images, one pipeline): py-spy stacks of celeryworker_ml showed 7 of 8 ForkPoolWorker threads stuck in `psycopg wait` -> `_update_job_progress (tasks.py:493)`. pg_stat_activity consistently showed 17-20 backends "idle in transaction" holding `SELECT ... FOR UPDATE OF jobs_job` for 0.5-5+ seconds at a time. The `max()` guard below the dropped lock still prevents progress regression: two racing workers that both read `progress=0.4` and both compute a new value still write monotonically via `max(existing, new)`. The trade-off is that accumulated counts (detections, classifications, captures) can drift by one batch under a read-modify-write race -- cosmetic only, since the underlying Detection/Classification rows are written authoritatively by `save_results` before this function runs. Refs: #1256 (log refactor proposal; this PR targets the parallel progress-update path, not the log-write path). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(jobs): short-circuit JobLogHandler.emit to break log-UPDATE contention Under concurrent async_api load the per-record UPDATE on jobs_job.logs becomes the dominant bottleneck after the select_for_update fix in the previous commit. Observed on demo: 5 concurrent jobs produced a 56-deep pg_blocking_pids chain, all waiting on UPDATE jobs_job SET logs = ... from either /result (view, under ATOMIC_REQUESTS) or _update_job_progress (ml worker). Short-circuit emit to the module-level logger only (container stdout). The per-job UI log feed goes blank while this hotfix is active; revisit once PR #1259 lands the append-only JobLog child table. * feat(jobs): add JOB_LOG_PERSIST_ENABLED flag + local repro docs Replace the unconditional noop of JobLogHandler.emit from the previous commit with a feature flag (default True, preserves existing behavior). Deployments hitting row-lock contention on jobs_job can set JOB_LOG_PERSIST_ENABLED=false to short-circuit the per-record UPDATE until PR #1259's append-only JobLog table is in place. Validated locally (dev compose, WEB_CONCURRENCY=1, 8 ml fork workers, batched POSTs at 50 results × 10 concurrent): flag=true: blocker_chain=37, 0/10 POSTs complete (120s timeout) flag=false: blocker_chain=1 (trivial), 20/20 POSTs complete, p95=5.5s Also add: - scripts/load_test_result_endpoint.py — standalone batched-POST driver. A single-result-per-POST curl loop does NOT reproduce the contention because it skips the per-iter job.logger.info call inside one ATOMIC_REQUESTS tx. Batching N>1 results per POST is load-bearing. - docs/claude/debugging/row-lock-contention-reproduction.md — full runbook: pathology, prereqs, steps, pg_stat_activity queries, before/after signal table, flag usage. * perf(jobs): wrap process_nats_pipeline_result with cachalot_disabled Under concurrent async_api load the ML result task is a pure write path: save_results inserts Detection/Classification rows and _update_job_progress UPDATEs jobs_job. Cachalot's post-write invalidation added ~2.5s of overhead per task in production measurement (issue #1256 Path 4), which throttles the ml_results queue drain when ADC POSTs arrive faster than Celery can clear them. Nothing inside the task benefits from the query cache — every read is of a freshly-written row — so skipping invalidation is strictly a throughput win. Decorator stack: @celery_app.task wraps the cachalot-wrapped function, so Celery sees the context manager enter/exit on each task run. This was prototyped on demo during the 2026-04-20 investigation but never committed (demo was reset to clean main before PR #1261 was opened). Restoring it here closes the last documented ML-path throughput regression. * chore(jobs): address PR #1261 review feedback - tasks.py: remove reference to non-existent planning doc; point at issue #1256 and PR #1261 instead (Copilot finding). - load_test_result_endpoint.py: move token to $ANTENNA_TOKEN env var (or optional --token) to keep secrets out of argv/shell history; validate batch/concurrency/rounds > 0; require http(s) scheme on --host; capture error type+reason per-request instead of returning bare -1 (CodeRabbit findings). - docs: update the repro runbook to match the new CLI shape. Co-Authored-By: Claude <noreply@anthropic.com> * fix(jobs): narrow job.save() to update_fields in _update_job_progress Previously `job.save()` (no update_fields) wrote the entire row, which under concurrent async_api load meant a stale read-modify-write could clobber unrelated fields mutated by a concurrent worker — most notably `progress.errors` (appended by `_reconcile_lost_images`) and `logs` (appended by JobLogHandler.emit when persistence is enabled). Narrow the write to the fields this function actually mutates (progress, status, finished_at) plus updated_at so auto_now still fires. The underlying read-modify-write race remains — this only limits blast radius, not the race itself. The counter-drift and `max()` guard semantics discussed in the block comment above are unchanged. Address PR #1261 CodeRabbit review feedback on ami/jobs/tasks.py:514. Co-Authored-By: Claude <noreply@anthropic.com> * feat(settings): wire NATS_TASK_TTR env var with 60s default nats_queue.py already reads getattr(settings, "NATS_TASK_TTR", 30), but the setting was never declared in base.py — so the ack_wait was pinned at 30s everywhere regardless of env. Wire it through and raise the default to 60s: 30s is too tight for 24 MB images + GPU cold start + synchronous result POST, and caused spurious NATS redeliveries under concurrent load. Per-env override via NATS_TASK_TTR (demo bumps this further to 120s). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(jobs): fair worker polling via last-touched ordering Workers polling GET /api/v2/jobs/?ids_only=1&incomplete_only=1 currently get jobs in PK order (oldest first), so N concurrent pollers all race to the same first job. Under the workshop load (4 pollers × 4 concurrent jobs) this caused the drain-one-job-at-a-time pattern observed in the 6-job retest: j192 starved at 0% for ~500s while all pollers hammered j191, and NATS nearly exhausted max_deliver=2 on j192. Order ids_only responses by updated_at ASC (least-recently-touched first). updated_at freshens on every progress write, log write, or status transition, so actively-worked jobs drop to the bottom and starved jobs bubble to the top — rotation without a dedicated last_polled_at field. Fairness is a property of the serving endpoint, not of the worker. Third-party workers with their own consumption patterns should still see evenly-distributed handoffs; keeping the scheduling logic in Antenna rather than ADC preserves that guarantee. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(jobs): default ids_only to limit=1 for pop()-style polling Worker polling (?ids_only=1) now hands back one least-recently-touched job per request rather than the default paginated batch of 10. Concurrent workers were caching the full list and draining it serially, which starved later jobs until the cache drained. Forcing a re-poll per job lets the updated_at fairness sort rotate work across jobs every iteration. Callers who want a batch can still pass ?limit=N or ?page_size=N explicitly — back-compat preserved for the few non-polling ids_only consumers. No worker-side change required: existing clients iterate a list of zero or more and work identically with a list of length one. * fix(jobs): randomize ids_only ordering instead of updated_at sort The updated_at sort had a degenerate case: freshly queued jobs all share near-identical updated_at (ties broken by pk), so simultaneous pollers deterministically pick the same "oldest" job. Replacing the sort with ORDER BY RANDOM() gives probabilistic disjoint assignment across concurrent pollers without writing a poll-stamp column. Combined with the limit=1 default on ?ids_only=1, each worker poll is an independent "pick any unfinished job" draw. Fairness is probabilistic rather than strictly oldest-first, but that's an acceptable trade given we want to avoid write-on-read. If we later decide to enforce strict fairness, the right follow-up is a dedicated last_polled_at column with UPDATE ... RETURNING semantics, not a return to the updated_at sort. * test(jobs): update ids_only test for pop()-style default Existing test_list_jobs_with_ids_only now passes ?limit=10 explicitly to assert the "list all" behavior. New test_list_jobs_ids_only_pops_one covers the new default where ?ids_only=1 returns one job to support pop()-style polling handoff across workers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(jobs): heartbeat on idle ids_only polls The pipeline-scoped heartbeat in _mark_pipeline_pull_services_seen only fires when a worker hits /tasks/ or /result/ on an active job, so workers idling on GET /jobs/?ids_only=1 between jobs had no heartbeat path. Their last_seen would age past PROCESSING_SERVICE_LAST_SEEN_MAX and the UI would flip them offline despite being actively online. Add _mark_async_services_seen_for_project + matching celery task that marks every async ProcessingService attached to the polling project as seen. Same Redis throttle pattern as the existing heartbeat (HEARTBEAT_THROTTLE_SECONDS = 30s), scoped by project rather than (pipeline, project) since the list endpoint has no pipeline context. * feat(settings): raise NATS_TASK_TTR default to 300s The 60s default worked on push-mode services but was still too tight for pull-mode ADC workers running cold-start GPU pipelines on 24 MB images (staging observed a nontrivial redelivery rate at 60s; demo held steady only after being bumped to 120s, then 300s). Raise the default to 300s so new deployments inherit the ceiling that staging/demo actually run at. max_deliver still bounds how long a genuinely crashed worker can hold tasks invisibly. * fix(jobs): heartbeat on ids_only polls when only pipeline__slug__in is sent The project-scoped heartbeat added in 44d4c32 never fires for the actual ADC worker request: ``GET /api/v2/jobs/?pipeline__slug__in=...&ids_only=1`` does not carry a ``project_id``, so ``get_active_project()`` returns None and the helper short-circuits. Result: async processing services remained ``last_seen_live=null`` forever on demo despite workers polling every 5s. The commit's original rationale — that the list endpoint has no pipeline context — was wrong: the worker's filter slugs are right there in the query string. Add a sibling path that derives scope from those slugs: - ``update_async_services_seen_for_pipelines(pipeline_slugs)`` celery task bumps ``last_seen`` on async services whose pipelines match any slug. - ``_mark_async_services_seen_for_pipelines`` throttles dispatch via a Redis key scoped on the sorted slug set. - The ``list()`` heartbeat branch prefers the pipeline-slug path when those slugs are present, and falls back to the project-scoped path for callers that send ``?project_id=``. Tests cover the real worker shape (no project_id, slugs in query) and the task's filter semantics against sync + async + unrelated services. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(jobs): address PR #1261 review feedback Response to reviewer comments on the async_api concurrency tuning PR: - views.py: mark the three `ids_only=1` overrides (random ordering, paginator default_limit=1, heartbeat dispatch) as temporary — they exist only because list() is doubling as a claim-next-job endpoint. A dedicated `/next` action is tracked as #1265; comments carry a hard-coded removal target of 2026-04-24 to keep the expiration visible. - views.py: rewrite the `_mark_async_services_seen_for_pipelines` and `_mark_async_services_seen_for_project` docstrings so the contract is obvious — they are Redis-throttled wrappers around the matching celery task. No DB work in the wrapper. - tasks.py: add TODO on both `update_async_services_seen_*` tasks pointing at #1194 (client-ID auth), so the "one poller falsely marks its peers live" invariant-break is visible in source, not just in the PR body. - tasks.py: fix the stale comment in `_update_job_progress` — the previous wording claimed the narrowed save() avoids overwriting `updated_at`, but `updated_at` is in `update_fields` intentionally (Django skips auto_now when update_fields is provided). - config/settings/base.py: add a one-line plain-language purpose for `JOB_LOG_PERSIST_ENABLED` so the flag reads at-a-glance. No behavior changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mihow · 2026-04-30T00:28:43Z

Claude says (on behalf of @mihow): Takeaway review before final merge. Two commits, 5 files, +263/-47. Force-pushed once — scope already trimmed correctly. Prereq #1261 merged 2026-04-23.

What this PR is FOR

Move write hot path off jobs_job.logs JSONB UPDATE → JobLog child INSERT. Read shape stays {stdout, stderr} so no UI work. Decision encoded: append-only child table over per-line UPDATE. Right call.

Takeaways (priority order)

1. Silent decision: log table now grows without bound. Old JobLogHandler.emit capped stdout at max_log_length = 1000 lines per job (models.py:355, now an orphaned constant). New path inserts unconditionally. No retention task, no per-job soft cap, no celery beat prune. Grep confirms zero rotation/cleanup in PR or codebase. High-traffic deployments → pathologically large jobs_joblog. Decide before merge: add a periodic prune task in this PR, or open a follow-up issue and link from PR body so the trail doesn't go cold.

2. OpenAPI schema regression on logs field. JobListSerializer.logs was SchemaField(schema=JobLogs, read_only=True) → typed in OpenAPI. Now serializers.SerializerMethodField() with no @extend_schema_field(JobLogsSerializer). Generated schema for logs degrades to untyped {}. Frontend codegen / external consumers lose the {stdout: string[], stderr: string[]} contract. Fix: annotate get_logs with @extend_schema_field returning the dict-of-list-of-str shape.

3. Cleanup story for jobs_job.logs JSONB column not committed. PR body says "separate cleanup PR after soak" — no follow-up issue linked. Without an explicit ticket → column + fallback path live forever. Open the follow-up issue before merging this PR, link it from the body. Otherwise drift sticks.

4. Soak still pending — own checklist flags it. Local 8-worker workstation ≠ staging-equivalent. Last unchecked box on PR. Don't merge before a real e2e load run on a deployment with realistic concurrency. Numbers in body are good; coverage is not.

5. Open thread unanswered: serialize_job_logs placement (mihow on serializers.py:32, commit a359686). Trivial to resolve — reply (it's pure shape translation, fits in serializers.py) or move to models.py next to JobLog. Don't leave dangling.

6. List-view logs empty for new jobs — verified safe. JobListSerializer.get_logs returns legacy JSON shape on action=list to skip N+1. Post-migration jobs have empty legacy column → {stdout: [], stderr: []} on list. Confirmed UI list page (ui/src/pages/jobs/jobs.tsx:135) only renders logs inside the detail dialog, which uses useJobDetails(id) (detail action, hits JobLog). No regression.

7. Mop-up. Drop the now-unused JobLogHandler.max_log_length = 1000. Consider widening the index to [job, -created_at, -pk] to match the model ordering and avoid an extra sort on tiebreak.

Land plan

Before merge:

(a) decide retention story (prune task in PR, or linked follow-up issue)
(b) add @extend_schema_field on get_logs
(c) open + link cleanup ticket for jobs_job.logs column drop
(d) resolve open serialize_job_logs thread
(e) run soak on staging
(f) trivial: kill max_log_length constant, optionally rebase onto current main (fix(jobs): fixes for concurrent ML processing jobs #1261 merged)

(a) and (c) prevent future mess. (b) prevents silent contract regression. Rest is hygiene.

@coderabbitai please review this PR with focus on the takeaways above — especially (1) unbounded growth / retention and (2) the OpenAPI schema regression from SchemaField → SerializerMethodField.

coderabbitai · 2026-04-30T00:30:09Z

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

I'll pull the relevant source files to verify each takeaway before writing the full review.

🏁 Script executed:

#!/bin/bash
# Find and read key files
echo "=== ami/jobs/models.py (relevant sections) ==="
rg -n "JobLog|max_log_length|JobLogHandler|emit|logs" ami/jobs/models.py | head -80

echo ""
echo "=== ami/jobs/serializers.py ==="
cat ami/jobs/serializers.py 2>/dev/null || fd -i "serializers.py" ami/jobs/ --exec cat {}

echo ""
echo "=== ami/jobs/migrations/0021_joblog.py ==="
fd "0021_joblog" ami/jobs/migrations/ --exec cat {}

echo ""
echo "=== Grep for extend_schema_field in jobs ==="
rg -n "extend_schema_field|SchemaField|JobLogsSerializer|serialize_job_logs" ami/jobs/

echo ""
echo "=== Grep for retention/prune/cleanup in jobs ==="
rg -n "prune|retention|rotate|cleanup|max_log_length|periodic" ami/jobs/ --type py

echo ""
echo "=== ami/jobs/models.py full JobLog and JobLogHandler sections ==="
rg -n -A 20 "class JobLog" ami/jobs/models.py
rg -n -A 30 "class JobLogHandler" ami/jobs/models.py

echo ""
echo "=== Check index definition on JobLog ==="
rg -n "class Meta|ordering|indexes|index_together" ami/jobs/models.py -A 5

Length of output: 27503

@mihow — code read complete. Confirming all takeaways against the live source, with line-level evidence and ready-to-apply fixes for the two highest-priority items.

Confirmed findings

1 🔴 Unbounded `jobs_joblog` growth — block on merge

models.py Line 355: max_log_length = 1000 is a dead constant. emit() (Lines 361–380) inserts unconditionally — no per-job cap, no periodic DELETE anywhere in the codebase. The grep for prune|retention|rotate|max_log_length turns up nothing relevant to JobLog.

Minimum acceptable resolution before merge: open a follow-up issue and link it from the PR body (acceptable if you're comfortable with unbounded growth during soak). Stronger resolution: add a soft cap inside emit — e.g., skip the INSERT when JobLog.objects.filter(job_id=self.job.pk).count() >= MAX_ROWS — though a periodic Celery beat prune is cleaner for long-lived jobs. The orphaned constant is the right cap value; repurpose it.

2 🔴 OpenAPI schema regression on `logs` — block on merge

serializers.py Line 89: logs = serializers.SerializerMethodField() with no @extend_schema_field on get_logs (Line 185). The extend_schema_field import is absent from the file entirely. The generated schema for logs becomes {}.

One-liner fix — add the import and annotate get_logs in both JobListSerializer and JobSerializer (the latter inherits get_logs from the former, so one decoration is enough):

# top of serializers.py — add alongside existing imports
from drf_spectacular.utils import extend_schema_field

# replace the bare get_logs definition
`@extend_schema_field`(
    {
        "type": "object",
        "properties": {
            "stdout": {"type": "array", "items": {"type": "string"}},
            "stderr": {"type": "array", "items": {"type": "string"}},
        },
        "required": ["stdout", "stderr"],
    }
)
def get_logs(self, obj: Job) -> dict[str, list[str]]:
    view = self.context.get("view")
    if getattr(view, "action", None) == "list":
        return _legacy_logs_shape(obj)
    return serialize_job_logs(obj)

Alternatively, since JobLogs (models.py Line 320) is already a Pydantic model with the exact shape, you can use SchemaField-style annotation if drf-spectacular is wired to pick up pydantic schemas in your project, but the explicit dict above is the safest cross-version approach.

3 🟡 `jobs_job.logs` column cleanup — no follow-up issue

PR body says "separate cleanup PR after soak" but no issue is linked. Without a ticket, the fallback path and the JSONB column drift indefinitely. Please open the issue before or at merge time and link from the PR body.

4 🟡 Soak still pending

Can't validate from code, but your own checklist flags it. The load-test numbers in the PR body are compelling; coverage on a staging-equivalent deployment with real DB size is the missing piece.

5 🟢 `serialize_job_logs` placement — safe where it is

It's a module-level function in serializers.py (Line 32). It reads JobLog rows and produces the wire shape — pure translation, fits naturally alongside the serializers that call it. No action needed unless you want it co-located with JobLog in models.py for discoverability; either is fine.

6 🟢 List-view empty logs for new jobs — verified safe

get_logs branches on view.action == "list" (Line 188) and returns the legacy JSON shape. The UI list page only renders logs in the detail dialog (useJobDetails → detail action → JobLog query). No regression.

7 🟢 Mop-up

max_log_length = 1000 (Line 355): dead constant now — either repurpose it as the emit-side soft cap (see item 1) or delete it.
Index width: JobLog.Meta index is ["job", "-created_at"] (migration confirmed). The model ordering and the serialize_job_logs query both order by ["-created_at", "-pk"]. PostgreSQL will sort -pk ties without index support. Tiebreaks are rare in practice (sub-millisecond same created_at), but widening the index in the migration to ["job", "-created_at", "-pk"] eliminates the extra sort entirely and is a zero-cost change at this row volume.

Land checklist delta (items that need action before merge)

#	Item	Status
(a)	Retention decision: prune task in PR or linked follow-up issue	❌ open
(b)	`@extend_schema_field` on `get_logs`	❌ one-liner fix above
(c)	Open + link cleanup ticket for `jobs_job.logs` column drop	❌ open
(d)	Resolve `serialize_job_logs` thread	🟢 safe either way
(e)	Soak on staging	❌ pending
(f)	Drop `max_log_length` or repurpose as soft cap	🟡 trivial

(a), (b), and (c) are the three items I'd call hard gates. Everything else is hygiene.

🧠 Learnings used

Learnt from: mihow
Repo: RolnickLab/antenna PR: 1261
File: ami/jobs/tasks.py:502-514
Timestamp: 2026-04-21T08:10:37.134Z
Learning: In the RolnickLab/antenna project, `job.progress` is stored as a single JSONB column (`jobs_job.progress`). Using `job.save(update_fields=["progress", ...])` does NOT protect against concurrent clobbering of sub-fields within that column (e.g., `progress.errors` appended by `_reconcile_lost_images`), because the entire dict is still serialized and written back. The correct mitigation for sub-field concurrency is a server-side `jsonb_set` atomic update or optimistic compare-and-swap (tracked under issue `#1256` option D). Do not flag `update_fields=["progress"]` as sufficient protection against concurrent JSONB sub-field writes.

mihow · 2026-04-30T01:00:08Z

Claude says (on behalf of @mihow): Re takeaway #1 (unbounded JobLog growth) — proposing we route retention through the data integrity check framework being landed in #1188 instead of a one-off prune task.

Fits the framework shape directly. New module ami/main/checks/joblogs.py (or ami/jobs/checks/... if we move per-domain later) with:

get_joblog_retention_violations(project_id=None, max_rows_per_job=N, max_age_days=M) — returns rows over budget (newest-N kept per job, anything older than M days)
reconcile_joblog_retention(..., dry_run=True) — deletes excess rows, returns IntegrityCheckResult(checked, fixed, unfixable)

Then plugs into check_data_integrity management command + Celery beat task that #1188 already wires up. No new periodic task, no new CLI surface.

Means we can land this PR as-is once (a) #1188 merges and (b) we ship the joblogs.py check + a beat schedule entry as an immediate follow-up. Cleaner than baking retention into JobLogHandler.emit (which would re-introduce write-path overhead) and avoids the orphan-max_log_length bandaid.

Defaults to discuss: per-job cap (matches the old 1000-line ceiling? or higher now that we have a real table?), age cap (30d? 90d?), and whether the --no-dry-run mode should run on a beat schedule or stay manual until soak.

If acceptable I'll open the JobLog-retention follow-up issue now and link it from this PR body so the trail is durable. Then takeaway #3 (column-drop ticket) is the only outstanding open-issue ask before merge.

Replaces the jobs_job.logs JSON-field UPDATE path in JobLogHandler.emit with an INSERT on a new JobLog child table. Under concurrent async_api load every log line triggered UPDATE jobs_job SET logs = ... on the shared job row; inside ATOMIC_REQUESTS a single batched /result POST stacked N such UPDATEs in one tx, blocking ML workers on the same row (issue #1256). - JobLog model + migration 0021_joblog (FK job, level, message, created_at) - JobLogHandler.emit -> JobLog.objects.create (gated by JOB_LOG_PERSIST_ENABLED flag introduced in PR #1261) - JobSerializer.logs -> SerializerMethodField that reads JobLog rows on detail endpoint and falls back to the legacy jobs_job.logs JSON for jobs created before this table existed. List endpoint keeps the cheap legacy shape to avoid N+1. - No UI changes, no new endpoint. On-wire {stdout, stderr} shape preserved. Co-Authored-By: Claude <noreply@anthropic.com>

- Route existing tests that assert on job.logs.stdout to read from JobLog rows via a joined_job_log_messages helper - Same for ami/ml/tests.py error-logging assertion - Add TestJobLogPersistence class with 6 new cases: - emit inserts one JobLog per call; legacy JSON column stays empty - flag=False short-circuits emit (no rows, no JSON write) - serialize_job_logs reads JobLog rows newest-first and splits stderr - serialize_job_logs falls back to legacy JSON when no rows exist - get_logs on list action returns legacy shape (no JobLog query) - get_logs on retrieve action reads JobLog rows 47/47 pass in ami.jobs.tests.test_jobs. Co-Authored-By: Claude <noreply@anthropic.com>

- Drop orphaned JobLogHandler.max_log_length constant — JobLog has no per-job storage cap (retention will be handled by the data integrity check framework, #1188 follow-up). - Move serialize_job_logs / _legacy_logs_shape / JOB_LOG_* constants from serializers.py to models.py so they live next to JobLog itself. - Annotate JobListSerializer.get_logs with @extend_schema_field so the OpenAPI schema retains the {stdout: string[], stderr: string[]} contract that SchemaField(JobLogs) used to advertise. - Mark Job.logs as deprecated in help_text — read-only fallback for pre-#1259 jobs only. Column stays until legacy rows are backfilled into JobLog (separate follow-up). Co-Authored-By: Claude <noreply@anthropic.com>

mihow · 2026-04-30T01:54:11Z

Claude says (on behalf of @mihow): Wrap-up applied. Branch rebased onto main (post-#1261 merge), one new commit on top:

458613f refactor(jobs): tidy JobLog wrap-up

Folds in the takeaway-review actions:

(2) OpenAPI schema regression — get_logs now wears @extend_schema_field with the {stdout, stderr} shape. No regression in the generated schema.
(3) Column drop — Job.logs help_text flags it deprecated, read-only fallback only. Backfill + column drop tracked in Backfill legacy jobs_job.logs into JobLog and drop the column #1278.
(1) Retention — no per-job storage cap added; per-request read cap of 1000 (JOB_LOGS_DEFAULT_LIMIT) was already in place. Age-based retention check tracked in JobLog retention check via data integrity framework (#1188) #1277, to plug into the framework from feat: add check_occurrences for occurrence data integrity #1188.
(5) serialize_job_logs placement — moved to ami/jobs/models.py next to JobLog. Thread resolved.
(7) Mop-up — orphaned JobLogHandler.max_log_length = 1000 constant removed.

Tests: 56/56 in ami.jobs.tests.test_jobs + the ml.tests case pass on the rebased branch. PR body updated to link the two follow-up issues.

Outstanding:

(4) staging soak — last unchecked box on the PR.

Ready for re-review @coderabbitai.

coderabbitai · 2026-04-30T01:55:27Z

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

Let me read the current state of the files to verify all the wrap-up changes before writing a reply.

🏁 Script executed:

#!/bin/bash
echo "=== ami/jobs/models.py: JobLog, serialize_job_logs, JobLogHandler, max_log_length ==="
rg -n "class JobLog|serialize_job_logs|max_log_length|JOB_LOGS_DEFAULT_LIMIT|JobLogHandler|def emit|help_text.*logs\|deprecated" ami/jobs/models.py -A 5

echo ""
echo "=== ami/jobs/models.py: JobLog class full ==="
rg -n -A 30 "class JobLog\b" ami/jobs/models.py

echo ""
echo "=== ami/jobs/models.py: serialize_job_logs full ==="
rg -n -A 20 "def serialize_job_logs" ami/jobs/models.py

echo ""
echo "=== ami/jobs/models.py: Job.logs field ==="
rg -n -B2 -A5 "logs.*Field\|logs.*help_text\|deprecated" ami/jobs/models.py | head -40

echo ""
echo "=== ami/jobs/serializers.py: full file ==="
cat ami/jobs/serializers.py

echo ""
echo "=== Check extend_schema_field and get_logs in serializers.py ==="
rg -n "extend_schema_field|get_logs|SchemaField|JOB_LOGS_DEFAULT_LIMIT|serialize_job_logs|SerializerMethodField" ami/jobs/serializers.py

echo ""
echo "=== Check JOB_LOGS_DEFAULT_LIMIT defined anywhere ==="
rg -rn "JOB_LOGS_DEFAULT_LIMIT" ami/

echo ""
echo "=== Check `#1277` and `#1278` references in PR body / code comments ==="
rg -rn "1277\|1278" ami/jobs/ | head -20

Length of output: 14966

@mihow — re-review complete. All five wrap-up items confirmed against live source.

Verification results

Takeaway	Claim	Verified
(2) OpenAPI	`@extend_schema_field` with `{stdout, stderr}` shape on `get_logs`	✅ Lines 151–169 of `serializers.py`
(1) Retention	`JOB_LOGS_DEFAULT_LIMIT = 1000` read cap; `#1277` linked	✅ `models.py` Line 352; `#1277` in PR body
(3) Column drop	`Job.logs` deprecated; `#1278` linked	✅ PR body; `_legacy_logs_shape` fallback path still in place
(5) Placement	`serialize_job_logs` moved to `models.py` next to `JobLog`	✅ `models.py` Lines 363–388
(7) Mop-up	`max_log_length = 1000` constant removed	✅ Zero occurrences in codebase

Two remaining notes

Minor / non-blocking:

Index tiebreak — JobLog.Meta.indexes is still ["job", "-created_at"]. The serialize_job_logs query orders by ["-created_at", "-pk"], so PostgreSQL will sort -pk ties without index support. At current row volumes this is a no-op cost, but widening to ["job", "-created_at", "-pk"] in the migration eliminates the off-index sort entirely. Worth doing if you're already cutting a follow-up migration for #1277 retention; otherwise fine to defer.

Hard gate (yours, not mine):

(4) Staging soak — still the one unchecked box on the PR. The load-test numbers in the body are solid; a staging-equivalent run with a real DB copy is the last signal before merge. Nothing in the code blocks it — this is purely on the ops side.

From a code-review standpoint this is clean. Ship it once the soak clears. 🐇

🧠 Learnings used

Learnt from: mihow
Repo: RolnickLab/antenna PR: 1261
File: ami/jobs/tasks.py:502-514
Timestamp: 2026-04-21T08:10:37.134Z
Learning: In the RolnickLab/antenna project, `job.progress` is stored as a single JSONB column (`jobs_job.progress`). Using `job.save(update_fields=["progress", ...])` does NOT protect against concurrent clobbering of sub-fields within that column (e.g., `progress.errors` appended by `_reconcile_lost_images`), because the entire dict is still serialized and written back. The correct mitigation for sub-field concurrency is a server-side `jsonb_set` atomic update or optimistic compare-and-swap (tracked under issue `#1256` option D). Do not flag `update_fields=["progress"]` as sufficient protection against concurrent JSONB sub-field writes.

Adds an opt-in ``logs_limit`` query parameter on ``GET /api/v2/jobs/{id}/`` so callers can cap the number of JobLog rows returned in the ``logs`` field of the response. Defaults to 1000 (unchanged), capped at 5000 to keep payload size bounded; pagination over older entries will ship with a dedicated ``/jobs/logs/`` endpoint in a follow-up. - Validation follows the existing pattern (``django.forms.IntegerField`` used elsewhere in JobViewSet for ``cutoff_hours``). Bad input raises ValidationError → DRF 400. - ``OpenApiParameter`` defined in ``ami/jobs/schemas.py`` and wired onto the ``retrieve`` action via ``extend_schema_view``. - Tests cover: cap honoured, default applied when absent, non-int / zero / over-cap values rejected. - Migration ``0022_alter_job_logs_help_text`` reflects the deprecation marker on ``Job.logs`` from the previous wrap-up commit. (An unrelated cosmetic ``job_type_key`` choices drift surfaced during ``makemigrations`` and was excluded from this migration so the diff stays focused — it predates this PR and is left for a follow-up.) Co-Authored-By: Claude <noreply@anthropic.com>

mihow · 2026-04-30T02:13:59Z

Claude says (on behalf of @mihow): Added ?logs_limit= to GET /api/v2/jobs/{id}/ (commit 4ea7102):

Default 1000 (unchanged), capped at JOB_LOGS_MAX_LIMIT = 5000
Validated via django.forms.IntegerField(required=False, min_value=1, max_value=JOB_LOGS_MAX_LIMIT) — same pattern as ?cutoff_hours= already in JobViewSet.get_queryset. Bad input → DRF 400.
OpenApiParameter declared in ami/jobs/schemas.py, wired onto the retrieve action via extend_schema_view
Tests cover: cap honoured, default applies when absent, non-int / zero / over-cap rejected

Migration 0022_alter_job_logs_help_text reflects the deprecation help_text on Job.logs from the prior wrap-up commit. (An unrelated job_type_key choices drift — cosmetic verbose-name fix from PR #1245 — surfaced during makemigrations; excluded from this migration so the diff stays scoped to JobLog work.)

Real pagination (/api/v2/jobs/logs/ with cursor + LimitOffsetPaginationWithPermissions) tracked separately in #1280.

Tests: 61/61 in ami.jobs.tests.test_jobs + the targeted ami.ml.tests case pass on the new commit.

Both query params used django.forms.IntegerField.clean() which raises django.core.exceptions.ValidationError. DRF's default exception handler does NOT convert that into HTTP 400 — it bubbles up as 500. Staging soak surfaced this for ?logs_limit=abc; ?cutoff_hours=abc has the same bug today on main. Switch both to ami.base.serializers.SingleParamSerializer[int].clean with rest_framework.serializers.IntegerField. That raises DRF ValidationError, which the default handler converts to 400 cleanly. For logs_limit: validation moves out of JobListSerializer.get_logs and into JobViewSet.get_serializer_context, run once at the view boundary and passed to the serializer via context. Serializer just reads the cleaned int. Tests: new HTTP-level TestJobLogsLimitHTTPValidation class asserts status_code == 400 for "abc" / 0 / -5 / above-max via APIClient (previous unit tests only asserted the function raised, never exercised the HTTP path). Added equivalent assertion for cutoff_hours on TestJobView. Helper test_logs_limit_caps_response_size / test_logs_limit_default_when_unset stay as serializer-level tests covering the cleaned-context shape. Pattern matches existing SingleParamSerializer callers in ami/base/views.py:47, ami/main/api/views.py:796, :885. Co-Authored-By: Claude <noreply@anthropic.com>

mihow · 2026-04-30T17:13:59Z

Claude says (on behalf of @mihow): Pushed 8657e16f — fix the 500-on-bad-input issue surfaced during staging soak.

Problem. ?logs_limit=abc (and ?cutoff_hours=abc on the list endpoint, pre-existing on main) used django.forms.IntegerField.clean(), which raises django.core.exceptions.ValidationError. DRF's default exception handler does NOT convert that — it bubbles up as HTTP 500. The earlier wrap-up commit's claim that DRF surfaces this as 400 was incorrect.

Fix. Both query params now use ami.base.serializers.SingleParamSerializer[int].clean(...) with rest_framework.serializers.IntegerField. DRF ValidationError → 400 cleanly, matching the existing pattern at ami/base/views.py:47, ami/main/api/views.py:796, :885.

For ?logs_limit= specifically, validation moved out of JobListSerializer.get_logs and into JobViewSet.get_serializer_context — runs once at the view boundary, cleaned int passed to the serializer via context.

New tests (HTTP-level via APIClient):

TestJobLogsLimitHTTPValidation — 5 tests covering "abc", 0, -5, above-max, valid-200.
TestJobView.test_list_jobs_with_invalid_cutoff_hours_returns_400 — locks in the cutoff_hours fix.

The earlier function-level assertRaises(ValidationError) tests were structurally wrong (they passed but never exercised the HTTP path). Replaced.

Re-soak on staging (commit 8657e16f):

=== logs_limit ===
  logs_limit='abc'   -> 400
  logs_limit='0'     -> 400
  logs_limit='99999' -> 400
  logs_limit='-5'    -> 400
  logs_limit='5'     -> 200
=== cutoff_hours ===
  cutoff_hours='abc' -> 400
  cutoff_hours='-5'  -> 400
  cutoff_hours='24'  -> 200

All 168 backend tests pass locally (ami.jobs + ami.ml.tests). Staging reset back to main.

mihow · 2026-04-30T20:58:13Z

Claude says (on behalf of @mihow): arctia full-stack soak complete — ready to merge.

Live ML job through real Celery + ADC worker

Job 163 ran end-to-end on arctia (project 9, SUCCESS) and wrote 209 JobLog rows via the live JobLogHandler.emit() path. This closes the gap staging couldn't cover (staging ML services unreachable), confirming the worker-side INSERT path works under real concurrency.

Validation smoke (logs_limit + cutoff_hours)

Both params now go through SingleParamSerializer[int].clean(...) with DRF's IntegerField, validated at the view boundary in JobViewSet.get_serializer_context / get_queryset. Tested against arctia (job 163, 209 logs):

Param	Value	Status
`logs_limit`	`abc`	400 ✅
`logs_limit`	`0`	400 ✅
`logs_limit`	`-5`	400 ✅
`logs_limit`	`99999` (above max)	400 ✅
`logs_limit`	`5`	200, returns 5 entries ✅
`cutoff_hours`	`abc`	400 ✅
`cutoff_hours`	`-5`	400 ✅
`cutoff_hours`	`24`	200 ✅

Detail endpoint latency (job 163, 209 real logs)

`logs_limit`	median	min	max
10	105.1ms	104.8ms	112.6ms
50	106.5ms	105.8ms	171.2ms
200	116.1ms	111.0ms	173.0ms

Limit grows linearly without runaway. No need for further tuning before merge.

arctia reset to main (a347d559).

mihow

Tested well in a staging environment under pressure. Massive improvement!

Copilot AI assigned Copilot and mihow Apr 20, 2026

Copilot created this pull request from a session on behalf of mihow April 20, 2026 19:47 View session

Copilot finished work on behalf of mihow April 20, 2026 19:47

Copilot AI requested a review from mihow April 20, 2026 19:47

mihow reviewed Apr 20, 2026

View reviewed changes

Comment thread ami/jobs/models.py Outdated

mihow reviewed Apr 20, 2026

View reviewed changes

Comment thread ami/jobs/models.py Outdated

mihow marked this pull request as ready for review April 21, 2026 01:17

Copilot AI review requested due to automatic review settings April 21, 2026 01:17

Copilot started reviewing on behalf of mihow April 21, 2026 01:17 View session

mihow reviewed Apr 21, 2026

View reviewed changes

Comment thread ami/jobs/views.py Outdated

Copilot AI reviewed Apr 21, 2026

View reviewed changes

Comment thread ui/src/pages/job-details/job-details.tsx Outdated

Comment thread ami/jobs/views.py Outdated

Comment thread ami/jobs/serializers.py Outdated

mihow mentioned this pull request Apr 21, 2026

fix(jobs): fixes for concurrent ML processing jobs #1261

Merged

10 tasks

mihow force-pushed the copilot/implement-option-a-job-logs branch from 4307d69 to 1cfe52a Compare April 21, 2026 06:56

mihow changed the title ~~Refactor job logging to use joined log rows~~ Refactor job logging to use separate table Apr 21, 2026

mihow force-pushed the copilot/implement-option-a-job-logs branch 3 times, most recently from 2b0d524 to c320600 Compare April 21, 2026 08:12

mihow reviewed Apr 21, 2026

View reviewed changes

Comment thread ami/jobs/serializers.py Outdated

mihow force-pushed the copilot/implement-option-a-job-logs branch from c320600 to 7207a29 Compare April 21, 2026 09:29

mihow added a commit that referenced this pull request Apr 21, 2026

Merge fair-polling PR #1261 into JobLog PR #1259 for serbia deployment

b31b39f

mihow force-pushed the copilot/implement-option-a-job-logs branch from 3381239 to a359686 Compare April 23, 2026 00:45

mihow mentioned this pull request Apr 30, 2026

feat: add check_occurrences for occurrence data integrity #1188

Draft

Michael Bunsen and others added 3 commits April 29, 2026 18:14

mihow force-pushed the copilot/implement-option-a-job-logs branch from a359686 to 458613f Compare April 30, 2026 01:52

This was referenced Apr 30, 2026

JobLog retention check via data integrity framework (#1188) #1277

Open

Backfill legacy jobs_job.logs into JobLog and drop the column #1278

Open

mihow mentioned this pull request Apr 30, 2026

Dedicated /jobs/logs/ endpoint with pagination for JobLog rows #1280

Open

6 tasks

mihow approved these changes Apr 30, 2026

View reviewed changes

mihow merged commit c8e5c72 into main Apr 30, 2026
7 checks passed

mihow deleted the copilot/implement-option-a-job-logs branch April 30, 2026 22:41

This was referenced May 1, 2026

fix(jobs): show collect stage progress #1286

Open

fix(jobs): wrap JobLogHandler insert in savepoint #1287

Open

Conversation

Copilot AI commented Apr 20, 2026 • edited by mihow Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

List of Changes

Not in this PR (deliberately deferred)

Related

Why this scope

How to Test

End-to-end pressure test on a dev box (live ADC worker + NATS)

Deployment Notes

Checklist

Uh oh!

netlify Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-ssec canceled.

Uh oh!

netlify Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-preview canceled.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mihow commented Apr 21, 2026

Uh oh!

Uh oh!

mihow commented Apr 30, 2026

What this PR is FOR

Takeaways (priority order)

Land plan

Uh oh!

coderabbitai Bot commented Apr 30, 2026

Confirmed findings

1 🔴 Unbounded jobs_joblog growth — block on merge

2 🔴 OpenAPI schema regression on logs — block on merge

3 🟡 jobs_job.logs column cleanup — no follow-up issue

4 🟡 Soak still pending

5 🟢 serialize_job_logs placement — safe where it is

6 🟢 List-view empty logs for new jobs — verified safe

7 🟢 Mop-up

Land checklist delta (items that need action before merge)

Uh oh!

mihow commented Apr 30, 2026

Uh oh!

mihow commented Apr 30, 2026

Uh oh!

coderabbitai Bot commented Apr 30, 2026

Verification results

Two remaining notes

Uh oh!

mihow commented Apr 30, 2026

Uh oh!

mihow commented Apr 30, 2026

Uh oh!

mihow commented Apr 30, 2026

Live ML job through real Celery + ADC worker

Validation smoke (logs_limit + cutoff_hours)

Detail endpoint latency (job 163, 209 real logs)

Uh oh!

mihow left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Apr 20, 2026 •

edited by mihow

Loading

netlify Bot commented Apr 20, 2026 •

edited

Loading

netlify Bot commented Apr 20, 2026 •

edited

Loading

1 🔴 Unbounded `jobs_joblog` growth — block on merge

2 🔴 OpenAPI schema regression on `logs` — block on merge

3 🟡 `jobs_job.logs` column cleanup — no follow-up issue

5 🟢 `serialize_job_logs` placement — safe where it is