✨ feat: create MeteredUsageRecord model and functionality, including …#283
✨ feat: create MeteredUsageRecord model and functionality, including …#283jc99k wants to merge 6 commits into
Conversation
…Celery Beat and task to fetch from Redis
|
Hey @jc99k, nice work on this — reviewed end-to-end. A few things worth addressing before merge: 1. delta_proxy_bytes=d_proxy if d_proxy else None # hourly.py
delta_proxy_bytes=_nonzero_int_or_none(...) # ledger.py JOB_CLOSE & ADJUSTMENTEffect: a job with proxy configured but no proxy bytes in a given slice writes NULL, so downstream can't distinguish "no proxy used" from "proxy used, zero delta". Totals via 2. The three test files added in this PR can't actually be executed against the suite. Commit the settings file before merging. 3. 4. raw = read_scrapy_counters_from_redis(job) or {}
storage_final = int(raw.get("storage_obj_bytes_total", 0))
proxy_cumulative = int(raw.get("meter_proxy_redis_bytes", 0))Not necessary to address in this PR, just a nice to have. 5. |
…es not apply, and 0 when proxy is configured and delta is zero Aligned writes with the field help text: NULL only when proxy attribution does not apply (no proxy_name on the job and a zero delta); 0 when a proxy is configured and the slice/close/reconcile delta is zero. Implementation: a shared helper delta_proxy_bytes_for_flow_row(proxy_name_label, delta) in ledger.py used from both ledger.py (JOB_CLOSE / ADJUSTMENT) and hourly.py (DELTA_SLICE). We still persist non-zero proxy deltas when there is no configured proxy name so we do not drop Redis-reported bytes on mis-labeled jobs (Redis can show proxy traffic while the job record has no proxy name (legacy crawl, misconfiguration. Keeping non-zero values preserves bytes that Redis already counted, even when the job wasn’t labeled with a provider name).
append_metered_usage_for_job_close now calls read_scrapy_counters_from_redis once at the top, uses raw = … or {}, then derives storage_final and proxy_cumulative from that dict for both the non-hourly JOB_CLOSE path and the hourly reconcile path (so no duplicate hgetall). Removed the two per-field helpers. Kept the in-function from api.utils import read_scrapy_counters_from_redis so existing tests that patch api.utils.read_scrapy_counters_from_redis still apply. Ledger tests still pass.
document on MeteredUsageRecord.delta_storage_bytes that storage flow is a signed Redis cumulative diff: object byte totals can shrink, so negatives are expected; warns downstream billing/analytics not to filter or aggregate as if values were always positive (e.g. strict > 0 filters drop legitimate corrections). No runtime or schema change.
metered_proxy_name_from_job no longer runs refresh_from_db on every call (removes an extra SELECT proxy_usage_data per hourly slice and other hot paths). It only uses the job instance in memory; the docstring says to refresh first if DB consistency matters. append_metered_usage_for_job_close still does a single refresh_from_db(fields=["proxy_usage_data"]) before resolving the proxy label so close/reconcile rows stay defensive without N queries per tick.
Add config/settings/test.py so DJANGO_SETTINGS_MODULE = config.settings.test (already in pytest.ini) resolves in git: SQLite in-memory DB, in-process Celery, estela_queue_adapter stubbed before importing base, and literal spiderdata settings so dummy env values do not break imports. Adds config/settings/__init__.py so config.settings is a proper package. Updates pytest.ini with pythonpath = .. so database_adapters at the repo root is on the path when pytest runs from estela-api/, matching how local/CI runs are documented. No change to deployed Django settings (local / Helm).
|
@joaquingx added fixes for each issue you raised:
Aligned writes with the field help text: NULL only when proxy attribution does not apply (no proxy_name on the job and a zero delta); 0 when a proxy is configured and the slice/close/reconcile delta is zero. Implementation: a shared helper delta_proxy_bytes_for_flow_row(proxy_name_label, delta) in ledger.py used from both ledger.py (JOB_CLOSE / ADJUSTMENT) and hourly.py (DELTA_SLICE). We still persist non-zero proxy deltas when there is no configured proxy name so we do not drop Redis-reported bytes on mis-labeled jobs (Redis can show proxy traffic while the job record has no proxy name (legacy crawl, misconfiguration. Keeping non-zero values preserves bytes that Redis already counted, even when the job wasn’t labeled with a provider name).
Add config/settings/test.py so DJANGO_SETTINGS_MODULE = config.settings.test (already in pytest.ini) resolves in git: SQLite in-memory DB, in-process Celery, estela_queue_adapter stubbed before importing base, and literal spiderdata settings so dummy env values do not break imports. Adds config/settings/init.py so config.settings is a proper package. Updates pytest.ini with pythonpath = .. so database_adapters at the repo root is on the path when pytest runs from estela-api/, matching how local/CI runs are documented. No change to deployed Django settings (local / Helm).
metered_proxy_name_from_job no longer runs refresh_from_db on every call (removes an extra SELECT proxy_usage_data per hourly slice and other hot paths). It only uses the job instance in memory; the docstring says to refresh first if DB consistency matters. append_metered_usage_for_job_close still does a single refresh_from_db(fields=["proxy_usage_data"]) before resolving the proxy label so close/reconcile rows stay defensive without N queries per tick.
append_metered_usage_for_job_close now calls read_scrapy_counters_from_redis once at the top, uses raw = … or {}, then derives storage_final and proxy_cumulative from that dict for both the non-hourly JOB_CLOSE path and the hourly reconcile path (so no duplicate hgetall). Removed the two per-field helpers. Kept the in-function from api.utils import read_scrapy_counters_from_redis so existing tests that patch api.utils.read_scrapy_counters_from_redis still apply. Ledger tests still pass.
document on MeteredUsageRecord.delta_storage_bytes that storage flow is a signed Redis cumulative diff: object byte totals can shrink, so negatives are expected; warns downstream billing/analytics not to filter or aggregate as if values were always positive (e.g. strict > 0 filters drop legitimate corrections). No runtime or schema change. |
Description
This change introduces a new database table, MeteredUsageRecord, that records how much each spider job used (network traffic, requests, items, runtime, proxy traffic, and stored object bytes) in a way that is safe to sum for billing and hard to double-count on retries.
How it works
While a job is running and hourly metering is turned on, a periodic task reads Scrapy stats from Redis and writes time-slice rows (DELTA_SLICE) when counters move. Each slice stores the increase over that period for bandwidth, proxy bytes (from Scrapy’s proxy response-byte stat), processing time, storage size (items + requests + logs objects), and a copy of the job’s proxy name from proxy_usage_data so it remains after the job row is removed.
When the job finishes (completed, stopped, or error), the existing usage pipeline also writes a close row: either a single JOB_CLOSE row with full totals (hourly metering off), or an ADJUSTMENT row with any leftover amounts so totals match Redis after summing slices (hourly metering on). Close reconcile rows use adjustment_reason RECONCILE_SCRAPY_FINAL or RECONCILE_STORAGE, and a stable idempotency key so Celery retries do not create duplicates.
Data deletion still gets a small DATA_DELETE marker row for auditing.
Also included
Django admin registration for the new model.
Tests for parsing Redis stats, the ledger, and hourly sampling.
Lightweight test settings and a small pytest tweak so the suite can run with SQLite and without a real Redis broker.
Operational note
Hourly sampling in code is independent of job length; how often slices appear in practice follows the Celery Beat schedule for the metering task (defaults to once per hour unless you change it).
Issue
Checklist before requesting a review