Skip to content

✨ feat: create MeteredUsageRecord model and functionality, including …#283

Open
jc99k wants to merge 6 commits into
mainfrom
feat/create-metered-usage-records
Open

✨ feat: create MeteredUsageRecord model and functionality, including …#283
jc99k wants to merge 6 commits into
mainfrom
feat/create-metered-usage-records

Conversation

@jc99k
Copy link
Copy Markdown

@jc99k jc99k commented May 11, 2026

Description

This change introduces a new database table, MeteredUsageRecord, that records how much each spider job used (network traffic, requests, items, runtime, proxy traffic, and stored object bytes) in a way that is safe to sum for billing and hard to double-count on retries.

How it works

While a job is running and hourly metering is turned on, a periodic task reads Scrapy stats from Redis and writes time-slice rows (DELTA_SLICE) when counters move. Each slice stores the increase over that period for bandwidth, proxy bytes (from Scrapy’s proxy response-byte stat), processing time, storage size (items + requests + logs objects), and a copy of the job’s proxy name from proxy_usage_data so it remains after the job row is removed.
When the job finishes (completed, stopped, or error), the existing usage pipeline also writes a close row: either a single JOB_CLOSE row with full totals (hourly metering off), or an ADJUSTMENT row with any leftover amounts so totals match Redis after summing slices (hourly metering on). Close reconcile rows use adjustment_reason RECONCILE_SCRAPY_FINAL or RECONCILE_STORAGE, and a stable idempotency key so Celery retries do not create duplicates.
Data deletion still gets a small DATA_DELETE marker row for auditing.
Also included

Django admin registration for the new model.
Tests for parsing Redis stats, the ledger, and hourly sampling.
Lightweight test settings and a small pytest tweak so the suite can run with SQLite and without a real Redis broker.
Operational note

Hourly sampling in code is independent of job length; how often slices appear in practice follows the Celery Beat schedule for the metering task (defaults to once per hour unless you change it).

Issue

Checklist before requesting a review

  • I have performed a self-review of my code.
  • My code follows the style guidelines of this project.
  • I have made corresponding changes to the documentation.
  • New and existing tests pass locally with my changes.
  • If this change is a core feature, I have added thorough tests.
  • If this change affects or depends on the behavior of other estela repositories, I have created pull requests with the relevant changes in the affected repositories. Please, refer to our official documentation.
  • I understand that my pull request may be closed if it becomes obvious or I did not perform all of the steps above.

@jc99k jc99k requested review from erick-GeGe and joaquingx May 11, 2026 16:51
@joaquingx
Copy link
Copy Markdown
Contributor

joaquingx commented May 12, 2026

Hey @jc99k, nice work on this — reviewed end-to-end. A few things worth addressing before merge:

1. delta_proxy_bytes NULL semantics don't match the docstring
The model docstring states: "NULL means proxy attribution does not apply to this row (vs. 0 which would mean 'applies but no change')". But the code writes NULL whenever the delta/cumulative is 0, regardless of whether the job has proxy configured:

delta_proxy_bytes=d_proxy if d_proxy else None  # hourly.py
delta_proxy_bytes=_nonzero_int_or_none(...)     # ledger.py JOB_CLOSE & ADJUSTMENT

Effect: a job with proxy configured but no proxy bytes in a given slice writes NULL, so downstream can't distinguish "no proxy used" from "proxy used, zero delta". Totals via Sum still work, but classification queries won't. Either align the code with the docstring (use 0 when proxy applies and delta is 0; NULL only when it doesn't apply) or update the docstring to say "NULL = zero delta".

2. config/settings/test.py is missing
The PR description mentions "Lightweight test settings ... so the suite can run with SQLite", but no such file is in the diff. pytest.ini points to config.settings.test, which doesn't exist in this branch. Tried locally:

ImportError: cannot import name 'test' from 'config.settings'

The three test files added in this PR can't actually be executed against the suite. Commit the settings file before merging.

3. metered_proxy_name_from_job does refresh_from_db on every call
This runs on every DELTA_SLICE write and every job close. proxy_usage_data is set once per job and not mutated thereafter, and the caller always has a freshly fetched job instance (from the hourly queryset or SpiderJob.objects.get(jid=...)). The refresh adds N extra SELECT proxy_usage_data FROM core_spiderjob queries per tick for no functional benefit. Suggest dropping it from the hourly path; if defensiveness is wanted at close-time, keep it only in append_metered_usage_for_job_close.

4. read_scrapy_counters_from_redis is called 3× per job close
append_metered_usage_for_job_close indirectly hits Redis three times (once for storage, twice for proxy via two separate helper calls). Each opens a connection and does hgetall scrapy_stats_{key}. Refactor to one read and pass the dict around:

raw = read_scrapy_counters_from_redis(job) or {}
storage_final = int(raw.get("storage_obj_bytes_total", 0))
proxy_cumulative = int(raw.get("meter_proxy_redis_bytes", 0))

Not necessary to address in this PR, just a nice to have.

5. delta_storage_bytes is signed (intentional) — flag for downstream consumers
Per the comment in hourly.py: "Signed diff (unlike network/items/requests): stored object bytes can shrink." Worth calling this out explicitly in the model help_text so that future billing queries don't accidentally filter WHERE delta_storage_bytes > 0 and drop legitimate negative corrections.

Copy link
Copy Markdown
Contributor

@joaquingx joaquingx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check comment above

jc99k added 5 commits May 13, 2026 15:25
…es not apply, and 0 when proxy is configured and delta is zero

Aligned writes with the field help text: NULL only when proxy
attribution does not apply (no proxy_name on the job and a zero delta);
0 when a proxy is configured and the slice/close/reconcile delta is
zero.

Implementation: a shared helper
delta_proxy_bytes_for_flow_row(proxy_name_label, delta) in ledger.py
used from both ledger.py (JOB_CLOSE / ADJUSTMENT) and hourly.py
(DELTA_SLICE). We still persist non-zero proxy deltas when there is no
configured proxy name so we do not drop Redis-reported bytes on
mis-labeled jobs (Redis can show proxy traffic while the job record has
no proxy name (legacy crawl, misconfiguration. Keeping non-zero values
preserves bytes that Redis already counted, even when the job wasn’t
labeled with a provider name).
append_metered_usage_for_job_close now calls read_scrapy_counters_from_redis once at the top, uses raw = … or {}, then derives storage_final and proxy_cumulative from that dict for both the non-hourly JOB_CLOSE path and the hourly reconcile path (so no duplicate hgetall). Removed the two per-field helpers. Kept the in-function from api.utils import read_scrapy_counters_from_redis so existing tests that patch api.utils.read_scrapy_counters_from_redis still apply. Ledger tests still pass.
document on MeteredUsageRecord.delta_storage_bytes that storage flow is a signed Redis cumulative diff: object byte totals can shrink, so negatives are expected; warns downstream billing/analytics not to filter or aggregate as if values were always positive (e.g. strict > 0 filters drop legitimate corrections). No runtime or schema change.
metered_proxy_name_from_job no longer runs refresh_from_db on every call (removes an extra SELECT proxy_usage_data per hourly slice and other hot paths). It only uses the job instance in memory; the docstring says to refresh first if DB consistency matters. append_metered_usage_for_job_close still does a single refresh_from_db(fields=["proxy_usage_data"]) before resolving the proxy label so close/reconcile rows stay defensive without N queries per tick.
Add config/settings/test.py so DJANGO_SETTINGS_MODULE = config.settings.test (already in pytest.ini) resolves in git: SQLite in-memory DB, in-process Celery, estela_queue_adapter stubbed before importing base, and literal spiderdata settings so dummy env values do not break imports. Adds config/settings/__init__.py so config.settings is a proper package. Updates pytest.ini with pythonpath = .. so database_adapters at the repo root is on the path when pytest runs from estela-api/, matching how local/CI runs are documented. No change to deployed Django settings (local / Helm).
@jc99k jc99k requested a review from joaquingx May 13, 2026 20:56
@jc99k
Copy link
Copy Markdown
Author

jc99k commented May 13, 2026

@joaquingx added fixes for each issue you raised:

  1. delta_proxy_bytes NULL semantics don't match the docstring
    Commit: 1d93b9c

Aligned writes with the field help text: NULL only when proxy attribution does not apply (no proxy_name on the job and a zero delta); 0 when a proxy is configured and the slice/close/reconcile delta is zero.

Implementation: a shared helper delta_proxy_bytes_for_flow_row(proxy_name_label, delta) in ledger.py used from both ledger.py (JOB_CLOSE / ADJUSTMENT) and hourly.py (DELTA_SLICE). We still persist non-zero proxy deltas when there is no configured proxy name so we do not drop Redis-reported bytes on mis-labeled jobs (Redis can show proxy traffic while the job record has no proxy name (legacy crawl, misconfiguration. Keeping non-zero values preserves bytes that Redis already counted, even when the job wasn’t labeled with a provider name).

  1. config/settings/test.py is missing
    Commit: 14fcb7d

Add config/settings/test.py so DJANGO_SETTINGS_MODULE = config.settings.test (already in pytest.ini) resolves in git: SQLite in-memory DB, in-process Celery, estela_queue_adapter stubbed before importing base, and literal spiderdata settings so dummy env values do not break imports. Adds config/settings/init.py so config.settings is a proper package. Updates pytest.ini with pythonpath = .. so database_adapters at the repo root is on the path when pytest runs from estela-api/, matching how local/CI runs are documented. No change to deployed Django settings (local / Helm).

  1. metered_proxy_name_from_job does refresh_from_db on every call
    Commit: b3408c3

metered_proxy_name_from_job no longer runs refresh_from_db on every call (removes an extra SELECT proxy_usage_data per hourly slice and other hot paths). It only uses the job instance in memory; the docstring says to refresh first if DB consistency matters. append_metered_usage_for_job_close still does a single refresh_from_db(fields=["proxy_usage_data"]) before resolving the proxy label so close/reconcile rows stay defensive without N queries per tick.

  1. read_scrapy_counters_from_redis is called 3× per job close
    Commit: 6ecab06

append_metered_usage_for_job_close now calls read_scrapy_counters_from_redis once at the top, uses raw = … or {}, then derives storage_final and proxy_cumulative from that dict for both the non-hourly JOB_CLOSE path and the hourly reconcile path (so no duplicate hgetall). Removed the two per-field helpers. Kept the in-function from api.utils import read_scrapy_counters_from_redis so existing tests that patch api.utils.read_scrapy_counters_from_redis still apply. Ledger tests still pass.

  1. delta_storage_bytes is signed (intentional) — flag for downstream consumers
    Commit: 6c3f044

document on MeteredUsageRecord.delta_storage_bytes that storage flow is a signed Redis cumulative diff: object byte totals can shrink, so negatives are expected; warns downstream billing/analytics not to filter or aggregate as if values were always positive (e.g. strict > 0 filters drop legitimate corrections). No runtime or schema change.

Copy link
Copy Markdown
Contributor

@joaquingx joaquingx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! 🔥

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants