Skip to content

refactor: Use a Rust-based filesystem storage implementation#1833

Draft
janbuchar wants to merge 10 commits into
masterfrom
rust-filesystem-storage
Draft

refactor: Use a Rust-based filesystem storage implementation#1833
janbuchar wants to merge 10 commits into
masterfrom
rust-filesystem-storage

Conversation

@janbuchar

Copy link
Copy Markdown
Collaborator
  • Go to https://github.com/apify/crawlee-storage to see the implementation
  • The motivation is to avoid duplication of the local storage implementation - there are subtle performance optimizations in Crawlee for JS v3 and it'd be difficult to maintain parity

@janbuchar janbuchar added t-tooling Issues with this label are in the ownership of the tooling team. adhoc Ad-hoc unplanned task added during the sprint. labels Apr 3, 2026
@github-actions github-actions Bot added this to the 137th sprint - Tooling team milestone Apr 3, 2026
@github-actions github-actions Bot added the tested Temporary label used only programatically for some analytics. label Apr 3, 2026
@codecov

codecov Bot commented Apr 9, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 94.62366% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.66%. Comparing base (88bd05a) to head (462f747).
⚠️ Report is 79 commits behind head on master.

Files with missing lines Patch % Lines
...rage_clients/_file_system/_request_queue_client.py 92.98% 4 Missing ⚠️
...ge_clients/_file_system/_key_value_store_client.py 95.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1833      +/-   ##
==========================================
+ Coverage   92.28%   92.66%   +0.37%     
==========================================
  Files         157      157              
  Lines       10919    10277     -642     
==========================================
- Hits        10077     9523     -554     
+ Misses        842      754      -88     
Flag Coverage Δ
unit 92.66% <94.62%> (+0.37%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@janbuchar janbuchar force-pushed the rust-filesystem-storage branch from 0c637bb to d64db9b Compare April 9, 2026 17:38
@janbuchar

Copy link
Copy Markdown
Collaborator Author

After apify/crawlee-storage#47, filesystem path validation is not necessary

@janbuchar

janbuchar commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator Author

More improvement material:

🔴 Block-worthy

  1. assume_sole_owner is never passed — silent crash-recovery regression
    _request_queue_client.py:112-117. The native binding exposes assume_sole_owner (stub line 197, default False), and the consumer's open() doesn't forward it. The JS PR correctly defaults it to true for the single-process case. Python inherits False, which per AGENTS.md means a crashed crawler's in-progress requests are not reclaimed on reopen — they stall until the lock window expires (default 3 minutes). On master, recovery was instant via the old _discover_existing_requests. This is an observable behavior regression for the overwhelmingly common single-process case.

  2. test_in_progress_requests_recovered_after_crash asserts behavior the code doesn't provide
    tests/.../test_fs_rq_client.py:205-247. The test fetches a request (writing a future-dated lock to disk), "crashes," reopens by id, and asserts all 3 are immediately fetchable (line 235-242). With assume_sole_owner=False, the locked request should NOT come back until its lock expires. So this test is either currently failing, flaky (passes only if the lock window elapses during the run), or passing for a reason that masks bug Basic repository setup #1. Either way it's lying about what the code does. Run it in isolation and watch — I'd bet it's the canary for Basic repository setup #1. Fixing Basic repository setup #1 (assume_sole_owner=True) makes this test honest.

  3. persist_state event-listener machinery drives a documented no-op
    _request_queue_client.py:121-136, 214-225, and the crash test calls it explicitly at line 228. The whole apparatus — service_locator.get_event_manager(), event_manager.on(PERSIST_STATE, ...), _on_persist_state, _deregister_event_listener, the try/except import dance — exists to invoke self._native_client.persist_state(), which AGENTS.md states is "a retained no-op for binding compatibility." The docstring at line 91-92 even claims "Queue state is automatically persisted by the native Rust client" — which is true precisely because persist_state does nothing. So this is ~30 lines of event wiring, an import, and per-event FFI overhead to call a function that returns immediately. Delete the listener block (or, if you want a teardown hook for future lock-release semantics, wire it to something that exists).

🟡 Duplication / dead code

  1. validate_subdirectory duplicated in all three clients — and the comment distrusts the library
    _dataset_client.py:102-103, _key_value_store_client.py:103-104, _request_queue_client.py:109-110. Each open() validates the subdirectory in Python "before handing off to the native client, which would otherwise create directories outside the intended location." But the Rust core already has validate_subdirectory with tests (utils.rs:274-310, dataset.rs:690). So either the native open() doesn't actually validate (a real Rust bug worth confirming) or this is dead defensive code in triplicate. Pick one: trust the library and delete three copies, or file a bug against the Rust side. The comment asserting the native client is unsafe is, as far as I can tell, false.

  2. KVS set_value re-implements serialization that arguably belongs above it
    _key_value_store_client.py:159-179. This branches on Noneapplication/x-none, JSON via json_dumps, str, bytes, then a str(value).encode() fallback (line 177). Two notes:

    • The application/x-none sentinel (line 164) is library territory per AGENTS.md — the core understands it on read; here the consumer re-encodes it on write. At minimum the binding should accept set_value(value=None) so this isn't hand-rolled.
    • The str(value).encode('utf-8') fallback (line 177) is the same smell the JS PR just deleted from its client. For a non-JSON, non-str, non-bytes value with an explicit non-JSON content type, you silently str() it. That's lossy coercion masquerading as a transport. Either reject it or document it.
    • Counterpoint: unlike JS (which moved serialization into a core key_value_store_codec.ts), crawlee-python's design keeps it here. So this is defensible as placed — but the None sentinel and the str() fallback are still warts.
  3. get_value swallows decode failures as "missing record"
    _key_value_store_client.py:137-148. A JSON or UTF-8 decode failure logs a warning and return None — i.e. a corrupt-but-present record is reported as absent. That's a debuggability landmine: a user with a malformed value gets "key not found" instead of an error. Matches historical behavior, maybe, but worth a louder signal than a warning.

🟢 Petty

  1. The 999_999_999_999 magic limit, now in two places plus the base interface
    _dataset_client.py:131 (default arg) and :162 (the if limit is not None else 999_999_999_999 — which is dead, since line 131 already defaults it, so limit is never None unless a caller explicitly passes None, in which case this re-substitutes the same literal). Redundant belt-and-suspenders around a sentinel that's also baked into the base _dataset_client.py:59. The native get_data accepts limit: int | None directly — pass None through and drop the literal.

  2. get_data/iterate_items unsupported-arg warnings duplicate a 7-key dict twice
    _dataset_client.py:142-158 and :182-196. Two nearly-identical unsupported_args dicts and warning blocks. Legitimate glue (the base interface declares these kwargs; only the consumer knows the FS backend ignores them) — but it's copy-pasted. Factor it into a helper. Minor.

  3. page_size tuning knob on the native iterators is never exposed
    _dataset_client.py:198-203, _key_value_store_client.py:192-195. The native iterate_items/iterate_keys accept a page_size; the consumer never passes it, so callers can't tune paging. Not a bug, just an unreachable knob.

Unused library surface (same as before, re-confirmed)

  • assume_sole_owner — exposed, not used (see Basic repository setup #1).
  • use_test_clock / advance_clock_for_testing — exposed, not used anywhere in src/ or tests/. Either write the lock-expiry tests that would use it, or stop carrying the plumbing.
  • set_expected_request_processing_time — exposed, not used, not surfaced via Configuration.

The one cross-PR asymmetry worth a meeting

JS defaults assumeSoleOwner=true; Python leaves it false. Same library, opposite crash-recovery behavior for the same single-process scenario. Pick one default and make both bindings agree, or you'll get bug reports that reproduce in one language and not the other. My vote: true for both (single-process is the common case), with the multi-process opt-out documented — which is exactly what JS did.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

adhoc Ad-hoc unplanned task added during the sprint. t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants