refactor: Use a Rust-based filesystem storage implementation by janbuchar · Pull Request #1833 · apify/crawlee-python

janbuchar · 2026-04-03T16:50:58Z

Go to https://github.com/apify/crawlee-storage to see the implementation
The motivation is to avoid duplication of the local storage implementation - there are subtle performance optimizations in Crawlee for JS v3 and it'd be difficult to maintain parity

codecov · 2026-04-09T13:13:13Z

Codecov Report

❌ Patch coverage is 94.62366% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.66%. Comparing base (88bd05a) to head (462f747).
⚠️ Report is 79 commits behind head on master.

Files with missing lines	Patch %	Lines
...rage_clients/_file_system/_request_queue_client.py	92.98%	4 Missing ⚠️
...ge_clients/_file_system/_key_value_store_client.py	95.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1833      +/-   ##
==========================================
+ Coverage   92.28%   92.66%   +0.37%     
==========================================
  Files         157      157              
  Lines       10919    10277     -642     
==========================================
- Hits        10077     9523     -554     
+ Misses        842      754      -88

Flag	Coverage Δ
unit	`92.66% <94.62%> (+0.37%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…rage

janbuchar · 2026-06-19T12:10:22Z

After apify/crawlee-storage#47, filesystem path validation is not necessary

…rage

janbuchar · 2026-06-23T19:57:55Z

More improvement material:

🔴 Block-worthy

assume_sole_owner is never passed — silent crash-recovery regression
_request_queue_client.py:112-117. The native binding exposes assume_sole_owner (stub line 197, default False), and the consumer's open() doesn't forward it. The JS PR correctly defaults it to true for the single-process case. Python inherits False, which per AGENTS.md means a crashed crawler's in-progress requests are not reclaimed on reopen — they stall until the lock window expires (default 3 minutes). On master, recovery was instant via the old _discover_existing_requests. This is an observable behavior regression for the overwhelmingly common single-process case.
test_in_progress_requests_recovered_after_crash asserts behavior the code doesn't provide
tests/.../test_fs_rq_client.py:205-247. The test fetches a request (writing a future-dated lock to disk), "crashes," reopens by id, and asserts all 3 are immediately fetchable (line 235-242). With assume_sole_owner=False, the locked request should NOT come back until its lock expires. So this test is either currently failing, flaky (passes only if the lock window elapses during the run), or passing for a reason that masks bug Basic repository setup #1. Either way it's lying about what the code does. Run it in isolation and watch — I'd bet it's the canary for Basic repository setup #1. Fixing Basic repository setup #1 (assume_sole_owner=True) makes this test honest.
persist_state event-listener machinery drives a documented no-op
_request_queue_client.py:121-136, 214-225, and the crash test calls it explicitly at line 228. The whole apparatus — service_locator.get_event_manager(), event_manager.on(PERSIST_STATE, ...), _on_persist_state, _deregister_event_listener, the try/except import dance — exists to invoke self._native_client.persist_state(), which AGENTS.md states is "a retained no-op for binding compatibility." The docstring at line 91-92 even claims "Queue state is automatically persisted by the native Rust client" — which is true precisely because persist_state does nothing. So this is ~30 lines of event wiring, an import, and per-event FFI overhead to call a function that returns immediately. Delete the listener block (or, if you want a teardown hook for future lock-release semantics, wire it to something that exists).

🟡 Duplication / dead code

validate_subdirectory duplicated in all three clients — and the comment distrusts the library
_dataset_client.py:102-103, _key_value_store_client.py:103-104, _request_queue_client.py:109-110. Each open() validates the subdirectory in Python "before handing off to the native client, which would otherwise create directories outside the intended location." But the Rust core already has validate_subdirectory with tests (utils.rs:274-310, dataset.rs:690). So either the native open() doesn't actually validate (a real Rust bug worth confirming) or this is dead defensive code in triplicate. Pick one: trust the library and delete three copies, or file a bug against the Rust side. The comment asserting the native client is unsafe is, as far as I can tell, false.
KVS set_value re-implements serialization that arguably belongs above it
_key_value_store_client.py:159-179. This branches on None→application/x-none, JSON via json_dumps, str, bytes, then a str(value).encode() fallback (line 177). Two notes:
- The application/x-none sentinel (line 164) is library territory per AGENTS.md — the core understands it on read; here the consumer re-encodes it on write. At minimum the binding should accept set_value(value=None) so this isn't hand-rolled.
- The str(value).encode('utf-8') fallback (line 177) is the same smell the JS PR just deleted from its client. For a non-JSON, non-str, non-bytes value with an explicit non-JSON content type, you silently str() it. That's lossy coercion masquerading as a transport. Either reject it or document it.
- Counterpoint: unlike JS (which moved serialization into a core key_value_store_codec.ts), crawlee-python's design keeps it here. So this is defensible as placed — but the None sentinel and the str() fallback are still warts.
get_value swallows decode failures as "missing record"
_key_value_store_client.py:137-148. A JSON or UTF-8 decode failure logs a warning and return None — i.e. a corrupt-but-present record is reported as absent. That's a debuggability landmine: a user with a malformed value gets "key not found" instead of an error. Matches historical behavior, maybe, but worth a louder signal than a warning.

🟢 Petty

The 999_999_999_999 magic limit, now in two places plus the base interface
_dataset_client.py:131 (default arg) and :162 (the if limit is not None else 999_999_999_999 — which is dead, since line 131 already defaults it, so limit is never None unless a caller explicitly passes None, in which case this re-substitutes the same literal). Redundant belt-and-suspenders around a sentinel that's also baked into the base _dataset_client.py:59. The native get_data accepts limit: int | None directly — pass None through and drop the literal.
get_data/iterate_items unsupported-arg warnings duplicate a 7-key dict twice
_dataset_client.py:142-158 and :182-196. Two nearly-identical unsupported_args dicts and warning blocks. Legitimate glue (the base interface declares these kwargs; only the consumer knows the FS backend ignores them) — but it's copy-pasted. Factor it into a helper. Minor.
page_size tuning knob on the native iterators is never exposed
_dataset_client.py:198-203, _key_value_store_client.py:192-195. The native iterate_items/iterate_keys accept a page_size; the consumer never passes it, so callers can't tune paging. Not a bug, just an unreachable knob.

Unused library surface (same as before, re-confirmed)

assume_sole_owner — exposed, not used (see Basic repository setup #1).
use_test_clock / advance_clock_for_testing — exposed, not used anywhere in src/ or tests/. Either write the lock-expiry tests that would use it, or stop carrying the plumbing.
set_expected_request_processing_time — exposed, not used, not surfaced via Configuration.

The one cross-PR asymmetry worth a meeting

JS defaults assumeSoleOwner=true; Python leaves it false. Same library, opposite crash-recovery behavior for the same single-process scenario. Pick one default and make both bindings agree, or you'll get bug reports that reproduce in one language and not the other. My vote: true for both (single-process is the common case), with the multi-process opt-out documented — which is exactly what JS did.

janbuchar added t-tooling Issues with this label are in the ownership of the tooling team. adhoc Ad-hoc unplanned task added during the sprint. labels Apr 3, 2026

github-actions Bot assigned janbuchar Apr 3, 2026

github-actions Bot added this to the 137th sprint - Tooling team milestone Apr 3, 2026

github-actions Bot added the tested Temporary label used only programatically for some analytics. label Apr 3, 2026

janbuchar force-pushed the rust-filesystem-storage branch from 0c637bb to d64db9b Compare April 9, 2026 17:38

janbuchar added 6 commits April 9, 2026 19:45

refactor: Use Rust-based filesystem storage implementation

c419614

Temporarily install crawlee_storage from github

311e85e

get_public_url is async

ee2b554

adapt to improved async iteration

8583297

Update crawlee-storage

0716c20

Attempt to un-hang tests

462f747

janbuchar force-pushed the rust-filesystem-storage branch from d64db9b to 462f747 Compare April 9, 2026 17:45

janbuchar mentioned this pull request Apr 23, 2026

"ayafileio: a truly async file I/O library with native IOCP/io_uring/Dispatch I/O support" #1856

Closed

janbuchar added 3 commits June 12, 2026 16:38

Adapt to recent crawlee-storage changes

f6cc1ab

Merge remote-tracking branch 'origin/master' into rust-filesystem-sto…

6b27a79

…rage

Update

dac3511

Merge remote-tracking branch 'origin/master' into rust-filesystem-sto…

eb58220

…rage

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: Use a Rust-based filesystem storage implementation#1833

refactor: Use a Rust-based filesystem storage implementation#1833
janbuchar wants to merge 10 commits into
masterfrom
rust-filesystem-storage

janbuchar commented Apr 3, 2026

Uh oh!

codecov Bot commented Apr 9, 2026 •

edited

Loading

Uh oh!

janbuchar commented Jun 19, 2026

Uh oh!

janbuchar commented Jun 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

janbuchar commented Apr 3, 2026

Uh oh!

codecov Bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

janbuchar commented Jun 19, 2026

Uh oh!

janbuchar commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔴 Block-worthy

🟡 Duplication / dead code

🟢 Petty

Unused library surface (same as before, re-confirmed)

The one cross-PR asymmetry worth a meeting

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Apr 9, 2026 •

edited

Loading

janbuchar commented Jun 23, 2026 •

edited

Loading