perf: add Cython LZ4 wrapper with direct C linkage for CQL v4 compression (hundreds of ns saving, x1.5-2.5 speedup) by mykaul · Pull Request #809 · scylladb/python-driver

mykaul · 2026-04-08T13:37:19Z

Motivation

The CQL binary protocol v4 compression path using LZ4 involves Python-level byte manipulation on every compress/decompress call:

Compress: int32_pack(len(byts)) + lz4.block.compress(byts)[4:] — byte-swaps the length header from little-endian (Python lz4 library) to big-endian (CQL protocol), allocating intermediate bytes objects via slicing and concatenation.
Decompress: lz4.block.decompress(byts[3::-1] + byts[4:]) — reverses the 4-byte header and concatenates with the payload, again allocating intermediates.

This overhead is per-call and adds up on high-throughput workloads. By calling LZ4_compress_default() and LZ4_decompress_safe() directly through Cython's C interface (cdef extern from "lz4.h"), we eliminate all intermediate Python object allocations and perform the big-endian byte-order conversion with simple C pointer operations.

Change

New file cassandra/cython_lz4.pyx (201 lines): Cython module with direct C linkage to liblz4.
- lz4_compress(): Writes the 4-byte big-endian uncompressed-length header + raw LZ4 compressed data directly. Uses alloca for the temporary compression buffer on typical CQL frames (≤128 KiB) to avoid heap malloc/free overhead; falls back to malloc for rare oversized frames.
- lz4_decompress(): Reads the big-endian header, allocates exact-size output, decompresses in one shot.
- Input validation: not None type guards, LZ4_MAX_INPUT_SIZE overflow check (Py_ssize_t → int truncation), INT32_MAX check on compressed payload, 256 MiB safety cap on declared uncompressed size, result size verification.
- GIL released during LZ4 C calls (with nogil:).
Modified setup.py: Separate Extension('cassandra.cython_lz4', ..., libraries=['lz4']) entry with exclude from the .pyx glob (which doesn't pass libraries).
Modified cassandra/connection.py: Import chain tries Cython module first, falls back to Python lz4 wrappers. Also works standalone without the Python lz4 package if the Cython module is available.

Tests

23 unit tests in tests/unit/cython/test_cython_lz4.py:

Round-trip tests: empty, single byte, small (16B), 1 KB, 8 KB, 64 KB, all-zeros, all-ones
Cross-compatibility tests (with Python lz4 package): Cython-compressed → Python-decompressed and vice versa at multiple sizes
Header format verification: Confirms big-endian wire format
Error handling: corrupt payload, too-short input, oversized header (>256 MiB), zero-length header
Type rejection: None, bytearray, memoryview, str — all raise TypeError

All 23 tests pass.

Benchmark Results

Measured on a quiet machine (load avg <0.2), pinned to a single core (taskset -c 0), min-of-5 × 10,000 iterations. Ranges from 3 consecutive runs:

Payload	Operation	Python (ns)	Cython (ns)	Speedup
1 KB	compress	393–456	178–198	2.07–2.56x
1 KB	decompress	307–345	135–149	2.28–2.32x
8 KB	compress	556–592	368–384	1.51–1.55x
8 KB	decompress	958–977	608–621	1.57–1.61x
64 KB	compress	2200–2260	2107–2336	0.94–1.07x
64 KB	decompress	7081–7134	4570–4615	1.54–1.56x

At 1–8 KB (the typical CQL hot path), Cython is 1.5–2.6x faster for compress and 1.6–2.3x faster for decompress. At 64 KB, compress is at parity (LZ4 C compression dominates ~95% of total time) and decompress remains 1.54–1.56x faster.

Benchmark script included at benchmarks/bench_lz4.py.

…sion Implement cython_lz4.pyx that calls LZ4_compress_default() and LZ4_decompress_safe() directly via Cython's cdef extern, bypassing the Python lz4 module's object allocation overhead in the hot compress/decompress path. Key design decisions: - Direct C linkage (cdef extern from "lz4.h") eliminates all intermediate Python object allocations for byte-order conversion - Zero-copy compress: uses _PyBytes_Resize to shrink the output bytes object in-place (CPython-specific; documented and safe because the object has refcount=1 during construction) - Wire-compatible with CQL binary protocol v4 format: [4 bytes big-endian uncompressed length][raw LZ4 compressed data] - Safety guards: LZ4_MAX_INPUT_SIZE check (prevents Py_ssize_t→int truncation), INT32_MAX compressed payload check, 256 MiB decompressed size cap, result size verification - bytes not None parameter typing rejects None/bytearray/memoryview - PyPy-safe: this is a Cython module (CPython only); PyPy users automatically fall back to the pure-Python lz4 wrappers via the import chain in connection.py Integration: - connection.py: Cython import with fallback; also enables LZ4 without the Python lz4 package when the Cython extension is built - setup.py: separate Extension with libraries=['lz4'], excluded from the .pyx glob (which lacks the -llz4 link flag) Benchmark results (taskset -c 0, CPython 3.14): Payload Operation Python (ns) Cython (ns) Speedup 1KB compress 596 360 1.66x 1KB decompress 313 136 2.30x 8KB compress 1192 722 1.65x 8KB decompress 1102 825 1.34x 64KB compress 8179 3976 2.06x 64KB decompress 6539 4890 1.34x

mykaul marked this pull request as draft April 8, 2026 13:42

mykaul changed the title ~~perf: add Cython LZ4 wrapper with direct C linkage for CQL v4 compression~~ perf: add Cython LZ4 wrapper with direct C linkage for CQL v4 compression (hundreds of ns saving, x1.5-2.5 speedup) Apr 8, 2026

mykaul force-pushed the perf/cython-lz4-direct-c-linkage branch from 12eea24 to 75a953a Compare April 8, 2026 18:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: add Cython LZ4 wrapper with direct C linkage for CQL v4 compression (hundreds of ns saving, x1.5-2.5 speedup)#809

perf: add Cython LZ4 wrapper with direct C linkage for CQL v4 compression (hundreds of ns saving, x1.5-2.5 speedup)#809
mykaul wants to merge 1 commit intoscylladb:masterfrom
mykaul:perf/cython-lz4-direct-c-linkage

mykaul commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mykaul commented Apr 8, 2026

Motivation

Change

Tests

Benchmark Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant