perf: add Cython LZ4 wrapper with direct C linkage for CQL v4 compression (hundreds of ns saving, x1.5-2.5 speedup)#809
Draft
mykaul wants to merge 1 commit intoscylladb:masterfrom
Conversation
…sion Implement cython_lz4.pyx that calls LZ4_compress_default() and LZ4_decompress_safe() directly via Cython's cdef extern, bypassing the Python lz4 module's object allocation overhead in the hot compress/decompress path. Key design decisions: - Direct C linkage (cdef extern from "lz4.h") eliminates all intermediate Python object allocations for byte-order conversion - Zero-copy compress: uses _PyBytes_Resize to shrink the output bytes object in-place (CPython-specific; documented and safe because the object has refcount=1 during construction) - Wire-compatible with CQL binary protocol v4 format: [4 bytes big-endian uncompressed length][raw LZ4 compressed data] - Safety guards: LZ4_MAX_INPUT_SIZE check (prevents Py_ssize_t→int truncation), INT32_MAX compressed payload check, 256 MiB decompressed size cap, result size verification - bytes not None parameter typing rejects None/bytearray/memoryview - PyPy-safe: this is a Cython module (CPython only); PyPy users automatically fall back to the pure-Python lz4 wrappers via the import chain in connection.py Integration: - connection.py: Cython import with fallback; also enables LZ4 without the Python lz4 package when the Cython extension is built - setup.py: separate Extension with libraries=['lz4'], excluded from the .pyx glob (which lacks the -llz4 link flag) Benchmark results (taskset -c 0, CPython 3.14): Payload Operation Python (ns) Cython (ns) Speedup 1KB compress 596 360 1.66x 1KB decompress 313 136 2.30x 8KB compress 1192 722 1.65x 8KB decompress 1102 825 1.34x 64KB compress 8179 3976 2.06x 64KB decompress 6539 4890 1.34x
12eea24 to
75a953a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
The CQL binary protocol v4 compression path using LZ4 involves Python-level byte manipulation on every compress/decompress call:
int32_pack(len(byts)) + lz4.block.compress(byts)[4:]— byte-swaps the length header from little-endian (Python lz4 library) to big-endian (CQL protocol), allocating intermediatebytesobjects via slicing and concatenation.lz4.block.decompress(byts[3::-1] + byts[4:])— reverses the 4-byte header and concatenates with the payload, again allocating intermediates.This overhead is per-call and adds up on high-throughput workloads. By calling
LZ4_compress_default()andLZ4_decompress_safe()directly through Cython's C interface (cdef extern from "lz4.h"), we eliminate all intermediate Python object allocations and perform the big-endian byte-order conversion with simple C pointer operations.Change
New file
cassandra/cython_lz4.pyx(201 lines): Cython module with direct C linkage to liblz4.lz4_compress(): Writes the 4-byte big-endian uncompressed-length header + raw LZ4 compressed data directly. Usesallocafor the temporary compression buffer on typical CQL frames (≤128 KiB) to avoid heap malloc/free overhead; falls back tomallocfor rare oversized frames.lz4_decompress(): Reads the big-endian header, allocates exact-size output, decompresses in one shot.not Nonetype guards,LZ4_MAX_INPUT_SIZEoverflow check (Py_ssize_t → int truncation),INT32_MAXcheck on compressed payload, 256 MiB safety cap on declared uncompressed size, result size verification.with nogil:).Modified
setup.py: SeparateExtension('cassandra.cython_lz4', ..., libraries=['lz4'])entry withexcludefrom the.pyxglob (which doesn't passlibraries).Modified
cassandra/connection.py: Import chain tries Cython module first, falls back to Pythonlz4wrappers. Also works standalone without the Pythonlz4package if the Cython module is available.Tests
23 unit tests in
tests/unit/cython/test_cython_lz4.py:lz4package): Cython-compressed → Python-decompressed and vice versa at multiple sizesNone,bytearray,memoryview,str— all raiseTypeErrorAll 23 tests pass.
Benchmark Results
Measured on a quiet machine (load avg <0.2), pinned to a single core (
taskset -c 0), min-of-5 × 10,000 iterations. Ranges from 3 consecutive runs:At 1–8 KB (the typical CQL hot path), Cython is 1.5–2.6x faster for compress and 1.6–2.3x faster for decompress. At 64 KB, compress is at parity (LZ4 C compression dominates ~95% of total time) and decompress remains 1.54–1.56x faster.
Benchmark script included at
benchmarks/bench_lz4.py.