Skip to content

perf: add Cython LZ4 wrapper with direct C linkage for CQL v4 compression (hundreds of ns saving, x1.5-2.5 speedup)#809

Draft
mykaul wants to merge 1 commit intoscylladb:masterfrom
mykaul:perf/cython-lz4-direct-c-linkage
Draft

perf: add Cython LZ4 wrapper with direct C linkage for CQL v4 compression (hundreds of ns saving, x1.5-2.5 speedup)#809
mykaul wants to merge 1 commit intoscylladb:masterfrom
mykaul:perf/cython-lz4-direct-c-linkage

Conversation

@mykaul
Copy link
Copy Markdown

@mykaul mykaul commented Apr 8, 2026

Motivation

The CQL binary protocol v4 compression path using LZ4 involves Python-level byte manipulation on every compress/decompress call:

  • Compress: int32_pack(len(byts)) + lz4.block.compress(byts)[4:] — byte-swaps the length header from little-endian (Python lz4 library) to big-endian (CQL protocol), allocating intermediate bytes objects via slicing and concatenation.
  • Decompress: lz4.block.decompress(byts[3::-1] + byts[4:]) — reverses the 4-byte header and concatenates with the payload, again allocating intermediates.

This overhead is per-call and adds up on high-throughput workloads. By calling LZ4_compress_default() and LZ4_decompress_safe() directly through Cython's C interface (cdef extern from "lz4.h"), we eliminate all intermediate Python object allocations and perform the big-endian byte-order conversion with simple C pointer operations.

Change

  • New file cassandra/cython_lz4.pyx (201 lines): Cython module with direct C linkage to liblz4.

    • lz4_compress(): Writes the 4-byte big-endian uncompressed-length header + raw LZ4 compressed data directly. Uses alloca for the temporary compression buffer on typical CQL frames (≤128 KiB) to avoid heap malloc/free overhead; falls back to malloc for rare oversized frames.
    • lz4_decompress(): Reads the big-endian header, allocates exact-size output, decompresses in one shot.
    • Input validation: not None type guards, LZ4_MAX_INPUT_SIZE overflow check (Py_ssize_t → int truncation), INT32_MAX check on compressed payload, 256 MiB safety cap on declared uncompressed size, result size verification.
    • GIL released during LZ4 C calls (with nogil:).
  • Modified setup.py: Separate Extension('cassandra.cython_lz4', ..., libraries=['lz4']) entry with exclude from the .pyx glob (which doesn't pass libraries).

  • Modified cassandra/connection.py: Import chain tries Cython module first, falls back to Python lz4 wrappers. Also works standalone without the Python lz4 package if the Cython module is available.

Tests

23 unit tests in tests/unit/cython/test_cython_lz4.py:

  • Round-trip tests: empty, single byte, small (16B), 1 KB, 8 KB, 64 KB, all-zeros, all-ones
  • Cross-compatibility tests (with Python lz4 package): Cython-compressed → Python-decompressed and vice versa at multiple sizes
  • Header format verification: Confirms big-endian wire format
  • Error handling: corrupt payload, too-short input, oversized header (>256 MiB), zero-length header
  • Type rejection: None, bytearray, memoryview, str — all raise TypeError

All 23 tests pass.

Benchmark Results

Measured on a quiet machine (load avg <0.2), pinned to a single core (taskset -c 0), min-of-5 × 10,000 iterations. Ranges from 3 consecutive runs:

Payload Operation Python (ns) Cython (ns) Speedup
1 KB compress 393–456 178–198 2.07–2.56x
1 KB decompress 307–345 135–149 2.28–2.32x
8 KB compress 556–592 368–384 1.51–1.55x
8 KB decompress 958–977 608–621 1.57–1.61x
64 KB compress 2200–2260 2107–2336 0.94–1.07x
64 KB decompress 7081–7134 4570–4615 1.54–1.56x

At 1–8 KB (the typical CQL hot path), Cython is 1.5–2.6x faster for compress and 1.6–2.3x faster for decompress. At 64 KB, compress is at parity (LZ4 C compression dominates ~95% of total time) and decompress remains 1.54–1.56x faster.

Benchmark script included at benchmarks/bench_lz4.py.

@mykaul mykaul marked this pull request as draft April 8, 2026 13:42
@mykaul mykaul changed the title perf: add Cython LZ4 wrapper with direct C linkage for CQL v4 compression perf: add Cython LZ4 wrapper with direct C linkage for CQL v4 compression (hundreds of ns saving, x1.5-2.5 speedup) Apr 8, 2026
…sion

Implement cython_lz4.pyx that calls LZ4_compress_default() and
LZ4_decompress_safe() directly via Cython's cdef extern, bypassing
the Python lz4 module's object allocation overhead in the hot
compress/decompress path.

Key design decisions:
- Direct C linkage (cdef extern from "lz4.h") eliminates all
  intermediate Python object allocations for byte-order conversion
- Zero-copy compress: uses _PyBytes_Resize to shrink the output
  bytes object in-place (CPython-specific; documented and safe
  because the object has refcount=1 during construction)
- Wire-compatible with CQL binary protocol v4 format:
  [4 bytes big-endian uncompressed length][raw LZ4 compressed data]
- Safety guards: LZ4_MAX_INPUT_SIZE check (prevents Py_ssize_t→int
  truncation), INT32_MAX compressed payload check, 256 MiB
  decompressed size cap, result size verification
- bytes not None parameter typing rejects None/bytearray/memoryview
- PyPy-safe: this is a Cython module (CPython only); PyPy users
  automatically fall back to the pure-Python lz4 wrappers via the
  import chain in connection.py

Integration:
- connection.py: Cython import with fallback; also enables LZ4
  without the Python lz4 package when the Cython extension is built
- setup.py: separate Extension with libraries=['lz4'], excluded
  from the .pyx glob (which lacks the -llz4 link flag)

Benchmark results (taskset -c 0, CPython 3.14):
  Payload  Operation     Python (ns)  Cython (ns)  Speedup
  1KB      compress            596        360       1.66x
  1KB      decompress          313        136       2.30x
  8KB      compress           1192        722       1.65x
  8KB      decompress         1102        825       1.34x
  64KB     compress           8179       3976       2.06x
  64KB     decompress         6539       4890       1.34x
@mykaul mykaul force-pushed the perf/cython-lz4-direct-c-linkage branch from 12eea24 to 75a953a Compare April 8, 2026 18:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant