perf: optimize time-series write/read hot paths (10's to 100's of ns savings, 2-25x better) by mykaul · Pull Request #798 · scylladb/python-driver

mykaul · 2026-04-05T20:30:38Z

Summary

Optimize the serialization and deserialization hot paths most relevant to time-series workloads. Each commit is an independent, benchmarked improvement.

Changes

Commit 1: Microbenchmark suite

Add benchmarks/bench_timeseries.py — standalone timeit-based microbenchmarks covering DateType.serialize/deserialize, varint_pack/unpack, MonotonicTimestampGenerator, and BoundStatement.bind() for a 5-column time-series schema.

Commit 2: DateType.serialize — replace calendar.timegm with integer arithmetic

Replace calendar.timegm(v.utctimetuple()) with timedelta arithmetic in cqltypes.py and encoder.py, eliminating the costly struct_time allocation.
3–5x faster for datetime serialization.

Commit 3: varint_pack/varint_unpack — use int.to_bytes/int.from_bytes

Replace the hand-rolled hex-string loop in marshal.py and cython_marshal.pyx with Python 3 builtins int.to_bytes() / int.from_bytes().
2–21x faster depending on integer size (larger values see bigger gains).

Commit 4: MonotonicTimestampGenerator — use time.time_ns()

Replace int(time.time() * 1e6) with time.time_ns() // 1000 for exact microsecond precision and pure integer arithmetic.
Remove unnecessary lock from __init__.
~1.16x faster single-threaded.

Commit 5: Cython-accelerated SerDateType timestamp serializer

Restore serializers.pyx/.pxd from PR (improvement) serializers: add Cython-optimized serialization for VectorType (up to 30x faster - 823us -> 4us!) #748 branch.
Add SerDateType that serializes datetime/date values to 8-byte big-endian int64 timestamps using C-level byte-swapping.
Fix TimestampType → SerDateType mapping so TimestampType (subclass of DateType) correctly uses the Cython fast path.

Benchmark Results (Python 3.14.3)

DateType.serialize (Python path)

Benchmark	Master	Perf	Speedup
datetime (2025)	1056 ns	256 ns	4.1x
datetime (epoch)	929 ns	176 ns	5.3x
date object	1492 ns	511 ns	2.9x

SerDateType (Cython path)

Benchmark	Master	Perf	Speedup
datetime (2025)	237 ns	191 ns	1.24x
datetime (epoch)	139 ns	125 ns	1.11x
date object	452 ns	398 ns	1.14x
raw int	641 ns	635 ns	1.01x

Note: Master baseline used a stale pre-built .so — the .pyx/.pxd sources did not exist on master. The main win from Commit 5 is that serializers.pyx is now source-tracked and TimestampType correctly resolves to the Cython fast path.

varint_pack / varint_unpack

Benchmark	Master	Perf	Speedup
pack small (42)	192 ns	79 ns	2.4x
pack large (2^127)	1158 ns	96 ns	12.1x
unpack large (2^127)	1966 ns	92 ns	21.4x
unpack negative	1209 ns	91 ns	13.3x

MonotonicTimestampGenerator

Benchmark	Master	Perf	Speedup
single-thread call	432 ns	373 ns	1.16x

End-to-end: BoundStatement.bind (5-col time-series row)

Benchmark	Master	Perf	Speedup
bind 5-col row	4486 ns	3157 ns	1.42x

Testing

162 unit tests pass, 1 skipped (pre-existing).

mykaul · 2026-04-10T21:14:01Z

Follow-up: Memoize `cql_parameterized_type()` on all type classes

Commit: 434465b

What changed

Added lazy memoization to cql_parameterized_type() on all 6 override sites: _CassandraType (base), TupleType, UserType, CompositeType, DynamicCompositeType, and VectorType.

The computed CQL type string is cached in a _cql_type_str class attribute. Since type classes are immutable after apply_parameters(), no invalidation is needed. The pattern is a simple None-sentinel check — no functools overhead.

Benchmark results (Python 3.14, 500k iterations)

Type	Uncached	Cached	Speedup
`Int32Type` (simple)	157 ns	23 ns	6.9x
`MapType<text, int>`	464 ns	20 ns	22.9x
`SetType<float>`	371 ns	20 ns	18.2x
`ListType<double>`	357 ns	21 ns	17.3x
`TupleType<int, text, bool>`	509 ns	20 ns	25.0x
`MapType<text, list<tuple<int, float, double>>>`	636 ns	56 ns	11.4x

Testing

607 unit tests passed (10.2s)
No regressions

mykaul · 2026-04-10T21:22:37Z

Follow-up: Skip ColDesc creation in bind() when column encryption is disabled

Commit: 3015986

What changed

Split BoundStatement.bind() into two code paths based on whether column_encryption_policy is set:

Fast path (no encryption — the common case): Calls col_spec.type.serialize(value, proto_version) directly. Eliminates per-column ColDesc namedtuple creation, ce_policy.contains_column() check, and ce_policy.column_type() lookup.
Encryption path: Unchanged behavior with full ColDesc creation and encryption logic.

Benchmark results (Python 3.14, 200k iterations, inner loop)

Schema	Old	New	Saving	Speedup
3-col (int, double, text)	1,375 ns	523 ns	852 ns	2.63x
5-col time-series	2,226 ns	1,013 ns	1,213 ns	2.20x
8-col wide row	3,495 ns	1,317 ns	2,178 ns	2.65x

Testing

607 unit tests passed (10.6s)
No regressions

…mestamps, bind Standalone benchmark covering the hot paths for time-series write/read workloads. Establishes baselines before optimization: DateType.serialize (datetime 2025): ~1020 ns/call DateType.deserialize (2025): ~695 ns/call varint_pack (medium): ~643 ns/call varint_unpack (medium): ~1086 ns/call MonotonicTimestampGenerator: ~374 ns/call BoundStatement.bind (5-col): ~4027 ns/call

… in DateType.serialize Eliminate the intermediate struct_time allocation in DateType.serialize() and Encoder.cql_encode_datetime() by using direct timedelta arithmetic instead of calendar.timegm(v.utctimetuple()). The old code allocated a 9-field time.struct_time named tuple via utctimetuple(), then calendar.timegm() disassembled it back to an epoch integer. The new code subtracts the epoch datetime directly to get a timedelta, then extracts days/seconds/microseconds as integers — zero intermediate object allocations. Handles both naive (treated as UTC) and timezone-aware datetimes. DateType.serialize datetime: 1022 -> 232 ns/call (4.4x faster) DateType.serialize date: 1369 -> 471 ns/call (2.9x faster) BoundStatement.bind (5-col): 4027 -> 3073 ns/call (1.3x faster)

…bytes Replace the manual string-formatting hex conversion in varint_unpack() and the byte-by-byte bytearray loop in varint_pack() with Python 3 builtins int.from_bytes() and int.to_bytes(). varint_unpack used '%02x' formatting per byte, str.join, then int(..., 16) to parse back — O(n) string allocations. int.from_bytes is a single C-level call. varint_pack used a while loop appending individual bytes to a bytearray, then reversing. int.to_bytes computes the result in one C call. Also fixes the Cython path in cython_marshal.pyx which had the same slow pattern with a TODO comment to optimize. Adapted from PR scylladb#689 (varint_unpack) with new varint_pack implementation. varint_pack medium: 643 -> 90 ns/call (7.1x faster) varint_pack large: 1109 -> 96 ns/call (11.6x faster) varint_unpack medium: 1086 -> 115 ns/call (9.4x faster) varint_unpack large: 1940 -> 146 ns/call (13.3x faster)

… and integer arithmetic Replace int(time.time() * 1e6) with time.time_ns() // 1000 to avoid float precision loss for timestamps far from epoch. Remove unnecessary lock acquisition in __init__ (no other thread can see the object yet). Use integer literal 1_000_000 instead of float 1e6 in _maybe_warn threshold/interval comparisons. Update tests to mock time_ns and use nanosecond input values.

Restore serializers.pyx/pxd from the PR scylladb#748 branch and add SerDateType that serializes datetime/date/numeric values to 8-byte big-endian int64 millisecond timestamps entirely in C, avoiding Python-level struct.pack. Uses the same timedelta arithmetic as the pure-Python DateType.serialize (Item B) but with C-level int64 byte-swapping. Benchmark shows ~1.5x speedup over the already-optimized Python path for datetime serialization (253 ns vs 381 ns per call).

Cache the computed CQL type string in a _cql_type_str class attribute. The string is computed lazily on first call and returned from cache on subsequent calls. Since type classes are immutable after apply_parameters(), no invalidation logic is needed. All 6 cql_parameterized_type() overrides are covered: _CassandraType (base), TupleType, UserType, CompositeType, DynamicCompositeType, VectorType. Benchmark (500k iters, Python 3.14): Int32Type (simple): 6.9x (157 -> 23 ns) MapType<text, int>: 22.9x (464 -> 20 ns) SetType<float>: 18.2x (371 -> 20 ns) ListType<double>: 17.3x (357 -> 21 ns) TupleType<int,text,bool>: 25.0x (509 -> 20 ns) Nested map/list/tuple: 11.4x (636 -> 56 ns)

Split BoundStatement.bind() into two code paths: when column_encryption_policy is None (the overwhelmingly common case), skip ColDesc namedtuple creation, ce_policy.contains_column() check, and ce_policy.column_type() lookup per column. Call col_spec.type.serialize() directly instead. When column encryption IS enabled, behavior is unchanged. Benchmark (inner loop only, 200k iters, Python 3.14): 3-col: 1375 -> 523 ns (2.63x, saving 852 ns/bind) 5-col: 2226 -> 1013 ns (2.20x, saving 1213 ns/bind) 8-col: 3495 -> 1317 ns (2.65x, saving 2178 ns/bind)

mykaul changed the title ~~perf: optimize time-series write/read hot paths~~ perf: optimize time-series write/read hot paths (tens to hundreds of ns savings, 2-13x better) Apr 7, 2026

mykaul changed the title ~~perf: optimize time-series write/read hot paths (tens to hundreds of ns savings, 2-13x better)~~ perf: optimize time-series write/read hot paths (10's to 100's of ns savings, 2-13x better) Apr 7, 2026

mykaul changed the title ~~perf: optimize time-series write/read hot paths (10's to 100's of ns savings, 2-13x better)~~ perf: optimize time-series write/read hot paths (10's to 100's of ns savings, 2-25x better) Apr 11, 2026

mykaul force-pushed the perf/timeseries-optimizations branch 2 times, most recently from 6f057f4 to 3015986 Compare April 11, 2026 16:07

mykaul added 7 commits April 11, 2026 19:26

mykaul force-pushed the perf/timeseries-optimizations branch from 3015986 to 35ee857 Compare April 11, 2026 16:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: optimize time-series write/read hot paths (10's to 100's of ns savings, 2-25x better)#798

perf: optimize time-series write/read hot paths (10's to 100's of ns savings, 2-25x better)#798
mykaul wants to merge 7 commits intoscylladb:masterfrom
mykaul:perf/timeseries-optimizations

mykaul commented Apr 5, 2026 •

edited

Loading

Uh oh!

mykaul commented Apr 10, 2026

Uh oh!

mykaul commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mykaul commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Commit 1: Microbenchmark suite

Commit 2: DateType.serialize — replace calendar.timegm with integer arithmetic

Commit 3: varint_pack/varint_unpack — use int.to_bytes/int.from_bytes

Commit 4: MonotonicTimestampGenerator — use time.time_ns()

Commit 5: Cython-accelerated SerDateType timestamp serializer

Benchmark Results (Python 3.14.3)

DateType.serialize (Python path)

SerDateType (Cython path)

varint_pack / varint_unpack

MonotonicTimestampGenerator

End-to-end: BoundStatement.bind (5-col time-series row)

Testing

Uh oh!

mykaul commented Apr 10, 2026

Follow-up: Memoize cql_parameterized_type() on all type classes

What changed

Benchmark results (Python 3.14, 500k iterations)

Testing

Uh oh!

mykaul commented Apr 10, 2026

Follow-up: Skip ColDesc creation in bind() when column encryption is disabled

What changed

Benchmark results (Python 3.14, 200k iterations, inner loop)

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mykaul commented Apr 5, 2026 •

edited

Loading

Follow-up: Memoize `cql_parameterized_type()` on all type classes