Skip to content

perf: optimize time-series write/read hot paths (10's to 100's of ns savings, 2-25x better)#798

Draft
mykaul wants to merge 7 commits intoscylladb:masterfrom
mykaul:perf/timeseries-optimizations
Draft

perf: optimize time-series write/read hot paths (10's to 100's of ns savings, 2-25x better)#798
mykaul wants to merge 7 commits intoscylladb:masterfrom
mykaul:perf/timeseries-optimizations

Conversation

@mykaul
Copy link
Copy Markdown

@mykaul mykaul commented Apr 5, 2026

Summary

Optimize the serialization and deserialization hot paths most relevant to time-series workloads. Each commit is an independent, benchmarked improvement.

Changes

Commit 1: Microbenchmark suite

  • Add benchmarks/bench_timeseries.py — standalone timeit-based microbenchmarks covering DateType.serialize/deserialize, varint_pack/unpack, MonotonicTimestampGenerator, and BoundStatement.bind() for a 5-column time-series schema.

Commit 2: DateType.serialize — replace calendar.timegm with integer arithmetic

  • Replace calendar.timegm(v.utctimetuple()) with timedelta arithmetic in cqltypes.py and encoder.py, eliminating the costly struct_time allocation.
  • 3–5x faster for datetime serialization.

Commit 3: varint_pack/varint_unpack — use int.to_bytes/int.from_bytes

  • Replace the hand-rolled hex-string loop in marshal.py and cython_marshal.pyx with Python 3 builtins int.to_bytes() / int.from_bytes().
  • 2–21x faster depending on integer size (larger values see bigger gains).

Commit 4: MonotonicTimestampGenerator — use time.time_ns()

  • Replace int(time.time() * 1e6) with time.time_ns() // 1000 for exact microsecond precision and pure integer arithmetic.
  • Remove unnecessary lock from __init__.
  • ~1.16x faster single-threaded.

Commit 5: Cython-accelerated SerDateType timestamp serializer

Benchmark Results (Python 3.14.3)

DateType.serialize (Python path)

Benchmark Master Perf Speedup
datetime (2025) 1056 ns 256 ns 4.1x
datetime (epoch) 929 ns 176 ns 5.3x
date object 1492 ns 511 ns 2.9x

SerDateType (Cython path)

Benchmark Master Perf Speedup
datetime (2025) 237 ns 191 ns 1.24x
datetime (epoch) 139 ns 125 ns 1.11x
date object 452 ns 398 ns 1.14x
raw int 641 ns 635 ns 1.01x

Note: Master baseline used a stale pre-built .so — the .pyx/.pxd sources did not exist on master. The main win from Commit 5 is that serializers.pyx is now source-tracked and TimestampType correctly resolves to the Cython fast path.

varint_pack / varint_unpack

Benchmark Master Perf Speedup
pack small (42) 192 ns 79 ns 2.4x
pack large (2^127) 1158 ns 96 ns 12.1x
unpack large (2^127) 1966 ns 92 ns 21.4x
unpack negative 1209 ns 91 ns 13.3x

MonotonicTimestampGenerator

Benchmark Master Perf Speedup
single-thread call 432 ns 373 ns 1.16x

End-to-end: BoundStatement.bind (5-col time-series row)

Benchmark Master Perf Speedup
bind 5-col row 4486 ns 3157 ns 1.42x

Testing

  • 162 unit tests pass, 1 skipped (pre-existing).

@mykaul mykaul changed the title perf: optimize time-series write/read hot paths perf: optimize time-series write/read hot paths (tens to hundreds of ns savings, 2-13x better) Apr 7, 2026
@mykaul mykaul changed the title perf: optimize time-series write/read hot paths (tens to hundreds of ns savings, 2-13x better) perf: optimize time-series write/read hot paths (10's to 100's of ns savings, 2-13x better) Apr 7, 2026
@mykaul
Copy link
Copy Markdown
Author

mykaul commented Apr 10, 2026

Follow-up: Memoize cql_parameterized_type() on all type classes

Commit: 434465b

What changed

Added lazy memoization to cql_parameterized_type() on all 6 override sites: _CassandraType (base), TupleType, UserType, CompositeType, DynamicCompositeType, and VectorType.

The computed CQL type string is cached in a _cql_type_str class attribute. Since type classes are immutable after apply_parameters(), no invalidation is needed. The pattern is a simple None-sentinel check — no functools overhead.

Benchmark results (Python 3.14, 500k iterations)

Type Uncached Cached Speedup
Int32Type (simple) 157 ns 23 ns 6.9x
MapType<text, int> 464 ns 20 ns 22.9x
SetType<float> 371 ns 20 ns 18.2x
ListType<double> 357 ns 21 ns 17.3x
TupleType<int, text, bool> 509 ns 20 ns 25.0x
MapType<text, list<tuple<int, float, double>>> 636 ns 56 ns 11.4x

Testing

  • 607 unit tests passed (10.2s)
  • No regressions

@mykaul
Copy link
Copy Markdown
Author

mykaul commented Apr 10, 2026

Follow-up: Skip ColDesc creation in bind() when column encryption is disabled

Commit: 3015986

What changed

Split BoundStatement.bind() into two code paths based on whether column_encryption_policy is set:

  • Fast path (no encryption — the common case): Calls col_spec.type.serialize(value, proto_version) directly. Eliminates per-column ColDesc namedtuple creation, ce_policy.contains_column() check, and ce_policy.column_type() lookup.
  • Encryption path: Unchanged behavior with full ColDesc creation and encryption logic.

Benchmark results (Python 3.14, 200k iterations, inner loop)

Schema Old New Saving Speedup
3-col (int, double, text) 1,375 ns 523 ns 852 ns 2.63x
5-col time-series 2,226 ns 1,013 ns 1,213 ns 2.20x
8-col wide row 3,495 ns 1,317 ns 2,178 ns 2.65x

Testing

  • 607 unit tests passed (10.6s)
  • No regressions

@mykaul mykaul changed the title perf: optimize time-series write/read hot paths (10's to 100's of ns savings, 2-13x better) perf: optimize time-series write/read hot paths (10's to 100's of ns savings, 2-25x better) Apr 11, 2026
@mykaul mykaul force-pushed the perf/timeseries-optimizations branch 2 times, most recently from 6f057f4 to 3015986 Compare April 11, 2026 16:07
mykaul added 7 commits April 11, 2026 19:26
…mestamps, bind

Standalone benchmark covering the hot paths for time-series write/read
workloads. Establishes baselines before optimization:

  DateType.serialize (datetime 2025):    ~1020 ns/call
  DateType.deserialize (2025):            ~695 ns/call
  varint_pack (medium):                   ~643 ns/call
  varint_unpack (medium):                ~1086 ns/call
  MonotonicTimestampGenerator:            ~374 ns/call
  BoundStatement.bind (5-col):           ~4027 ns/call
… in DateType.serialize

Eliminate the intermediate struct_time allocation in DateType.serialize()
and Encoder.cql_encode_datetime() by using direct timedelta arithmetic
instead of calendar.timegm(v.utctimetuple()).

The old code allocated a 9-field time.struct_time named tuple via
utctimetuple(), then calendar.timegm() disassembled it back to an epoch
integer.  The new code subtracts the epoch datetime directly to get a
timedelta, then extracts days/seconds/microseconds as integers — zero
intermediate object allocations.

Handles both naive (treated as UTC) and timezone-aware datetimes.

DateType.serialize datetime:  1022 -> 232 ns/call  (4.4x faster)
DateType.serialize date:      1369 -> 471 ns/call  (2.9x faster)
BoundStatement.bind (5-col):  4027 -> 3073 ns/call (1.3x faster)
…bytes

Replace the manual string-formatting hex conversion in varint_unpack()
and the byte-by-byte bytearray loop in varint_pack() with Python 3
builtins int.from_bytes() and int.to_bytes().

varint_unpack used '%02x' formatting per byte, str.join, then
int(..., 16) to parse back — O(n) string allocations.  int.from_bytes
is a single C-level call.

varint_pack used a while loop appending individual bytes to a bytearray,
then reversing.  int.to_bytes computes the result in one C call.

Also fixes the Cython path in cython_marshal.pyx which had the same
slow pattern with a TODO comment to optimize.

Adapted from PR scylladb#689 (varint_unpack) with new varint_pack implementation.

varint_pack  medium:   643 ->  90 ns/call  (7.1x faster)
varint_pack  large:   1109 ->  96 ns/call (11.6x faster)
varint_unpack medium: 1086 -> 115 ns/call  (9.4x faster)
varint_unpack large:  1940 -> 146 ns/call (13.3x faster)
… and integer arithmetic

Replace int(time.time() * 1e6) with time.time_ns() // 1000 to avoid
float precision loss for timestamps far from epoch.  Remove unnecessary
lock acquisition in __init__ (no other thread can see the object yet).
Use integer literal 1_000_000 instead of float 1e6 in _maybe_warn
threshold/interval comparisons.  Update tests to mock time_ns and use
nanosecond input values.
Restore serializers.pyx/pxd from the PR scylladb#748 branch and add SerDateType
that serializes datetime/date/numeric values to 8-byte big-endian int64
millisecond timestamps entirely in C, avoiding Python-level struct.pack.
Uses the same timedelta arithmetic as the pure-Python DateType.serialize
(Item B) but with C-level int64 byte-swapping.

Benchmark shows ~1.5x speedup over the already-optimized Python path
for datetime serialization (253 ns vs 381 ns per call).
Cache the computed CQL type string in a _cql_type_str class attribute.
The string is computed lazily on first call and returned from cache on
subsequent calls.  Since type classes are immutable after
apply_parameters(), no invalidation logic is needed.

All 6 cql_parameterized_type() overrides are covered:
  _CassandraType (base), TupleType, UserType, CompositeType,
  DynamicCompositeType, VectorType.

Benchmark (500k iters, Python 3.14):
  Int32Type (simple):         6.9x  (157 -> 23 ns)
  MapType<text, int>:        22.9x  (464 -> 20 ns)
  SetType<float>:            18.2x  (371 -> 20 ns)
  ListType<double>:          17.3x  (357 -> 21 ns)
  TupleType<int,text,bool>:  25.0x  (509 -> 20 ns)
  Nested map/list/tuple:     11.4x  (636 -> 56 ns)
Split BoundStatement.bind() into two code paths: when
column_encryption_policy is None (the overwhelmingly common case),
skip ColDesc namedtuple creation, ce_policy.contains_column() check,
and ce_policy.column_type() lookup per column.  Call
col_spec.type.serialize() directly instead.

When column encryption IS enabled, behavior is unchanged.

Benchmark (inner loop only, 200k iters, Python 3.14):
  3-col: 1375 -> 523 ns  (2.63x, saving 852 ns/bind)
  5-col: 2226 -> 1013 ns (2.20x, saving 1213 ns/bind)
  8-col: 3495 -> 1317 ns (2.65x, saving 2178 ns/bind)
@mykaul mykaul force-pushed the perf/timeseries-optimizations branch from 3015986 to 35ee857 Compare April 11, 2026 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant