perf: optimize Tablet memory layout and per-query lookup speed by mykaul · Pull Request #812 · scylladb/python-driver

mykaul · 2026-04-09T10:48:01Z

Summary

Five incremental optimizations to the tablets hot path, each in a separate commit:

__slots__ on Tablet — eliminates per-instance __dict__ allocation
tuple replicas — replicas are immutable after creation; use tuple instead of list
Cached _replica_dict — build {host_id: shard_id} dict once at construction, use it in both per-query hot paths (policies.py and pool.py). Also fixes a latent iterator-consumption bug where passing a generator to Tablet() would silently produce an empty _replica_dict.
Streamline from_row — inline the _is_valid_tablet check to eliminate a staticmethod descriptor lookup, an extra function call, and a redundant is not None guard
Parallel token index lists — maintain _first_tokens and _last_tokens as plain list[int] dicts alongside _tablets, so bisect_left runs entirely in C on native ints instead of calling an attrgetter callback per comparison. Follow-up to PR perf: use stdlib bisect and attrgetter in tablets.py (100's of ns, 1.5-5.6x speedup) #757 which identified this opportunity in its own benchmarks.

Benchmarks (Python 3.14, best-of-5 rounds)

Memory per Tablet (deep, `pympler.asizeof`, 3 replicas)

State	Deep size	Change vs baseline
Baseline (no `__slots__`, `list`, no dict)	1856 B	—
After commit 1 (`__slots__`)	1400 B	-456 B
After commit 2 (`tuple`)	1384 B	-472 B
After commit 3 (`_replica_dict`)	1616 B	-240 B

Commits 1+2 save 472 bytes/tablet. Commit 3 spends 232 bytes back on the _replica_dict cache. Net: 240 bytes saved per tablet (13%). Commit 5 adds 16 bytes/tablet (two ints in parallel lists) — negligible.

Shallow breakdown (sys.getsizeof)

Component	Before	After	Change
instance shell	48 B	64 B	+16 B (slots have fixed overhead)
`__dict__`	296 B	0 B	-296 B
replicas container	88 B (list)	72 B (tuple)	-16 B
`_replica_dict` (3 entries)	—	224 B	+224 B

`get_tablet_for_key` (hit — the primary per-query hot path)

Tablets	Before	After	Saved	Speedup
10	293 ns	216 ns	78 ns	1.36x
100	351 ns	233 ns	118 ns	1.51x
1,000	448 ns	267 ns	181 ns	1.68x
10,000	537 ns	282 ns	255 ns	1.90x

Miss path (N=1000): 458 ns -> 229 ns (2.0x).

Other per-query hot paths

Path	Before	After	Speedup
`policies.py`: `set(map(lambda r: r[0], tablet.replicas))`	372 ns	18 ns (`tablet._replica_dict`)	20.7x
`pool.py`: linear scan for shard_id	199 ns	73 ns (`dict.get`)	2.7x
`replica_contains_host_id`	O(n) linear	O(1) dict `in`	—

Construction (`Tablet.from_row`)

State	Time	Change
Original (master)	143 ns	—
After commit 3 (`_replica_dict` + old `from_row`)	465 ns	+322 ns
After commit 4 (streamlined `from_row`)	410 ns	+267 ns

Commit 4 recovers ~54 ns (12%) of the construction regression. The remaining +267 ns vs master is the irreducible cost of building tuple() + dict() at construction time (~250 ns), which pays for itself on every query.

Tests

All 223 unit tests pass (tablets, policies, pool, metadata, cluster, response_future). 7 new tests added for _replica_dict behavior, including an iterator edge-case regression test.

Add __slots__ to the Tablet class, removing the per-instance __dict__ allocation. Tablets are created frequently (one per token range per table) and are long-lived, so the cumulative memory savings are significant. Before: 416 bytes/tablet (48 instance + 96 __dict__ + 80 replicas + 192 tuples) After: 328 bytes/tablet (56 instance + 0 __dict__ + 80 replicas + 192 tuples) Saving: 88 bytes/tablet (21%) Scale impact (3 replicas/tablet): 12,800 tablets (100 tables x 128): saves 1.1 MB 128,000 tablets (1000 tables x 128): saves 10.7 MB 256,000 tablets (1000 tables x 256): saves 21.5 MB Tablet.from_row construction also improves: Before: 186 ns/call After: 147 ns/call (1.27x faster, -21%)

Replicas are never mutated after Tablet construction; convert to tuple in __init__ to save 8 bytes per tablet (list overallocates for future appends that never happen) and communicate immutability. Before: 328 bytes/tablet (replicas container: 80 bytes as list) After: 320 bytes/tablet (replicas container: 72 bytes as tuple) Saving: 8 bytes/tablet (2.4%) Combined with __slots__ (commit 1), total savings so far: 96 bytes/tablet. Scale impact (3 replicas/tablet): 128,000 tablets: saves ~1.0 MB (tuple) + 10.7 MB (slots) = 11.7 MB total 256,000 tablets: saves ~2.0 MB (tuple) + 21.5 MB (slots) = 23.5 MB total

Build a {host_id: shard_id} dict once at Tablet construction time so that policies.py and pool.py can replace set(map(lambda ...)) and linear scans with O(1) dict operations. - Add _replica_dict to __slots__ - Build dict from the materialized tuple (not the raw replicas arg) to avoid double-consuming a one-shot iterator - Update DCAwareRoundRobinPolicy to use tablet._replica_dict keys - Update HostConnection to use tablet._replica_dict.get() for shard - Rewrite replica_contains_host_id() to use dict membership - Add 7 unit tests covering dict construction, lookup, host membership, tuple storage, and the iterator edge case

Remove the _is_valid_tablet staticmethod indirection and replace the two-step from_row -> _is_valid_tablet -> Tablet() chain with a single truthiness guard and direct construction. Saves ~54 ns/call (12%) by eliminating a staticmethod descriptor lookup, an extra function call, and redundant 'is not None' check (replicas from CQL deserialization is always a list or None).

Maintain parallel _first_tokens and _last_tokens dicts alongside _tablets, each mapping (keyspace, table) to a plain list[int]. This lets bisect_left run entirely in C on native ints instead of calling an attrgetter callback on every comparison during binary search. Follow-up to PR scylladb#757 which identified the opportunity: its own benchmarks showed bisect_left without key= is 2.7-5.7x faster than with key=attrgetter. Results (best-of-5, Python 3.14): get_tablet_for_key (hit): Tablets Before After Saved Speedup 10 293ns 216ns 78ns 1.36x 100 351ns 233ns 118ns 1.51x 1,000 448ns 267ns 181ns 1.68x 10,000 537ns 282ns 255ns 1.90x All three dicts are kept in sync by add_tablet, drop_tablets, and drop_tablets_by_host_id. The attrgetter imports are no longer needed and have been removed.

Replace the per-tablet reversed pop() loop (O(k*n) for each of three parallel lists) with a single-pass index filter that rebuilds the lists once. This avoids repeated list element shifting and scales better when many tablets are dropped at once. Benchmark (3 replicas/tablet, ~33% dropped): Tablets Old (triple-pop) New (batch-filter) Speedup 100 123 us 128 us ~1.0x 1,000 1,375 us 1,113 us 1.24x 10,000 25,429 us 13,079 us 1.94x Add 3 unit tests for drop_tablets_by_host_id covering matching, None host_id, and nonexistent host_id.

mykaul added 3 commits April 9, 2026 13:43

mykaul force-pushed the perf/tablets-memory-and-lookup branch from 2640c18 to 1d25663 Compare April 9, 2026 11:18

mykaul marked this pull request as draft April 9, 2026 11:35

mykaul added 2 commits April 9, 2026 14:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: optimize Tablet memory layout and per-query lookup speed#812

perf: optimize Tablet memory layout and per-query lookup speed#812
mykaul wants to merge 6 commits intoscylladb:masterfrom
mykaul:perf/tablets-memory-and-lookup

mykaul commented Apr 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mykaul commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmarks (Python 3.14, best-of-5 rounds)

Memory per Tablet (deep, pympler.asizeof, 3 replicas)

get_tablet_for_key (hit — the primary per-query hot path)

Other per-query hot paths

Construction (Tablet.from_row)

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mykaul commented Apr 9, 2026 •

edited

Loading

Memory per Tablet (deep, `pympler.asizeof`, 3 replicas)

`get_tablet_for_key` (hit — the primary per-query hot path)

Construction (`Tablet.from_row`)