Skip to content

perf: optimize Tablet memory layout and per-query lookup speed#812

Draft
mykaul wants to merge 6 commits intoscylladb:masterfrom
mykaul:perf/tablets-memory-and-lookup
Draft

perf: optimize Tablet memory layout and per-query lookup speed#812
mykaul wants to merge 6 commits intoscylladb:masterfrom
mykaul:perf/tablets-memory-and-lookup

Conversation

@mykaul
Copy link
Copy Markdown

@mykaul mykaul commented Apr 9, 2026

Summary

Five incremental optimizations to the tablets hot path, each in a separate commit:

  1. __slots__ on Tablet — eliminates per-instance __dict__ allocation
  2. tuple replicas — replicas are immutable after creation; use tuple instead of list
  3. Cached _replica_dict — build {host_id: shard_id} dict once at construction, use it in both per-query hot paths (policies.py and pool.py). Also fixes a latent iterator-consumption bug where passing a generator to Tablet() would silently produce an empty _replica_dict.
  4. Streamline from_row — inline the _is_valid_tablet check to eliminate a staticmethod descriptor lookup, an extra function call, and a redundant is not None guard
  5. Parallel token index lists — maintain _first_tokens and _last_tokens as plain list[int] dicts alongside _tablets, so bisect_left runs entirely in C on native ints instead of calling an attrgetter callback per comparison. Follow-up to PR perf: use stdlib bisect and attrgetter in tablets.py (100's of ns, 1.5-5.6x speedup) #757 which identified this opportunity in its own benchmarks.

Benchmarks (Python 3.14, best-of-5 rounds)

Memory per Tablet (deep, pympler.asizeof, 3 replicas)

State Deep size Change vs baseline
Baseline (no __slots__, list, no dict) 1856 B
After commit 1 (__slots__) 1400 B -456 B
After commit 2 (tuple) 1384 B -472 B
After commit 3 (_replica_dict) 1616 B -240 B

Commits 1+2 save 472 bytes/tablet. Commit 3 spends 232 bytes back on the _replica_dict cache. Net: 240 bytes saved per tablet (13%). Commit 5 adds 16 bytes/tablet (two ints in parallel lists) — negligible.

Shallow breakdown (sys.getsizeof)
Component Before After Change
instance shell 48 B 64 B +16 B (slots have fixed overhead)
__dict__ 296 B 0 B -296 B
replicas container 88 B (list) 72 B (tuple) -16 B
_replica_dict (3 entries) 224 B +224 B

get_tablet_for_key (hit — the primary per-query hot path)

Tablets Before After Saved Speedup
10 293 ns 216 ns 78 ns 1.36x
100 351 ns 233 ns 118 ns 1.51x
1,000 448 ns 267 ns 181 ns 1.68x
10,000 537 ns 282 ns 255 ns 1.90x

Miss path (N=1000): 458 ns -> 229 ns (2.0x).

Other per-query hot paths

Path Before After Speedup
policies.py: set(map(lambda r: r[0], tablet.replicas)) 372 ns 18 ns (tablet._replica_dict) 20.7x
pool.py: linear scan for shard_id 199 ns 73 ns (dict.get) 2.7x
replica_contains_host_id O(n) linear O(1) dict in

Construction (Tablet.from_row)

State Time Change
Original (master) 143 ns
After commit 3 (_replica_dict + old from_row) 465 ns +322 ns
After commit 4 (streamlined from_row) 410 ns +267 ns

Commit 4 recovers ~54 ns (12%) of the construction regression. The remaining +267 ns vs master is the irreducible cost of building tuple() + dict() at construction time (~250 ns), which pays for itself on every query.

Tests

All 223 unit tests pass (tablets, policies, pool, metadata, cluster, response_future). 7 new tests added for _replica_dict behavior, including an iterator edge-case regression test.

mykaul added 3 commits April 9, 2026 13:43
Add __slots__ to the Tablet class, removing the per-instance __dict__
allocation. Tablets are created frequently (one per token range per table)
and are long-lived, so the cumulative memory savings are significant.

Before: 416 bytes/tablet (48 instance + 96 __dict__ + 80 replicas + 192 tuples)
After:  328 bytes/tablet (56 instance +  0 __dict__ + 80 replicas + 192 tuples)
Saving: 88 bytes/tablet (21%)

Scale impact (3 replicas/tablet):
  12,800 tablets (100 tables x 128): saves 1.1 MB
 128,000 tablets (1000 tables x 128): saves 10.7 MB
 256,000 tablets (1000 tables x 256): saves 21.5 MB

Tablet.from_row construction also improves:
  Before: 186 ns/call
  After:  147 ns/call (1.27x faster, -21%)
Replicas are never mutated after Tablet construction; convert to tuple
in __init__ to save 8 bytes per tablet (list overallocates for future
appends that never happen) and communicate immutability.

Before: 328 bytes/tablet (replicas container: 80 bytes as list)
After:  320 bytes/tablet (replicas container: 72 bytes as tuple)
Saving: 8 bytes/tablet (2.4%)

Combined with __slots__ (commit 1), total savings so far: 96 bytes/tablet.

Scale impact (3 replicas/tablet):
  128,000 tablets: saves ~1.0 MB (tuple) + 10.7 MB (slots) = 11.7 MB total
  256,000 tablets: saves ~2.0 MB (tuple) + 21.5 MB (slots) = 23.5 MB total
Build a {host_id: shard_id} dict once at Tablet construction time so
that policies.py and pool.py can replace set(map(lambda ...)) and
linear scans with O(1) dict operations.

- Add _replica_dict to __slots__
- Build dict from the materialized tuple (not the raw replicas arg)
  to avoid double-consuming a one-shot iterator
- Update DCAwareRoundRobinPolicy to use tablet._replica_dict keys
- Update HostConnection to use tablet._replica_dict.get() for shard
- Rewrite replica_contains_host_id() to use dict membership
- Add 7 unit tests covering dict construction, lookup, host membership,
  tuple storage, and the iterator edge case
@mykaul mykaul force-pushed the perf/tablets-memory-and-lookup branch from 2640c18 to 1d25663 Compare April 9, 2026 11:18
Remove the _is_valid_tablet staticmethod indirection and replace the
two-step from_row -> _is_valid_tablet -> Tablet() chain with a single
truthiness guard and direct construction.  Saves ~54 ns/call (12%)
by eliminating a staticmethod descriptor lookup, an extra function
call, and redundant 'is not None' check (replicas from CQL
deserialization is always a list or None).
@mykaul mykaul marked this pull request as draft April 9, 2026 11:35
mykaul added 2 commits April 9, 2026 14:52
Maintain parallel _first_tokens and _last_tokens dicts alongside
_tablets, each mapping (keyspace, table) to a plain list[int].  This
lets bisect_left run entirely in C on native ints instead of calling
an attrgetter callback on every comparison during binary search.

Follow-up to PR scylladb#757 which identified the opportunity: its own
benchmarks showed bisect_left without key= is 2.7-5.7x faster than
with key=attrgetter.

Results (best-of-5, Python 3.14):

  get_tablet_for_key (hit):
  Tablets    Before    After    Saved   Speedup
       10    293ns    216ns     78ns     1.36x
      100    351ns    233ns    118ns     1.51x
    1,000    448ns    267ns    181ns     1.68x
   10,000    537ns    282ns    255ns     1.90x

All three dicts are kept in sync by add_tablet, drop_tablets, and
drop_tablets_by_host_id.  The attrgetter imports are no longer needed
and have been removed.
Replace the per-tablet reversed pop() loop (O(k*n) for each of three
parallel lists) with a single-pass index filter that rebuilds the
lists once.  This avoids repeated list element shifting and scales
better when many tablets are dropped at once.

Benchmark (3 replicas/tablet, ~33% dropped):
  Tablets   Old (triple-pop)   New (batch-filter)   Speedup
     100          123 us             128 us           ~1.0x
   1,000        1,375 us           1,113 us           1.24x
  10,000       25,429 us          13,079 us           1.94x

Add 3 unit tests for drop_tablets_by_host_id covering matching,
None host_id, and nonexistent host_id.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant