perf: optimize Tablet memory layout and per-query lookup speed#812
Draft
mykaul wants to merge 6 commits intoscylladb:masterfrom
Draft
perf: optimize Tablet memory layout and per-query lookup speed#812mykaul wants to merge 6 commits intoscylladb:masterfrom
mykaul wants to merge 6 commits intoscylladb:masterfrom
Conversation
Add __slots__ to the Tablet class, removing the per-instance __dict__ allocation. Tablets are created frequently (one per token range per table) and are long-lived, so the cumulative memory savings are significant. Before: 416 bytes/tablet (48 instance + 96 __dict__ + 80 replicas + 192 tuples) After: 328 bytes/tablet (56 instance + 0 __dict__ + 80 replicas + 192 tuples) Saving: 88 bytes/tablet (21%) Scale impact (3 replicas/tablet): 12,800 tablets (100 tables x 128): saves 1.1 MB 128,000 tablets (1000 tables x 128): saves 10.7 MB 256,000 tablets (1000 tables x 256): saves 21.5 MB Tablet.from_row construction also improves: Before: 186 ns/call After: 147 ns/call (1.27x faster, -21%)
Replicas are never mutated after Tablet construction; convert to tuple in __init__ to save 8 bytes per tablet (list overallocates for future appends that never happen) and communicate immutability. Before: 328 bytes/tablet (replicas container: 80 bytes as list) After: 320 bytes/tablet (replicas container: 72 bytes as tuple) Saving: 8 bytes/tablet (2.4%) Combined with __slots__ (commit 1), total savings so far: 96 bytes/tablet. Scale impact (3 replicas/tablet): 128,000 tablets: saves ~1.0 MB (tuple) + 10.7 MB (slots) = 11.7 MB total 256,000 tablets: saves ~2.0 MB (tuple) + 21.5 MB (slots) = 23.5 MB total
Build a {host_id: shard_id} dict once at Tablet construction time so
that policies.py and pool.py can replace set(map(lambda ...)) and
linear scans with O(1) dict operations.
- Add _replica_dict to __slots__
- Build dict from the materialized tuple (not the raw replicas arg)
to avoid double-consuming a one-shot iterator
- Update DCAwareRoundRobinPolicy to use tablet._replica_dict keys
- Update HostConnection to use tablet._replica_dict.get() for shard
- Rewrite replica_contains_host_id() to use dict membership
- Add 7 unit tests covering dict construction, lookup, host membership,
tuple storage, and the iterator edge case
2640c18 to
1d25663
Compare
Remove the _is_valid_tablet staticmethod indirection and replace the two-step from_row -> _is_valid_tablet -> Tablet() chain with a single truthiness guard and direct construction. Saves ~54 ns/call (12%) by eliminating a staticmethod descriptor lookup, an extra function call, and redundant 'is not None' check (replicas from CQL deserialization is always a list or None).
Maintain parallel _first_tokens and _last_tokens dicts alongside _tablets, each mapping (keyspace, table) to a plain list[int]. This lets bisect_left run entirely in C on native ints instead of calling an attrgetter callback on every comparison during binary search. Follow-up to PR scylladb#757 which identified the opportunity: its own benchmarks showed bisect_left without key= is 2.7-5.7x faster than with key=attrgetter. Results (best-of-5, Python 3.14): get_tablet_for_key (hit): Tablets Before After Saved Speedup 10 293ns 216ns 78ns 1.36x 100 351ns 233ns 118ns 1.51x 1,000 448ns 267ns 181ns 1.68x 10,000 537ns 282ns 255ns 1.90x All three dicts are kept in sync by add_tablet, drop_tablets, and drop_tablets_by_host_id. The attrgetter imports are no longer needed and have been removed.
Replace the per-tablet reversed pop() loop (O(k*n) for each of three
parallel lists) with a single-pass index filter that rebuilds the
lists once. This avoids repeated list element shifting and scales
better when many tablets are dropped at once.
Benchmark (3 replicas/tablet, ~33% dropped):
Tablets Old (triple-pop) New (batch-filter) Speedup
100 123 us 128 us ~1.0x
1,000 1,375 us 1,113 us 1.24x
10,000 25,429 us 13,079 us 1.94x
Add 3 unit tests for drop_tablets_by_host_id covering matching,
None host_id, and nonexistent host_id.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Five incremental optimizations to the tablets hot path, each in a separate commit:
__slots__onTablet— eliminates per-instance__dict__allocationtuplereplicas — replicas are immutable after creation; use tuple instead of list_replica_dict— build{host_id: shard_id}dict once at construction, use it in both per-query hot paths (policies.pyandpool.py). Also fixes a latent iterator-consumption bug where passing a generator toTablet()would silently produce an empty_replica_dict.from_row— inline the_is_valid_tabletcheck to eliminate a staticmethod descriptor lookup, an extra function call, and a redundantis not Noneguard_first_tokensand_last_tokensas plainlist[int]dicts alongside_tablets, sobisect_leftruns entirely in C on native ints instead of calling anattrgettercallback per comparison. Follow-up to PR perf: use stdlib bisect and attrgetter in tablets.py (100's of ns, 1.5-5.6x speedup) #757 which identified this opportunity in its own benchmarks.Benchmarks (Python 3.14, best-of-5 rounds)
Memory per Tablet (deep,
pympler.asizeof, 3 replicas)__slots__,list, no dict)__slots__)tuple)_replica_dict)Commits 1+2 save 472 bytes/tablet. Commit 3 spends 232 bytes back on the
_replica_dictcache. Net: 240 bytes saved per tablet (13%). Commit 5 adds 16 bytes/tablet (two ints in parallel lists) — negligible.Shallow breakdown (sys.getsizeof)
__dict___replica_dict(3 entries)get_tablet_for_key(hit — the primary per-query hot path)Miss path (N=1000): 458 ns -> 229 ns (2.0x).
Other per-query hot paths
policies.py:set(map(lambda r: r[0], tablet.replicas))tablet._replica_dict)pool.py: linear scan for shard_iddict.get)replica_contains_host_idinConstruction (
Tablet.from_row)_replica_dict+ oldfrom_row)from_row)Commit 4 recovers ~54 ns (12%) of the construction regression. The remaining +267 ns vs master is the irreducible cost of building
tuple()+dict()at construction time (~250 ns), which pays for itself on every query.Tests
All 223 unit tests pass (tablets, policies, pool, metadata, cluster, response_future). 7 new tests added for
_replica_dictbehavior, including an iterator edge-case regression test.