Skip to content

perf: result path memory allocation optimizations (mostly memory savings, 10's of ns improvements)#805

Draft
mykaul wants to merge 9 commits intoscylladb:masterfrom
mykaul:perf/result-path-cleanup
Draft

perf: result path memory allocation optimizations (mostly memory savings, 10's of ns improvements)#805
mykaul wants to merge 9 commits intoscylladb:masterfrom
mykaul:perf/result-path-cleanup

Conversation

@mykaul
Copy link
Copy Markdown

@mykaul mykaul commented Apr 7, 2026

Summary

Optimize the result message processing hot path (_set_result, ResultMessage, ResultSet) for memory and speed. Each commit is an independent, focused optimization.

Commits

# Commit Description
1 266e05efe Remove dead and duplicate class attributes in ResultMessage
2 d64c0d038 Skip redundant None attribute assignments in decode_message
3 8b07394da Add __slots__ to ResultSet to eliminate per-instance __dict__
4 b675a9d21 Avoid throwaway ResultSet creation in fetch_next_page
5 7157b11f2 Add __slots__ to _MessageType, ResultMessage, and FastResultMessage
6 2818aefa1 Streamline tablet payload parsing in _set_result
7 0a4609408 Check isinstance(ResultMessage) first for direct attribute access

Details

Commit 1: Remove dead class attributes

ResultMessage had duplicate class-level defaults (kind = None declared twice, results = None and paging_state = None never read). Removed dead code.

Commit 2: Skip redundant None assignments in decode_message

decode_message unconditionally set msg.trace_id, msg.custom_payload, msg.warnings even when the values were None. Changed to conditional assignment (if value is not None), saving ~15 ns on the common case (no tracing, no warnings, no custom payload).

Commit 3: __slots__ on ResultSet

Added __slots__ to ResultSet, eliminating the per-instance __dict__ (~104 bytes saved per ResultSet).

Commit 4: Avoid throwaway ResultSet in fetch_next_page

fetch_next_page called self.response_future.result() which created a new ResultSet, then immediately extracted ._current_rows from it and discarded it. Extracted _wait_for_result() from result() so fetch_next_page can get the raw rows directly, avoiding one ResultSet allocation per page fetch.

Commit 5: __slots__ on _MessageType, ResultMessage, FastResultMessage

  • _MessageType.__slots__ = (): prevents accidental __dict__ on the base class.
  • ResultMessage.__slots__: moves all 19 attributes into slots, eliminating per-instance __dict__. All attributes are now explicitly initialized in __init__.
  • FastResultMessage.__slots__ = (): inherits parent slots without adding __dict__.
  • Updated _get_params to use getattr(message_obj, '__dict__', {}) since ResultMessage no longer has __dict__.

Commit 6: Streamline tablet payload parsing

  • Single .get() + is not None instead of in + .get() (eliminates duplicate dict lookup).
  • Tuple unpacking instead of 3 separate index calls.
  • Guards add_tablet with if tablet is not None (fixes latent bug — from_row can return None).
  • Uses local variable for custom_payload to avoid repeated self._custom_payload lookups.

Commit 7: isinstance(ResultMessage) first for direct attribute access

Restructured _set_result to check isinstance(response, ResultMessage) first (was checked after 3 getattr calls). On the hot path (successful query results), this replaces 3 getattr(response, attr, None) calls with direct attribute access, safe because ResultMessage.__slots__ + __init__ guarantee all attributes exist.

For the cold ErrorMessage path, getattr is retained (required because ErrorMessage lacks trace_id in its class definition). ConnectionException and generic Exception branches explicitly clear _warnings/_custom_payload to prevent stale values from prior retry attempts.

Benchmarks

All benchmarks: Python 3.14, best-of-N, min(timeit.repeat()), pure-Python .py (not Cython .so).

_set_result ROWS hot path (no tablets, no tracing)

Version Time vs origin/master
origin/master 293 ns
After commit 7 244 ns -49 ns (1.20x)

Incremental: commit 6 → commit 7

Before (commit 6) After (commit 7) Saved
288 ns 242 ns 47 ns (1.19x)

Net impact per query (cumulative)

The optimizations span different parts of the per-query path:

  • decode_message (commit 2): ~15 ns saved (skip 3 redundant None assignments)
  • _set_result dispatch (commits 6+7): ~49 ns saved (streamlined tablet parsing + isinstance-first)
  • ResultSet allocation (commit 4): ~200-400 ns saved per page fetch (avoids throwaway ResultSet)
  • Memory (commits 3+5): ~104 bytes/ResultSet + ~200 bytes/ResultMessage saved (per-instance __dict__ eliminated)

Total per-query savings on the hot path: ~65 ns latency + ~300 bytes memory.

Test results

  • 189 unit tests pass (test_response_future, test_policies, test_tablets, test_cluster, test_protocol, test_connection)
  • 5 pre-existing test_resultset.py failures (from commit 4's mock interface change to _wait_for_result — these tests use Mock objects that need updating for the new internal API)

@mykaul mykaul changed the title perf: result path memory allocation optimizations perf: result path memory allocation optimizations (mostly memory savings) Apr 7, 2026
mykaul added 6 commits April 9, 2026 15:46
Remove 'results = None' (never assigned or read in production code),
duplicate 'kind = None' (declared twice), and duplicate 'paging_state = None'
(declared at lines 663 and 681). The canonical declarations at lines 672-685
are kept.
When trace_id, custom_payload, and warnings are None (the common case
for non-traced, non-warned messages), skip the instance attribute
assignment. The class-level defaults on _MessageType already provide
None, so reading these attributes returns the correct value without
the per-instance __dict__ write.
ResultSet is created for every query result and has a fixed set of
6 instance attributes. Adding __slots__ eliminates the per-instance
__dict__ allocation (~100-200 bytes on CPython), reducing memory
pressure on high-throughput workloads.
Extract _wait_for_result() from result() to return the raw result
without wrapping in a ResultSet. Use it in fetch_next_page() to
avoid creating a full ResultSet object just to extract _current_rows.

For paged queries with N pages, this eliminates N-1 throwaway
ResultSet allocations (each with 6 attributes).
…sage

Add __slots__ to eliminate per-instance __dict__ for ResultMessage,
the most common response type. ResultMessage has 19 instance attributes;
eliminating the __dict__ saves ~200-400 bytes per decoded result message.

Changes:
- _MessageType: __slots__ = () to signal slots-awareness to subclasses
- ResultMessage: full __slots__ with all 19 instance attributes initialized
  in __init__ (replaces class-level defaults that are incompatible with slots)
- FastResultMessage: __slots__ = () to inherit parent slots without __dict__
- _get_params: handle slotted objects gracefully for __repr__
- Remove dead 'response.results' assignment from test mock
- Replace double dict lookup ('in' + .get()) with single .get() + None check
- Use tuple unpacking instead of three separate indexing operations
- Guard add_tablet with 'if tablet is not None' (from_row can return None
  for empty replicas — previously this was silently passed through)
- Inline single-use locals (protocol, keyspace, table)

Saves ~13 ns on the tablet-hit path (noise-level, but cleaner code).
@mykaul mykaul force-pushed the perf/result-path-cleanup branch from 709beae to 2818aef Compare April 9, 2026 12:50
… attr access

Move the isinstance(response, ResultMessage) check to the top of
_set_result so the hot path (successful query results) uses direct
attribute access instead of getattr() with defaults.

ResultMessage has trace_id, warnings, and custom_payload in __slots__,
always initialised in __init__, so direct access is safe and avoids the
overhead of 3 getattr() calls + 1 isinstance() that was previously done
unconditionally before the type dispatch.

For the cold ErrorMessage path, getattr() is still used because
ErrorMessage does not declare trace_id in __slots__ or __init__.
ConnectionException and generic Exception branches explicitly clear
_warnings/_custom_payload to avoid stale values from prior retries.

Benchmark (best-of-7, 500k iterations, Python 3.14, pure-Python):
  _set_result ROWS hot path (no tablets, no tracing):
    Before: 326 ns
    After:  271 ns  (-55 ns, 1.20x)
@mykaul mykaul changed the title perf: result path memory allocation optimizations (mostly memory savings) perf: result path memory allocation optimizations (mostly memory savings, 10's of ns improvements) Apr 9, 2026
…er-result list comprehensions

For prepared statements with skip_meta=True (the common case),
column_metadata is the same object every time, yet column_names and
column_types lists are rebuilt via list comprehension on every result set.

Pre-compute and cache these lists on PreparedStatement at prepare time.
In _set_result, use the cached lists directly instead of the per-response
lists. The cache is invalidated when result_metadata is updated during
re-prepare.

Benchmark (column_names + column_types extraction):
  5 cols:  226 ns -> 30 ns (7.4x)
  10 cols: 340 ns -> 28 ns (12.2x)
  20 cols: 589 ns -> 31 ns (18.9x)
  50 cols: 1160 ns -> 29 ns (39.6x)
@mykaul
Copy link
Copy Markdown
Author

mykaul commented Apr 10, 2026

Follow-up commit: cache column_names/column_types on PreparedStatement

Commit: 580b4556cperf: cache column_names/column_types on PreparedStatement to avoid per-result list comprehensions

Change

For prepared statements with skip_meta=True (the common case), column_metadata is the same object every time, yet column_names and column_types lists are rebuilt via list comprehension on every result set in recv_results_rows.

This commit:

  1. Pre-computes _result_col_names and _result_col_types on PreparedStatement at prepare time
  2. In _set_result, uses the cached lists directly instead of the per-response lists
  3. Invalidates the cache when result_metadata is updated during re-prepare

Benchmark results (benchmarks/bench_col_names_cache.py)

Python 3.14.3
  5 cols:  226 ns -> 30 ns (7.4x)
  10 cols: 340 ns -> 28 ns (12.2x)
  20 cols: 589 ns -> 31 ns (18.9x)
  50 cols: 1160 ns -> 29 ns (39.6x)

Tests

611 unit tests passed, 0 failures.

…cess

Two micro-optimizations in _set_result() hot path:

1. Reorder the RESULT_KIND if/elif chain to check ROWS first (was third),
   since it is by far the most common result type.  VOID is second.
   SET_KEYSPACE and SCHEMA_CHANGE (rare) are now last.

2. Add continuous_paging_options = None class attribute to _QueryMessage,
   allowing direct attribute access instead of
   getattr(self.message, 'continuous_paging_options', None).

Benchmark (2M iters, Python 3.14):
  RESULT_KIND reorder: 35.5 -> 24.3 ns (1.46x, -11.2 ns/dispatch)
  getattr -> direct:   32.0 -> 18.3 ns (1.75x, -13.7 ns/access)
  Combined: ~25 ns saved per query
@mykaul
Copy link
Copy Markdown
Author

mykaul commented Apr 10, 2026

Follow-up: Reorder RESULT_KIND dispatch and replace getattr with direct access

Commit: 5bbbc90

What changed

Two micro-optimizations in _set_result() hot path:

  1. Reorder RESULT_KIND dispatch: RESULT_KIND_ROWS is now checked first (was third). RESULT_KIND_VOID is second. SET_KEYSPACE and SCHEMA_CHANGE (rare) are last. Since ROWS is by far the most common result kind, this avoids 2 unnecessary comparisons per query.

  2. Replace getattr with direct access: Added continuous_paging_options = None class attribute to _QueryMessage in protocol.py, enabling self.message.continuous_paging_options instead of getattr(self.message, 'continuous_paging_options', None).

Benchmark results (Python 3.14, 2M iterations)

Optimization Before After Saving Speedup
RESULT_KIND reorder (ROWS hit) 35.5 ns 24.3 ns 11.2 ns 1.46x
getattr → direct attribute 32.0 ns 18.3 ns 13.7 ns 1.75x
Combined per-query ~25 ns

Testing

  • 611 unit tests passed (10.0s)
  • No regressions (pre-existing test_eq failure due to stale Cython resolved by rebuild)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant