Free-threaded Python 3.14t: comprehensive thread-safety audit

## Summary

Free-threaded Python 3.14t disables the GIL, exposing numerous thread-safety issues throughout the driver that were previously "accidentally safe" under the GIL. This issue tracks all identified problems beyond the shutdown segfault (#717).

The driver extensively uses shared mutable state (dicts, sets, counters) accessed from multiple threads without proper synchronization. Under CPython with the GIL, many of these were benign. Under free-threaded Python, they cause segfaults, data corruption, lost updates, and race conditions.

---

## 1. CRITICAL: Load Balancing Policy Counter/Host Races

**Files:** `cassandra/policies.py`

All round-robin-based policies have unprotected `_position` counter increments and unsynchronized reads of host lists during `make_query_plan()`:

- **`RoundRobinPolicy`** (lines 190-191): `_position` read-modify-write without lock; `_live_hosts` read at line 193 without lock while `on_up()`/`on_down()` modify it concurrently
- **`DCAwareRoundRobinPolicy`** (lines 279-280): same `_position` pattern; `_dc_live_hosts.get()` at line 282 without lock
- **`RackAwareRoundRobinPolicy`** (lines 395-396): same pattern

The code even has a comment acknowledging this: `"not thread-safe, but we don't care much about lost increments"` — this was written assuming GIL protection for the underlying integer object, which no longer holds.

**Impact:** Duplicate round-robin positions, queries seeing inconsistent host lists, possible crashes in `islice(cycle(hosts))` if host set changes mid-iteration.

---

## 2. CRITICAL: `Connection._requests` Dict Race

**File:** `cassandra/connection.py`

- **Line 1104**: `self._requests[request_id] = (cb, decoder, result_metadata)` — written **outside** `self.lock`
- **Lines 1029-1030**: `error_all_requests()` snapshots and clears `_requests` inside lock, but `send_msg()` can write to it concurrently without the lock
- **Lines 1290-1297**: Response handling pops from `_requests` without consistent locking

**Impact:** Dict corruption, lost requests, segfaults during concurrent dict mutation.

---

## 3. CRITICAL: `_requests` Pop / Request ID Recycling Race

**File:** `cassandra/connection.py`

Response handling at line 1291 calls `self._requests.pop(stream_id)` **outside** `self.lock`. Combined with issue #2 (where `send_msg()` writes to `_requests` outside the lock), this creates a race between request registration and response dispatch. Two threads could pop and process the same stream ID concurrently.

Once a stream ID is popped, it is recycled back into `request_ids` (lines 1296, 1332, 1344). While the recycling appends themselves are inside `with self.lock`, the preceding pop and callback dispatch happen outside the lock. This creates a window where:

1. Thread A pops stream ID `N` from `_requests` (line 1291, no lock)
2. Thread A runs the callback and then recycles `N` back into `request_ids` (line 1332, with lock)
3. Thread B calls `get_request_id()`, gets `N`, and sends a new request
4. A late response for the **original** request `N` arrives and is misrouted to the new request's callback

Note: `get_request_id()` itself (lines 1067-1078) is correctly documented as requiring `self.lock`, and the `highest_request_id` increment is safe as long as callers hold the lock as required. The issue is the **unprotected window between pop and recycle**, not the increment.

**Impact:** Duplicate stream IDs on the same connection → protocol errors, response routing to wrong callbacks, silent data corruption.

---

## 4. HIGH: `Session._pools` Dict Races

**File:** `cassandra/cluster.py`

- **Line 3214**: `self._pools.get(host)` outside lock
- **Line 3234**: `self._pools[host] = new_pool` inside lock (but earlier read was outside)
- **Line 3245**: `self._pools.pop(host, None)` in `remove_pool()` without lock
- **Line 3369**: `get_pools()` returns `self._pools.values()` — a **live view**, not a snapshot

**Impact:** Dict corruption during concurrent pool addition/removal, RuntimeError during iteration.

---

## 5. HIGH: `HostConnection` State Races

**File:** `cassandra/pool.py`

- **`_is_replacing` flag** (lines 578-580): check-then-act without lock — two threads can both read `False`, both set `True`, both submit `_replace()` → double replacement
- **`_trash` set** (lines 582-591): membership check and remove without atomicity → `KeyError` or double-close
- **`_connections` dict** (lines 450, 512-515): read without lock while `_replace()` modifies it → `NoConnectionsAvailable` or choice from empty dict
- **`_excess_connections` set** (lines 827-848): size check and add/close without lock
- **`in_flight` counter** (line 781): read without lock for comparison → stale value → premature connection close

---

## 6. HIGH: `concurrent.py` Executor Shared State

**File:** `cassandra/concurrent.py`

- **`_exception`** (lines 193-194): written from multiple callback threads without lock → lost errors
- **`_results_queue`** (line 189): `append()` without lock while `_results()` (line 207) sorts/reads it → list corruption
- **`_exec_depth`** counter (lines 130, 145): `+= 1` / `-= 1` from multiple threads → wrong recursion depth tracking

---

## 7. MEDIUM: Metadata / Token Map Races

**File:** `cassandra/metadata.py`

- **`token_map` replacement** (lines 311-312): `self.token_map = TokenMap(...)` without lock while query threads read `self.token_map` at line 319 → queries route using partially-built or freed map
- **`keyspaces` dict** (lines 208, 223, 231, 238): accessed without locks from both schema refresh (ControlConnection thread) and user queries
- **`_tablets` access** (lines 269, 278): `drop_tablets()` called without synchronization during topology changes

---

## 8. MEDIUM: `Cluster._prepared_statements` WeakValueDictionary

**File:** `cassandra/cluster.py`, lines 1448-1449

Writes are locked (`_prepared_statement_lock`), but reads during query execution may not hold the lock. `WeakValueDictionary` is not thread-safe — values can be GC'd on another thread during iteration.

---

## 9. MEDIUM: Global `_clusters_for_shutdown` Set

**File:** `cassandra/cluster.py`, lines 243-256

Module-level `_clusters_for_shutdown` set is modified via `add()`/`discard()` without any lock. The `atexit` handler `_shutdown_clusters()` calls `.copy()` but races with concurrent register/unregister.

---

## 10. LOW: `Session.__del__()` Accessing Shared State

**File:** `cassandra/cluster.py`, lines 3181-3188

`__del__` calls `shutdown()` which accesses `_lock`, `_pools`, `is_shutdown`. In free-threaded Python, `__del__` can run on any thread at any time.

---

## Summary

| # | Severity | Area | Core Problem |
|---|----------|------|--------------|
| 1 | CRITICAL | `policies.py` | Unprotected counter increment + host list reads in query plan |
| 2 | CRITICAL | `connection.py` | `_requests` dict written outside lock |
| 3 | CRITICAL | `connection.py` | `_requests` pop and stream ID recycling outside lock |
| 4 | HIGH | `cluster.py` | `_pools` dict concurrent mutation |
| 5 | HIGH | `pool.py` | `_is_replacing`, `_trash`, `_connections` races |
| 6 | HIGH | `concurrent.py` | Callback shared state without synchronization |
| 7 | MEDIUM | `metadata.py` | Token map and keyspaces replaced during reads |
| 8 | MEDIUM | `cluster.py` | WeakValueDictionary not thread-safe |
| 9 | MEDIUM | `cluster.py` | Global set mutations without lock |
| 10 | LOW | `cluster.py` | `__del__` on arbitrary thread |

## Related

- #717 — Segfault during cluster shutdown (logging race in Cythonized cluster.so)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Free-threaded Python 3.14t: comprehensive thread-safety audit #718

Summary

1. CRITICAL: Load Balancing Policy Counter/Host Races

2. CRITICAL: `Connection._requests` Dict Race

3. CRITICAL: `_requests` Pop / Request ID Recycling Race

4. HIGH: `Session._pools` Dict Races

5. HIGH: `HostConnection` State Races

6. HIGH: `concurrent.py` Executor Shared State

7. MEDIUM: Metadata / Token Map Races

8. MEDIUM: `Cluster._prepared_statements` WeakValueDictionary

9. MEDIUM: Global `_clusters_for_shutdown` Set

10. LOW: `Session.del()` Accessing Shared State

Summary

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

#	Severity	Area	Core Problem
1	CRITICAL	`policies.py`	Unprotected counter increment + host list reads in query plan
2	CRITICAL	`connection.py`	`_requests` dict written outside lock
3	CRITICAL	`connection.py`	`_requests` pop and stream ID recycling outside lock
4	HIGH	`cluster.py`	`_pools` dict concurrent mutation
5	HIGH	`pool.py`	`_is_replacing`, `_trash`, `_connections` races
6	HIGH	`concurrent.py`	Callback shared state without synchronization
7	MEDIUM	`metadata.py`	Token map and keyspaces replaced during reads
8	MEDIUM	`cluster.py`	WeakValueDictionary not thread-safe
9	MEDIUM	`cluster.py`	Global set mutations without lock
10	LOW	`cluster.py`	`__del__` on arbitrary thread

Free-threaded Python 3.14t: comprehensive thread-safety audit #718

Description

Summary

1. CRITICAL: Load Balancing Policy Counter/Host Races

2. CRITICAL: Connection._requests Dict Race

3. CRITICAL: _requests Pop / Request ID Recycling Race

4. HIGH: Session._pools Dict Races

5. HIGH: HostConnection State Races

6. HIGH: concurrent.py Executor Shared State

7. MEDIUM: Metadata / Token Map Races

8. MEDIUM: Cluster._prepared_statements WeakValueDictionary

9. MEDIUM: Global _clusters_for_shutdown Set

10. LOW: Session.__del__() Accessing Shared State

Summary

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

2. CRITICAL: `Connection._requests` Dict Race

3. CRITICAL: `_requests` Pop / Request ID Recycling Race

4. HIGH: `Session._pools` Dict Races

5. HIGH: `HostConnection` State Races

6. HIGH: `concurrent.py` Executor Shared State

8. MEDIUM: `Cluster._prepared_statements` WeakValueDictionary

9. MEDIUM: Global `_clusters_for_shutdown` Set

10. LOW: `Session.del()` Accessing Shared State