Summary
Free-threaded Python 3.14t disables the GIL, exposing numerous thread-safety issues throughout the driver that were previously "accidentally safe" under the GIL. This issue tracks all identified problems beyond the shutdown segfault (#717).
The driver extensively uses shared mutable state (dicts, sets, counters) accessed from multiple threads without proper synchronization. Under CPython with the GIL, many of these were benign. Under free-threaded Python, they cause segfaults, data corruption, lost updates, and race conditions.
1. CRITICAL: Load Balancing Policy Counter/Host Races
Files: cassandra/policies.py
All round-robin-based policies have unprotected _position counter increments and unsynchronized reads of host lists during make_query_plan():
RoundRobinPolicy (lines 190-191): _position read-modify-write without lock; _live_hosts read at line 193 without lock while on_up()/on_down() modify it concurrently
DCAwareRoundRobinPolicy (lines 279-280): same _position pattern; _dc_live_hosts.get() at line 282 without lock
RackAwareRoundRobinPolicy (lines 395-396): same pattern
The code even has a comment acknowledging this: "not thread-safe, but we don't care much about lost increments" — this was written assuming GIL protection for the underlying integer object, which no longer holds.
Impact: Duplicate round-robin positions, queries seeing inconsistent host lists, possible crashes in islice(cycle(hosts)) if host set changes mid-iteration.
2. CRITICAL: Connection._requests Dict Race
File: cassandra/connection.py
- Line 1104:
self._requests[request_id] = (cb, decoder, result_metadata) — written outside self.lock
- Lines 1029-1030:
error_all_requests() snapshots and clears _requests inside lock, but send_msg() can write to it concurrently without the lock
- Lines 1290-1297: Response handling pops from
_requests without consistent locking
Impact: Dict corruption, lost requests, segfaults during concurrent dict mutation.
3. CRITICAL: _requests Pop / Request ID Recycling Race
File: cassandra/connection.py
Response handling at line 1291 calls self._requests.pop(stream_id) outside self.lock. Combined with issue #2 (where send_msg() writes to _requests outside the lock), this creates a race between request registration and response dispatch. Two threads could pop and process the same stream ID concurrently.
Once a stream ID is popped, it is recycled back into request_ids (lines 1296, 1332, 1344). While the recycling appends themselves are inside with self.lock, the preceding pop and callback dispatch happen outside the lock. This creates a window where:
- Thread A pops stream ID
N from _requests (line 1291, no lock)
- Thread A runs the callback and then recycles
N back into request_ids (line 1332, with lock)
- Thread B calls
get_request_id(), gets N, and sends a new request
- A late response for the original request
N arrives and is misrouted to the new request's callback
Note: get_request_id() itself (lines 1067-1078) is correctly documented as requiring self.lock, and the highest_request_id increment is safe as long as callers hold the lock as required. The issue is the unprotected window between pop and recycle, not the increment.
Impact: Duplicate stream IDs on the same connection → protocol errors, response routing to wrong callbacks, silent data corruption.
4. HIGH: Session._pools Dict Races
File: cassandra/cluster.py
- Line 3214:
self._pools.get(host) outside lock
- Line 3234:
self._pools[host] = new_pool inside lock (but earlier read was outside)
- Line 3245:
self._pools.pop(host, None) in remove_pool() without lock
- Line 3369:
get_pools() returns self._pools.values() — a live view, not a snapshot
Impact: Dict corruption during concurrent pool addition/removal, RuntimeError during iteration.
5. HIGH: HostConnection State Races
File: cassandra/pool.py
_is_replacing flag (lines 578-580): check-then-act without lock — two threads can both read False, both set True, both submit _replace() → double replacement
_trash set (lines 582-591): membership check and remove without atomicity → KeyError or double-close
_connections dict (lines 450, 512-515): read without lock while _replace() modifies it → NoConnectionsAvailable or choice from empty dict
_excess_connections set (lines 827-848): size check and add/close without lock
in_flight counter (line 781): read without lock for comparison → stale value → premature connection close
6. HIGH: concurrent.py Executor Shared State
File: cassandra/concurrent.py
_exception (lines 193-194): written from multiple callback threads without lock → lost errors
_results_queue (line 189): append() without lock while _results() (line 207) sorts/reads it → list corruption
_exec_depth counter (lines 130, 145): += 1 / -= 1 from multiple threads → wrong recursion depth tracking
7. MEDIUM: Metadata / Token Map Races
File: cassandra/metadata.py
token_map replacement (lines 311-312): self.token_map = TokenMap(...) without lock while query threads read self.token_map at line 319 → queries route using partially-built or freed map
keyspaces dict (lines 208, 223, 231, 238): accessed without locks from both schema refresh (ControlConnection thread) and user queries
_tablets access (lines 269, 278): drop_tablets() called without synchronization during topology changes
8. MEDIUM: Cluster._prepared_statements WeakValueDictionary
File: cassandra/cluster.py, lines 1448-1449
Writes are locked (_prepared_statement_lock), but reads during query execution may not hold the lock. WeakValueDictionary is not thread-safe — values can be GC'd on another thread during iteration.
9. MEDIUM: Global _clusters_for_shutdown Set
File: cassandra/cluster.py, lines 243-256
Module-level _clusters_for_shutdown set is modified via add()/discard() without any lock. The atexit handler _shutdown_clusters() calls .copy() but races with concurrent register/unregister.
10. LOW: Session.__del__() Accessing Shared State
File: cassandra/cluster.py, lines 3181-3188
__del__ calls shutdown() which accesses _lock, _pools, is_shutdown. In free-threaded Python, __del__ can run on any thread at any time.
Summary
| # |
Severity |
Area |
Core Problem |
| 1 |
CRITICAL |
policies.py |
Unprotected counter increment + host list reads in query plan |
| 2 |
CRITICAL |
connection.py |
_requests dict written outside lock |
| 3 |
CRITICAL |
connection.py |
_requests pop and stream ID recycling outside lock |
| 4 |
HIGH |
cluster.py |
_pools dict concurrent mutation |
| 5 |
HIGH |
pool.py |
_is_replacing, _trash, _connections races |
| 6 |
HIGH |
concurrent.py |
Callback shared state without synchronization |
| 7 |
MEDIUM |
metadata.py |
Token map and keyspaces replaced during reads |
| 8 |
MEDIUM |
cluster.py |
WeakValueDictionary not thread-safe |
| 9 |
MEDIUM |
cluster.py |
Global set mutations without lock |
| 10 |
LOW |
cluster.py |
__del__ on arbitrary thread |
Related
Summary
Free-threaded Python 3.14t disables the GIL, exposing numerous thread-safety issues throughout the driver that were previously "accidentally safe" under the GIL. This issue tracks all identified problems beyond the shutdown segfault (#717).
The driver extensively uses shared mutable state (dicts, sets, counters) accessed from multiple threads without proper synchronization. Under CPython with the GIL, many of these were benign. Under free-threaded Python, they cause segfaults, data corruption, lost updates, and race conditions.
1. CRITICAL: Load Balancing Policy Counter/Host Races
Files:
cassandra/policies.pyAll round-robin-based policies have unprotected
_positioncounter increments and unsynchronized reads of host lists duringmake_query_plan():RoundRobinPolicy(lines 190-191):_positionread-modify-write without lock;_live_hostsread at line 193 without lock whileon_up()/on_down()modify it concurrentlyDCAwareRoundRobinPolicy(lines 279-280): same_positionpattern;_dc_live_hosts.get()at line 282 without lockRackAwareRoundRobinPolicy(lines 395-396): same patternThe code even has a comment acknowledging this:
"not thread-safe, but we don't care much about lost increments"— this was written assuming GIL protection for the underlying integer object, which no longer holds.Impact: Duplicate round-robin positions, queries seeing inconsistent host lists, possible crashes in
islice(cycle(hosts))if host set changes mid-iteration.2. CRITICAL:
Connection._requestsDict RaceFile:
cassandra/connection.pyself._requests[request_id] = (cb, decoder, result_metadata)— written outsideself.lockerror_all_requests()snapshots and clears_requestsinside lock, butsend_msg()can write to it concurrently without the lock_requestswithout consistent lockingImpact: Dict corruption, lost requests, segfaults during concurrent dict mutation.
3. CRITICAL:
_requestsPop / Request ID Recycling RaceFile:
cassandra/connection.pyResponse handling at line 1291 calls
self._requests.pop(stream_id)outsideself.lock. Combined with issue #2 (wheresend_msg()writes to_requestsoutside the lock), this creates a race between request registration and response dispatch. Two threads could pop and process the same stream ID concurrently.Once a stream ID is popped, it is recycled back into
request_ids(lines 1296, 1332, 1344). While the recycling appends themselves are insidewith self.lock, the preceding pop and callback dispatch happen outside the lock. This creates a window where:Nfrom_requests(line 1291, no lock)Nback intorequest_ids(line 1332, with lock)get_request_id(), getsN, and sends a new requestNarrives and is misrouted to the new request's callbackNote:
get_request_id()itself (lines 1067-1078) is correctly documented as requiringself.lock, and thehighest_request_idincrement is safe as long as callers hold the lock as required. The issue is the unprotected window between pop and recycle, not the increment.Impact: Duplicate stream IDs on the same connection → protocol errors, response routing to wrong callbacks, silent data corruption.
4. HIGH:
Session._poolsDict RacesFile:
cassandra/cluster.pyself._pools.get(host)outside lockself._pools[host] = new_poolinside lock (but earlier read was outside)self._pools.pop(host, None)inremove_pool()without lockget_pools()returnsself._pools.values()— a live view, not a snapshotImpact: Dict corruption during concurrent pool addition/removal, RuntimeError during iteration.
5. HIGH:
HostConnectionState RacesFile:
cassandra/pool.py_is_replacingflag (lines 578-580): check-then-act without lock — two threads can both readFalse, both setTrue, both submit_replace()→ double replacement_trashset (lines 582-591): membership check and remove without atomicity →KeyErroror double-close_connectionsdict (lines 450, 512-515): read without lock while_replace()modifies it →NoConnectionsAvailableor choice from empty dict_excess_connectionsset (lines 827-848): size check and add/close without lockin_flightcounter (line 781): read without lock for comparison → stale value → premature connection close6. HIGH:
concurrent.pyExecutor Shared StateFile:
cassandra/concurrent.py_exception(lines 193-194): written from multiple callback threads without lock → lost errors_results_queue(line 189):append()without lock while_results()(line 207) sorts/reads it → list corruption_exec_depthcounter (lines 130, 145):+= 1/-= 1from multiple threads → wrong recursion depth tracking7. MEDIUM: Metadata / Token Map Races
File:
cassandra/metadata.pytoken_mapreplacement (lines 311-312):self.token_map = TokenMap(...)without lock while query threads readself.token_mapat line 319 → queries route using partially-built or freed mapkeyspacesdict (lines 208, 223, 231, 238): accessed without locks from both schema refresh (ControlConnection thread) and user queries_tabletsaccess (lines 269, 278):drop_tablets()called without synchronization during topology changes8. MEDIUM:
Cluster._prepared_statementsWeakValueDictionaryFile:
cassandra/cluster.py, lines 1448-1449Writes are locked (
_prepared_statement_lock), but reads during query execution may not hold the lock.WeakValueDictionaryis not thread-safe — values can be GC'd on another thread during iteration.9. MEDIUM: Global
_clusters_for_shutdownSetFile:
cassandra/cluster.py, lines 243-256Module-level
_clusters_for_shutdownset is modified viaadd()/discard()without any lock. Theatexithandler_shutdown_clusters()calls.copy()but races with concurrent register/unregister.10. LOW:
Session.__del__()Accessing Shared StateFile:
cassandra/cluster.py, lines 3181-3188__del__callsshutdown()which accesses_lock,_pools,is_shutdown. In free-threaded Python,__del__can run on any thread at any time.Summary
policies.pyconnection.py_requestsdict written outside lockconnection.py_requestspop and stream ID recycling outside lockcluster.py_poolsdict concurrent mutationpool.py_is_replacing,_trash,_connectionsracesconcurrent.pymetadata.pycluster.pycluster.pycluster.py__del__on arbitrary threadRelated