Skip to content

SIGSEGV in thread-sampling transaction profiler under concurrent HTTP load (Python 3.11, sentry-sdk 2.58.0) #6119

@saurabh-statisfy

Description

@saurabh-statisfy

How do you use Sentry?

Sentry SaaS (sentry.io)

Version

2.58.0 (also reproduced on 2.17.0)

Steps to Reproduce

  1. Run a FastAPI app under uvicorn on Python 3.11, containerised (Linux, x86_64), with many concurrent HTTP requests.
  2. Each request performs synchronous outbound I/O (e.g. google-cloud-storage blob.download_as_bytes, arbitrary requests.Session.post / get) from an anyio worker thread (standard FastAPI run_in_threadpool pattern).
  3. sentry_sdk.init(dsn=..., traces_sampler=..., profiles_sample_rate=1.0) — any non-zero profiles_sample_rate is sufficient to trigger.
  4. Wait ~seconds under steady load.

Expected Result

No process-level crash from the profiler.

Actual Result

Worker process dies with SIGSEGV. Fault address is a small integer (observed: 0, 1, 2, 7, 8, 68, 72, 80, 111, 228) — classic use-after-free / null-deref-with-offset. Multiple different threads crash across different incidents; every crash has the profiler thread actively sampling frames at the moment of death.

With PYTHONFAULTHANDLER=1 enabled, faulthandler consistently shows the pattern:

Sibling thread — profiler (running at the moment of crash, every time):

File ".../sentry_sdk/profiler/transaction_profiler.py", line 711, in run
File ".../sentry_sdk/profiler/transaction_profiler.py", line 601, in _sample_stack
File ".../sentry_sdk/profiler/transaction_profiler.py", line 602, in <listcomp>
File ".../sentry_sdk/profiler/utils.py", line 167, in extract_stack
File ".../sentry_sdk/profiler/utils.py", line 167, in <genexpr>
File ".../sentry_sdk/profiler/utils.py", line 114, in frame_id

Current thread — mid-HTTP I/O via a Sentry stdlib patch (one representative stack; the exact app code above the stdlib patch varies, but the Sentry patch frame is always present):

File "/usr/local/lib/python3.11/ssl.py", line 1166, in read
File "/usr/local/lib/python3.11/ssl.py", line 1314, in recv_into
File "/usr/local/lib/python3.11/socket.py", line 718, in readinto
File "/usr/local/lib/python3.11/http/client.py", line 291, in _read_status
File "/usr/local/lib/python3.11/http/client.py", line 330, in begin
File "/usr/local/lib/python3.11/http/client.py", line 1415, in getresponse
File ".../sentry_sdk/integrations/stdlib.py", line 146, in getresponse       <-- Sentry patch
File ".../urllib3/connection.py", line 571, in getresponse
File ".../urllib3/connectionpool.py", line 534, in _make_request
File ".../urllib3/connectionpool.py", line 787, in urlopen
File ".../requests/adapters.py", line 644, in send
File ".../opentelemetry_instrumentation_requests/__init__.py", line 432, in instrumented_send
File ".../requests/sessions.py", line 703, in send
File ".../requests/sessions.py", line 589, in request
File ".../google/auth/transport/requests.py", line 543, in request
File ".../google/cloud/storage/_media/requests/download.py", line 253, in retriable_request
File ".../google/api_core/retry/retry_unary.py", line 147, in retry_target
File ".../google/cloud/storage/blob.py", line 1094, in _do_download
File ".../google/cloud/storage/blob.py", line 1530, in download_as_bytes
File ".../google/cloud/storage/blob.py", line 1651, in download_as_string
File ".../<app>/handler.py", in <app_handler>
File ".../starlette/concurrency.py", line 42, in run_in_threadpool
File ".../anyio/to_thread.py", line 63, in run_sync
File ".../anyio/_backends/_asyncio.py", line 1002, in run
File ".../sentry_sdk/integrations/fastapi.py", line 90, in _sentry_call
File ".../sentry_sdk/integrations/threading.py", line 133, in _run_old_run_func
File ".../sentry_sdk/integrations/threading.py", line 140, in run
File "/usr/local/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
File "/usr/local/lib/python3.11/threading.py", line 1002, in _bootstrap

Previously observed variant: same profiler sibling stack, but the crashing frame was in tracing.Span.__init__ → uuid.uuid4(), triggered via the sentry_sdk/integrations/stdlib.py:91 putrequest patch instead of getresponse. Both stdlib patch points reproduce.

Analysis

The profiler thread in transaction_profiler.py samples PyFrameObjects of all other threads at the configured frequency, via extract_stackframe_id (utils.py:114). frame_id reads fields from a PyFrameObject that another thread may be actively mutating/freeing. There is no GIL-level synchronisation across the sample boundary — the sampler is scheduled cooperatively with mutator threads, but the C-level attribute reads inside frame_id can see a partially-freed object if the mutator releases/reclaims a frame or code object mid-sample.

The signature (always tiny fault_addr, always in frame_id/extract_stack while a mutator is mid-Span construction around HTTP I/O) is consistent with that race. The Sentry stdlib.putrequest / getresponse patches are a frequent entry point because every outbound HTTP call constructs a new Span → allocates a uuid4() → lots of short-lived frame/code churn right in the profiler's sampling window.

Workarounds

  • profiles_sample_rate=0.0 — stops the profiler thread entirely. Confirmed to eliminate crashes.
  • profiles_sample_rate=0.1 (reduced from 1.0) — reduces crash rate proportionally but does not eliminate it. A single sampled transaction hitting the race is enough to kill the container.
  • Upgrading sentry-sdk from 2.17.0 to 2.58.0 does not fix it. The legacy thread-sampling profiler is still used whenever profiles_sample_rate > 0; 2.24.1+ added the continuous profiler as an opt-in, it did not replace the transaction profiler.

Ask

  • Can the Sentry team confirm whether the thread-sampling transaction profiler is considered safe for production use on Python 3.11+ under concurrent HTTP load?
  • If not, would the docs acknowledge this (it's currently presented as a general-purpose option)?
  • Could frame_id / extract_stack be hardened against concurrent mutation, or is migration to the continuous profiler the official path forward?

Environment

  • OS: Linux (Debian slim base, x86_64), managed container platform
  • Python: 3.11 (CPython, stock)
  • Runtime: FastAPI 0.114, Starlette 0.37, uvicorn 0.34, anyio 4.13
  • Third-party in-stack at crash: requests 2.32, urllib3 2.6, google-cloud-storage 3.10, google-auth 2.49, opentelemetry-instrumentation-requests 0.62b0
  • PYTHONFAULTHANDLER=1 to capture the trace; without it, crash is logged only as "Container terminated on signal 11"

Metadata

Metadata

Assignees

No one assigned
    No fields configured for issues without a type.

    Projects

    Status

    Waiting for: Product Owner

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions