Skip to content

feat(transport): add HTTP retry with exponential backoff#1520

Draft
jpnurmi wants to merge 50 commits intomasterfrom
jpnurmi/feat/http-retry
Draft

feat(transport): add HTTP retry with exponential backoff#1520
jpnurmi wants to merge 50 commits intomasterfrom
jpnurmi/feat/http-retry

Conversation

@jpnurmi
Copy link
Collaborator

@jpnurmi jpnurmi commented Feb 13, 2026

Add sentry_options_set_http_retries() to configure retry attempts for transient network errors. Failed envelopes are stored as <db>/cache/<ts>-<n>-<uuid>.envelope and retried with exponential backoff (15min, 30min, 1h, 2h, 8h) modeled after Crashpad's upload retry behavior. When retries are exhausted, and offline caching is enabled, envelopes are stored as <db>/cache/<uuid>.envelope instead of being discarded.

flowchart TD
    startup --> R{retry?}
    R -->|yes| throttle
    R -->|no| C{cache?}
    throttle -. 100ms .-> resend
    resend -->|success| C
    resend -->|fail| C2[&lt;db&gt;/cache/<br/>&lt;ts&gt;-&lt;n&gt;-&lt;uuid&gt;.envelope]
    C2 --> backoff
    backoff -. 2ⁿ×15min .-> resend
    C -->|yes| CACHE[&lt;db&gt;/cache/<br/>&lt;uuid&gt;.envelope]
    C -->|no| discard
Loading

See also: https://develop.sentry.dev/sdk/expected-features/#buffer-to-disk

Depends on:

See also:

@github-actions
Copy link

github-actions bot commented Feb 13, 2026

Messages
📖 Do not forget to update Sentry-docs with your feature once the pull request gets approved.

Generated by 🚫 dangerJS against 0902da2

@jpnurmi jpnurmi force-pushed the jpnurmi/feat/http-retry branch from df2be97 to b083a57 Compare February 13, 2026 16:59
jpnurmi and others added 28 commits February 13, 2026 18:47
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The deferred startup retry scan (100ms delay) could pick up files
written by the current session. Filter by startup_time so only
previous-session files are processed. Also ensure the cache directory
exists when cache_keep is enabled, since sentry__process_old_runs
only creates it conditionally.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Monotonic time is process-relative and doesn't work across restarts.
Retry envelope timestamps need to persist across sessions, so use
time() (seconds since epoch) for file timestamps, startup_time, and
backoff comparison.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rename SENTRY_RETRY_BACKOFF_BASE_MS to SENTRY_RETRY_BACKOFF_BASE_S
and sentry__retry_backoff_ms to sentry__retry_backoff, since file
timestamps are now in seconds. The bgworker delay sites multiply
by 1000 to convert to the milliseconds it expects.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move startup_time initialization into sentry__retry_new and remove the
unnecessary sentry__retry_set_startup_time indirection. Tests now use
write_retry_file with timestamps well in the past to match production
behavior where retry files are from previous sessions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When files exist but aren't eligible yet (backoff not elapsed),
foreach was returning 0 causing the retry polling task to stop.
Return total valid retry files found instead of just the eligible
count so the caller keeps rescheduling.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Make handle_result return bool (true = file rescheduled for retry,
false = file consumed) and use it in foreach to decrement the total
count. This avoids one extra no-op poll cycle after the last retry.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…Y_THROTTLE

Replace SENTRY_RETRY_BACKOFF_BASE_S and SENTRY_RETRY_STARTUP_DELAY_MS
with ms-based constants so the transport uses them directly without
leaking unit conversion details.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Give the retry module a bgworker ref and send callback so it owns all
scheduling. Transport just calls _start and _enqueue.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
sentry__retry_new only returns NULL on failure, not based on options.
sentry__retry_start and _enqueue require non-NULL retry.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Deduplicate prepare/send/free sequence shared by retry_send_cb and
http_send_task.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Already covered by retry_throttle and retry_result.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ashpad

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pass startup_time directly to _foreach as a `before` filter instead of
a bool. Clear it after the first run so subsequent polls use backoff.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Deduplicate filename construction across write_envelope, handle_result,
and tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When the transport supports retry and http_retries > 0,
sentry__process_old_runs now skips caching .envelope files from old
runs. The retry system handles persistence, so duplicating into
cache/ is unnecessary.

Also simplifies sentry__retry_handle_result: only cache on max
retries exhausted, not on successful send.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move the retry-aware check before cache_dir creation so we avoid
mkdir when the retry system handles persistence.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The retry callback now receives a sentry_envelope_t and returns a
status code. The retry system handles deserialization and file
lifecycle internally, keeping path concerns out of the transport.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add test case for successful send at max retry count with cache_keep
enabled, confirming envelopes are cached regardless of send outcome.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
jpnurmi and others added 2 commits February 13, 2026 18:47
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…lopes

The startup poll used `ts >= startup_time` to skip envelopes written
after startup. With second-precision timestamps, this also skipped
cross-session envelopes written in the same second as a fast restart.

Reset `startup_time` in `sentry__retry_enqueue` so the startup poll
falls through to the backoff path for same-session envelopes. The
bgworker processes the send task (immediate) before the startup poll
(delayed), so by the time the poll fires, `startup_time` is already 0.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jpnurmi jpnurmi force-pushed the jpnurmi/feat/http-retry branch from b083a57 to a264f66 Compare February 13, 2026 17:47
jpnurmi and others added 20 commits February 14, 2026 10:40
Submit a one-shot retry send task before bgworker shutdown to ensure
pre-existing retry files are sent even if the startup poll hasn't
fired yet. The flush checks startup_time on the worker thread to
avoid re-sending files already handled by enqueue.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace `time(NULL)` (1-second granularity) with `sentry__usec_time() / 1000`
(millisecond granularity) to avoid timestamp collisions that caused flaky
`>=` vs `>` comparison behavior in CI.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Make sentry__retry_flush block until the flush task completes by adding
a bgworker_flush call, and subtract the elapsed time from the shutdown
timeout. This ensures retries are actually sent before the worker stops.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Break out of the send loop on the first network error to avoid wasting
time on a dead connection. Remaining envelopes stay untouched for the
next retry poll.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When bgworker shutdown times out, persist any remaining queued envelopes
to the retry directory so they are not lost. The retry module provides
sentry__retry_dump_queue to keep retry internals out of the transport.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After shutdown timeout, the bgworker thread is detached but may still
be executing an http_send_task. Since dump_queue already saves that
task's envelope to the retry dir, the worker's subsequent call to
retry_enqueue would create a duplicate file. Seal the retry module
after dumping so that any late enqueue calls are silently skipped.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… logic

Remove count_eligible_files helper that duplicated filtering logic
from sentry__retry_send. The retry_backoff test now exercises the
actual send path for both backoff and startup modes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Store parsed fields (ts, count, uuid) alongside the path during the
filter phase so handle_result and future debug logging can use them
without re-parsing. Also improves sort performance by comparing
numeric fields before falling back to string comparison.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Log retry attempts at DEBUG level and max-retries-reached at WARN
level to make retry behavior observable.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…writes

Three places independently constructed <database>/cache and wrote
envelopes there. Add cache_path to sentry_run_t and introduce
sentry__run_write_cache() and sentry__run_move_cache() to centralize
the cache directory creation and file operations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CURLOPT_TIMEOUT_MS is a total transfer timeout that could cut off large
envelopes. Use CURLOPT_CONNECTTIMEOUT_MS instead so only connection
establishment is bounded. For winhttp, limit resolve and connect to 15s
but leave send/receive at their defaults.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Without this, sentry__retry_send overcounts remaining files, causing an
unnecessary extra poll cycle.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Restructure handle_result so "max retries reached" warnings only fire
on actual network failures, not on successful delivery at the last
attempt. Separate the warning logic from the cache/discard actions and
put the re-enqueue branch first for clarity.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the `can_retry` bool on the transport with a `retry_func`
callback, and expose `sentry_transport_retry()` as an experimental
public API for explicitly retrying all pending envelopes, e.g. when
coming back online.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move retry envelopes from a separate retry/ directory into cache/ so
that sentry__cleanup_cache() enforces disk limits for both file formats
out of the box. The two formats are distinguishable by length: retry
files use <ts>-<count>-<uuid>.envelope (49+ chars) while cache files
use <uuid>.envelope (45 chars). Default http_retries to 0 (opt-in).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant