A bulletproof bash script for streaming large tar archives directly to disk with automatic retry logic, stall detection, and progress monitoring.
- Streaming extraction - Downloads and extracts simultaneously, no temporary files
- Production-grade reliability - Automatic retries with exponential backoff
- Stall detection - Watchdog automatically kills and retries stalled downloads
- Progress monitoring - Real-time download ETA via
pv, or extraction speed fallback - Compression support - Auto-detects zstd, lz4, gzip, bzip2, xz, and plain tar (by extension or magic bytes)
- Minimal disk usage - No temporary tar file, extracts on-the-fly
- Connection resilience - TCP keepalive, nodelay, and aggressive timeout handling
- Multipart support - Split archives (
.partNNNN) concatenated and extracted as one stream - Structured logging - Text or JSON output for log aggregation
- Checksum verification - Optional SHA-256 integrity check
Perfect for:
- Blockchain snapshot restoration (Ethereum, Cosmos, etc.)
- Large database backups
- CI/CD deployment of large archives
- Any scenario where disk space is limited but reliability is critical
- bash 4.4+
- curl
- tar
- Compression tools (zstd, lz4, gzip, bzip2, xz) - only needed if using compressed archives
- Standard Unix utilities: du, awk, numfmt
- Optional for accurate ETA: pv (auto-detected)
- Optional for checksum verification: sha256sum or shasum, tee, mktemp
export RESTORE_SNAPSHOT=true
export URL="https://example.com/snapshot.tar"
export DIR="/data"
./stream-download.shFROM alpine:3.22
RUN apk add --no-cache \
bash curl tar \
zstd lz4 gzip bzip2 xz \
coreutils pv ca-certificates dumb-init
COPY stream-download.sh /usr/local/bin/stream-download.sh
RUN chmod +x /usr/local/bin/stream-download.sh
ENTRYPOINT ["/usr/bin/dumb-init", "--"]
CMD ["/bin/bash", "-c", "/usr/local/bin/stream-download.sh"]apiVersion: v1
kind: Pod
metadata:
name: snapshot-restore
spec:
initContainers:
- name: restore-snapshot
image: your-image:latest
env:
- name: RESTORE_SNAPSHOT
value: "true"
- name: URL
value: "https://snapshot.arbitrum.foundation/arb1/classic-archive.tar"
- name: DIR
value: "/storage"
- name: SUBPATH
value: "db"
- name: TAR_ARGS
value: "--strip-components=1"
volumeMounts:
- name: data
mountPath: /storage
containers:
- name: main
image: your-app:latest
volumeMounts:
- name: data
mountPath: /storage
volumes:
- name: data
persistentVolumeClaim:
claimName: your-pvc| Variable | Default | Description |
|---|---|---|
RESTORE_SNAPSHOT |
false |
Set to true to enable snapshot restore |
URL |
- | Required (or use URLS). URL of the snapshot to download |
URLS |
"" |
Comma-delimited list of URLs to download sequentially (takes precedence over URL) |
DIR |
- | Required. Absolute path to extract snapshot to |
SUBPATH |
"" |
Subdirectory within DIR to extract to (e.g., db) |
TAR_ARGS |
"" |
Additional arguments to pass to tar (e.g., --strip-components=1) |
COMPRESSION |
auto |
Compression format: auto, none, gzip, bzip2, xz, zstd, lz4 |
RM_SUBPATH |
true |
Remove SUBPATH directory before extraction (set to false to keep) |
MAX_RETRIES |
10 |
Number of retry attempts before giving up |
STALL_MINUTES |
3 |
Minutes of no progress before watchdog kills curl |
CURL_SPEED_LIMIT |
102400 |
Minimum bytes/sec before curl considers connection stalled |
CURL_SPEED_TIME |
180 |
Seconds at low speed before curl aborts |
DEBUG |
false |
Set to true to enable verbose shell tracing |
CURL_INSECURE |
false |
Set to true to skip TLS verification |
CACERT |
"" |
Path to CA bundle for TLS verification |
CURL_EXTRA_ARGS |
"" |
Extra arguments appended to curl (advanced use) |
MULTIPART |
false |
Set to true for split archives (.partNNNN files concatenated as one stream) |
CHECKSUM_SHA256 |
"" |
Expected SHA-256 of the downloaded stream (single URL or multipart whole-stream) |
CHECKSUMS_SHA256 |
"" |
Comma-delimited SHA-256 checksums, positionally matching URLS (not supported with MULTIPART) |
USE_PV |
auto |
Use pv for download ETA when available |
STATUS_INTERVAL_SECONDS |
30 |
Progress update interval in seconds |
LOG_FORMAT |
text |
Logging format: text or json |
┌─────────┐ ┌──────────────┐ ┌─────┐ ┌────────────┐
│ curl │───▶│ decompressor │───▶│ tar │───▶│ /storage/* │
└─────────┘ └──────────────┘ └─────┘ └────────────┘
│ │ │ │
└──────────────┴──────────────────┴──────────────┘
│
┌─────▼──────┐
│ monitors │
│ watchdog + │
│ status │
└────────────┘
- curl streams data from URL with connection monitoring
- decompressor (if needed) decompresses on-the-fly
- tar extracts files directly to disk
- monitors track progress and detect stalls
- Automatic retry - Up to 10 attempts with exponential backoff (10s, 20s, 30s...)
- Stall detection - Watchdog kills download if no progress for 3 minutes (configurable)
- Connection monitoring - curl aborts if speed drops below 100KB/s for 3 minutes
With pv available and known file size (text mode):
Download: 45% | 278GiB / 613GiB | Speed: 245MiB/s | ETA: 23m
With pv available and known file size (JSON mode):
{"ts":"2026-02-05T12:00:00Z","level":"info","event":"download","percent":45,"bytes":298521149440,"total":658280898560,"speed_bps":256901120,"eta_seconds":1380}Stall warnings:
No progress detected for 1 minute(s) (278GiB extracted)
No progress detected for 2 minute(s) (278GiB extracted)
WATCHDOG: Detected stall for 3 minutes, killing download to trigger retry
export RESTORE_SNAPSHOT=true
export URL="https://snapshot.arbitrum.foundation/arb1/classic-archive.tar"
export DIR="/storage"
export SUBPATH="db"
export TAR_ARGS="--strip-components=1"
./stream-download.shexport RESTORE_SNAPSHOT=true
export URL="https://example.com/snapshot.tar.zst"
export DIR="/data"
export COMPRESSION="zstd" # or use "auto" to auto-detect
export MAX_RETRIES=5
./stream-download.shDownload and extract multiple tar archives sequentially to the same directory:
export RESTORE_SNAPSHOT=true
export URLS="https://example.com/data.tar.zst,https://example.com/indexes.tar.gz"
export DIR="/storage"
export CHECKSUMS_SHA256="abc123...,def456..." # optional, positional match
./stream-download.shAll archives must succeed before the completion stamp is written. If any URL fails, the script exits immediately and the next run retries all URLs.
For large snapshots split into multiple parts (e.g., .part0000, .part0001, ...), parts are downloaded sequentially and concatenated into a single stream for extraction.
Auto-detection from base URL:
export RESTORE_SNAPSHOT=true
export URL="https://snapshot.arbitrum.io/arb1/2026-04-07-7a6e8fe1/archive-path.tar"
export DIR="/storage"
export MULTIPART=true
./stream-download.shThe script probes for .part0000, .part0001, etc. via HEAD requests until a part returns non-200. You can also pass a URL that already includes the part suffix (e.g., archive-path.tar.part0000) — the suffix is stripped automatically before probing.
Explicit part URLs:
export RESTORE_SNAPSHOT=true
export URLS="https://snapshot.arbitrum.io/arb1/2026-04-07-7a6e8fe1/archive-path.tar.part0000,\
https://snapshot.arbitrum.io/arb1/2026-04-07-7a6e8fe1/archive-path.tar.part0001,\
https://snapshot.arbitrum.io/arb1/2026-04-07-7a6e8fe1/archive-path.tar.part0002,\
https://snapshot.arbitrum.io/arb1/2026-04-07-7a6e8fe1/archive-path.tar.part0003,\
https://snapshot.arbitrum.io/arb1/2026-04-07-7a6e8fe1/archive-path.tar.part0004,\
https://snapshot.arbitrum.io/arb1/2026-04-07-7a6e8fe1/archive-path.tar.part0005,\
https://snapshot.arbitrum.io/arb1/2026-04-07-7a6e8fe1/archive-path.tar.part0006"
export DIR="/storage"
export MULTIPART=true
./stream-download.shKey differences from URLS without MULTIPART:
| Feature | URLS (multi-archive) |
URLS + MULTIPART=true |
|---|---|---|
| Archives | Each URL is a separate archive | All URLs are parts of one archive |
| Extraction | Each extracted independently | Parts concatenated into single stream |
| Compression | Detected per URL | Detected from base name (strips .partNNNN) |
| Progress | Per-file | Tracks total across all parts |
| Checksums | Per-URL via CHECKSUMS_SHA256 |
Whole-stream via CHECKSUM_SHA256 |
export RESTORE_SNAPSHOT=true
export URL="https://example.com/snapshot.tar"
export DIR="/data"
export SUBPATH="database"
export RM_SUBPATH="false" # Don't delete existing data
./stream-download.shCheck connection stability:
# Test download speed
curl -o /dev/null https://your-snapshot-url.tar
# Check if server supports HTTP keepalive
curl -I https://your-snapshot-url.tar | grep -i "keep-alive"Increase retry attempts:
export MAX_RETRIES=20The watchdog detects stalls after 3 minutes of no progress by default. Tune it:
export STALL_MINUTES=5 # Wait 5 minutes instead of 3This script uses minimal space (extracts on-the-fly), but you need enough space for the extracted data. The script warns on startup if free space is less than the file size.
df -h /storageThis streaming approach cannot resume from a specific byte position. If the download fails, it restarts from the beginning.
Why? Tar archives must be read sequentially. Jumping to a mid-point causes tar to see garbage data and fail to extract correctly.
Mitigation:
- Automatic retries with exponential backoff
- Stall detection and auto-recovery
- Connection speed monitoring
- Most downloads succeed on first attempt with good internet
This streaming approach prioritizes:
- Minimal disk space usage
- Immediate file availability
- Simple, predictable behavior
The trade-off is that failed downloads restart from the beginning. However, with retry logic, stall detection, and connection monitoring, the vast majority of downloads complete successfully.
| Snapshot Size | Network Speed | Extraction Time |
|---|---|---|
| 100GB | 100Mbps | ~2.5 hours |
| 500GB | 100Mbps | ~12 hours |
| 1TB | 1Gbps | ~2.5 hours |
- Network - Usually the limiting factor
- Disk I/O - Can bottleneck on slow disks (HDD vs SSD)
- CPU - Decompression (zstd, lz4) can be CPU-intensive
- TLS verification is enabled by default; use
CURL_INSECURE=trueonly when required - Prefer
CACERT=/path/to/ca-bundle.crtfor custom CAs - No authentication - assumes public snapshot URLs
- Optional SHA-256 verification via
CHECKSUM_SHA256 DIRmust be an absolute path;SUBPATHmust be relative (no..)--compressedis intentionally omitted from curl to avoid double-decode on misconfigured CDNs
Download ETA is computed from bytes received, not extracted. For accurate ETA:
- Ensure the server provides
Content-Lengthor supports HTTP Range requests pvis used automatically when available and file size is known (USE_PV=auto)
Set LOG_FORMAT=json for line-delimited JSON logs (useful for log aggregation):
export LOG_FORMAT="json"Use environment variables to customize curl:
export CACERT="/path/to/ca-bundle.crt"
export CURL_EXTRA_ARGS="--retry 2 --retry-delay 5"Verify the download stream with SHA-256:
export CHECKSUM_SHA256="abc123...yourchecksum..."Changing the checksum re-triggers the download even if the stamp file exists.
Tune stall detection threshold:
export STALL_MINUTES=5 # Wait 5 minutes instead of 3Comment out watchdog in stream_and_extract function:
# watchdog &
# WATCHDOG_PID=$!For issues, questions, or contributions, please refer to your internal documentation or contact your DevOps team.
This project is based on the excellent init-stream-download tool by GraphOps. We've extended it to support additional compression formats while maintaining full backward compatibility with the original.