Universal agnostic web crawler with multi-level AI extraction.
Given a URL and a natural-language goal, the crawler auto-escalates through 4 extraction levels — CSS selectors → browser interactions → visual (screenshot + Claude Vision) → captcha — until enough data is extracted.
- L1 — CSS selectors: Claude reads raw HTML, discovers container + field selectors, validates extraction, produces a reusable
SiteProfile - L2 — Interactive: When L1 yields too few records, Claude suggests browser actions (clicks, scrolls, dismiss overlays) and the crawler executes them via dig2browser, then re-extracts
- L3 — Visual: When L2 still falls short, takes a screenshot, sends it to Claude Vision for coordinate-based actions (click at x,y), executes them, re-extracts
- L4 — Captcha: Architecture stub with
CaptchaSolvertrait — not implemented unless forced by life - Auto-navigation — if Claude lands on a homepage/landing without data, it finds the right sub-page (e.g.
/vps/) and redirects automatically (up to 2 hops) - SPA JSON extraction — detects
__NEXT_DATA__(Next.js),__NUXT_DATA__(Nuxt),window.__NUXT__,window.__INITIAL_STATE__before HTML stripping removes<script>tags; Claude can returnjson_pathextraction mode instead of CSS selectors - Detects anti-bot protection, extracts JSON-LD and page metadata as bonus context
- Saves a
SiteProfileand exports aDaemonSpecfor scheduled monitoring
Single-crate layout — all modules live under src/:
dig2crawl/
├── src/
│ ├── main.rs — CLI binary (clap) + escalation coordinator
│ ├── lib.rs — Library root
│ ├── core/ — Types, traits, error types, rate limiter, engine
│ ├── fetch/ — HttpFetcher, BrowserFetcher, interactive action executor
│ ├── agent/ — AgentSession, prompts, protocol, actions, visual, captcha
│ ├── parser/ — SelectorExtractor, JsonLdExtractor, AntiBotDetector, MetadataExtractor, LinkExtractor, SpaJsonExtractor
│ ├── config/ — TOML job config, SiteProfile, DaemonSpec serialisation
│ └── storage/ — SQLite + JSONL output backends
Discovery runs in 5 steps + escalation:
- Fetch — page is fetched via HTTP or headless browser; anti-bot check runs immediately
- Context extraction — SPA JSON blocks (
__NEXT_DATA__,__NUXT_DATA__) extracted before HTML cleaning; JSON-LD and page metadata injected as bonus context - Auto-navigation — if Claude determines the page is a homepage without target data, it returns a
navigateresponse with the target URL; the crawler fetches the new page and restarts discovery (max 2 hops) - Discovery (L1) — Claude reads
page.htmlfrom disk, analyses the DOM, writes selectors + confidence score; for SPA sites it may returnjson_pathextraction mode instead of CSS selectors - Validation — the pure-Rust
SelectorExtractorapplies discovered selectors; Claude reviews the sample - Save —
SiteProfileis written tooutput/<domain>/profile.json
Escalation — if L1 yields fewer records than --min-records or confidence below --min-confidence:
- L2 Interactive — Claude suggests
BrowserActions (Click, ScrollTo, DismissOverlay, Type, WaitForElement, etc.), the crawler executes them on a live browser page, then re-extracts with the same selectors - L3 Visual — a screenshot is taken and sent to Claude Vision; Claude responds with coordinate-based
VisualActions (click at x,y), which are converted to browser actions and executed - L4 Captcha — if anti-bot is detected and
--max-level 4, prints a warning (solver trait exists but is not implemented)
After discovery, extract applies the saved profile in pure Rust — no agent needed.
dig2crawl uses a stealth headless browser (dig2browser) by default for all commands.
Supported browsers:
| Browser | Backend | Detection | Notes |
|---|---|---|---|
| Chrome | CDP | Auto (first priority) | Full stealth: UA override, Client Hints, canvas noise, WebGL spoof |
| Edge | CDP | Auto (second priority) | Same Chromium flags as Chrome |
| Firefox | BiDi | "browser": "firefox" in fingerprint |
Stealth via moz:firefoxOptions.prefs, requires geckodriver |
- Uses CDP (Chrome/Edge) or BiDi (Firefox) with stealth scripts that patch
navigator.webdriverand related fingerprint vectors - Bypasses Cloudflare and other WAF challenges that block plain HTTP clients
- Waits for a CSS selector to appear before capturing HTML (
--wait-selector) - Auto-creates a persistent browser profile per domain at
%TEMP%/dig2crawl-profiles/<domain>/
By default, dig2browser auto-detects Chrome → Edge. To force a specific browser, set "browser" in the fingerprint config.
Pass --http-only to fall back to a plain reqwest HTTP client — faster, sufficient for static or server-rendered pages.
Use --fingerprint <path> to load a JSON file that configures the browser fingerprint. All fields are optional — omitted fields use defaults (en-US locale, Standard stealth level, 1920x1080 viewport).
{
"browser": "chrome",
"level": "full",
"locale": "ru-RU",
"timezone": "Europe/Moscow",
"viewport": [1440, 900],
"hardware_concurrency": 4,
"device_memory_gb": 4,
"user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) ..."
}| Field | Type | Default | Description |
|---|---|---|---|
browser |
string | "auto" |
Browser: "auto", "chrome", "edge", "firefox" |
level |
string | "standard" |
Stealth level: "basic", "standard_no_webgl", "standard", "full" |
locale |
string | "en-US" |
BCP-47 locale tag |
timezone |
string | null |
IANA timezone (e.g. "Europe/Moscow") |
viewport |
[w, h] |
[1920, 1080] |
Screen resolution |
hardware_concurrency |
int | 8 |
navigator.hardwareConcurrency |
device_memory_gb |
int | 8 |
navigator.deviceMemory (GB) |
user_agent |
string | Chrome 131 | Full User-Agent string |
Firefox note: Firefox uses the BiDi protocol via geckodriver (must be running at http://localhost:4444). Stealth is applied via moz:firefoxOptions.prefs — no CDP-level overrides. set_extra_http_headers is not supported on Firefox.
The fingerprint applies to both auth (visible browser) and all headless commands. This ensures the auth session and subsequent crawling share the same fingerprint — critical for sites that compare session fingerprints against cookie fingerprints (e.g. Yandex SmartCaptcha).
# Discover selectors and produce a SiteProfile (browser by default)
dig2crawl discover <url> --goal "Extract VPS plans: name, price, cpu, ram, disk"
dig2crawl discover <url> --goal "..." --wait-selector "div.tariffs"
dig2crawl discover <url> --goal "..." --http-only --output-dir ./profiles/mysite
# Discover with escalation control
dig2crawl discover <url> --goal "..." --max-level 2 # stop at L2 (no visual)
dig2crawl discover <url> --goal "..." --min-records 5 # escalate until 5+ records
dig2crawl discover <url> --goal "..." --min-confidence 0.8 # escalate until 80% confidence
# Extract data using a saved profile (pure Rust, no agent)
dig2crawl extract <url> --profile output/<domain>/profile.json
dig2crawl extract <url> --profile output/<domain>/profile.json --max-pages 5 --output records.jsonl
# Export a DaemonSpec for scheduled monitoring
dig2crawl export-spec output/<domain>/profile.json --schedule "0 6 * * *" --output spec.json
dig2crawl export-spec output/<domain>/profile.json --schedule "0 6 * * *" --output spec.toml
# Cookie auth — separate binary, see below
cookie-auth <url>
cookie-auth <url> --profile %TEMP%/custom-profile # explicit profile path
cookie-auth <url> --fingerprint russian.json # custom fingerprint for auth
# Debug tools (browser by default, add --http-only for plain HTTP)
dig2crawl fetch <url> [--output page.html] [--metadata] [--jsonld] [--antibot]
dig2crawl test-selector <url> --selector "div.item" --field "title:h2.name" --field "price:.price"
dig2crawl collect-links <url> [--depth 2] [--domain-only]Global flags:
--verbose/-v— debug logging--headed— launch browser in visible (non-headless) mode--browser-profile <PATH>— explicit persistent profile directory (default: auto%TEMP%/dig2crawl-profiles/<domain>/)--fingerprint <PATH>— JSON fingerprint config (locale, timezone, viewport, stealth level, etc.)--bot-auth <JWKS_URL>— enable Web Bot Auth signing--bot-key <PATH>— Ed25519 private key for bot auth (default:keys/bot.key)--model <MODEL>— Claude model to use (default:claude-sonnet-4-6)--max-level <N>— maximum extraction level: 1=CSS, 2=interactive, 3=visual (default: 3)--min-records <N>— minimum records before considering L1 successful (default: 1)--min-confidence <F>— minimum confidence threshold (default: 0.5)
Separate binary for sites behind captcha or login walls (e.g. Yandex SmartCaptcha). Opens a visible browser where you log in manually, saves cookies to a persistent profile reused by subsequent discover/extract/fetch commands.
# Build
cargo build --release --bin cookie-auth
# Step 1: Open browser, pass captcha, close the window
# Profile auto-created at %TEMP%/dig2crawl-profiles/yandex.cloud/
cookie-auth https://yandex.cloud/ru/prices --fingerprint russian.json
# Step 2: Use the saved profile for headless crawling (same fingerprint!)
dig2crawl discover https://yandex.cloud/ru/prices \
--goal "Extract cloud VM pricing" \
--fingerprint russian.json| Flag | Description |
|---|---|
--fingerprint <PATH> |
JSON fingerprint config (browser, locale) |
--profile <PATH> |
Explicit profile directory (default: %TEMP%/dig2crawl-profiles/<domain>/) |
The --fingerprint flag ensures auth and headless sessions share the same browser fingerprint. Without it, default fingerprint (en-US) is used for both.
Under the hood, cookie-auth calls dig2browser::cookies::open_auth_session_with_locale() which launches the browser with the same stealth args as headless mode. The profile directory is passed via --user-data-dir.
For quick browser testing without the full crawler, use dev-fetch from dig2browser — DevTools in your terminal:
cargo install dig2browser
dev-fetch https://cloud.vk.com/pricing --fingerprint russian.json --network-log --cookies --save-html out.htmlSee dig2browser README for full flag reference.
AgentSession wraps gate4agent PipeSession — the same library used by agent2overlay.
Pipe mode — Claude CLI runs in headless NDJSON-streaming mode:
- Prompts are delivered via stdin (no file intermediary, no cmd.exe argument length limits)
- Claude responds via stdout as NDJSON
stream-jsonevents (PipeText,PipeToolStart,PipeSessionEnd, etc.) --resume <session_id>is captured fromPipeSessionStartevents and passed on subsequent calls — L1 → validation → L2 → L3 all share the same conversational context--dangerously-skip-permissionsenables tool use (Read, Grep, Bash) in one-shot mode- Claude reads HTML/screenshot files referenced in the prompt via its Read tool (with
offset/limitfor large files)
HTML cleaning — before saving to disk, <script>, <style>, <svg>, <noscript> tags are stripped and whitespace collapsed (typically 70-90% size reduction). Claude reads the cleaned file in chunks if needed.
Temp directory layout during a discover run:
%TEMP%/dig2crawl_<pid>/
├── page.html — cleaned fetched HTML (scripts/styles stripped)
├── spa_data.json — extracted SPA JSON (if __NEXT_DATA__ / __NUXT_DATA__ found)
├── l2_page.html — post-action HTML (if L2 escalated)
└── l3_screenshot.png — page screenshot (if L3 escalated)
The directory is deleted when the session closes.
Robust JSON parsing — the agent protocol tolerates several classes of model output variation:
- Regex escape sequences in
transformpatterns (e.g.\\d+,\\\s*) sanitized — the state machine correctly consumes both characters of valid JSON escapes (\\,\") before inspecting the next character nullselectors for fields absent from the page (skipped gracefully, not treated as errors)- Untagged
next_urlsvalues — accepted as either a plain string or a{"url": "..."}object - Mixed-type
url_patternsand field specs — arrays and scalars both accepted at every field
Tested against VPS hosting providers with auto-navigation and SPA extraction (v0.3.19):
| Provider | Confidence | Records | Notes |
|---|---|---|---|
| serverspace.io | 0.88 | 9 | L1 CSS selectors |
| ruvds.com | 0.93 | 3 | L1, browser mode |
| 4dedic.io | 0.92 | 4233 | L1, large catalog |
| robovps.biz | 0.91 | 7 | Auto-nav homepage → /vps/ |
| ishosting.com | 0.92 | 6 | Auto-nav ×2, through Cloudflare |
| timeweb.cloud | 0.90 | 4 | L3 visual (Nuxt SPA) |
| adminvps.ru | 0.88 | — | L1 CSS selectors |
Auto-navigation successfully finds VPS pricing pages from homepages. SPA sites (Next.js, Nuxt) are handled via embedded JSON extraction or L3 visual fallback.
output/<domain>/
└── profile.json
Example (output/adminvps.ru/profile.json):
{
"domain": "adminvps.ru",
"container_selector": "#tariffs .slider__block-content .swiper-slide .slider__item",
"fields": [
{ "name": "name", "selector": ".products-new__item-head-title", "extract": "text", "transform": "trim" },
{ "name": "price", "selector": ".slide-title p", "extract": "text", "transform": "parse_number" },
{ "name": "cpu", "selector": ".slide-params-list li:nth-child(1) .char", "extract": "text", "transform": "trim" }
],
"pagination": null,
"requires_browser": true,
"confidence": 0.9,
"validated": true
}output/ is gitignored.
- Claude CLI —
npm install -g @anthropic-ai/claude-code - Chrome — required for
--browsermode (auto-detected by dig2browser) - Rust 1.75+
cargo build --release --bin dig2crawl| Crate | Purpose |
|---|---|
| dig2browser | Stealth browser automation (CDP + BiDi) — crates.io |
| gate4agent | Claude CLI pipe session with NDJSON streaming — crates.io |
| scraper | CSS selector engine |
| reqwest | HTTP client |
| rusqlite | SQLite storage |
| clap | CLI framework |
| governor | Rate limiter |
| tokio | Async runtime |
If you find this tool useful, consider supporting development:
| Currency | Network | Address |
|---|---|---|
| USDT | TRC20 | TNxMKsvVLYViQ5X5sgCYmkzH4qjhhh5U7X |
| USDC | Arbitrum | 0xEF3B94Fe845E21371b4C4C5F2032E1f23A13Aa6e |
| ETH | Ethereum | 0xEF3B94Fe845E21371b4C4C5F2032E1f23A13Aa6e |
| BTC | Bitcoin | bc1qjgzthxja8umt5tvrp5tfcf9zeepmhn0f6mnt40 |
| SOL | Solana | DZJjmH8Cs5wEafz5Ua86wBBkurSA4xdWXa3LWnBUR94c |
MIT