Skip to content

feat(bench): bench robustness — TTL fixes, interleaved HOT rounds, path-set parity checks#369

Merged
githubrobbi merged 68 commits into
mainfrom
feat/bench-preflight-robustness
Jun 8, 2026
Merged

feat(bench): bench robustness — TTL fixes, interleaved HOT rounds, path-set parity checks#369
githubrobbi merged 68 commits into
mainfrom
feat/bench-preflight-robustness

Conversation

@githubrobbi

@githubrobbi githubrobbi commented Jun 7, 2026

Copy link
Copy Markdown
Collaborator

Summary

Batch of bench-harness improvements accumulated on feat/bench-preflight-robustness.

Daemon TTL defaults (fix(bench))

  • HOT_TO_WARM_IDLE_SECS: 60 s → 600 s (10 min)
  • WARM_TO_PARKED_IDLE_SECS: 300 s → 1 800 s (30 min)
  • Previous 1 min / 5 min defaults caused spurious mid-bench demotions.
  • Both bench scripts (cold-parity-per-drive.rs, cross-tool-benchmark.rs) now start the daemon with bench-safe overrides scoped to the daemon child process:
    • UFFS_HOT_TO_WARM_IDLE_SECS=3600
    • UFFS_WARM_TO_PARKED_IDLE_SECS=7200
  • config.rs const-pin and lib.rs env-var doc table updated.

Interleaved HOT rounds with random tool order (feat(bench))

  • Previously all N rounds of UFFS ran first, then C++, then Everything — later tools benefited from OS FS cache warmed by earlier tools.
  • New structure: for each (sink, pattern), every round shuffles [uffs, cpp, es] with a fresh LCG seed, runs them in that order, then reports.
  • Zero new dependencies — LCG Fisher-Yates on 3 elements using glibc constants.

Path-set superset check per round (fix(bench))

  • Row counts can match by coincidence.
  • Each tool now writes to a separate per-pattern output file (bench_uffs_<pat>.csv etc.) so all three coexist within a round.
  • After every round (File sink): extract normalised path sets, verify:
    es.exe paths ⊆ uffs.com paths ⊆ uffs.exe paths
    
  • Violations printed immediately with first 3 missing paths as examples.
  • Files cleaned up after each round check.

Stage 3 / preflight CLI arg order fix (fix(bench))

  • DEFAULT_PATTERNS and preflight test fixture corrected to match actual uffs.exe CLI: pattern first, then --drives DRIVE, then --count.

Cold-parity and warm-up polish

  • cold-parity-per-drive.rs always restarts daemon with 3-pass warm-up.
  • Full COLD→WARM→HOT trajectory shown in warm-up output.

Testing

All CI gates pass: fmt, typos, reuse, lint-ci, lint-prod, lint-tests, rustdoc, doc-tests, tests, smoke, lint-ci-windows.

Covers prerequisites, the full Stage 0-5 guided/auto/dry-run/preflight
modes, bundle layout, resume, crash recovery (restore + verify), stage
and drive selection flags, publishing steps, and a troubleshooting
section. Wires into docs/benchmarks/README.md.
… is current

- Replace 'build from source' prereq with the actual default_binary()
  cascade (USERPROFILE\bin → target/release → PATH), matching all
  validation scripts. Add --bin flag override note.
- Confirm Everything CLI 1.1.0.30 is still the latest (voidtools.com
  downloads page checked 2026-06-07); no pin change needed.
- Add --bin to flags reference table.
…CLI 1.1.0.30

es.exe is a thin IPC wrapper; the GUI daemon version determines search
behaviour and is part of the benchmark environment.

- competitors.toml: add engine_version = "1.4.1.1032" field + comment
  explaining the CLI/engine split; note latest GUI download URL
- runbook.md: prereq step 4 now says install GUI engine first, then
  fetch-competitors for the ES CLI
- methodology.md: binary versions disclosure updated to list both
  GUI engine and CLI versions with explanation
es.exe per-drive scoping uses the bare drive letter and colon (e.g. C:)
with no trailing backslash, consistent with everything_capacity_probe.rs
L1+ convention. The backslash form worked incidentally but deviated from
the established pattern in the scripts.
…fail-fast on missing tools

- ToolVersion gains an exe field; render_md shows it as inline code alongside
  the version string so operators know exactly which binary was used
- ToolProbe gains version_line_prefix: uffs_cpp uses 'UFFS version:' to
  extract the semantic version line from the multi-line banner instead of
  the first (URL) line
- probe_tool: use map_or_else + bool::then idioms; rename single-char params
- check_es_available probes tasklist/pgrep to distinguish DaemonNotRunning
  (process absent) from DaemonStarting (process running but IPC not ready);
  solo_reason in matrix carries matching user-facing messages
- Orchestrator::capture fails fast with BenchError::MissingTools listing every
  tool whose version probe returned unknown so the run aborts before any
  measurements instead of silently degrading to UFFS-only
- run.rs split into run/mod.rs + run/tests.rs to stay under 800 LOC policy
- Tests: new preflight daemon_not_running/daemon_starting tests,
  probe_tool_version_line_prefix golden test, split dry_run_host/autopilot_host
  fixtures with documented call-order slots
resolve::es_exe: replace run-based PATH probe with where.exe/which lookup.
The old probe ran `es.exe -get-everything-version` which exits 0 even when
the Everything daemon is not running, so is_ok() was always true and the
function always returned bare "es.exe" instead of the full path. The new
probe calls where.exe (Windows) / which (Unix) to get the resolved absolute
path from the OS PATH, with ~/bin and Program Files as fallbacks.

tool_probe for everything: switch version arg from -get-everything-version
to -version. The -version flag prints the ES CLI version (e.g. 1.1.0.30)
without requiring the Everything daemon to be running, so the env fingerprint
always shows a real version string regardless of daemon state.

ToolProbe gains daemon_error_markers: if the combined output contains any
marker substring, probe_tool returns "not running" rather than raw error
text — guards against any remaining IPC-error bleed-through.

Remove stale crates/uffs-bench/src/run.rs (untracked leftover after the
run.rs -> run/mod.rs rename; was causing E0761 in doc tests).

Tests: add probe_tool_daemon_error_markers_returns_not_running golden test;
update dry_run_host/autopilot_host to reflect where.exe slot + -version output.
Each tool in the env fingerprint now carries a state field alongside its
version. State is determined by a lightweight StateProbe that runs a
process-presence command:

  uffs           — uffs daemon status, running_marker="running"
  uffs_cpp       — no daemon, state = n/a
  everything     — tasklist /FI "IMAGENAME eq Everything.exe" CSV, checks
                   for "Everything.exe" in output
  everything_gui — same tasklist probe for state, but version comes from
                   es.exe -get-everything-version (IPC) so the version
                   reflects the running daemon build. daemon_error_markers
                   turns IPC-error output into "not running" rather than
                   raw error text.

New resolve::everything_exe uses where.exe/which first (full PATH hit),
then %ProgramFiles(x86)%/ProgramFiles known locations, then bare fallback.

ToolProbe gains display_exe: Option<String> so the version probe binary
(es.exe) can differ from the path shown in the report (Everything.exe).

ToolVersion gains state: String; render_md now shows
  - **tool:** version (state: running) `/path/to/exe`

DEFAULT_TOOLS grows to [uffs, uffs_cpp, everything, everything_gui].
EVERYTHING_GUI_TOOL constant added to matrix.rs.

All 65 tests pass; no clippy warnings.
uffs tool probe:
- version_line_prefix: Some("uffs ") strips prefix so "uffs 0.5.117"
  becomes "0.5.117" in the fingerprint and report.
- state_probe removed: daemon (uttfd) is started/restarted by the bench
  itself; pre-run state is irrelevant — reports n/a.

Missing-tool soft gate (Orchestrator::capture):
- When tools are not found (version = "unknown"), print a per-tool
  install hint (URL/instruction) via env::tool_install_hint.
- If >= 2 tools remain, show a confirm gate: proceed with available
  tools or abort and install first.
- Hard-fail only when < 2 tools are available.
- New cards::missing_tools_card builds the Card for the gate.

Tool-version table output:
- render_md now emits a padded GFM table instead of a bullet list.
  Columns: Tool | Version | State | Path
  Column widths computed from actual data so the table stays aligned
  regardless of exe path or version string length.

All 65 tests pass.
render_md now shows only Tool | Version | Path. The state field on
ToolVersion is preserved — it will be used downstream to decide how
to proceed (e.g. whether the daemon needs starting before a run).
Instead of a separate warning block, missing tools now appear as
ordinary rows in the Tool versions table:

  | uffs_cpp | ⚠️ not found | https://...install-url... |

Version cell: ⚠️ not found
Path cell: install URL / artifact location from tool_install_hint()

Column widths are computed from the hint string length so the table
stays aligned. No pre-amble host.out() noise before the confirm gate.
tool_install_hint import removed from run/mod.rs (only used in env.rs).
- uffs_cpp: direct download of uffs.com v1.0.0 release artifact
- everything (ES CLI): voidtools CLI downloads page
- everything_gui (Everything.exe): voidtools installer page
dev profile (unoptimized) is not appropriate for the benchmark harness
itself — it runs the timing loops and process orchestration. All bench-*
recipes now use `cargo run --release`.
Move render_md() + matrix::render_md() from run_stage0 into capture()
so the tool-version table and matrix are shown immediately after probing,
before either the missing-tool gate or the Stage 0 plan gate fires.

Previously the table was invisible when tools were missing because
run_stage0 never ran — the missing-tool confirm fired first with no
context shown to the operator.
- show_full: blank lines between sections, skip empty commands header,
  render resources one-per-line, expand prompt to spaced [y]/[a]/[b]/[q]
- missing_tools_card: title now reads 'Benchmarking UFFS and ES — proceed
  or quit?'; resources list available tools with checkmarks; takes
  available &[&str] param so run/mod.rs passes real tool names
everything + everything_gui → 'Everything' (one product, two probes)
uffs → 'UFFS', uffs_cpp → 'UFFS (C++ ref)'

Card title/resources now read 'Benchmarking UFFS and Everything' instead
of listing raw probe IDs. Deduplication via unique_product_names().
…ore any gate

capture() now prints: env fingerprint → matrix → (optional missing-tool
gate) → returns. run_stage0 no longer re-prints the matrix; it goes
straight to the plan-gate card. Single, unambiguous output sequence.
Previously the gate only fired when tools were missing. Now it always
fires so the operator confirms (or in future deselects) which products
will be benchmarked — UFFS, Everything, etc — before any measurement.

- Rename missing_tools_card → tool_selection_card(available, missing)
- All-present case: 'Benchmark X and Y — confirm tool selection'
- Missing case: 'Benchmark X — proceed or quit to install missing first?'
- Card id changed to 'tool-selection'
- Prompt line now reads '[Enter/y] proceed' making the default obvious
- After read_key returns a terminal decision, echo '[Enter] -> proceeding'
  / '[y] -> proceeding' / '[q] -> aborting' etc. so the user sees their
  choice reflected before the next output appears
- Enter (\n/\r) already mapped to Proceed; just needed the UX hint
…ction

- Add uffs_record_count to DrivePreflight (UFFS full-scan count per drive,
  always populated regardless of ES state)
- Add es_ram_budget_bytes to PreflightSpec and MatrixSpec (default: 50% of
  system RAM, derived from new EnvFingerprint::ram_bytes field)
- Add UFFS_BYTES_PER_RECORD = 100 constant (voidtools: ~100 MB / 1M files)
- Replace simple everything_serves() capable-drive check with greedy RAM-
  budget selector: sort drives by uffs_record_count ascending, accumulate
  until es_ram_budget_bytes exceeded -- avoids OOM-ing Everything's index
- Add render_drive_table() to preflight: GFM table showing Drive, UFFS
  records, Est. RAM, ES status, ES capable (checkmark = fits in budget)
- Print drive inventory table between tool-selection gate and matrix
- Update all tests for new struct fields and revised mock call sequences
- fix: sort_unstable on char, alloc BTreeSet, integer-only fmt_ram,
  doc on probe_drive, rename single-char ident in fmt_count
…table

- daemon_start_if_needed(): fired at top of capture() before env probe;
  checks 'uffs daemon status' for Status:Ready; if not ready, fires
  'uffs daemon start' and returns immediately so index loads in parallel
  with env capture + tool-selection gate
- ensure_daemon_ready(): called only right before preflight::capture where
  UFFS drive counts are needed; polls 'uffs daemon status' up to 3 min
  (90x2s); prints live status line each tick so operator sees progress;
  hard-fails with BenchError::Command if never reaches Status:Ready
- Extract both helpers + constants to run/daemon.rs to keep mod.rs under
  the 800 LOC file-size policy limit
- Update run/tests.rs mock sequences for the two new daemon status calls
  and the full preflight drive-probe chain (es availability + uffs count
  + es result-count per candidate drive)
…ve IPC

- Replace uffs_drive_count() (ran 'uffs <D>:\ * --count' per drive) with
  parse_daemon_status_drives() which parses the 'uffs daemon status' stdout
  emitted in capture() once; drives absent from output get count=0
- daemon_status_output() helper runs 'uffs daemon status' once in capture()
  and distributes the BTreeMap<char,u64> to probe_drive() via parameter
- Rename 'ES status' column to 'ES index' in render_drive_table() — the
  column reflects whether ES has the drive indexed, not daemon run-state
- Add parse_daemon_status_drives_extracts_counts test covering the real
  status format with comma-separated counts and em-dash separators
- Update all preflight and run/tests.rs mocks: replace per-drive count
  mocks with single daemon status mock (one fewer IPC call per drive)
Parked drives appear in 'uffs daemon status' without a record count:
  [Parked] G: — bloom + trie kept resident; body released

warm_parked_drives() fires 'uffs daemon preload <DRIVE>' for each
candidate drive absent from the initial warm-map, then re-reads
'uffs daemon status' once so the drive appears as [Warm] with a live
record count.  Preload failure is non-fatal (count stays 0).

- Add warm_parked_drives() helper in preflight/mod.rs
- capture() calls it between first status read and probe_drive loop;
  second status read only runs when at least one drive was absent
- Add DAEMON_READY_STATUS C Warm line in run/tests.rs so the C test
  drive is already warm (no preload injected there)
- Add parked_drive_is_preloaded_and_count_populated test
- Update unconfigured_drive_probed_once_without_sleep: mock preload +
  second status; assert 5 Run calls instead of 3
- Extract preflight test module to preflight/tests.rs (mod.rs 559 LOC,
  tests.rs 295 LOC — both under 800 LOC policy limit)
…ives

Adds parse_daemon_status_drives_handles_hot_and_parked_tiers using real
daemon output with mixed tiers:
- [Parked] drives (no record count line) → absent from map
- [Hot]    drives → parsed identically to [Warm] (same 'N records (live)' format)
- [Warm]   drives → baseline already covered, now asserted alongside Hot
ES OOMs empirically at ~1.3 GiB (C+D+E); C+D=998 MiB works fine.
Replace ram_bytes/2 heuristic (~32 GiB on 64 GiB host, meaningless) with
ES_RAM_BUDGET_BYTES = 1_073_741_824 (1 GiB conservative ceiling).

Drive table changes:
- Title: 'Drive inventory' -> 'ES RAM budget'
- Drop 'ES capable' column (every drive is individually capable; the
  column was redundant with Fits budget)
- Rename 'ES status' column var es_status -> es_index (already done in
  header; align the local variable name)
- Add 'Fits budget' column: greedy smallest-first fill up to 1 GiB cap
- Add blockquote footer: 'ES RAM budget: X used of 1 GiB cap (N drives)'

Expected output for the 6-drive run:
  ### ES RAM budget
  | Drive | UFFS records | Est. RAM | ES index  | Fits budget |
  | C     |    3,409,074 |  325 MiB | not running | ✓         |
  | D     |    7,066,034 |  673 MiB | not running | ✓         |
  | E     |    2,929,741 |  279 MiB | not running | ✗ over budget |
  ...
  > ES RAM budget: 998 MiB used of 1024 MiB cap (2 drive(s) fit)
Smallest-first maximized drive count within budget but violated operator
intent: running 'just bench-suite --drives C,D,E,F,M,S' should fill
C first, then D (C+D=998 MiB ≤ 1 GiB ✓), then E tips over the cap.

Both render_drive_table() and ram_budget_capable_drives() now iterate
drives in the order the operator specified them (candidate order).
Candidate drives passed via --drives that the UFFS daemon has no record of
(neither Warm/Hot/Parked nor recoverable via preload) were shown as rows
with 0 records. This fabricates status for non-existent drives.

After preload + second status re-read, filter candidates to only those
present in uffs_counts. Unknown drives emit a WARNING and are skipped
entirely — no table row, no ES probe, no budget allocation.

Update test: rename unconfigured_drive_probed_once_without_sleep to
drive_unknown_to_daemon_is_skipped and assert drives.is_empty() with
4 Run calls (no es-probe on a dropped drive).
…tate

ES running state (loaded+hot) should not prevent a drive from being listed
as capable. Whether Everything is currently running is an execution-time
concern; the capacity decision belongs to the RAM budget alone.

- Remove everything_serves() — its only caller was ram_budget_capable_drives
- ram_budget_capable_drives: drop the everything_serves() filter; accept
  all candidate drives in order up to the RAM cap (budget=0 = unlimited)
- compute_matrix: per-cell cross vs solo remains gated on es_feasible
  (the feasibility cell check), which correctly handles not-running ES
- Update tests: rename and correct assertions to reflect new semantics
  * everything_only_serves_loaded_hot_drives -> ram_budget_gates_capable_drives_not_es_running_state
  * configured_but_indexing_drive_is_uffs_only -> indexing_drive_is_capable_but_cell_is_uffs_only
… 0-record drives

preflight/mod.rs:
- Also skip drives with 0 records in the UFFS daemon index (warn + skip),
  not just drives absent entirely. A 0-record drive is not usefully indexed.

matrix.rs:
- ram_budget_capable_drives and compute_matrix now filter to only drives
  confirmed in preflight.drives (candidate_drives drives that were absent
  or 0-record were already dropped during capture and must not reappear).
- Rename single-char closure params (d, dp) to satisfy min_ident_chars lint.
- Update tests: without_everything adds preflight records; budget test
  asserts F/M/S absent from preflight produce capable=[C,D,E] solo=[E].
The negotiated matrix now appears before the ES launch confirmation so
the operator sees the full plan (capable drives, UFFS-only reasons) and
can make an informed decision. The redundant post-launch matrix render
is dropped — the plan gate that follows immediately shows the locked plan.

Also: add ES_STARTUP_GRACE_MS (5s) before first poll, increase kill grace
to 3s, log full spawn command, show per-drive counts in poll messages.
1. Skip preload for drives UFFS has never heard of (H, I)
   Add parse_daemon_known_drives() which captures all drive letters from
   bracketed tier lines ([Warm], [Hot], [Parked]) regardless of record
   count. warm_parked_drives now skips any candidate not in known_drives —
   completely unknown drives (H, I) no longer trigger a preload attempt
   or a spurious second daemon-status call.

2. ES startup/poll diagnostics
   Add ES_STARTUP_GRACE_MS (5 s) sleep before first IPC poll, increase
   kill grace to 3 s, log full spawn command, show per-drive counts in
   every poll message [C:N D:N G:N].
…cond pass

Before ES launch gate: show only capable drives (not the UFFS-only
cell list which was all 'ES not started/starting' noise). The full
matrix with cross-tool cells is shown after second-pass preflight.

Second-pass preflight: restrict candidate_drives to the drives that
survived first-pass UFFS filtering so H/I and other unknown drives
never generate warnings on the re-probe.
The operator already confirmed tool selection and the ES launch by the
time PLAN was presented.  Replace the confirm loop with a direct write:
stage0 artifacts are written silently and the done panel is shown.
Dry-run mode is preserved: skips the write and shows the noop result.
tool_selection_card now takes a step_total parameter.  When Everything
is available the tool-selection card shows step 1/2 and the ES launch
card shows 2/2.  When Everything is not installed both are absent and
step_total is 1 (1/1).
Add Host::run_streaming — inherits parent stdout/stderr via
Command::status() so the child's output flows to the operator in
real time rather than being buffered until process exit.

Stage 1 (cross-tool) and Stage 2 (parity) now use run_streaming so
the operator sees benchmark progress during a multi-minute harness run.
Stage 3 already emits progress via host.out() and is unchanged.

MockHost records RunStreaming calls; SystemHost uses Command::status().
The cross-tool script has no everything-gui tool token — it only knows
uffs, uffs-cpp, and everything (es.exe).  everything_gui (Everything.exe)
is the GUI process used for ES indexing, not a CLI benchmarking tool.

harness_tool now returns Option<String> and returns None for
everything_gui/everything-gui; cross_tool_invocation uses filter_map
to silently exclude it from the --tools list passed to the harness.
…stance

When the bench launches Everything.exe with -instance uffs-bench, the
default IPC window is absent and every es.exe call returns 'Error 8:
Everything IPC window not found'.

cross-tool-benchmark.rs: add --es-instance <name> arg; prepend
  -instance <name> to all es.exe invocations in run_es when set.
StageCfg: add es_instance_name field (None = system instance).
stage_cfg(): populate from cap.es_ini_path — Some(INSTANCE_NAME)
  when the bench launched its own private Everything.exe.
cross_tool_invocation(): pass --es-instance to the harness when set.
--skip-stages 1,2  skips those stages without prompting, emitting
'-> STAGE N: ... skipped (--skip-stages)' for each.  Useful during
development to iterate quickly without waiting for multi-minute runs.

Complements --only-stage and --from-stage; all three compose correctly.
s/S was already wired to Decision::Skip in interpret_key but was not
shown in the prompt line, leaving operators unaware of the option.
…upported

uffs.com returns nothing/errors for trailing-wildcard glob (win*).
Use empty cpp_pattern as sentinel to skip C++ for that pattern cell.
UFFS Rust vs Everything head-to-head is preserved; C++ prints SKIP.
find_p50 already returns "SKIP" for absent rows so summary is correct.
Remove daemon_start_if_needed, daemon_is_ready, uffs_needs_restart —
the daemon is always killed and restarted (not conditionally) so
measurements are confined to the negotiated drive set regardless of
what was loaded before.

daemon.rs: drop unused functions; keep kill_and_restart_with_drives
  + ensure_daemon_ready only.
run/mod.rs: remove early daemon_start_if_needed call from capture();
  uffs_needs_restart flag is now always !capable_drives.is_empty().
run/tests.rs: update mock call sequences to match new flow (no early
  start; dry-run skips kill/restart via ProceedNoop; autopilot queues
  daemon kill + start --drive C + poll + second-pass preflight).
run/bootstrap.rs: extract resolve_bundle_dir, tool_disposition,
  run_fetch_competitors, load_or_new_state to bring mod.rs under
  the 800-line policy limit (816 -> 745 LOC).
…ate instance

When es.exe returns an IPC error (Error 8) but the Everything.exe
process is in the tasklist, the instance is running — just not the
default one es.exe knows how to address (e.g. our private bench
instance). Reporting 'not running' in that case was misleading.

env.rs: compute state_probe before version probe so the result is
  available without a second tasklist call; in the daemon_error_markers
  branch return 'ipc unavailable' when state == 'running', 'not
  running' only when the process is genuinely absent.
env.rs tests: add ipc_error_with_process_running_reports_ipc_unavailable
  and ipc_error_with_process_absent_reports_not_running to pin both
  branches.
run/tests.rs: update mock call order — state (tasklist) now fires
  before the version probe for both 'everything' and 'everything_gui'.
Step numbers were wrong in two ways:

1. tool_selection_card used preflight_steps = if es_available { 2 } else
   { 1 }, missing the UFFS restart step entirely — showed '1/2' when
   there were actually 3 steps.

2. uffs_restart_card computed uffs_step_num as preflight_steps - es_step
   which evaluated to 0 when both gates were active — showed '0/3'.

Fix: compute preflight_steps in capture() as:
  1 + u32::from(!cli.drives.is_empty()) + u32::from(es_available)
using cli.drives as a conservative proxy for 'UFFS restart will fire'
(if no drives are specified, the matrix will be empty and the restart
gate is skipped anyway).

Hardcode uffs_step_num = 2 (tool selection is always step 1).

In execute(), compute total_steps from the actual post-matrix flags
cap.uffs_needs_restart and cap.es_needs_launch for precise ES launch
step numbering.
Before negotiation the daemon may have been restricted to a previous
drive set, causing drives to be silently missing from the first preflight
probe and producing a narrower matrix than the candidate set.

daemon.rs: add kill_and_restart_all_drives() -- same pattern as
  kill_and_restart_with_drives but with no --drive args so uffsd
  self-discovers every available NTFS drive.

mod.rs: at the start of capture(), if drives are specified:
  1. kill_and_restart_all_drives (background warm-up)
  2. run env probes + show tool-selection gate (parallel with load)
  3. ensure_daemon_ready (blocks until index ready)
  4. first preflight (now sees the full drive set)
  The existing gated step-2 restart (--drive <capable>) is unchanged.

tests.rs: update both mock sequences to include the two new run calls
  (daemon kill + daemon start) before env probes, and the additional
  ensure_daemon_ready poll before first preflight.
parity_invocation() was passing cfg.drives (-Drives C,D,E,F,G,H,I,M,S)
to the PowerShell harness instead of cfg.capable_drives (the negotiated
subset all tools can serve, e.g. C,D,G).

Also fix the Stage 2 plan card: the cache resource label and backup note
were also referencing cfg.drives for the drop-cache path.
Replaces the PowerShell cold-parity-per-drive.ps1 with a self-contained
rust-script that uses the same binary-resolution cascade as uffs-bench:

  1. explicit --uffs-bin / --cpp-bin
  2. %USERPROFILE%\bin\uffs.exe / uffs.com
  3. target\release\uffs.exe  (Rust only)
  4. bare PATH fallback

If uffs.exe is missing the script exits with an actionable error and the
release download URL.  If uffs.com is missing it prints a warning with the
C++ download URL and continues with the Rust-only column.

ES / Everything is explicitly out of scope: the C++ binary re-reads all
MFTs on every invocation so it cannot sustain a stable IPC connection; the
parity test is strictly uffs.exe (daemon HOT) vs uffs.com (MFT re-read).

Features:
  --drives C,D,G       drives to test (default: C,D)
  --rounds N           rounds per drive per tool (default: 1)
  --purge-cache        stop daemon + delete all cache files (true COLD)
  --skip-cpp           skip the C++ reference column
  --dump-raw           print raw stderr per invocation
  --sleep-ms N         inter-round sleep (default: 1000 ms)
  --output-file <path> tee full output to a file

Emits two markdown tables: per-drive parity wall-clock p50 and the Rust
wall / daemon / CLI-overhead breakdown.  No external crate dependencies
(no chrono) — timestamp computed via a simple epoch decomposition.
Switch the Stage 2 parity invocation from:
  powershell -NoProfile -ExecutionPolicy Bypass -File
  scripts/windows/cold-parity-per-drive.ps1
  -Drives C,D,G -Rounds N -OutputFile ... [-PurgeCacheFirst]

to:
  rust-script scripts/windows/cold-parity-per-drive.rs
  --drives C,D,G --rounds N --output-file ... [--purge-cache]

Update PARITY_SCRIPT constant to point at the .rs file, reuse
RUST_SCRIPT_EXE (same launcher as Stage 1), and remove the now-dead
POWERSHELL_EXE constant. Module doc comment updated accordingly.
run_native() and plan() (stage 3 branch) were both iterating cfg.drives
(all CLI-provided drives: C,D,E,F,G,H,I,M,S) instead of cfg.capable_drives
(the negotiated subset every tool can serve).

Also fixes snapshot_cache() which was backing up — and registering restores
for — cache files on the full drive list rather than only the capable ones.

After this commit every reference to cfg.drives in stages.rs is gone;
all three stages and their cache snapshot/restore now operate exclusively
on cfg.capable_drives.
- Remove --hide-system / --hide-ads: those flags exist only to align row
  counts with Everything (which skips system files and ADS).  This bench
  is uffs.exe vs uffs.com only — the full unfiltered MFT corpus is the
  correct and honest baseline.

- Keep --profile: it adds 'daemon: N ms' on stderr powering Table 2's
  wall/daemon/overhead breakdown.  Explain in doc comment why it is kept
  (non-default code path, negligible overhead at >100 ms query scale).

- Fix C++ version display: scan --version output for the line starting
  with 'UFFS version:' (same prefix the bench suite uses), emit as
  'uffs.com X.Y.Z' instead of the full multi-line banner.

- Fix uffs.exe version display: print first non-empty trimmed line only.

- Fix binary resolution source labels: use .display() instead of {:?}
  so paths are shown as plain 'C:\Users\rnio\bin\uffs.exe' without
  Rust Debug escaping.

- Fix table alignment: compute every column width from the actual data
  (drive names, ms values, speedup strings, row counts) before rendering
  headers and rows, so all columns are exactly wide enough and properly
  right-aligned with no overflow or under-padding.  Add ms_str() and
  speedup_str() helpers used by both tables.
Phase 0a now unconditionally kills and restarts the daemon with the
requested --drives (or bare start to auto-discover all drives when none
are given).  This guarantees a known COLD starting state regardless of
whatever daemon was running before.  --purge-cache additionally deletes
the on-disk cache files before the restart, forcing a true MFT re-read.

Phase 0b now issues three warm-up passes ('uffs * --limit 1 --profile'):
  pass 1 — cold load path, await_ready, MFT read (or cache hit)
  pass 2 — first HOT query, JIT query state primed
  pass 3 — fully primed, daemon in steady-state query mode
Only pass-1 wall + await_ready are reported; total_records is read once
after all three settle.

Summary warm-up recap: 'Mode: WARM' is replaced with the correct
'COLD (daemon restarted ...)' label with a note on whether the
on-disk cache was purged or retained.
warmup_daemon_primed now returns WarmupResult carrying all three pass
wall times, per-pass daemon ms (from --profile), await_ready from
pass 1, and total_records.

Phase 0b and the summary recap both print the full trajectory:
  Pass 1 [COLD]  wall = X.XX s  daemon = N ms  (daemon load + MFT read / cache hit)
  Pass 2 [WARM]  wall = X.XX s  daemon = N ms  (first HOT query — JIT structures built)
  Pass 3 [HOT ]  wall = X.XX s  daemon = N ms  (fully primed — steady-state latency)

This makes the COLD/WARM/HOT cost difference visible at a glance and
shows the progression from a cold-started daemon through full priming.
DEFAULT_PATTERNS was emitting 'uffs.exe C:\ *.dll --count' (positional
path arg) instead of the correct CLI form:

  uffs.exe "*.dll" --drives C --count

Fix: drop the '{DRIVE}:\' positional arg, use '--drives {DRIVE}' flag.
This fixes:
  - Stage 3 plan card (commands shown verbatim were wrong)
  - Stage 3 run_native (cells were failing with exit 1)
  - preflight estimate_rows (row-count estimation was also broken)

Also update the spec_for test fixture in preflight/tests.rs to match.
Production defaults (policy.rs):
  HOT_TO_WARM:    60 s  ->  600 s (10 min)
  WARM_TO_PARKED: 300 s -> 1800 s (30 min)
  PARKED_TO_COLD: 86400 s (24 h, unchanged)

The previous 1 min / 5 min defaults caused spurious demotes mid-bench
(HOT queries demote after 1 min idle between sink rotations; WARM
restarts between Stage 2 and Stage 3 demote to Parked after 5 min).

Bench scripts (cold-parity-per-drive.rs, cross-tool-benchmark.rs):
Both now start the daemon with bench-safe TTL overrides:
  UFFS_HOT_TO_WARM_IDLE_SECS=3600    (1 hr)
  UFFS_WARM_TO_PARKED_IDLE_SECS=7200 (2 hr)
Scoped to the daemon child process only; teardown's next start gets
production defaults.

cross-tool-benchmark.rs adds uffs_start() helper and wires it at:
  - skip_cold warm-up  : kill+restart before HOT
  - WARM per-round     : stop+start instead of stop-only
  - HOT per-drive      : start before probe query

Update config.rs const-pin and lib.rs env-var doc table.
… superset check

Previously all N rounds of UFFS ran first, then all N rounds of C++,
then all N of Everything. The later tools consistently benefited from
OS filesystem cache warmed by the earlier tool — biasing timings.

New HOT loop structure for each (sink, pattern):
  for round in 1..=N:
    shuffle [uffs, cpp, es] with a fresh LCG seed each round
    run the three tools in that order
    immediately check: uffs.exe rows >= uffs.com rows >= es.exe rows
    print per-round row counts + superset verdict (ok / VIOLATION)
  print per-tool p50/p95 summary

Superset semantics:
  uffs.exe >= uffs.com  — both read same MFT; with --hide-system
                          counts should match; violation = parity bug
  uffs.exe >= es.exe    — Everything skips NTFS system files / ADS
  uffs.com >= es.exe    — same reasoning for C++ tool

Violations are warnings only (bench continues) so timing data is
always preserved. The es.exe fast-fail short-circuit is preserved
(checked on round 0; remaining es rounds skipped on fast failure).

Also adds lcg_shuffle3() — Fisher-Yates on 3 elements using a
minimal LCG (glibc constants), zero external dependencies.
Row counts can match by coincidence (different files, same count).
Replace the per-round count comparison with an actual path-set check:

  es.exe paths ⊆ uffs.com paths ⊆ uffs.exe paths

Implementation:
- Each run_* function gains a run_*_to variant accepting an explicit
  output file path so all three tools write to separate files
  (bench_uffs_<pat>.csv, bench_cpp_<pat>.csv, bench_es_<pat>.csv)
  within the same round without clobbering each other.
- After every round (File sink): read each file into a normalised
  HashSet<String> of lowercased, header-stripped paths; call
  check_subset() for each pair; print violations with first 3
  missing paths as examples.
- Stdout/Null sinks: no output file retained, skip path check
  (row counts still shown).
- Dead code removed: run_es and run_uffs_cpp wrappers,
  extract_paths_from_bytes.
@githubrobbi githubrobbi changed the title feat(bench): state probes + Everything GUI version entry feat(bench): bench robustness — TTL fixes, interleaved HOT rounds, path-set parity checks Jun 8, 2026
@githubrobbi githubrobbi enabled auto-merge (squash) June 8, 2026 18:00
@githubrobbi githubrobbi merged commit f1bfcc7 into main Jun 8, 2026
29 checks passed
@githubrobbi githubrobbi deleted the feat/bench-preflight-robustness branch June 8, 2026 18:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant