Benchmark Support for A/B Tests by hildebrandmw · Pull Request #900 · microsoft/DiskANN

hildebrandmw · 2026-04-03T23:25:53Z

This adds support for native A/B testing in diskann-benchmark-runner with an example implementation added to diskann-benchmark-simd. The benefits of having this infrastructure at the Rust level are:

Code actually performing the before and after checks can run on the concrete types used in the Rust code. They do not need to dig through layers of JSON (and the corresponding maintenance headache associated with changing data structures). The compiler checks that at least structurally the logic is sound.
Shared infrastructure for running comparison tests locally during development and for CI.
Stricter checks on correctness and wellformedness of data that all users can take advantage of.

Concept

The idea is to use one input.json file to generate two output JSON files before.json and after.json for different builds or configurations of the library. The input.json is then accompanied by a tolerances.json, which contains runtime thresholds for values of interest to help accommodate runtime variability. A regression check takes all such files and performs the following steps:

Correctly parse tolerances.json and input.json.
Unambiguously match entries in tolerances.json to entries in input.json. I went with matching semantics rather than requiring a one-to-one correspondence to make it easier to have a single tolerance entry work as a blanket entry for multiple benchmark runs.
Take the resolve tolerance/input pairs and match them with a regression-checkable benchmark.
Once everything has been correctly set up, run each regression check which
- Deserializes entries in before/after.json into the benchmark's output type or error gracefully if this cannot be done due to an incorrect environment.
- Runs the before/after check.
- Aggregates and displays errors/failures/successes with an exit code that indicates success or failure.
An invocation of the checking functionality only returns a successful exit code if all of the above steps succeed without error and all the run checks pass.

The matching semantics work like this. Each tolerance entry looks like the following:

{
  "input": {
    "type": "type-tag",
    "content": {},
  },
  "tolerance": {
     "type": "tolerance-type-tag",
     "content": "defined-per-tolerance",
  }
}

The content field of "input" need not be deserializable to its corresponding value. Instead, we use the raw JSON of the input and match it as a "subset" against the raw JSON of the input.json using the following rules:

Nulls are always equal.
Primitive types are equal if their values compare equal.
An array x is a subset of an array y if x.len() <= y.len() and each entry i in x is a subset of its corresponding entry in y (i.e., we match prefixes).
An object x is a subset of an object y if each key in x is a key in y and each value associated with a key in x is a subset of the value of the same entry in y.
Heterogeneous matches are not supported. A bool cannot be a subset of an integer. This breaks the match instantly.

This means that an empty "content" field will match any struct and thus be a blanket implementation for the input with the same type tag. Or, the "content" field can be refined to be more specific as needed.

Since a single tolerance entry may match multiple inputs, we ensure that the matching is unambiguous using the following rules:

Each tolerance entry must be matched with at least one entry from "inputs.json".
Every entry in "inputs.json" must have exactly one match from "tolerances.json".

If there is any ambiguity, the app stops with an error.

CLI Changes

This adds a new check subcommand to the runner CLI with the following options:

check skeleton: Print a skeleton tolerance JSON file.
check tolerances [NAME]: List tolerance kinds, or describe one by name. This is similar to inputs [NAME]
check verify --tolerances <FILE> --input-file <FILE>: Validate a tolerance file against an input file. This runs up to step 3 of the checklist above and serves as a pre-flight check before any CI jobs are run. Errors with the setup that can be caught early will be and can thus save CI time.
check run --tolerances <FILE> --input-file <FILE> --before <FILE> --after <FILE> [--output-file <FILE>]: Run regression checks.

Benchmark Registration Changes

Regression checks are opt-in. Benchmarks that wish to opt-in implement benchmark::Regression and the singular check method. All logic for the before and after comparison lives in the check method. Such benchmarks also need to use registry::Benchmarks::register_regression to be correctly tracked as regression compatible. No independent registration of the Tolerance associated type is needed.

That is it.

Note that check should not print anything out to stdout and instead communicate success/failure solely through its return type to avoid spamming the output.

Example

You can see this in action in diskann-benchmark-simd. Run

# Generate the "before" file
cargo run --package diskann-benchmark-simd --release -- \
   run \
   --input-file ./diskann-benchmark-simd/examples/simd-scalar.json \
   --output-file before.json

# Generate the "after" file
cargo run --package diskann-benchmark-simd --release -- \
   run \
   --input-file ./diskann-benchmark-simd/examples/simd-scalar.json \
   --output-file after.json

# Run the before/after check
cargo run --package diskann-benchmark-simd --release -- \
  check run \
  --tolerances ./diskann-benchmark-simd/examples/tolerance.json \
  --input-file ./diskann-benchmark-simd/examples/simd-scalar.json \
  --before before.json \
  --after after.json

Depending on the noise in your system, you will see something like the following (note that the ci profile should be used for more reliable measurements):

Check 1 of 4 ("simd-op-f32xf32-scalar") FAILED:
     Distance,   Dim,   Min Before (ns),   Min After (ns),   Change (%),                                      Remark
====================================================================================================================
   squared_l2,   100,             0.000,            0.000,      invalid,   expected "before" to be greater than zero
   squared_l2,   128,             0.000,            0.000,      invalid,   expected "before" to be greater than zero
   squared_l2,   160,             0.000,            0.000,      invalid,   expected "before" to be greater than zero
   squared_l2,   384,            40.000,           40.000,      0.000 %,
   squared_l2,   768,            80.000,           80.000,      0.000 %,
inner_product,   100,             0.000,            0.000,      invalid,   expected "before" to be greater than zero
inner_product,   128,             0.000,            0.000,      invalid,   expected "before" to be greater than zero
inner_product,   160,             0.000,            0.000,      invalid,   expected "before" to be greater than zero
inner_product,   384,            40.000,            0.000,   -100.000 %,
inner_product,   768,           120.000,           40.000,    -66.667 %,
       cosine,   100,           120.000,           40.000,    -66.667 %,
       cosine,   128,            80.000,            0.000,   -100.000 %,
       cosine,   160,            40.000,           40.000,      0.000 %,
       cosine,   384,           120.000,           80.000,    -33.333 %,
       cosine,   768,           280.000,          160.000,    -42.857 %,

Where this particular implementation has decided that a regression cannot be meaningfully detected if an execution time got rounded to zero.

Suggested Reviewing Order

diskann-benchmark-runner

benchmark.rs: This contains the new Regression trait and the internal type-erasure machinery that is used inside the registry to mark a benchmark as regression-capable.
registry.rs: The new register_regression API as well as methods for retrieving and interrogating regression capable benchmarks.
internal/regression.rs: This is the meat and potatoes of this PR. This contains all the logic to perform the outlined steps and has a pretty good module-level documentation that outlines the approach taken.
app.rs: The new Check subcommand and routing, this is pretty straight-forward save for the changes to tests (see below).
utils/num.rs: I added two new opinionated utilities to help with writing checks:
- relative_change: Compute the relative change between two values, handling corner cases.
- NonNegativeFinite: Useful to serde-compatible assertions that tolerance entries have the obvious properties the name implies.
jobs.rs: The changes here are to mainly make it easier to interact with the JSON patterns used by the input files to have more uniform handling of said files.

Changes to the testing infrastructure:

The simple test benchmarks/inputs have been factored into their own submodules and given regression checks.
The UX tests in app.rs have been upgraded to handle tolerance checks as well. This involves having more magically escaped patterns for input/output files and supporting multi-line stdin.txt files to allow regression tests to set up their respective environment.

Much of the logic in internal/regression.rs is tested via the UX tests because (1) setting up a proper environment for this functionality is challenging and (2) the UX tests provide much better visual information on the expected behavior. A lot of new UX tests have been added.

diskann-benchmark-simd:
The changes here are largely meant to work as an example of adding regression tests. Since diskann-benchmark-simd is not a production critical crate, feel free to just skim this or ignore entirely.

Future Ideas

There are parts of this that are not perfect:

Error diagnostics when matching fails could be better.
This doesn't directly support situations where you change runtime values in the input file to generate the before and after comparison. It will work if the changes are sufficiently minimal to not influence benchmark matching. I decided not to include in this PR to keep the changes in API small (the internal changes in this PR are already huge).
This doesn't natively support long-term regression monitoring where we have lots of before files and build a statistical picture of trends. That opens a whole can of worms like schema stability, machine stability (if comparing run times) etc. that are way beyond the scope of what's needed to get basic regression tests running.
Adding a --verbose flag to print out all diagnostics rather than using the opinionated triage of errors then failures then successes.

Disclaimer: This PR (though not the PR description) was written with the help of AI but has been heavily edited and is something that I wouldn't be annoyed at reviewing outside of it being +3500 lines (I am sorry).

Copilot

Pull request overview

This PR adds first-class A/B regression checking to diskann-benchmark-runner by introducing a regression-capable benchmark API (tolerance inputs + before/after comparison) and a new check CLI surface, with an example adoption in diskann-benchmark-simd.

Changes:

Introduce benchmark::Regression + registry plumbing to register and discover regression-capable benchmarks and their tolerance input types.
Add tolerance-file parsing, subset-based matching, and check verify / check run execution + reporting pipeline.
Extend the UX test harness and add many new golden tests covering success/failure/error paths; add SIMD example tolerances + regression implementation.

Reviewed changes

Copilot reviewed 103 out of 143 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
diskann-benchmark-simd/src/lib.rs	Adds SIMD regression tolerance type and `Regression` implementation for kernels; updates run result schema.
diskann-benchmark-simd/src/bin.rs	Adds integration coverage for `check verify` on the SIMD example.
diskann-benchmark-simd/examples/tolerance.json	Example tolerance file for SIMD regression checks.
diskann-benchmark-runner/src/lib.rs	Exposes `benchmark` publicly and adds `internal` module for shared helpers.
diskann-benchmark-runner/src/benchmark.rs	Adds `Regression` trait, `PassFail`, and type-erased regression support in the registry layer.
diskann-benchmark-runner/src/registry.rs	Adds `register_regression` and tolerance discovery/mapping to regression-capable benchmarks.
diskann-benchmark-runner/src/internal/mod.rs	Adds shared `load_from_disk` helper and internal module organization.
diskann-benchmark-runner/src/internal/regression.rs	Implements tolerance parsing, subset matching, check job creation, execution, JSON output, and reporting.
diskann-benchmark-runner/src/app.rs	Adds `check` subcommand (`skeleton`, `tolerances`, `verify`, `run`) and upgrades UX test harness for multi-step scenarios.
diskann-benchmark-runner/src/jobs.rs	Refactors input loading/parsing and exposes raw job list/partial parsing for regression pipeline.
diskann-benchmark-runner/src/result.rs	Adds `RawResult` loader for reading previously saved benchmark outputs.
diskann-benchmark-runner/src/checker.rs	Tightens `Checker::any` tagging expectation (with clippy annotation).
diskann-benchmark-runner/src/input.rs	Adds `Wrapper::<T>::INSTANCE` const for tolerance type-erasure usage.
diskann-benchmark-runner/src/ux.rs	Adds `scrub_path` helper for deterministic UX test output.
diskann-benchmark-runner/src/utils/mod.rs	Exports new `num` utilities module.
diskann-benchmark-runner/src/utils/num.rs	Adds `relative_change` and `NonNegativeFinite` for regression/tolerance validation and comparisons.
diskann-benchmark-runner/src/utils/percentiles.rs	Adds `minimum` percentile value to output structure (and marks struct `#[non_exhaustive]`).
diskann-benchmark-runner/src/utils/fmt.rs	Adds clippy expectation annotation for panic-based bounds checks.
diskann-benchmark-runner/src/test/mod.rs	Reorganizes test benchmark registration and marks regression-capable test benchmarks.
diskann-benchmark-runner/src/test/dim.rs	Adds regression checks to dim benchmarks and introduces a non-regression “simple” benchmark.
diskann-benchmark-runner/src/test/typed.rs	Adds regression checks to typed benchmarks and introduces tolerance input used by typed regression tests.
diskann-benchmark-runner/Cargo.toml	Switches to explicit clippy lint configuration.
diskann-benchmark-runner/.clippy.toml	Allows unwrap/expect/panic in tests under clippy.
diskann-benchmark-runner/tests/regression/check-skeleton-0/stdin.txt	Adds UX test for `check skeleton`.
diskann-benchmark-runner/tests/regression/check-skeleton-0/stdout.txt	Golden output for `check skeleton`.
diskann-benchmark-runner/tests/regression/check-skeleton-0/README.md	Documents `check skeleton` UX test scenario.
diskann-benchmark-runner/tests/regression/check-tolerances-0/stdin.txt	Adds UX test for listing tolerance kinds.
diskann-benchmark-runner/tests/regression/check-tolerances-0/stdout.txt	Golden output for listing tolerance kinds.
diskann-benchmark-runner/tests/regression/check-tolerances-0/README.md	Documents tolerance listing UX test.
diskann-benchmark-runner/tests/regression/check-tolerances-1/stdin.txt	Adds UX test for describing a specific tolerance kind.
diskann-benchmark-runner/tests/regression/check-tolerances-1/stdout.txt	Golden output for tolerance kind description/skeleton.
diskann-benchmark-runner/tests/regression/check-tolerances-1/README.md	Documents tolerance kind description UX test.
diskann-benchmark-runner/tests/regression/check-tolerances-2/stdin.txt	Adds UX test for requesting an unknown tolerance kind.
diskann-benchmark-runner/tests/regression/check-tolerances-2/stdout.txt	Golden output for unknown tolerance kind.
diskann-benchmark-runner/tests/regression/check-tolerances-2/README.md	Documents unknown tolerance kind behavior.
diskann-benchmark-runner/tests/regression/check-verify-0/stdin.txt	Adds UX test for successful `check verify`.
diskann-benchmark-runner/tests/regression/check-verify-0/tolerances.json	Test tolerance file for successful verification.
diskann-benchmark-runner/tests/regression/check-verify-0/input.json	Test input file used for successful verification.
diskann-benchmark-runner/tests/regression/check-verify-0/README.md	Documents successful verification behavior (no stdout).
diskann-benchmark-runner/tests/regression/check-verify-1/stdin.txt	Adds UX test for unknown tolerance tag error.
diskann-benchmark-runner/tests/regression/check-verify-1/stdout.txt	Golden output for unknown tolerance tag error.
diskann-benchmark-runner/tests/regression/check-verify-1/tolerances.json	Test tolerance file with unknown tolerance tag.
diskann-benchmark-runner/tests/regression/check-verify-1/input.json	Input file used by unknown tolerance tag test.
diskann-benchmark-runner/tests/regression/check-verify-1/README.md	Documents unknown tolerance tag error scenario.
diskann-benchmark-runner/tests/regression/check-verify-2/stdin.txt	Adds UX test for tolerance/input match but no regression benchmark dispatch.
diskann-benchmark-runner/tests/regression/check-verify-2/stdout.txt	Golden output for “no matching regression benchmark” in verify.
diskann-benchmark-runner/tests/regression/check-verify-2/tolerances.json	Tolerance file used by the dispatch-failure verify test.
diskann-benchmark-runner/tests/regression/check-verify-2/input.json	Input file used by the dispatch-failure verify test.
diskann-benchmark-runner/tests/regression/check-verify-2/README.md	Documents dispatch-failure verify scenario.
diskann-benchmark-runner/tests/regression/check-verify-3/stdin.txt	Adds UX test covering ambiguous/orphaned/uncovered tolerance matching problems.
diskann-benchmark-runner/tests/regression/check-verify-3/stdout.txt	Golden output for matching failure diagnostics.
diskann-benchmark-runner/tests/regression/check-verify-3/tolerances.json	Tolerance file constructed to trigger matching errors.
diskann-benchmark-runner/tests/regression/check-verify-3/input.json	Input file constructed to trigger matching errors.
diskann-benchmark-runner/tests/regression/check-verify-4/stdin.txt	Adds UX test for incompatible tolerance/input tag pairing.
diskann-benchmark-runner/tests/regression/check-verify-4/stdout.txt	Golden output for incompatible tag pairing error.
diskann-benchmark-runner/tests/regression/check-verify-4/tolerances.json	Tolerance file with mismatched input tag type.
diskann-benchmark-runner/tests/regression/check-verify-4/input.json	Input file used by incompatible tag pairing test.
diskann-benchmark-runner/tests/regression/check-verify-4/README.md	Documents incompatible tag pairing test (currently needs alignment with actual scenario).
diskann-benchmark-runner/tests/regression/check-run-pass-0/stdin.txt	Adds UX test for successful `check run` and checks.json generation.
diskann-benchmark-runner/tests/regression/check-run-pass-0/stdout.txt	Golden output for successful regression checks.
diskann-benchmark-runner/tests/regression/check-run-pass-0/input.json	Input file for passing run scenario.
diskann-benchmark-runner/tests/regression/check-run-pass-0/tolerances.json	Tolerance file for passing run scenario.
diskann-benchmark-runner/tests/regression/check-run-pass-0/output.json	Golden output.json produced during setup in passing run scenario.
diskann-benchmark-runner/tests/regression/check-run-pass-0/checks.json	Golden checks.json produced by `check run` in pass scenario.
diskann-benchmark-runner/tests/regression/check-run-pass-0/README.md	Documents pass scenario coverage.
diskann-benchmark-runner/tests/regression/check-run-fail-0/stdin.txt	Adds UX test where regression check fails (but checks.json still written).
diskann-benchmark-runner/tests/regression/check-run-fail-0/stdout.txt	Golden output for failing regression checks.
diskann-benchmark-runner/tests/regression/check-run-fail-0/input.json	Input file for failing run scenario.
diskann-benchmark-runner/tests/regression/check-run-fail-0/tolerances.json	Tolerance file that triggers a failure.
diskann-benchmark-runner/tests/regression/check-run-fail-0/output.json	Golden output.json for failing run scenario setup.
diskann-benchmark-runner/tests/regression/check-run-fail-0/checks.json	Golden checks.json for failing run scenario.
diskann-benchmark-runner/tests/regression/check-run-fail-0/README.md	Documents failing regression run scenario.
diskann-benchmark-runner/tests/regression/check-run-error-0/stdin.txt	Adds UX test where check execution errors are surfaced and recorded.
diskann-benchmark-runner/tests/regression/check-run-error-0/stdout.txt	Golden output for erroring regression checks.
diskann-benchmark-runner/tests/regression/check-run-error-0/input.json	Input file for error scenario.
diskann-benchmark-runner/tests/regression/check-run-error-0/tolerances.json	Tolerance file that triggers check errors.
diskann-benchmark-runner/tests/regression/check-run-error-0/output.json	Golden output.json for error scenario setup.
diskann-benchmark-runner/tests/regression/check-run-error-0/checks.json	Golden checks.json showing error entries.
diskann-benchmark-runner/tests/regression/check-run-error-0/README.md	Documents error triage behavior and checks.json writing.
diskann-benchmark-runner/tests/regression/check-run-error-1/stdin.txt	Adds UX test for before/after length mismatch detection.
diskann-benchmark-runner/tests/regression/check-run-error-1/stdout.txt	Golden output for length mismatch error.
diskann-benchmark-runner/tests/regression/check-run-error-1/input.json	Setup input file for generating output.json with 2 entries.
diskann-benchmark-runner/tests/regression/check-run-error-1/regression_input.json	Regression input file with 1 job used to trigger length mismatch.
diskann-benchmark-runner/tests/regression/check-run-error-1/tolerances.json	Tolerance file for length mismatch scenario.
diskann-benchmark-runner/tests/regression/check-run-error-1/output.json	Golden output.json with 2 entries for mismatch scenario.
diskann-benchmark-runner/tests/regression/check-run-error-1/README.md	Documents length mismatch scenario.
diskann-benchmark-runner/tests/regression/check-run-error-2/stdin.txt	Adds UX test for input drift causing “no matching regression benchmark” during run.
diskann-benchmark-runner/tests/regression/check-run-error-2/stdout.txt	Golden output for drift/no-match error.
diskann-benchmark-runner/tests/regression/check-run-error-2/input.json	Setup input file for generating output.json.
diskann-benchmark-runner/tests/regression/check-run-error-2/regression_input.json	Drifted input file used for `check run`.
diskann-benchmark-runner/tests/regression/check-run-error-2/tolerances.json	Tolerance file for drift scenario.
diskann-benchmark-runner/tests/regression/check-run-error-2/output.json	Golden output.json from setup run.
diskann-benchmark-runner/tests/regression/check-run-error-2/README.md	Documents drift scenario expectations.
diskann-benchmark-runner/tests/regression/check-run-error-3/stdin.txt	Adds UX test for before/after schema drift (deserialization error) handling.
diskann-benchmark-runner/tests/regression/check-run-error-3/stdout.txt	Golden output for schema drift error.
diskann-benchmark-runner/tests/regression/check-run-error-3/input.json	Setup input file used to generate integer results.
diskann-benchmark-runner/tests/regression/check-run-error-3/regression_input.json	Regression input expecting string results used to trigger schema mismatch.
diskann-benchmark-runner/tests/regression/check-run-error-3/tolerances.json	Tolerance file for schema drift scenario.
diskann-benchmark-runner/tests/regression/check-run-error-3/output.json	Golden output.json containing integer results.
diskann-benchmark-runner/tests/regression/check-run-error-3/checks.json	Golden checks.json containing structured error output.
diskann-benchmark-runner/tests/regression/check-run-error-3/README.md	Documents schema drift scenario and expectations.
diskann-benchmark-runner/tests/benchmark/test-0/stdin.txt	Adds baseline UX test for `skeleton`.
diskann-benchmark-runner/tests/benchmark/test-0/stdout.txt	Golden output for `skeleton`.
diskann-benchmark-runner/tests/benchmark/test-0/README.md	Documents `skeleton` UX test.
diskann-benchmark-runner/tests/benchmark/test-1/stdin.txt	Adds baseline UX test for `inputs`.
diskann-benchmark-runner/tests/benchmark/test-1/stdout.txt	Golden output for `inputs` list.
diskann-benchmark-runner/tests/benchmark/test-1/README.md	Documents `inputs` list UX test.
diskann-benchmark-runner/tests/benchmark/test-2/stdin.txt	Adds baseline UX test for `inputs <NAME>`.
diskann-benchmark-runner/tests/benchmark/test-2/stdout.txt	Golden output for `inputs test-input-dim`.
diskann-benchmark-runner/tests/benchmark/test-2/README.md	Documents `inputs <NAME>` behavior.
diskann-benchmark-runner/tests/benchmark/test-3/stdin.txt	Adds baseline UX test for `inputs test-input-types`.
diskann-benchmark-runner/tests/benchmark/test-3/stdout.txt	Golden output for `inputs test-input-types`.
diskann-benchmark-runner/tests/benchmark/test-3/README.md	Documents typed input example output.
diskann-benchmark-runner/tests/benchmark/test-4/stdin.txt	Adds baseline UX test for `benchmarks`.
diskann-benchmark-runner/tests/benchmark/test-4/stdout.txt	Golden output for benchmark listing (now includes simple-bench).
diskann-benchmark-runner/tests/benchmark/test-4/README.md	Documents `benchmarks` listing behavior.
diskann-benchmark-runner/tests/benchmark/test-success-0/stdin.txt	Adds baseline “successful run generates output.json” UX test.
diskann-benchmark-runner/tests/benchmark/test-success-0/stdout.txt	Golden stdout for successful run.
diskann-benchmark-runner/tests/benchmark/test-success-0/input.json	Input file for successful run.
diskann-benchmark-runner/tests/benchmark/test-success-0/output.json	Golden output.json for successful run.
diskann-benchmark-runner/tests/benchmark/test-success-0/README.md	Documents successful run behavior.
diskann-benchmark-runner/tests/benchmark/test-success-1/stdin.txt	Adds baseline `--dry-run` UX test.
diskann-benchmark-runner/tests/benchmark/test-success-1/stdout.txt	Golden stdout for dry-run.
diskann-benchmark-runner/tests/benchmark/test-success-1/input.json	Input file for dry-run.
diskann-benchmark-runner/tests/benchmark/test-success-1/README.md	Documents dry-run behavior.
diskann-benchmark-runner/tests/benchmark/test-overload-0/stdin.txt	Adds UX test for MatchScore overload resolution.
diskann-benchmark-runner/tests/benchmark/test-overload-0/stdout.txt	Golden stdout for overload resolution.
diskann-benchmark-runner/tests/benchmark/test-overload-0/input.json	Input file for overload resolution test.
diskann-benchmark-runner/tests/benchmark/test-overload-0/output.json	Golden output.json for overload resolution test.
diskann-benchmark-runner/tests/benchmark/test-overload-0/README.md	Documents overload resolution expectations.
diskann-benchmark-runner/tests/benchmark/test-mismatch-0/stdin.txt	Adds UX test for mismatch diagnostics.
diskann-benchmark-runner/tests/benchmark/test-mismatch-0/stdout.txt	Golden mismatch diagnostics output.
diskann-benchmark-runner/tests/benchmark/test-mismatch-0/input.json	Input file that triggers mismatch diagnostics.
diskann-benchmark-runner/tests/benchmark/test-mismatch-0/README.md	Documents mismatch diagnostics output.
diskann-benchmark-runner/tests/benchmark/test-mismatch-1/stdin.txt	Adds UX test for ExactTypeBench mismatch reason paths.
diskann-benchmark-runner/tests/benchmark/test-mismatch-1/stdout.txt	Golden output for ExactTypeBench mismatch diagnostics.
diskann-benchmark-runner/tests/benchmark/test-mismatch-1/input.json	Input file for ExactTypeBench mismatch diagnostics.
diskann-benchmark-runner/tests/benchmark/test-mismatch-1/README.md	Documents ExactTypeBench mismatch reasons.
diskann-benchmark-runner/tests/benchmark/test-deserialization-error-0/stdin.txt	Adds UX test for input deserialization error reporting.
diskann-benchmark-runner/tests/benchmark/test-deserialization-error-0/stdout.txt	Golden output for deserialization errors.
diskann-benchmark-runner/tests/benchmark/test-deserialization-error-0/input.json	Input file containing an invalid datatype variant.
diskann-benchmark-runner/tests/benchmark/test-deserialization-error-0/README.md	Documents deserialization error reporting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

diskann-benchmark-simd/src/bin.rs

diskann-benchmark-simd/src/lib.rs

diskann-benchmark-runner/src/app.rs

diskann-benchmark-runner/src/internal/regression.rs

diskann-benchmark-runner/src/jobs.rs

diskann-benchmark-runner/tests/regression/check-verify-4/README.md

JordanMaples

lgtm

codecov-commenter · 2026-04-06T17:45:00Z

Codecov Report

❌ Patch coverage is 93.73434% with 75 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.38%. Comparing base (0ced23d) to head (25bd4db).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
diskann-benchmark-runner/src/app.rs	86.02%	19 Missing ⚠️
diskann-benchmark-simd/src/lib.rs	93.47%	12 Missing ⚠️
diskann-benchmark-runner/src/benchmark.rs	89.42%	11 Missing ⚠️
diskann-benchmark-runner/src/test/dim.rs	87.91%	11 Missing ⚠️
diskann-benchmark-runner/src/registry.rs	84.61%	10 Missing ⚠️
...iskann-benchmark-runner/src/internal/regression.rs	97.70%	9 Missing ⚠️
diskann-benchmark-runner/src/jobs.rs	91.66%	1 Missing ⚠️
diskann-benchmark-runner/src/test/typed.rs	97.61%	1 Missing ⚠️
diskann-benchmark-runner/src/ux.rs	95.65%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #900      +/-   ##
==========================================
+ Coverage   89.31%   89.38%   +0.06%     
==========================================
  Files         445      450       +5     
  Lines       84095    85166    +1071     
==========================================
+ Hits        75113    76129    +1016     
- Misses       8982     9037      +55

Flag	Coverage Δ
miri	`89.38% <93.73%> (+0.06%)`	⬆️
unittests	`89.23% <93.73%> (+0.07%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
diskann-benchmark-runner/src/checker.rs	`72.66% <ø> (ø)`
diskann-benchmark-runner/src/input.rs	`78.37% <100.00%> (ø)`
diskann-benchmark-runner/src/internal/mod.rs	`100.00% <100.00%> (ø)`
diskann-benchmark-runner/src/result.rs	`97.84% <100.00%> (+0.03%)`	⬆️
diskann-benchmark-runner/src/test/mod.rs	`100.00% <100.00%> (ø)`
diskann-benchmark-runner/src/utils/fmt.rs	`97.50% <ø> (ø)`
diskann-benchmark-runner/src/utils/num.rs	`100.00% <100.00%> (ø)`
diskann-benchmark-runner/src/utils/percentiles.rs	`100.00% <100.00%> (ø)`
diskann-benchmark-simd/src/bin.rs	`87.93% <100.00%> (+6.35%)`	⬆️
diskann-benchmark-runner/src/jobs.rs	`96.82% <91.66%> (+0.33%)`	⬆️
... and 8 more

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ChenSunriseJiaBao · 2026-04-07T06:40:33Z

diskann-benchmark-runner/src/internal/regression.rs

+            }
+            (Value::Bool(h), Value::Bool(n)) => false_if!(h != n),
+            (Value::Number(h), Value::Number(n)) => false_if!(h != n),
+            (Value::String(h), Value::String(n)) => false_if!(h != n),


is there any scenario that we don't want to consider up/low cases?

That's a good question - I lean towards "no" for the following reasons:

The decision to normalize case in strings is type dependent and even system dependent. For example, it's valid to normalize file-path cases on windows because the OS is generally case insensitive, but most Unix filesystems are case sensitive. So the decision to normalize case needs to be made by the input type we are matching against, which leads to problem 2.

My desire to have fuzzy matching is basically to make the tolerance files easier to write and maintain. One consequence is that the input JSON that accompanies a tolerance JSON cannot properly deserialize into a valid input instance, meaning we can't use something like PartialEq to do the comparison. I had thought about allowing inputs to define an acceptance criteria like
trait Accept { fn accept(&self, raw: &serde_json::Value, partial: &serde_json::Value) -> bool { is_subset(raw, partial) } }
where we could delegate the entire matching process to the inputs, rather than using this blanket match for everything. The downside is that this adds yet another trait that inputs need to implement, even if it does end up being just a marker trait most of the time due to the provided implementation.

But this then opens a can of worms of extending the trait in the future to provide better diagnostics and potentially unintuitive matching behavior, and so I just kept it simple.

ChenSunriseJiaBao · 2026-04-07T07:18:34Z

diskann-benchmark-runner/src/internal/regression.rs

+ * Licensed under the MIT license.
+ */
+
+//! This module contains the tolerance parsing, matching, and running logic.


This design separates benchmark and regression checking.
Do we consider combining them? such as for each benchmark, there is one output file, and maybe we can simple to add --baseline for target build/config, then comparing the baseline and current output.

I had thought about that approach but went with this approach here for a few reasons. First, adding a --baseline mode to the normal benchmark flow introduces two potential problems.

It requires that either all benchmarks implement regression checks. This is not something I wanted to do: mandatory regression checks mean that they will often be implemented hurriedly to get things to compile - and a bad regression check is worse than no regression check.

Or it requires that the regression capability matching is fused with the normal benchmark dispatching, which seemed like an unnecessary complication.

Reporting is also messier with a fused approach. A separate check flow allows all pass/fail results to be aggregated at once. With a fused regression check, I don't know whether regression checks should be displayed inline with the normal results or rolled up at the very end - and if a check fails do we abort early or not? These should probably all be options, but that makes the implementation more complicated.

Finally, the embedded flow is tedious for tuning tolerances. An after check allows multiple tolerances to be tried and the matching scheme to be refined without rerunning the benchmark constantly.

Overall, I felt a separate flow with pre-validation gave a cleaner separation of concerns.

arrayka · 2026-04-07T22:04:37Z

diskann-benchmark-simd/examples/tolerance.json

+      "tolerance": {
+        "type": "simd-tolerance",
+        "content": {
+          "min_time_regression": 0.05


@hildebrandmw , is it an expected usage of the proposed pattern:

Create tolerance.json similar to this:

{ "checks": [ { "input": { "type": "disk-index", "content": {} }, "tolerance": { "type": "disk-index-tolerance", "content": { "build_time_regression": 0.10, "qps_regression": 0.10, "recall_regression": 0.01, "mean_ios_regression": 0.01, "mean_comps_regression": 0.01, "mean_latency_regression": 0.15, "p95_latency_regression": 0.15 } } } ] }

Add ~300 lines of Rust code to wire it up including implementation of check_lower_is_better() and check_higher_is_better():
https://github.com/microsoft/DiskANN/pull/912/changes#diff-f77d4bbaebfd48c9785d878c56f274b1aa3414d88b3896aed961acdd3904e194R131-R450

... comparisons.push(check_higher_is_better( "qps", b_sr.qps as f64, a_sr.qps as f64, tolerances.qps_regression, &mut passed, )); ...

Is it possible to avoid this step 2? It feels like the higher-the-better knowledge belongs to tolerance.json.

I would argue that the higher-the-better knowledge is exactly what shouldn't live in the tolerance.json. The writer of the benchmark knows which direction the change should be in and can structurally enforce that in the code. That said, I would recommend using a macro to stamp out much of the boiler plate. I would also say that reusing NonNegativeFinite to describe negative bounds is a little confusing.

arrayka · 2026-04-07T22:29:06Z

diskann-benchmark-simd/examples/tolerance.json

+      }
+    }
+  ]
+}


Thinking aloud:

I would love us to make benchmarking effortless by default, with more advanced capabilities unlocking incrementally as needed, where common workflows work out of the box with little to no user configuration.

I'm thinking about these scenarios (the next one builds on top of previous one):

Single run, easy sharing
Run a benchmark and export results to Excel, Word, or Markdown for quick sharing with the team.

Zero‑config before/after comparisons
Run benchmarks before and after a change and automatically produce a table with absolute and relative metric diffs, export the table to Excel, Word, or Markdown — no extra configuration, no diff reasoning needed.

Optional passed/failed signal for each metric
Introduce metric metadata (higher_better / lower_better, thresholds) to enable green/red highlighting with minimal effort.

Add CI support to (3.)
Add CI support via something like --fail_on_error, so the benchmark fails when thresholds are violated.

Thank you Alex. Some of these are indeed limitations, but some of these are already achievable:

This is the biggest weakness of the current approach of dumping things to JSON and I'd really like to improve the story here. I've learned from experience that generating as rich of results as possible (i.e. output.json) is very helpful when it comes to data analysis, but not so much data presentation. Would love for more cycles to be put into this for the tool!

You can see a more detailed response here, but the TLDR is that the split into a separate executable supports keeping regression-testing opt-in and tolerance iteration a little easier. That said, the Display impls on the PassFail structs are where rich formatting can happen in this PR. We could definitely work on infrastructure to make these Display implementations prettier though to make the output more useful by default.

This is already possible with proper Display implementations on the returned PassFail structs. The benchmark crate could make these a little more accessible though, but the colored crate is available pretty easily for this.

This should already work: the check sub executable will exit with an Error when at least one inner check fails with an error or doesn't pass. This in turn exits from main with a non-zero exit code that CI can use to indicate a failure. Indeed, any error encountered this way will result in a non-zero exit code 😄.

Mark Hildebrand added 2 commits April 3, 2026 15:10

Add dedicated A/B testing functionality to diskann-benchmark-runner.

9b02f0e

Small addition of test coverage to num.rs.

2c220f2

hildebrandmw requested review from a team and Copilot April 3, 2026 23:25

Copilot started reviewing on behalf of hildebrandmw April 3, 2026 23:26 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

Mark Hildebrand added 2 commits April 3, 2026 16:57

Fix backtrace stripping to handle multiple errors.

a9dc976

Fix stacktrace stripping to remove a preceding newline.

d2a9b48

JordanMaples approved these changes Apr 6, 2026

View reviewed changes

Mark Hildebrand added 2 commits April 6, 2026 12:41

Fix path normalization in UX tests.

912f8bc

Minor fixes.

25bd4db

ChenSunriseJiaBao reviewed Apr 7, 2026

View reviewed changes

YuanyuanTian-hh mentioned this pull request Apr 7, 2026

Add benchmark pipeline with Rust-native A/B validation #912

Open

arrayka approved these changes Apr 7, 2026

View reviewed changes

ChenSunriseJiaBao approved these changes Apr 8, 2026

View reviewed changes

Conversation

hildebrandmw commented Apr 3, 2026

Concept

CLI Changes

Benchmark Registration Changes

Example

Suggested Reviewing Order

Future Ideas

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JordanMaples left a comment

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ChenSunriseJiaBao Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

hildebrandmw Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

ChenSunriseJiaBao Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

hildebrandmw Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

arrayka Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

hildebrandmw Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

arrayka Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hildebrandmw Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

codecov-commenter commented Apr 6, 2026 •

edited

Loading

arrayka Apr 7, 2026 •

edited

Loading