Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds first-class A/B regression checking to diskann-benchmark-runner by introducing a regression-capable benchmark API (tolerance inputs + before/after comparison) and a new check CLI surface, with an example adoption in diskann-benchmark-simd.
Changes:
- Introduce
benchmark::Regression+ registry plumbing to register and discover regression-capable benchmarks and their tolerance input types. - Add tolerance-file parsing, subset-based matching, and
check verify/check runexecution + reporting pipeline. - Extend the UX test harness and add many new golden tests covering success/failure/error paths; add SIMD example tolerances + regression implementation.
Reviewed changes
Copilot reviewed 103 out of 143 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| diskann-benchmark-simd/src/lib.rs | Adds SIMD regression tolerance type and Regression implementation for kernels; updates run result schema. |
| diskann-benchmark-simd/src/bin.rs | Adds integration coverage for check verify on the SIMD example. |
| diskann-benchmark-simd/examples/tolerance.json | Example tolerance file for SIMD regression checks. |
| diskann-benchmark-runner/src/lib.rs | Exposes benchmark publicly and adds internal module for shared helpers. |
| diskann-benchmark-runner/src/benchmark.rs | Adds Regression trait, PassFail, and type-erased regression support in the registry layer. |
| diskann-benchmark-runner/src/registry.rs | Adds register_regression and tolerance discovery/mapping to regression-capable benchmarks. |
| diskann-benchmark-runner/src/internal/mod.rs | Adds shared load_from_disk helper and internal module organization. |
| diskann-benchmark-runner/src/internal/regression.rs | Implements tolerance parsing, subset matching, check job creation, execution, JSON output, and reporting. |
| diskann-benchmark-runner/src/app.rs | Adds check subcommand (skeleton, tolerances, verify, run) and upgrades UX test harness for multi-step scenarios. |
| diskann-benchmark-runner/src/jobs.rs | Refactors input loading/parsing and exposes raw job list/partial parsing for regression pipeline. |
| diskann-benchmark-runner/src/result.rs | Adds RawResult loader for reading previously saved benchmark outputs. |
| diskann-benchmark-runner/src/checker.rs | Tightens Checker::any tagging expectation (with clippy annotation). |
| diskann-benchmark-runner/src/input.rs | Adds Wrapper::<T>::INSTANCE const for tolerance type-erasure usage. |
| diskann-benchmark-runner/src/ux.rs | Adds scrub_path helper for deterministic UX test output. |
| diskann-benchmark-runner/src/utils/mod.rs | Exports new num utilities module. |
| diskann-benchmark-runner/src/utils/num.rs | Adds relative_change and NonNegativeFinite for regression/tolerance validation and comparisons. |
| diskann-benchmark-runner/src/utils/percentiles.rs | Adds minimum percentile value to output structure (and marks struct #[non_exhaustive]). |
| diskann-benchmark-runner/src/utils/fmt.rs | Adds clippy expectation annotation for panic-based bounds checks. |
| diskann-benchmark-runner/src/test/mod.rs | Reorganizes test benchmark registration and marks regression-capable test benchmarks. |
| diskann-benchmark-runner/src/test/dim.rs | Adds regression checks to dim benchmarks and introduces a non-regression “simple” benchmark. |
| diskann-benchmark-runner/src/test/typed.rs | Adds regression checks to typed benchmarks and introduces tolerance input used by typed regression tests. |
| diskann-benchmark-runner/Cargo.toml | Switches to explicit clippy lint configuration. |
| diskann-benchmark-runner/.clippy.toml | Allows unwrap/expect/panic in tests under clippy. |
| diskann-benchmark-runner/tests/regression/check-skeleton-0/stdin.txt | Adds UX test for check skeleton. |
| diskann-benchmark-runner/tests/regression/check-skeleton-0/stdout.txt | Golden output for check skeleton. |
| diskann-benchmark-runner/tests/regression/check-skeleton-0/README.md | Documents check skeleton UX test scenario. |
| diskann-benchmark-runner/tests/regression/check-tolerances-0/stdin.txt | Adds UX test for listing tolerance kinds. |
| diskann-benchmark-runner/tests/regression/check-tolerances-0/stdout.txt | Golden output for listing tolerance kinds. |
| diskann-benchmark-runner/tests/regression/check-tolerances-0/README.md | Documents tolerance listing UX test. |
| diskann-benchmark-runner/tests/regression/check-tolerances-1/stdin.txt | Adds UX test for describing a specific tolerance kind. |
| diskann-benchmark-runner/tests/regression/check-tolerances-1/stdout.txt | Golden output for tolerance kind description/skeleton. |
| diskann-benchmark-runner/tests/regression/check-tolerances-1/README.md | Documents tolerance kind description UX test. |
| diskann-benchmark-runner/tests/regression/check-tolerances-2/stdin.txt | Adds UX test for requesting an unknown tolerance kind. |
| diskann-benchmark-runner/tests/regression/check-tolerances-2/stdout.txt | Golden output for unknown tolerance kind. |
| diskann-benchmark-runner/tests/regression/check-tolerances-2/README.md | Documents unknown tolerance kind behavior. |
| diskann-benchmark-runner/tests/regression/check-verify-0/stdin.txt | Adds UX test for successful check verify. |
| diskann-benchmark-runner/tests/regression/check-verify-0/tolerances.json | Test tolerance file for successful verification. |
| diskann-benchmark-runner/tests/regression/check-verify-0/input.json | Test input file used for successful verification. |
| diskann-benchmark-runner/tests/regression/check-verify-0/README.md | Documents successful verification behavior (no stdout). |
| diskann-benchmark-runner/tests/regression/check-verify-1/stdin.txt | Adds UX test for unknown tolerance tag error. |
| diskann-benchmark-runner/tests/regression/check-verify-1/stdout.txt | Golden output for unknown tolerance tag error. |
| diskann-benchmark-runner/tests/regression/check-verify-1/tolerances.json | Test tolerance file with unknown tolerance tag. |
| diskann-benchmark-runner/tests/regression/check-verify-1/input.json | Input file used by unknown tolerance tag test. |
| diskann-benchmark-runner/tests/regression/check-verify-1/README.md | Documents unknown tolerance tag error scenario. |
| diskann-benchmark-runner/tests/regression/check-verify-2/stdin.txt | Adds UX test for tolerance/input match but no regression benchmark dispatch. |
| diskann-benchmark-runner/tests/regression/check-verify-2/stdout.txt | Golden output for “no matching regression benchmark” in verify. |
| diskann-benchmark-runner/tests/regression/check-verify-2/tolerances.json | Tolerance file used by the dispatch-failure verify test. |
| diskann-benchmark-runner/tests/regression/check-verify-2/input.json | Input file used by the dispatch-failure verify test. |
| diskann-benchmark-runner/tests/regression/check-verify-2/README.md | Documents dispatch-failure verify scenario. |
| diskann-benchmark-runner/tests/regression/check-verify-3/stdin.txt | Adds UX test covering ambiguous/orphaned/uncovered tolerance matching problems. |
| diskann-benchmark-runner/tests/regression/check-verify-3/stdout.txt | Golden output for matching failure diagnostics. |
| diskann-benchmark-runner/tests/regression/check-verify-3/tolerances.json | Tolerance file constructed to trigger matching errors. |
| diskann-benchmark-runner/tests/regression/check-verify-3/input.json | Input file constructed to trigger matching errors. |
| diskann-benchmark-runner/tests/regression/check-verify-4/stdin.txt | Adds UX test for incompatible tolerance/input tag pairing. |
| diskann-benchmark-runner/tests/regression/check-verify-4/stdout.txt | Golden output for incompatible tag pairing error. |
| diskann-benchmark-runner/tests/regression/check-verify-4/tolerances.json | Tolerance file with mismatched input tag type. |
| diskann-benchmark-runner/tests/regression/check-verify-4/input.json | Input file used by incompatible tag pairing test. |
| diskann-benchmark-runner/tests/regression/check-verify-4/README.md | Documents incompatible tag pairing test (currently needs alignment with actual scenario). |
| diskann-benchmark-runner/tests/regression/check-run-pass-0/stdin.txt | Adds UX test for successful check run and checks.json generation. |
| diskann-benchmark-runner/tests/regression/check-run-pass-0/stdout.txt | Golden output for successful regression checks. |
| diskann-benchmark-runner/tests/regression/check-run-pass-0/input.json | Input file for passing run scenario. |
| diskann-benchmark-runner/tests/regression/check-run-pass-0/tolerances.json | Tolerance file for passing run scenario. |
| diskann-benchmark-runner/tests/regression/check-run-pass-0/output.json | Golden output.json produced during setup in passing run scenario. |
| diskann-benchmark-runner/tests/regression/check-run-pass-0/checks.json | Golden checks.json produced by check run in pass scenario. |
| diskann-benchmark-runner/tests/regression/check-run-pass-0/README.md | Documents pass scenario coverage. |
| diskann-benchmark-runner/tests/regression/check-run-fail-0/stdin.txt | Adds UX test where regression check fails (but checks.json still written). |
| diskann-benchmark-runner/tests/regression/check-run-fail-0/stdout.txt | Golden output for failing regression checks. |
| diskann-benchmark-runner/tests/regression/check-run-fail-0/input.json | Input file for failing run scenario. |
| diskann-benchmark-runner/tests/regression/check-run-fail-0/tolerances.json | Tolerance file that triggers a failure. |
| diskann-benchmark-runner/tests/regression/check-run-fail-0/output.json | Golden output.json for failing run scenario setup. |
| diskann-benchmark-runner/tests/regression/check-run-fail-0/checks.json | Golden checks.json for failing run scenario. |
| diskann-benchmark-runner/tests/regression/check-run-fail-0/README.md | Documents failing regression run scenario. |
| diskann-benchmark-runner/tests/regression/check-run-error-0/stdin.txt | Adds UX test where check execution errors are surfaced and recorded. |
| diskann-benchmark-runner/tests/regression/check-run-error-0/stdout.txt | Golden output for erroring regression checks. |
| diskann-benchmark-runner/tests/regression/check-run-error-0/input.json | Input file for error scenario. |
| diskann-benchmark-runner/tests/regression/check-run-error-0/tolerances.json | Tolerance file that triggers check errors. |
| diskann-benchmark-runner/tests/regression/check-run-error-0/output.json | Golden output.json for error scenario setup. |
| diskann-benchmark-runner/tests/regression/check-run-error-0/checks.json | Golden checks.json showing error entries. |
| diskann-benchmark-runner/tests/regression/check-run-error-0/README.md | Documents error triage behavior and checks.json writing. |
| diskann-benchmark-runner/tests/regression/check-run-error-1/stdin.txt | Adds UX test for before/after length mismatch detection. |
| diskann-benchmark-runner/tests/regression/check-run-error-1/stdout.txt | Golden output for length mismatch error. |
| diskann-benchmark-runner/tests/regression/check-run-error-1/input.json | Setup input file for generating output.json with 2 entries. |
| diskann-benchmark-runner/tests/regression/check-run-error-1/regression_input.json | Regression input file with 1 job used to trigger length mismatch. |
| diskann-benchmark-runner/tests/regression/check-run-error-1/tolerances.json | Tolerance file for length mismatch scenario. |
| diskann-benchmark-runner/tests/regression/check-run-error-1/output.json | Golden output.json with 2 entries for mismatch scenario. |
| diskann-benchmark-runner/tests/regression/check-run-error-1/README.md | Documents length mismatch scenario. |
| diskann-benchmark-runner/tests/regression/check-run-error-2/stdin.txt | Adds UX test for input drift causing “no matching regression benchmark” during run. |
| diskann-benchmark-runner/tests/regression/check-run-error-2/stdout.txt | Golden output for drift/no-match error. |
| diskann-benchmark-runner/tests/regression/check-run-error-2/input.json | Setup input file for generating output.json. |
| diskann-benchmark-runner/tests/regression/check-run-error-2/regression_input.json | Drifted input file used for check run. |
| diskann-benchmark-runner/tests/regression/check-run-error-2/tolerances.json | Tolerance file for drift scenario. |
| diskann-benchmark-runner/tests/regression/check-run-error-2/output.json | Golden output.json from setup run. |
| diskann-benchmark-runner/tests/regression/check-run-error-2/README.md | Documents drift scenario expectations. |
| diskann-benchmark-runner/tests/regression/check-run-error-3/stdin.txt | Adds UX test for before/after schema drift (deserialization error) handling. |
| diskann-benchmark-runner/tests/regression/check-run-error-3/stdout.txt | Golden output for schema drift error. |
| diskann-benchmark-runner/tests/regression/check-run-error-3/input.json | Setup input file used to generate integer results. |
| diskann-benchmark-runner/tests/regression/check-run-error-3/regression_input.json | Regression input expecting string results used to trigger schema mismatch. |
| diskann-benchmark-runner/tests/regression/check-run-error-3/tolerances.json | Tolerance file for schema drift scenario. |
| diskann-benchmark-runner/tests/regression/check-run-error-3/output.json | Golden output.json containing integer results. |
| diskann-benchmark-runner/tests/regression/check-run-error-3/checks.json | Golden checks.json containing structured error output. |
| diskann-benchmark-runner/tests/regression/check-run-error-3/README.md | Documents schema drift scenario and expectations. |
| diskann-benchmark-runner/tests/benchmark/test-0/stdin.txt | Adds baseline UX test for skeleton. |
| diskann-benchmark-runner/tests/benchmark/test-0/stdout.txt | Golden output for skeleton. |
| diskann-benchmark-runner/tests/benchmark/test-0/README.md | Documents skeleton UX test. |
| diskann-benchmark-runner/tests/benchmark/test-1/stdin.txt | Adds baseline UX test for inputs. |
| diskann-benchmark-runner/tests/benchmark/test-1/stdout.txt | Golden output for inputs list. |
| diskann-benchmark-runner/tests/benchmark/test-1/README.md | Documents inputs list UX test. |
| diskann-benchmark-runner/tests/benchmark/test-2/stdin.txt | Adds baseline UX test for inputs <NAME>. |
| diskann-benchmark-runner/tests/benchmark/test-2/stdout.txt | Golden output for inputs test-input-dim. |
| diskann-benchmark-runner/tests/benchmark/test-2/README.md | Documents inputs <NAME> behavior. |
| diskann-benchmark-runner/tests/benchmark/test-3/stdin.txt | Adds baseline UX test for inputs test-input-types. |
| diskann-benchmark-runner/tests/benchmark/test-3/stdout.txt | Golden output for inputs test-input-types. |
| diskann-benchmark-runner/tests/benchmark/test-3/README.md | Documents typed input example output. |
| diskann-benchmark-runner/tests/benchmark/test-4/stdin.txt | Adds baseline UX test for benchmarks. |
| diskann-benchmark-runner/tests/benchmark/test-4/stdout.txt | Golden output for benchmark listing (now includes simple-bench). |
| diskann-benchmark-runner/tests/benchmark/test-4/README.md | Documents benchmarks listing behavior. |
| diskann-benchmark-runner/tests/benchmark/test-success-0/stdin.txt | Adds baseline “successful run generates output.json” UX test. |
| diskann-benchmark-runner/tests/benchmark/test-success-0/stdout.txt | Golden stdout for successful run. |
| diskann-benchmark-runner/tests/benchmark/test-success-0/input.json | Input file for successful run. |
| diskann-benchmark-runner/tests/benchmark/test-success-0/output.json | Golden output.json for successful run. |
| diskann-benchmark-runner/tests/benchmark/test-success-0/README.md | Documents successful run behavior. |
| diskann-benchmark-runner/tests/benchmark/test-success-1/stdin.txt | Adds baseline --dry-run UX test. |
| diskann-benchmark-runner/tests/benchmark/test-success-1/stdout.txt | Golden stdout for dry-run. |
| diskann-benchmark-runner/tests/benchmark/test-success-1/input.json | Input file for dry-run. |
| diskann-benchmark-runner/tests/benchmark/test-success-1/README.md | Documents dry-run behavior. |
| diskann-benchmark-runner/tests/benchmark/test-overload-0/stdin.txt | Adds UX test for MatchScore overload resolution. |
| diskann-benchmark-runner/tests/benchmark/test-overload-0/stdout.txt | Golden stdout for overload resolution. |
| diskann-benchmark-runner/tests/benchmark/test-overload-0/input.json | Input file for overload resolution test. |
| diskann-benchmark-runner/tests/benchmark/test-overload-0/output.json | Golden output.json for overload resolution test. |
| diskann-benchmark-runner/tests/benchmark/test-overload-0/README.md | Documents overload resolution expectations. |
| diskann-benchmark-runner/tests/benchmark/test-mismatch-0/stdin.txt | Adds UX test for mismatch diagnostics. |
| diskann-benchmark-runner/tests/benchmark/test-mismatch-0/stdout.txt | Golden mismatch diagnostics output. |
| diskann-benchmark-runner/tests/benchmark/test-mismatch-0/input.json | Input file that triggers mismatch diagnostics. |
| diskann-benchmark-runner/tests/benchmark/test-mismatch-0/README.md | Documents mismatch diagnostics output. |
| diskann-benchmark-runner/tests/benchmark/test-mismatch-1/stdin.txt | Adds UX test for ExactTypeBench mismatch reason paths. |
| diskann-benchmark-runner/tests/benchmark/test-mismatch-1/stdout.txt | Golden output for ExactTypeBench mismatch diagnostics. |
| diskann-benchmark-runner/tests/benchmark/test-mismatch-1/input.json | Input file for ExactTypeBench mismatch diagnostics. |
| diskann-benchmark-runner/tests/benchmark/test-mismatch-1/README.md | Documents ExactTypeBench mismatch reasons. |
| diskann-benchmark-runner/tests/benchmark/test-deserialization-error-0/stdin.txt | Adds UX test for input deserialization error reporting. |
| diskann-benchmark-runner/tests/benchmark/test-deserialization-error-0/stdout.txt | Golden output for deserialization errors. |
| diskann-benchmark-runner/tests/benchmark/test-deserialization-error-0/input.json | Input file containing an invalid datatype variant. |
| diskann-benchmark-runner/tests/benchmark/test-deserialization-error-0/README.md | Documents deserialization error reporting. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
diskann-benchmark-runner/tests/regression/check-verify-4/README.md
Outdated
Show resolved
Hide resolved
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #900 +/- ##
==========================================
+ Coverage 89.31% 89.38% +0.06%
==========================================
Files 445 450 +5
Lines 84095 85166 +1071
==========================================
+ Hits 75113 76129 +1016
- Misses 8982 9037 +55
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
| } | ||
| (Value::Bool(h), Value::Bool(n)) => false_if!(h != n), | ||
| (Value::Number(h), Value::Number(n)) => false_if!(h != n), | ||
| (Value::String(h), Value::String(n)) => false_if!(h != n), |
There was a problem hiding this comment.
is there any scenario that we don't want to consider up/low cases?
There was a problem hiding this comment.
That's a good question - I lean towards "no" for the following reasons:
- The decision to normalize case in strings is type dependent and even system dependent. For example, it's valid to normalize file-path cases on windows because the OS is generally case insensitive, but most Unix filesystems are case sensitive. So the decision to normalize case needs to be made by the input type we are matching against, which leads to problem 2.
- My desire to have fuzzy matching is basically to make the tolerance files easier to write and maintain. One consequence is that the input JSON that accompanies a tolerance JSON cannot properly deserialize into a valid input instance, meaning we can't use something like
PartialEqto do the comparison. I had thought about allowing inputs to define an acceptance criteria likewhere we could delegate the entire matching process to the inputs, rather than using this blanket match for everything. The downside is that this adds yet another trait that inputs need to implement, even if it does end up being just a marker trait most of the time due to the provided implementation.trait Accept { fn accept(&self, raw: &serde_json::Value, partial: &serde_json::Value) -> bool { is_subset(raw, partial) } }
But this then opens a can of worms of extending the trait in the future to provide better diagnostics and potentially unintuitive matching behavior, and so I just kept it simple.
| * Licensed under the MIT license. | ||
| */ | ||
|
|
||
| //! This module contains the tolerance parsing, matching, and running logic. |
There was a problem hiding this comment.
This design separates benchmark and regression checking.
Do we consider combining them? such as for each benchmark, there is one output file, and maybe we can simple to add --baseline for target build/config, then comparing the baseline and current output.
There was a problem hiding this comment.
I had thought about that approach but went with this approach here for a few reasons. First, adding a --baseline mode to the normal benchmark flow introduces two potential problems.
- It requires that either all benchmarks implement regression checks. This is not something I wanted to do: mandatory regression checks mean that they will often be implemented hurriedly to get things to compile - and a bad regression check is worse than no regression check.
- Or it requires that the regression capability matching is fused with the normal benchmark dispatching, which seemed like an unnecessary complication.
Reporting is also messier with a fused approach. A separate check flow allows all pass/fail results to be aggregated at once. With a fused regression check, I don't know whether regression checks should be displayed inline with the normal results or rolled up at the very end - and if a check fails do we abort early or not? These should probably all be options, but that makes the implementation more complicated.
Finally, the embedded flow is tedious for tuning tolerances. An after check allows multiple tolerances to be tried and the matching scheme to be refined without rerunning the benchmark constantly.
Overall, I felt a separate flow with pre-validation gave a cleaner separation of concerns.
| "tolerance": { | ||
| "type": "simd-tolerance", | ||
| "content": { | ||
| "min_time_regression": 0.05 |
There was a problem hiding this comment.
@hildebrandmw , is it an expected usage of the proposed pattern:
- Create tolerance.json similar to this:
{
"checks": [
{
"input": {
"type": "disk-index",
"content": {}
},
"tolerance": {
"type": "disk-index-tolerance",
"content": {
"build_time_regression": 0.10,
"qps_regression": 0.10,
"recall_regression": 0.01,
"mean_ios_regression": 0.01,
"mean_comps_regression": 0.01,
"mean_latency_regression": 0.15,
"p95_latency_regression": 0.15
}
}
}
]
}
- Add ~300 lines of Rust code to wire it up including implementation of check_lower_is_better() and check_higher_is_better():
https://github.com/microsoft/DiskANN/pull/912/changes#diff-f77d4bbaebfd48c9785d878c56f274b1aa3414d88b3896aed961acdd3904e194R131-R450
...
comparisons.push(check_higher_is_better(
"qps",
b_sr.qps as f64,
a_sr.qps as f64,
tolerances.qps_regression,
&mut passed,
));
...
Is it possible to avoid this step 2? It feels like the higher-the-better knowledge belongs to tolerance.json.
There was a problem hiding this comment.
I would argue that the higher-the-better knowledge is exactly what shouldn't live in the tolerance.json. The writer of the benchmark knows which direction the change should be in and can structurally enforce that in the code. That said, I would recommend using a macro to stamp out much of the boiler plate. I would also say that reusing NonNegativeFinite to describe negative bounds is a little confusing.
| } | ||
| } | ||
| ] | ||
| } |
There was a problem hiding this comment.
Thinking aloud:
I would love us to make benchmarking effortless by default, with more advanced capabilities unlocking incrementally as needed, where common workflows work out of the box with little to no user configuration.
I'm thinking about these scenarios (the next one builds on top of previous one):
-
Single run, easy sharing
Run a benchmark and export results to Excel, Word, or Markdown for quick sharing with the team. -
Zero‑config before/after comparisons
Run benchmarks before and after a change and automatically produce a table with absolute and relative metric diffs, export the table to Excel, Word, or Markdown — no extra configuration, no diff reasoning needed. -
Optional passed/failed signal for each metric
Introduce metric metadata (higher_better/lower_better, thresholds) to enable green/red highlighting with minimal effort. -
Add CI support to (3.)
Add CI support via something like --fail_on_error, so the benchmark fails when thresholds are violated.
There was a problem hiding this comment.
Thank you Alex. Some of these are indeed limitations, but some of these are already achievable:
- This is the biggest weakness of the current approach of dumping things to JSON and I'd really like to improve the story here. I've learned from experience that generating as rich of results as possible (i.e.
output.json) is very helpful when it comes to data analysis, but not so much data presentation. Would love for more cycles to be put into this for the tool! - You can see a more detailed response here, but the TLDR is that the split into a separate executable supports keeping regression-testing opt-in and tolerance iteration a little easier. That said, the
Displayimpls on thePassFailstructs are where rich formatting can happen in this PR. We could definitely work on infrastructure to make theseDisplayimplementations prettier though to make the output more useful by default. - This is already possible with proper
Displayimplementations on the returnedPassFailstructs. The benchmark crate could make these a little more accessible though, but the colored crate is available pretty easily for this. - This should already work: the
checksub executable will exit with anErrorwhen at least one inner check fails with an error or doesn't pass. This in turn exits frommainwith a non-zero exit code that CI can use to indicate a failure. Indeed, any error encountered this way will result in a non-zero exit code 😄.
This adds support for native A/B testing in
diskann-benchmark-runnerwith an example implementation added todiskann-benchmark-simd. The benefits of having this infrastructure at the Rust level are:Concept
The idea is to use one
input.jsonfile to generate two output JSON filesbefore.jsonandafter.jsonfor different builds or configurations of the library. Theinput.jsonis then accompanied by atolerances.json, which contains runtime thresholds for values of interest to help accommodate runtime variability. A regression check takes all such files and performs the following steps:tolerances.jsonandinput.json.tolerances.jsonto entries ininput.json. I went with matching semantics rather than requiring a one-to-one correspondence to make it easier to have a single tolerance entry work as a blanket entry for multiple benchmark runs.tolerance/inputpairs and match them with a regression-checkable benchmark.before/after.jsoninto the benchmark's output type or error gracefully if this cannot be done due to an incorrect environment.The matching semantics work like this. Each tolerance entry looks like the following:
{ "input": { "type": "type-tag", "content": {}, }, "tolerance": { "type": "tolerance-type-tag", "content": "defined-per-tolerance", } }The content field of "input" need not be deserializable to its corresponding value. Instead, we use the raw JSON of the input and match it as a "subset" against the raw JSON of the
input.jsonusing the following rules:xis a subset of an arrayyifx.len() <= y.len()and each entryiinxis a subset of its corresponding entry iny(i.e., we match prefixes).xis a subset of an objectyif each key inxis a key inyand each value associated with a key inxis a subset of the value of the same entry iny.boolcannot be a subset of aninteger. This breaks the match instantly.This means that an empty "content" field will match any struct and thus be a blanket implementation for the input with the same type tag. Or, the "content" field can be refined to be more specific as needed.
Since a single tolerance entry may match multiple inputs, we ensure that the matching is unambiguous using the following rules:
If there is any ambiguity, the app stops with an error.
CLI Changes
This adds a new
checksubcommand to the runner CLI with the following options:check skeleton: Print a skeleton tolerance JSON file.check tolerances [NAME]: List tolerance kinds, or describe one by name. This is similar toinputs [NAME]check verify --tolerances <FILE> --input-file <FILE>: Validate a tolerance file against an input file. This runs up to step 3 of the checklist above and serves as a pre-flight check before any CI jobs are run. Errors with the setup that can be caught early will be and can thus save CI time.check run --tolerances <FILE> --input-file <FILE> --before <FILE> --after <FILE> [--output-file <FILE>]: Run regression checks.Benchmark Registration Changes
Regression checks are opt-in. Benchmarks that wish to opt-in implement
benchmark::Regressionand the singularcheckmethod. All logic for the before and after comparison lives in thecheckmethod. Such benchmarks also need to useregistry::Benchmarks::register_regressionto be correctly tracked as regression compatible. No independent registration of theToleranceassociated type is needed.That is it.
Note that check should not print anything out to
stdoutand instead communicate success/failure solely through its return type to avoid spamming the output.Example
You can see this in action in
diskann-benchmark-simd. RunDepending on the noise in your system, you will see something like the following (note that the
ciprofile should be used for more reliable measurements):Where this particular implementation has decided that a regression cannot be meaningfully detected if an execution time got rounded to zero.
Suggested Reviewing Order
diskann-benchmark-runnerbenchmark.rs: This contains the newRegressiontrait and the internal type-erasure machinery that is used inside the registry to mark a benchmark as regression-capable.registry.rs: The newregister_regressionAPI as well as methods for retrieving and interrogating regression capable benchmarks.internal/regression.rs: This is the meat and potatoes of this PR. This contains all the logic to perform the outlined steps and has a pretty good module-level documentation that outlines the approach taken.app.rs: The newChecksubcommand and routing, this is pretty straight-forward save for the changes to tests (see below).utils/num.rs: I added two new opinionated utilities to help with writing checks:relative_change: Compute the relative change between two values, handling corner cases.NonNegativeFinite: Useful to serde-compatible assertions that tolerance entries have the obvious properties the name implies.jobs.rs: The changes here are to mainly make it easier to interact with the JSON patterns used by the input files to have more uniform handling of said files.Changes to the testing infrastructure:
app.rshave been upgraded to handle tolerance checks as well. This involves having more magically escaped patterns for input/output files and supporting multi-linestdin.txtfiles to allow regression tests to set up their respective environment.Much of the logic in
internal/regression.rsis tested via the UX tests because (1) setting up a proper environment for this functionality is challenging and (2) the UX tests provide much better visual information on the expected behavior. A lot of new UX tests have been added.diskann-benchmark-simd:The changes here are largely meant to work as an example of adding regression tests. Since
diskann-benchmark-simdis not a production critical crate, feel free to just skim this or ignore entirely.Future Ideas
There are parts of this that are not perfect:
beforefiles and build a statistical picture of trends. That opens a whole can of worms like schema stability, machine stability (if comparing run times) etc. that are way beyond the scope of what's needed to get basic regression tests running.--verboseflag to print out all diagnostics rather than using the opinionated triage of errors then failures then successes.Disclaimer: This PR (though not the PR description) was written with the help of AI but has been heavily edited and is something that I wouldn't be annoyed at reviewing outside of it being +3500 lines (I am sorry).