Add reproducibility data pipeline (schema v1, lib, scripts, CI)#4
Merged
radinhamidi merged 1 commit intomainfrom Apr 29, 2026
Merged
Add reproducibility data pipeline (schema v1, lib, scripts, CI)#4radinhamidi merged 1 commit intomainfrom
radinhamidi merged 1 commit intomainfrom
Conversation
Adds a self-contained reproducibility/ umbrella that backs
leaderboard.querygym.com and the SIGIR 2026 reproducibility paper.
Nothing in querygym/ is touched; the wheel is unchanged.
Schema v1 (reproducibility/schema.json) is the language-neutral
contract every run JSON must satisfy. Three validation passes
(emit / submit / aggregate) prevent drift. Hashes are embedded:
params_hash (8 hex over the tuning surface, doubles as filename)
and run_id (16 hex over the payload minus volatile fields).
Layout: reproducibility/data/runs/{dataset_id}/{method_id}/{model}/
{params_hash}.{json,run.txt,queries.tsv}
Tooling:
- reproducibility/lib/: build_run_summary, validate, hash helpers
(private to this repo's tooling; external consumers read schema.json)
- reproducibility/scripts/aggregate_runs.py with --check for CI
- reproducibility/scripts/submit_run.py for both trusted contribs
and fork PR submitters
- reproducibility/tests/: 19 tests covering hashing, validation,
and hostile inputs
Wires examples/querygym_pyserini/pipeline.py to call
build_run_summary at the end of run_pipeline; falls back to a
pipeline_partial.json for incomplete runs.
CI workflow runs on PRs touching reproducibility/** or the
example pipeline. No pyserini/trec_eval in CI by design;
fork PRs are verified manually by maintainers re-running locally.
MANIFEST.in prunes reproducibility/ and web/ from sdist;
pyproject.toml adds a 'repro' extra (pandas, jsonschema) and
extends testpaths. .gitignore protects CLAUDE.local.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced Apr 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a self-contained
reproducibility/umbrella that backsleaderboard.querygym.comand the SIGIR 2026 reproducibility paper.querygym/is untouched; the wheel is unchanged.reproducibility/schema.json) is the language-neutral contract every run JSON must satisfy. Validated three times (emit / submit / aggregate) so drift can't leak into the leaderboard.params_hash(8 hex over the tuning surface, doubles as filename) andrun_id(16 hex over the payload minus volatile fields). Hand-edits to a metric value fail validation with a clearrun_id mismatch.reproducibility/data/runs/{dataset_id}/{method_id}/{model}/{params_hash}.{json,run.txt,queries.tsv}.reproducibility/lib/(private helpers),scripts/aggregate_runs.pywith--checkfor CI,scripts/submit_run.pyused by both trusted and fork contributors.examples/querygym_pyserini/pipeline.pycallsbuild_run_summaryfor full pipelines; partial pipelines writepipeline_partial.jsoninstead.--checkon PRs touchingreproducibility/**or the example pipeline. No pyserini/trec_eval in CI by design — fork PRs are verified manually by maintainers re-running locally.See
reproducibility/README.mdanddocs/user-guide/reproducibility.mdfor contributor flows;reproducibility/schema.mdfor the field-by-field schema.Test plan
pytest reproducibility/tests -v --no-cov— 19 tests cover hashing, schema rejections, registry checks, hash tampering, and silent metric edits.python -m reproducibility.scripts.aggregate_runson emptyruns/produces deterministic CSV + manifest;--checkexits 0.runs/viasubmit_run.py, ranaggregate_runs, confirmed 3 rows; tampered a metric →--checkfails with clear error.python -m build --sdistconfirmsreproducibility/,web/, andruns/content are NOT in the sdist;querygym/files ship as before._build_v1_summaryhelper inpipeline.pyvalidates against synthetic per-step metadata.Out of scope (separate PRs)
reproducibility/site/)._config.yml,_layouts/,docs/leaderboard.html).🤖 Generated with Claude Code