Skip to content

Fix CI for reproducibility data pipeline#6

Merged
radinhamidi merged 5 commits intomainfrom
fix/repro-pipeline-ci
Apr 29, 2026
Merged

Fix CI for reproducibility data pipeline#6
radinhamidi merged 5 commits intomainfrom
fix/repro-pipeline-ci

Conversation

@radinhamidi
Copy link
Copy Markdown
Member

Summary

PR #4 merged in a state where its CI was failing — main is currently broken. This hotfix brings together the five fixes that were developed against the (already-closed) feat/reproducibility-pipeline branch but didn't make it into the merge:

  1. Add pythonpath = ["."] to pytest config — defensive (a separate python -m pytest change covers the actual fix).
  2. Trigger repro CI on pyproject.toml + workflow changes — broaden the path filter so future config-only fixes actually re-run CI.
  3. Run repro tests with python -m pytest — guarantees the repo root is on sys.path regardless of pytest version, so from reproducibility.lib import ... resolves.
  4. Add missing reproducibility/lib/ files — this was the root cause. The standard Python .gitignore line lib/ (no leading slash) silently caught reproducibility/lib/, so __init__.py, emit.py, and validate.py were absent from the merge commit. Anchored that and the other Python build dirs (build/, dist/, eggs/, lib64/, etc.) to the repo root with leading slashes.
  5. Drop querygym_version from aggregator --check comparison — it's informational provenance, not data correctness. Fluctuates between contributor machines and CI; should not fail the check. content_hash still pins the actual data byte-for-byte.

Test plan

  • python -m pytest reproducibility/tests -q --no-cov → 19 passed locally.
  • python -m reproducibility.scripts.aggregate_runs --check → OK locally.
  • python -m build after applying these fixes → sdist + wheel clean (no leak), wheel still 36 files all under querygym/ (Python package isolation preserved).
  • CI on this PR runs Reproducibility Check workflow and passes.

Notes

🤖 Generated with Claude Code

radinhamidi and others added 5 commits April 29, 2026 16:54
CI runs `pytest reproducibility/tests` directly (not `python -m
pytest`), which doesn't add the repo root to sys.path. Without the
explicit pythonpath, `from reproducibility.lib import ...` fails
with ModuleNotFoundError. Pin it in [tool.pytest.ini_options] so
the test setup works regardless of invocation style.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The workflow's path filter previously skipped any change that didn't
touch reproducibility/**, the example pipeline, or dataset_registry.yaml.
That meant the pyproject.toml pytest config fix (previous commit)
didn't actually re-run CI. Add pyproject.toml and the workflow file
itself to the path filter, plus workflow_dispatch for manual reruns.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Even with pythonpath = ["."] in pytest config, CI's Python 3.9 +
pytest 8.4 + cov plugin combination doesn't make `reproducibility`
importable as a top-level package when invoked as `pytest`.
`python -m pytest` always prepends cwd to sys.path, which fixes it.
This is the canonical workaround for "from <repo-root-pkg> import"
in CI without making the package pip-installable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The standard Python .gitignore line `lib/` (no leading slash) was
matching reproducibility/lib/ everywhere in the tree, which silently
hid emit.py, validate.py, and __init__.py from the previous commit.
CI couldn't import reproducibility.lib because the directory itself
didn't exist on the remote — explaining the persistent
ModuleNotFoundError despite the pytest config + python -m pytest fixes.

Anchor the Python distribution directories (build/, dist/, lib/,
lib64/, eggs/, etc.) to the repo root with a leading slash so they
match only top-level dirs, not nested project directories that
legitimately use the same names.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributors generate manifest.json on machines that may or may not
have querygym installed (the lib reads __version__ via lazy import
and falls back to "unknown" on ImportError). CI always has querygym
installed via `pip install -e ".[repro,dev]"`, so the regenerated
manifest's querygym_version differed from the committed one and
--check failed on a purely informational field.

content_hash still pins the actual aggregate data byte-for-byte.
schema_version, run_count, and row_count remain in the comparison —
those are real correctness signals.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@radinhamidi radinhamidi merged commit 7ebaac9 into main Apr 29, 2026
1 check passed
@radinhamidi radinhamidi deleted the fix/repro-pipeline-ci branch April 30, 2026 05:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant