fix: preview Hugging Face dry-run scans#1645
Conversation
Add an opt-in Hugging Face false-positive regression harness with pinned live cases and synthetic malicious controls.
Performance BenchmarksCompared
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 05fb6a454c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 086e576836
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 982f74dfa3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f178df9913
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b6b113e1a2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e2fdd6e7be
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
CI diagnosis for exact head |
|
@codex review |
|
Codex Review: Didn't find any major issues. Hooray! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
Current head
Run/job: https://github.com/promptfoo/modelaudit/actions/runs/27322160248/job/80715503773 All other required jobs passed. @codex reproduce under Windows path/URL semantics, fix without weakening dry-run's no-download/no-scan invariant, add/adjust the cross-platform regression, push, and request exact-head review. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f78836c99a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0c3024f18f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Independent adversarial review: promptfoo/modelaudit PR #1645Findings1. High —
|
| Thread | Live state | Independent disposition |
|---|---|---|
r3392638132 preserve issue exit |
resolved/outdated | Fixed by the preview-only exit guard; malicious local dry-run test passes. |
r3392638135 recursive metadata |
resolved/outdated | Fixed with recursive tree listing, subject to finding 9. |
r3392687672 preserve local no-files exit |
resolved/current | Fixed; empty local dry-run exits 2. |
r3392687673 skip recursive folders |
resolved/current | Fixed for current Hub RepoFolder objects, which lack size. |
r3392687676 structured stdout |
resolved/outdated | Fixed; preview text is routed to stderr for JSON/SARIF stdout. |
r3392687679 validate direct-file existence |
resolved/current | Fixed through repository/path metadata calls. |
r3392764356 reject direct folders |
resolved/current | Fixed because folder metadata has no valid size. |
r3392764357 honor repo max-size |
resolved/outdated | Partially fixed; finding 4 remains. |
r3392801228 use real file set for max-size |
resolved/outdated | Partially fixed; findings 1, 4, and 9 remain. |
r3392801231 no scannable files |
resolved/outdated | Fixed for unselected non-streaming previews. |
r3392860440 mirror full-download selection |
resolved/outdated | Partially fixed; findings 1, 4, and 5 remain. |
r3392977222 content-routed selection |
resolved/current | Partially fixed; finding 5 remains with scanner selection. |
r3392977223 streaming unfiltered limit |
resolved/current | Fixed; the 129-file regression exits 2. |
r3393239818 avoid content probing |
unresolved/current | Validated; finding 1. |
r3393239822 POSIX Hub paths |
unresolved/outdated | Fixed by PurePosixPath and the new backslash-path regression, but unrelated to the prior failing test; see finding 2. |
r3393239824 unsupported direct files |
unresolved/current | Partially validated with script.py; cover.png is not a valid reproducer; see finding 6. |
r3393330907 scanner-selected content routes |
unresolved/current | Validated; finding 5. |
r3393330909 timeout propagation |
unresolved/current | Validated; finding 3. |
r3393330913 README size budget |
unresolved/current | Validated; finding 4. |
Validation performed
Review inputs:
- Fetched live PR metadata first and checked out the exact PR ref in an isolated detached worktree; the primary checkout was dirty and was not modified.
- Read exact-head root
AGENTS.md, thepr-reviewskill, the full changed production/test surface, surrounding source selectors/downloaders, all live comments/reviews/threads, commit history, and prior/current Windows job evidence. git diff --check 8d6c4864...0c3024f1: passed.- No scoped
AGENTS.mdbeyond root applies to the changed paths; no unambiguous policy violation was found.
Focused exact-head tests:
tests/test_huggingface_fp_regression_harness.py, two local dry-run exit tests,TestGetModelInfo, andTestGetHuggingFaceFileInfo: 23 passed, 7 skipped (the seven live Hub cases are opt-in).tests/test_cli.py -k "huggingface and (streaming or dry_run or scanner)": 13 passed, 246 deselected.tests/utils/sources/test_huggingface.py: 271 passed.- Focused xdist harness/exit subset: 18 passed, 7 skipped.
Adversarial exact-head probes:
- Dry-run content route: two remote-prefix-read attempts observed.
- Direct
script.py: normal exit 2 vs dry-run exit 0. - README budget: dry-run exit 0 under 50 B while full selector includes
README.md. - Renamed PyTorch route: scanner-selected dry-run exit 2 while full selector includes the file.
- Timeout:
--timeout 1was not passed to metadata lookup. - Empty cloud directory preview: exit 0.
- Old Windows basename semantics and new POSIX semantics selected the same failed-test fixture.
- Replacing
modelaudit.clias done bytest_cli_logging_handlers.pymade string patches target the new module while the previously imported Click command retained old globals, reproducing the exact Windows mock bypass.
Broader validation:
- A local broad fast-suite attempt using the primary checkout's existing virtualenv was stopped after native-package/cache failures unrelated to this diff (
modelaudit-picklescanwas not resolvable from the isolated exact-head worktree, causing scanner cascades). It reached 3,658 passes before interruption. These results are not used as PR findings. - The exact-head GitHub matrix is the broad-suite authority. It completed with only Windows and aggregate
CI Successfailing; the Windows failure is finding 2. - The pinned live Hub matrix was not independently run because
huggingface.cois outside this environment's network allowlist. The PR body reports one rank-2 live case passed and six cases skipped, but that self-reported run exercises only the limited path described in finding 7.
Side-effect and malformed-input assessment
- No new model-content disk write or cache/temp-directory creation was found before the Hugging Face dry-run branches return. Explicit
--output/--sbomwrites remain user-requested behavior. - The network non-execution guarantee is not met because content prefix bodies are fetched; metadata API calls and normal telemetry are separate network activity.
- Missing direct files, direct folders, credential-bearing errors, recursive folders, unknown selected sizes under
--max-size, local findings, and empty local paths have focused passing regressions. - Unsupported direct-file prefiltering, mutable-revision metadata, and malformed/empty cloud preview semantics remain incorrect as described above.
Merge disposition
Request changes / do not merge 0c3024f18f10286d4a8649432837aa2cf2e68fac.
Minimum merge gates:
- Make Hugging Face dry-run metadata-only: no model-content GETs, downloads, scans, or hidden cache/temp writes. Fail closed when exact content routing cannot be known from metadata.
- Propagate the end-to-end timeout and use one immutable repository manifest for names, sizes, and revision.
- Match full-download/streaming size and selection semantics, including README files, scanner-selected renamed artifacts, and direct files rejected by the normal prefilter.
- Replace declarative mode strings with actual pinned CLI dry-run, streaming, scanner-selective, local-directory, and malicious-control executions; bound acquisition before transfer.
- Stop
test_cli_logging_handlers.pyfrom replacing the process-global CLI module (prefer a subprocess), make the new harness independent of worker ordering/live network, and require green exact-head Windows evidence. - Restore the existing cloud dry-run no-files/error contract or explicitly scope, document, and test the intended cross-provider change.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d0ac58c6e4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…t01-hf-fp-regression-harness-20260610
…t01-hf-fp-regression-harness-20260610
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e1baa07330
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…t01-hf-fp-regression-harness-20260610 # Conflicts: # modelaudit/cli.py # modelaudit/utils/sources/huggingface.py
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4fc88b3363
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…t01-hf-fp-regression-harness-20260610 # Conflicts: # modelaudit/cli.py # modelaudit/utils/sources/huggingface.py # tests/utils/sources/test_huggingface.py
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: aafefb118d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…t01-hf-fp-regression-harness-20260610 # Conflicts: # modelaudit/cli.py # tests/test_cli.py
…t01-hf-fp-regression-harness-20260610
…t01-hf-fp-regression-harness-20260610
…t01-hf-fp-regression-harness-20260610 # Conflicts: # modelaudit/cli.py # modelaudit/utils/sources/huggingface.py # tests/test_cli.py
…t01-hf-fp-regression-harness-20260610
…t01-hf-fp-regression-harness-20260610 # Conflicts: # modelaudit/utils/sources/_huggingface_download_worker.py
Summary
--dry-runpreviews that skip downloads and scans[Unreleased]changelogReview Notes
modelaudit.cli.get_model_infonow delegates through the Hugging Face source module so existing source-module patch points still affect metadata preview tests.Validation
uv run ruff format modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/uv run ruff check --fix modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/uv run mypy modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/PROMPTFOO_DISABLE_TELEMETRY=1 uv run pytest tests/test_huggingface_fp_regression_harness.py tests/test_cli.py::test_scan_huggingface_metadata_preview_escapes_model_id tests/test_cli.py::test_scan_huggingface_metadata_preflight_verbose_log_is_sanitized -qPROMPTFOO_DISABLE_TELEMETRY=1 uv run pytest tests/test_cli.py -k "huggingface and (streaming or dry_run or scanner)" --maxfail=1 -qPROMPTFOO_DISABLE_TELEMETRY=1 uv run pytest tests/utils/sources/test_huggingface.py --maxfail=1 -qPROMPTFOO_DISABLE_TELEMETRY=1 uv run pytest -n auto -m "not slow and not integration" --maxfail=1(17,373 passed, 1,292 skipped)git diff --checkLive QA
PROMPTFOO_DISABLE_TELEMETRY=1 MODELAUDIT_RUN_HF_FP_LIVE=1 uv run pytest tests/test_huggingface_fp_regression_harness.py::test_live_pinned_hf_manifest_case_end_to_end -q --maxfail=1(rank-2 live case passed, 6 full-matrix cases skipped)PROMPTFOO_DISABLE_TELEMETRY=1 uv run modelaudit scan --dry-run --format json --scanners safetensors hf://sentence-transformers/all-MiniLM-L6-v2(exit 0; 0 files scanned; download/scan skipped)