fix: calibrate legacy PyTorch storage persistent IDs by mldangelo-oai · Pull Request #1652 · promptfoo/modelaudit

mldangelo-oai · 2026-06-11T01:41:42Z

Summary

Downgrade canonical PyTorch storage BINPERSID findings to informational only after ModelAudit has validated legacy PyTorch framing, the canonical storage persistent-ID tuple, and the referenced storage payload.
Keep arbitrary/custom PERSID and non-storage BINPERSID records fail-closed by marking the legacy storage layout incomplete instead of applying contextual trust.
Add focused regression coverage for trusted canonical storage IDs and malformed/malicious controls, plus a changelog entry.

Root cause and security tradeoff

Legacy PyTorch binary files store tensor payload references as pickle persistent IDs. The standalone pickle engine correctly reports PERSISTENT_ID/BINPERSID as S212 because, outside context, persistent loaders can be attacker-controlled. ModelAudit already validates the broader legacy PyTorch container and storage payload layout, but it did not use that validated context to calibrate the canonical storage BINPERSID warning. That left canonical torch.LongStorage references in a fully validated legacy PyTorch binary as actionable false positives.

This PR applies trust only in the Python ModelAudit scanner, after all of these hold:

legacy PyTorch control streams and storage records validate
the persistent ID is canonical storage BINPERSID
the storage key is present in the validated storage-key list
the referenced storage payload is present and sized correctly

The standalone modelaudit-picklescan package remains conservative. Ambiguous evidence still fails closed.

Real-model QA

Pinned model:

ProsusAI/finbert @ 4556d13015211d73dccd3fdd39d39232506f3e43
Verified current hf://ProsusAI/finbert SHA: 4556d13015211d73dccd3fdd39d39232506f3e43

Before the fix on the required baseline, the pinned pytorch_model.bin reported actionable S212 for canonical torch.LongStorage BINPERSID key 94591118788240.

Final local pinned artifact:

PROMPTFOO_DISABLE_TELEMETRY=1 uv run modelaudit scan /tmp/modelaudit-t20-pytorch_model.bin --format json --output /tmp/modelaudit-t20-final-local-bin.json --max-size 1GB --timeout 900 --no-whitelist --no-cache

Outcome: exit 0, success=true, issues=0, failed_checks=0, pickle_verdict=clean, pickle_report_status=complete, legacy_pytorch_storage_key_count=202, legacy_pytorch_trusted_storage_persistent_id_count=1. The S212 BINPERSID for key 94591118788240 is present as passed/info with trusted_legacy_pytorch_context=true.

Final pinned HF file:

PROMPTFOO_DISABLE_TELEMETRY=1 uv run modelaudit scan https://huggingface.co/ProsusAI/finbert/resolve/4556d13015211d73dccd3fdd39d39232506f3e43/pytorch_model.bin --format json --output /tmp/modelaudit-t20-final-hf-file.json --cache-dir /tmp/modelaudit-t20-final-hf-file-cache --max-size 1GB --timeout 900 --no-whitelist

Outcome: exit 0, success=true, issues=0, failed_checks=0, pickle_verdict=clean, pickle_report_status=complete, legacy_pytorch_storage_key_count=202, legacy_pytorch_trusted_storage_persistent_id_count=1.

Final HF streaming scan:

PROMPTFOO_DISABLE_TELEMETRY=1 uv run modelaudit scan hf://ProsusAI/finbert --stream --format json --output /tmp/modelaudit-t20-final-hf-stream.json --cache-dir /tmp/modelaudit-t20-final-hf-stream-cache --max-size 2GB --timeout 900 --no-whitelist

Outcome: exit 0, success=true, has_errors=false, files_scanned=7, bytes_scanned=876193055, failed_checks=0; the PyTorch file has pickle_verdict=clean, pickle_report_status=complete, legacy_pytorch_storage_key_count=202, legacy_pytorch_trusted_storage_persistent_id_count=1. The remaining 7 issues are informational README S309 URL/domain findings.

Tests

PROMPTFOO_DISABLE_TELEMETRY=1 uv run pytest tests/scanners/test_pickle_scanner.py -k "legacy_pytorch" --maxfail=1 -q
PROMPTFOO_DISABLE_TELEMETRY=1 uv run pytest tests/scanners/test_pickle_scanner.py -q
PROMPTFOO_DISABLE_TELEMETRY=1 uv run pytest packages/modelaudit-picklescan/tests/test_api.py -k "persistent_id or pytorch_storage" -q
PROMPTFOO_DISABLE_TELEMETRY=1 uv run pytest tests/scanners/test_picklescan_adapter.py tests/scanners/test_pytorch_zip_scanner.py -k "persistent_id or PERSISTENT_ID or storage_persistent" -q
PROMPTFOO_DISABLE_TELEMETRY=1 uv run pytest tests/test_core.py -k "persistent or pytorch or pickle" -q
PROMPTFOO_DISABLE_TELEMETRY=1 uv run pytest tests/test_streaming_scan.py -q
uv run ruff format --check modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/
uv run ruff check modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/
uv run mypy modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/
PROMPTFOO_DISABLE_TELEMETRY=1 uv run pytest -n auto --maxschedchunk=1 -m "not slow and not integration" --maxfail=1
git diff --check

Broad suite outcome: 18549 passed, 793 skipped, 40 warnings in 635.72s.

Malicious and malformed controls

Regression tests cover:

.bin extension alone does not grant trust
custom PERSID
extra non-storage BINPERSID
noncanonical storage tuple shapes and types
missing storage-key records
unlisted storage keys
truncated storage payloads
malformed legacy storage headers
truncated control streams

All ambiguous cases retain warning or inconclusive behavior and do not set trusted_legacy_pytorch_context.

mldangelo-oai · 2026-06-11T01:41:49Z

@codex review

github-actions · 2026-06-11T01:43:51Z

Workflow run and artifacts

Performance Benchmarks

Compared 12 shared benchmarks with a regression threshold of 15%.
Status: 0 regressions, 0 improved, 12 stable, 0 new, 0 missing.
Aggregate shared-benchmark median: 1.422s -> 1.413s (-0.6%).

Workload	Benchmark	Target	Size	Files	Baseline	Current	Change	Status
`nested-payload-review`	`tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_nested_payload_review[nested_base64]`	`nested_base64`	98 B	1	586.5us	598.2us	+2.0%	stable
`padded-multi-stream-upload`	`tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_padded_multi_stream_upload`	`multi_stream_padded`	4.1 KiB	1	653.4us	640.7us	-1.9%	stable
`nested-payload-review`	`tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_nested_payload_review[nested_hex]`	`nested_hex`	130 B	1	622.7us	611.3us	-1.8%	stable
`direct-malicious-upload`	`tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_direct_malicious_upload`	`malicious_reduce`	52 B	1	541.0us	533.1us	-1.5%	stable
`nested-payload-review`	`tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_nested_payload_review[nested_raw]`	`nested_raw`	78 B	1	585.6us	578.2us	-1.3%	stable
`mixed-model-repository`	`tests/benchmarks/test_scan_benchmarks.py::test_scan_release_candidate_repository`	`release-candidate`	547.3 KiB	32	484.72ms	479.40ms	-1.1%	stable
`single-checkpoint-preflight`	`tests/benchmarks/test_scan_benchmarks.py::test_scan_single_checkpoint_before_load`	`single_checkpoint.pkl`	183.0 KiB	1	71.59ms	70.82ms	-1.1%	stable
`chunked-upload-stream`	`tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_chunked_upload_stream`	`chunked_stream`	278.2 KiB	1	112.92ms	112.37ms	-0.5%	stable
`duplicate-heavy-registry`	`tests/benchmarks/test_scan_benchmarks.py::test_scan_duplicate_registry_snapshot`	`registry-snapshot`	915.2 KiB	13	390.26ms	388.70ms	-0.4%	stable
`warm-cache-rescan`	`tests/benchmarks/test_scan_benchmarks.py::test_scan_warm_cached_repository_rescan`	`release-candidate`	547.3 KiB	32	107.09ms	106.67ms	-0.4%	stable
`clean-training-checkpoint`	`tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_clean_training_checkpoint`	`safe_large`	278.2 KiB	1	109.05ms	109.46ms	+0.4%	stable
`suspicious-pickle-intake`	`tests/benchmarks/test_scan_benchmarks.py::test_scan_suspicious_pickle_intake`	`suspicious-intake`	183.8 KiB	4	142.89ms	142.66ms	-0.2%	stable

chatgpt-codex-connector · 2026-06-11T01:45:34Z

Codex Review: Didn't find any major issues. 🎉

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

ianw-oai

Approved as a focused legacy PyTorch storage persistent-ID calibration with fail-closed regressions and green checks.

fix: calibrate legacy PyTorch storage persistent IDs

23ce29e

mldangelo-oai requested a review from mldangelo June 11, 2026 02:00

mldangelo-oai enabled auto-merge (squash) June 11, 2026 02:00

ianw-oai approved these changes Jun 11, 2026

View reviewed changes

mldangelo-oai merged commit 4a28b7f into main Jun 11, 2026
29 checks passed

mldangelo-oai deleted the mdangelo/codex/hf-fp-t20-legacy-pytorch-persistent-ids-20260610 branch June 11, 2026 16:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: calibrate legacy PyTorch storage persistent IDs#1652

fix: calibrate legacy PyTorch storage persistent IDs#1652
mldangelo-oai merged 1 commit into
mainfrom
mdangelo/codex/hf-fp-t20-legacy-pytorch-persistent-ids-20260610

mldangelo-oai commented Jun 11, 2026

Uh oh!

mldangelo-oai commented Jun 11, 2026

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

chatgpt-codex-connector Bot commented Jun 11, 2026

Uh oh!

ianw-oai left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mldangelo-oai commented Jun 11, 2026

Summary

Root cause and security tradeoff

Real-model QA

Tests

Malicious and malformed controls

Uh oh!

mldangelo-oai commented Jun 11, 2026

Uh oh!

github-actions Bot commented Jun 11, 2026

Performance Benchmarks

Uh oh!

chatgpt-codex-connector Bot commented Jun 11, 2026

Uh oh!

ianw-oai left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants