[WIP][Feature Request] Support ONNX Q/DQ Autotuning with Subgraph Mode by Hale423 · Pull Request #1015 · NVIDIA/Model-Optimizer

Hale423 · 2026-03-10T02:14:13Z

Pull Request: Expand ONNX Q/DQ Autotuning with Subgraph Mode

Branch: dev-wahao-autotune-subgraph-profile → main
Type: Feature

Summary

This PR expands on the existing ONNX Q/DQ Autotune framework by adding a subgraph workflow (--workflow subgraph) that uses TensorRT fusion boundaries for faster, fusion-aware QDQ placement optimization.

It also integrates @willg-nv's TorchRegionBuilder (from Draft PR #963) for PyTorch-style hierarchical region discovery, and adds TensorRT Python API benchmarking utilities, documentation, and examples.

Relation to existing PRs:

This PR supersedes Draft: Integrate TorchRegionBuilder to AutoQDQ #963 — Will Guo's TorchRegionBuilder is included here along with the subgraph workflow.
The base autotune package (region mode) was introduced by PRs Integrate Automated QDQ placement tool - part 4.3 #843, [OMNIML-3252][ONNX] Add real Q/DQ scales in Autotune #951, [OMNIML-3252][ONNX] MOQ + Autotune moq integration docs #1026; this PR builds on that foundation.

What's New

Area	Description
Subgraph workflow	`--workflow subgraph`: fusion-aware grouping from TensorRT `graph.json`; per-subgraph QDQ scheme profiling; optional per-layer timing with fallback to total latency.
Fusion grouping	`fusion_grouping.py`: parse TRT `graph.json`, build fusion groups, infer shapes. Auto-generates `graph.json` via `trtexec` FP16 build if not provided.
Subgraph extraction	`subgraph_extractor.py`: extract standalone ONNX subgraphs per fusion group for isolated benchmarking.
Torch region builder	`torch_region_builder.py`: PyTorch-style hierarchical region discovery using node name conventions (from #963).
TensorRT utils	`tensorrt_utils.py`: TRT Python API benchmark with timing cache, plugin support, and configurable warmup/timing runs.
Incremental validation	Per-group full-model validation: apply QDQ groups one-by-one, keep only those that improve latency. Saves `optimized_raw.onnx` + `optimized_final.onnx`.
Cache / resume	`autotune_cache.json` for Phase 2 (subgraph profiling) and Phase 3 (incremental validation).
trtexec compatibility	Profiling-flag retry: on "Unknown option", strip `--exportProfile`/`--profilingVerbosity` and retry with total latency.
CLI	`--workflow {region,subgraph}`, `--graph_json`, `--incremental_validation` / `--no-incremental-validation`.
Example	`examples/qdq_placement/`: README (Quick Start, region vs subgraph, best practices) and `set_batch_size.py`.

Key Files

Path	Role
`modelopt/onnx/quantization/autotune/__main__.py`	CLI: `--workflow`, `--graph-json`, `--incremental-validation`, `--use-trtexec`, `--trtexec-args`, etc.
`modelopt/onnx/quantization/autotune/subgraph_workflow.py`	Subgraph pipeline: Phase 1 (fusion grouping), Phase 2 (subgraph profiling), Phase 3 (full-model + incremental validation), cache I/O.
`modelopt/onnx/quantization/autotune/fusion_grouping.py`	Parse `graph.json`, create fusion groups, `generate_graph_json()` (trtexec FP16 build when no graph is provided).
`modelopt/onnx/quantization/autotune/subgraph_extractor.py`	Extract subgraph ONNX from full model given group inputs/outputs and shapes.
`modelopt/onnx/quantization/autotune/tensorrt_utils.py`	TRT Python API benchmark runner with timing cache, plugin support, and dynamic shape handling.
`modelopt/onnx/quantization/autotune/torch_region_builder.py`	PyTorch-style hierarchical region discovery for region mode.
`modelopt/onnx/quantization/autotune/benchmark.py`	trtexec benchmark runner: optional `export_profile_path`, profiling-flag dedup and "Unknown option" retry.
`modelopt/onnx/quantization/autotune/workflows.py`	Dispatcher and `benchmark_onnx_model()`; passes through `export_profile_path` when using trtexec.
`modelopt/onnx/quantization/autotune/qdq_utils.py`	Quantized tensor discovery helpers.
`examples/qdq_placement/README.md`	User-facing example: prerequisites, Quick Start (region + subgraph), output layout, subgraph best practices.
`examples/qdq_placement/set_batch_size.py`	ResNet50 fixed-batch script for reproducible benchmarking.
`tests/unit/onnx/quantization/autotune/test_config.py`	Config class unit tests.

How to Test

Region mode (default):

cd examples/qdq_placement
curl -L -o resnet50_Opset17.onnx https://github.com/onnx/models/raw/main/Computer_Vision/resnet50_Opset17_torch_hub/resnet50_Opset17.onnx
python3 set_batch_size.py resnet50_Opset17.onnx --batch-size 128 --output resnet50.bs128.onnx
python3 -m modelopt.onnx.quantization.autotune --model resnet50.bs128.onnx --output ./resnet50_results --quant-type int8 --schemes-per-region 20
# Expect: ./resnet50_results/optimized_final.onnx and logs under ./resnet50_results/logs/

Subgraph mode with trtexec (FP8, optional graph.json):

python3 -m modelopt.onnx.quantization.autotune \
  --model resnet50.bs128.onnx \
  --output ./resnet50_subgraph \
  --workflow subgraph \
  --quant-type fp8 \
  --use-trtexec \
  --warmup-runs 5 \
  --timing-runs 20 \
  --incremental-validation \
  --trtexec-args "--stronglyTyped" \
  --schemes-per-region 30
# If --graph-json is omitted, first run will trigger trtexec to generate graph.json under output dir.
# Expect: optimized_raw.onnx, optimized_final.onnx, autotune_cache.json, logs/, subgraphs/

Resume: Kill the subgraph run mid-way, then re-run the same command; it should resume from autotune_cache.json.

Checklist

Rebased onto latest main with DCO sign-off
All CodeRabbit review comments addressed (license headers, mutable defaults, bug fixes)
CI / unit tests pass
Region mode end-to-end verified
Subgraph mode end-to-end with --use-trtexec verified
Interrupted subgraph run resumes after re-run
examples/qdq_placement/README.md matches behavior

Documentation

Example: examples/qdq_placement/README.md — Quick Start, subgraph best practices, output layout, optional graph generation.
Guides / API: docs/source/guides/9_qdq_placement.rst and docs/source/reference/2_qdq_placement.rst align with the CLI and behavior above.

Notes

The base autotune package (region mode) was introduced in previous PRs (Integrate Automated QDQ placement tool - part 4.3 #843, [OMNIML-3252][ONNX] Add real Q/DQ scales in Autotune #951, [OMNIML-3252][ONNX] MOQ + Autotune moq integration docs #1026); this PR adds the subgraph workflow and supporting infrastructure on top.
Subgraph mode is recommended for large or dynamic-shape models; region mode remains the default for compatibility and smaller models.
trtexec versions that do not support --exportProfile / --profilingVerbosity are handled by retrying without those flags and using total latency for scheme selection.

Summary by CodeRabbit

New Features
- Automated Q/DQ placement optimizer: region and new subgraph workflows, fusion-aware grouping, per-subgraph heuristic schemes, incremental validation, resume/crash recovery, and calibration support.
- Expanded CLI: workflow selection, calibration options, incremental validation, and profiling/export options.
Utilities
- Batch-size fixer, subgraph extractor, fusion/region inspection, and TensorRT benchmarking (CLI & Python) with timing-cache and profiling fallbacks.
Documentation
- Comprehensive guides, reference docs, examples, and quick-starts for autotuning and deployment.
Tests
- Extensive unit tests for workflows, utilities, and tooling.

copy-pr-bot · 2026-03-10T02:14:16Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

cjluo-nv

This PR introduces 16k+ lines of changes. Please consider sharing a design and get design review.

Hale423 · 2026-03-10T09:01:20Z

Thanks for the feedback.

Sharing this design, please kindly take a look.

Design: ONNX Q/DQ Autotuning with Subgraph Mode

Design: ONNX Q/DQ Autotuning for TensorRT

Design review document for PR #1015
Branch: dev-wahao-autotune-subgraph-profile
Target: main

1. Background

TensorRT performance for quantized ONNX models depends not only on whether Q/DQ nodes exist, but also on where they are inserted. In practice:

Quantizing too aggressively can break fusion opportunities or introduce reformat overhead.
Quantizing too conservatively leaves performance on the table.
The best placement strategy is model-specific and often difficult to determine manually.

This branch introduces an ONNX Q/DQ autotuning system that searches for better Q/DQ placement using actual TensorRT latency measurements.

The design intentionally supports two workflows:

Region mode for structure-aware, pattern-based optimization across the whole model.
Subgraph mode for faster, fusion-aware tuning on large or dynamic-shape models.

2. Goals

Provide an end-to-end ONNX Q/DQ autotuning workflow that optimizes for TensorRT latency.
Reuse optimization results across structurally similar parts of the same model.
Support warm-start from:
- Previously learned pattern caches
- Existing QDQ baseline ONNX models
Support large and dynamic-shape models via a trtexec-based path.
Provide resumable workflows for long-running tuning jobs.
Expose the feature through a simple CLI and user-facing example.

3. Non-goals

This system does not perform accuracy validation or calibration quality evaluation.
This system does not try to search all possible Q/DQ placements exhaustively.
This system does not replace standard PTQ/QAT flows; it only optimizes placement.
This system does not target backends other than TensorRT.

4. Scope Relative to `main`

main does not contain the ONNX Q/DQ autotuning package. This branch adds:

A new modelopt.onnx.quantization.autotune package
CLI support for region and subgraph autotuning workflows
TensorRT benchmarking utilities (Python API and trtexec)
Region discovery, pattern matching, insertion-point modeling, scheme generation, state persistence, and warm-start support
A subgraph workflow based on TensorRT fusion boundaries
User documentation and examples

5. User Experience

5.1 Region Mode

Region mode is the default workflow:

python3 -m modelopt.onnx.quantization.autotune \
    --model model.onnx \
    --output ./results \
    --quant-type int8 \
    --schemes-per-region 30

User-visible properties:

Automatically discovers optimization regions
Groups structurally equivalent regions into patterns
Profiles multiple insertion schemes per pattern
Saves:
- baseline.onnx
- optimized_final.onnx
- autotuner_state.yaml
- autotuner_state_pattern_cache.yaml

5.2 Subgraph Mode

Subgraph mode is intended for large and/or dynamic-shape models:

python3 -m modelopt.onnx.quantization.autotune \
    --model model.onnx \
    --output ./results_subgraph \
    --mode subgraph \
    --quant-type fp8 \
    --use-trtexec \
    --incremental-validation \
    --trtexec-args "--stronglyTyped --optShapes=input:1x3x224x224"

User-visible properties:

Uses TensorRT layer/fusion information from graph.json
If --graph-json is not provided, runs trtexec once to generate it
Profiles isolated subgraphs
Saves:
- optimized_raw.onnx
- optimized_final.onnx
- autotune_cache.json
- logs/
- subgraphs/

6. High-level Architecture

The implementation is organized around four layers.

6.1 Core Modeling Layer

Files:

common.py
insertion_points.py
region_pattern.py
qdq_utils.py

Core abstractions:

Region: hierarchical optimization unit in the graph
RegionPattern: deterministic structural signature used to group equivalent regions
InsertionScheme: a candidate set of Q/DQ insertion points
PatternSchemes: all candidate schemes for one pattern plus measurements
PatternCache: reusable best schemes across runs
Insertion point types:
- NodeInputInsertionPoint
- ChildRegionInputInsertionPoint
- RegionOutputInsertionPoint
- ResolvedInsertionPoint

Key design choice:

Schemes are represented using pattern-relative insertion points, not absolute node IDs.
This allows one measured scheme to be reused for every region instance with the same structure.

6.2 Region-mode Optimization Layer

Files:

autotuner.py
torch_region_builder.py
region_search.py
workflows.py

Responsibilities:

Discover optimization regions from the ONNX graph
Build structural patterns
Generate candidate insertion schemes
Export test models with current/best schemes applied
Record measurements and update best-known schemes
Persist and restore session state

Current implementation detail:

The default region discovery path in QDQAutotuner._search_regions() uses TorchRegionBuilder.
TorchRegionBuilder builds hierarchical regions from PyTorch-style ONNX node names.
region_search.py provides a more general region-search foundation and algorithms, but this branch’s default autotuner path is centered on the torch-name-based hierarchy.

6.3 Subgraph-mode Optimization Layer

Files:

subgraph_workflow.py
fusion_grouping.py
subgraph_extractor.py

Responsibilities:

Obtain TensorRT fusion boundaries from graph.json
Convert fused TRT layers into ONNX-level optimization groups
Extract standalone ONNX subgraphs per group
Test heuristic QDQ schemes per group
Merge accepted schemes back into the full model
Optionally validate improvements incrementally on the full model
Cache progress for resume

6.4 Benchmarking Layer

Files:

tensorrt_utils.py
workflows.py

Responsibilities:

Benchmark ONNX models using either:
- TensorRT Python API
- trtexec
Reuse timing caches and persistent benchmark instances
Support plugin libraries
Support dynamic-shape arguments
Export per-layer profiles when trtexec supports them
Fall back gracefully when profiling flags are unsupported

7. Region Mode Design

7.1 Why Region Mode Exists

Region mode is intended to solve the general case:

The model may not have TensorRT fusion metadata available yet
Similar subgraphs may appear repeatedly across the model
A single structural pattern may benefit from the same placement policy everywhere

This workflow optimizes patterns, not isolated instances.

7.2 Region Discovery

The branch uses TorchRegionBuilder as the default discovery mechanism because many production ONNX models are exported from PyTorch and preserve hierarchical naming such as:

/encoder/layer_0/self_attn/q_proj/MatMul

This naming convention enables:

Stable hierarchical grouping
More semantically meaningful optimization units
Better scheme reuse across repeated blocks (e.g. transformer layers)

TorchRegionBuilder:

Builds a path trie from node names
Creates composite and leaf regions
Merges small neighboring regions
Probes epilogues and computes region boundaries
Can filter out non-quantizable regions

7.3 Pattern-based Scheme Reuse

Once regions are discovered:

A RegionPattern is built from region topology and op types.
Regions with identical signatures share a single PatternSchemes object.
Candidate schemes are generated and measured for that pattern.
The best scheme is reused for all matching regions.

This reduces duplicated work and makes the search tractable.

7.4 Warm Start

The region workflow supports two warm-start mechanisms:

Pattern cache: load prior best schemes by pattern signature
QDQ baseline import: inspect an already-quantized model and infer pattern-level insertion points from it

This allows transfer learning across:

Different runs of the same model
Similar models
User-provided manually tuned quantized baselines

7.5 State and Resume

Region mode saves:

Baseline latency
Profiled schemes and latencies
Best scheme selection
Region discovery results

The state file is autotuner_state.yaml, allowing restart from the last profiled region.

8. Subgraph Mode Design

8.1 Why Subgraph Mode Exists

Region mode can become expensive for large models because each candidate scheme requires building and profiling increasingly quantized versions of the full model.

Subgraph mode addresses this by:

Aligning optimization units with TensorRT fusion groups
Profiling isolated subgraphs instead of the whole model
Using a small, heuristic scheme set instead of a broad mutation search

This makes the workflow significantly faster on large and dynamic-shape models.

8.2 Phase 1: Fusion-aware Grouping

Subgraph mode begins by obtaining TensorRT layer information:

If --graph-json is provided, use it directly.
Otherwise, call generate_graph_json() in fusion_grouping.py.

generate_graph_json() runs trtexec once in FP16 mode and writes:

<model>.fp16.graph.json
<model>.fp16.build.log

The exported layer info is then parsed to create fusion groups that map TensorRT fused layers back to ONNX nodes.

8.3 Phase 2: Subgraph Profiling

For each quantizable fusion group:

Resolve boundary tensors
Extract a standalone ONNX subgraph
Build a small heuristic set of schemes, for example:
- baseline
- full QDQ
- weight-only
- activation-only
Benchmark each scheme
Choose the best scheme for that group

Dynamic-shape handling:

The workflow parses --minShapes/--optShapes/--maxShapes from --trtexec-args
Infers intermediate tensor shapes
Constructs subgraph-specific shape arguments for extracted subgraphs

8.4 Per-layer Comparison

A major issue with isolated subgraph benchmarking is that total subgraph latency can be dominated by artifacts such as:

Reformat layers
Transposes
Layout adaptation overhead

To reduce this noise, subgraph mode attempts to use trtexec per-layer profile export:

--exportProfile
--profilingVerbosity=detailed
--separateProfileRun

If supported, the workflow:

Parses the per-layer profile JSON
Filters out reformat-style overhead
Compares the time spent on compute layers

If the local trtexec version does not support those flags, the code falls back to total subgraph latency.

8.5 Phase 3: Full-model Merge and Incremental Validation

The subgraph workflow distinguishes two outputs:

optimized_raw.onnx: all qualifying groups applied together
optimized_final.onnx: final validated model after incremental validation

Incremental validation exists because multiple individually beneficial subgraph changes can still regress full-model latency when combined.

When enabled:

Start from the FP16 baseline full model
Apply candidate groups one at a time
Benchmark the full model after each change
Keep the change only if latency improves

This design preserves:

A raw “all accepted by local heuristic” model
A validated “accepted by full-model measurement” model

8.6 Cache and Resume

Subgraph mode uses autotune_cache.json in the output directory.

It stores:

Phase 2 group profiling results
Phase 3 validation progress
Current latency and keep/reject decisions

This enables:

Resuming long runs after interruption
Restarting directly at incremental validation if Phase 2 is already complete

9. Benchmarking Design

9.1 Python API Benchmark

The Python API path is used for standard latency measurement when trtexec is not required.

Design decisions:

Reuse a global benchmark instance
Reuse TensorRT builder/runtime state
Reuse timing cache across measurements

This reduces repeated benchmark setup overhead.

9.2 trtexec Benchmark

trtexec is needed for:

Dynamic-shape benchmarking with custom shape flags
Generating TensorRT graph.json
Optional per-layer profile export

The trtexec path accepts a free-form --trtexec-args string so users can pass deployment-specific settings.

9.3 Profiling-flag Compatibility Fallback

Different TensorRT/trtexec versions do not always support the same profiling flags.

The branch therefore implements the following behavior:

Attempt to run with per-layer profiling flags if a profile is requested
If trtexec fails with Unknown option, remove profiling-related flags
Retry once without profiling
Mark profiling as unsupported for subsequent runs

This keeps subgraph mode functional even on older or restricted trtexec versions.

10. CLI Design

The CLI is defined in __main__.py.

Important options:

--mode {region,subgraph}
--graph-json
--incremental-validation / --no-incremental-validation
--use-trtexec
--trtexec-args
--pattern-cache
--qdq-baseline
--state-file
--plugin-libraries
--schemes-per-region

Design rationale:

Keep region mode as the default for backward compatibility
Make subgraph mode opt-in because it depends on TRT fusion metadata and trtexec
Keep advanced deployment settings in --trtexec-args instead of introducing many narrowly scoped flags

11. Data and Output Model

11.1 Region Mode Outputs

baseline.onnx
optimized_final.onnx
autotuner_state.yaml
autotuner_state_pattern_cache.yaml
logs/
region_models/

11.2 Subgraph Mode Outputs

optimized_raw.onnx
optimized_final.onnx
autotune_cache.json
logs/
subgraphs/
optional generated <model>.fp16.graph.json

12. Key Trade-offs

12.1 Region Mode vs Subgraph Mode

Region mode:

Pros:
- General
- Pattern-reuse friendly
- Does not require TRT fusion metadata up front
Cons:
- Can be slow on large models
- More expensive search loop

Subgraph mode:

Pros:
- Much faster
- Aligned with actual TRT fusion boundaries
- Better fit for large and dynamic-shape models
Cons:
- Requires TRT graph export or prior graph
- More heuristic
- Local subgraph wins may not transfer cleanly to full-model wins

12.2 Why Keep Both Raw and Final Models

We save both optimized_raw.onnx and optimized_final.onnx because they serve different purposes:

optimized_raw.onnx preserves all locally accepted group changes
optimized_final.onnx preserves only changes that survive full-model validation

This avoids hiding useful intermediate output from advanced users while still giving a safer default final artifact.

12.3 Why Pattern-relative Insertion Points

Absolute node IDs would make learned schemes brittle and model-instance-specific.

Pattern-relative insertion points:

Generalize across repeated structures
Enable pattern cache reuse
Make imported QDQ baselines transferable

13. Limitations and Risks

Region-mode default discovery currently relies on PyTorch-style ONNX naming for best results.
The autotuner optimizes for latency only; it does not enforce accuracy constraints.
Subgraph heuristics do not explore the full combinatorial space.
Per-layer profiling depends on local trtexec support.
Full-model validation still depends on benchmark stability; small latency differences can be noisy.

14. Alternatives Considered

14.1 Full-model-only Search

Rejected because it is too slow for large models and dynamic-shape deployments.

14.2 Subgraph-only Design

Rejected because it gives up the benefits of pattern reuse and a backend-agnostic structural view of the graph.

14.3 Absolute-node Placement Search

Rejected because it would prevent scheme reuse across equivalent structures and make state transfer less useful.

15. Validation Plan

Recommended design-review validation:

Basic region-mode run
- Run on ResNet50 fixed-batch example
- Confirm optimized_final.onnx and state files are produced
Subgraph mode without --graph-json
- Confirm trtexec first generates *.fp16.graph.json
- Confirm the workflow proceeds into Phase 2 and Phase 3
Resume behavior
- Interrupt a subgraph run mid-way
- Re-run with the same output directory
- Confirm resume from autotune_cache.json
Profiling fallback
- Run in an environment where trtexec does not support profiling flags
- Confirm automatic retry without those flags
Warm-start path
- Run once to produce a pattern cache
- Re-run using --pattern-cache
- Confirm seeding behavior and faster convergence

16. Rollout and Documentation

This branch already includes:

CLI integration
Example under examples/qdq_placement/
User documentation updates
API documentation updates

Suggested rollout:

Complete design review
Land PR behind the existing CLI mode separation
Gather feedback on:
- default region discovery behavior
- subgraph heuristic quality
- benchmark noise and validation thresholds

17. Open Questions for Review

Should region mode continue to default to TorchRegionBuilder, or should a more general search path be reinstated as the primary default?
Should subgraph mode require --use-trtexec, or should it remain flexible while documenting that trtexec is the recommended path?
Should the per-group acceptance threshold remain latency-only, or should we expose stronger hysteresis / noise margins for incremental validation?
Should we eventually add accuracy hooks so this becomes a joint latency + quality optimization flow?

coderabbitai · 2026-03-13T08:03:22Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a subgraph-based Q/DQ placement autotuner: docs and examples, CLI workflow switch to "subgraph", TensorRT benchmarking (trtexec + Python API), fusion-group analysis from TensorRT graph.json, subgraph extraction and QDQ insertion, heuristic scheme generation, multi-phase profiling/validation with caching/resume, and many unit tests.

Changes

Cohort / File(s)	Summary
Documentation & Examples `docs/source/guides/9_qdq_placement.rst`, `docs/source/reference/2_qdq_placement.rst`, `examples/qdq_placement/README.md`	New comprehensive guide/reference and example README for the Q/DQ placement optimizer covering CLI/Python usage, workflows, configs, examples, and deployment notes.
Example Utility Script `examples/qdq_placement/set_batch_size.py`	New CLI/script to fix model batch size, run shape inference (with fallback), save and validate modified ONNX models.
Autotune Entry & CLI `modelopt/onnx/quantization/autotune/__init__.py`, `modelopt/onnx/quantization/autotune/__main__.py`	Exports `subgraph_autotuning_workflow`; CLI gains `--workflow` (region
Benchmarking Layer `modelopt/onnx/quantization/autotune/tensorrt_utils.py`, `modelopt/onnx/quantization/autotune/benchmark.py`, `modelopt/onnx/quantization/autotune/workflows.py`	Adds Benchmark base class, TrtExecBenchmark (trtexec invocation, deduped args, profiling fallback, profile export), TensorRTPyBenchmark (TensorRT Python API engine build, optimization profiles, timing cache), and forwards new benchmark args through workflows.
Fusion & Subgraph Tools `modelopt/onnx/quantization/autotune/fusion_grouping.py`, `modelopt/onnx/quantization/autotune/subgraph_extractor.py`, `modelopt/onnx/quantization/autotune/qdq_utils.py`	New fusion-group parsing from TensorRT graph.json (TRTLayer/FusionGroup), graph.json generation helper, subgraph extraction utilities, and helper to list tensors already quantized.
Subgraph Autotuning Workflow `modelopt/onnx/quantization/autotune/subgraph_workflow.py`	New multi-phase subgraph workflow: fusion-group discovery, heuristic per-subgraph scheme generation & insertion, per-subgraph profiling and caching (phase2), full-model baseline and optional incremental validation (phase3), final model export and cache/resume support.
Torch Region Discovery `modelopt/onnx/quantization/autotune/torch_region_builder.py`	New TorchRegionBuilder to derive hierarchical regions from PyTorch-exported ONNX graphs, with naming detection, pruning/merging, boundary computation and inspection CLI.
Public API & Small Utils `modelopt/onnx/quantization/autotune/__init__.py`, `modelopt/onnx/quantization/autotune/qdq_utils.py`	Public API expansion exporting `subgraph_autotuning_workflow` and `get_quantized_tensors`.
Tests `tests/unit/onnx/quantization/autotune/*`	Adds unit tests covering config, CLI parsing, fusion grouping, subgraph extraction, subgraph workflow logic, benchmarking helpers, cache/shape/profile utilities, and QDQ insertion behavior.

sequenceDiagram
    participant CLI as CLI / User
    participant WF as Subgraph Workflow
    participant GA as Graph Analysis
    participant SG as Subgraph Extraction
    participant BM as Benchmarking
    participant FS as FileSystem

    CLI->>WF: subgraph_autotuning_workflow(model_path, ...)
    
    rect rgba(100,200,150,0.5)
    note over WF: Phase 1: Setup
    WF->>GA: parse graph.json + create_fusion_groups
    GA->>GA: Map TRT layers → ONNX nodes
    GA-->>WF: FusionGroup list
    end

    rect rgba(100,150,200,0.5)
    note over WF: Phase 2: Per-Subgraph Profiling
    loop For each FusionGroup
        WF->>WF: generate_heuristic_schemes
        loop For each scheme
            WF->>SG: extract_subgraph_by_nodes
            SG-->>WF: subgraph bytes
            WF->>WF: insert_qdq_on_graph
            WF->>BM: benchmark_onnx_model
            BM-->>WF: latency_ms
        end
        WF->>WF: select best_scheme
    end
    WF->>FS: cache phase2 results
    end

    rect rgba(200,150,100,0.5)
    note over WF: Phase 3: Full-Model Validation
    WF->>BM: baseline FP16 benchmark full model
    BM-->>WF: baseline_latency_ms
    loop Incremental Validation (if enabled)
        WF->>WF: apply candidate QDQ schemes
        WF->>BM: benchmark full model per candidate
        BM-->>WF: test latency_ms
        WF->>WF: keep if latency improves
    end
    WF->>FS: save optimized_final.onnx and cache phase3 progress
    end

    WF-->>CLI: return optimized_model_path

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

[OMNIML-3252][ONNX] Add real Q/DQ scales in Autotune #951: Overlapping changes to ONNX autotune/QDQ placement workflows and CLI integration.

Suggested reviewers

ajrasane
cjluo-nv

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 48.78% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Security Anti-Patterns	✅ Passed	No security-sensitive pattern violations detected in codebase.
Title check	✅ Passed	The PR title clearly identifies the primary feature: adding subgraph mode support to ONNX Q/DQ autotuning. It is specific, concise, and accurately reflects the main change.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 16

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

modelopt/onnx/quantization/autotune/benchmark.py (1)
34-34: ⚠️ Potential issue | 🟡 Minor

# nosec comments require explicit justification and approval.

As per coding guidelines, use of # nosec comments to bypass Bandit security checks is not allowed. If this security-sensitive pattern is genuinely necessary, the PR must be reviewed and approved by @NVIDIA/modelopt-setup-codeowners with explicit justification in the PR description.

The subprocess import itself is safe when used properly (with list arguments, not shell=True), but the # nosec B404 comment should be documented or removed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/autotune/benchmark.py` at line 34, Remove the
inline "# nosec B404" on the subprocess import (import subprocess) unless you
add an explicit justification and approval from
`@NVIDIA/modelopt-setup-codeowners` in the PR description; if the pattern is
required, keep the comment but add the justification text to the PR and ensure
all uses of subprocess in this module (functions/methods that call
subprocess.run, subprocess.Popen, etc.) follow safe patterns (pass args as a
list, avoid shell=True) and reference that approval in the PR description.

🧹 Nitpick comments (6)

examples/qdq_placement/set_batch_size.py (1)

62-65: Consider handling large models (>2GB) for model verification.

The direct call to onnx.checker.check_model(output_path) may fail for models larger than 2GB due to protobuf size limits. The codebase has a utility in modelopt/onnx/utils.py (see check_model function) that handles this case by using external data storage.

Suggested approach

+import tempfile
+import os
+
 def set_batch_size(model_path: str, batch_size: int, output_path: str) -> None:
     # ... existing code ...
     
     # Verify the saved model
     print("Verifying model...")
-    onnx.checker.check_model(output_path)
+    # Handle large models that exceed protobuf 2GB limit
+    saved_model = onnx.load(output_path)
+    if saved_model.ByteSize() > (2 * (1024**3)):
+        # For large models, check_model needs the file path with external data
+        onnx.checker.check_model(output_path)
+    else:
+        onnx.checker.check_model(saved_model)
     print("✓ Model saved and verified successfully!")

Alternatively, consider importing and using modelopt.onnx.utils.check_model for consistency with the rest of the codebase.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/qdq_placement/set_batch_size.py` around lines 62 - 65, Replace the
direct call to onnx.checker.check_model(output_path) with the project's robust
checker that handles >2GB models by importing and calling
modelopt.onnx.utils.check_model; locate the verification block around the
print("Verifying model...") and change the call to use
modelopt.onnx.utils.check_model(output_path) (or import the function as
check_model and call check_model(output_path)) so external-data large protobuf
models are supported.

modelopt/onnx/quantization/autotune/qdq_utils.py (2)

60-60: Simplify redundant condition.

The check node.input and len(node.input) > 0 is redundant since node.input being truthy already implies it's non-empty.

Suggested simplification

-            if node.input and len(node.input) > 0:
+            if node.input:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/autotune/qdq_utils.py` at line 60, The condition
checking node inputs is redundant: replace the `if node.input and
len(node.input) > 0:` guard with a simpler truthy check `if node.input:` in the
function where this appears (look for the code handling `node.input` in
qdq_utils.py) so the branch behavior remains identical but the expression is
simplified.

1-1: Update copyright year for consistency.

The copyright year is 2024, but other new files in this PR use 2026. Consider updating for consistency.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/autotune/qdq_utils.py` at line 1, Update the SPDX
header year in modelopt/onnx/quantization/autotune/qdq_utils.py to match the
rest of the PR (change "2024" to "2026"); locate the SPDX comment at the top of
the file and modify the year in the copyright line so it is consistent across
files.

tests/unit/onnx/quantization/autotune/test_config.py (1)

12-13: Remove unnecessary sys.path manipulation.

The sys.path.insert is unnecessary if the package is properly installed in the test environment. This pattern can cause import issues and is not recommended.

Suggested fix

-# Add parent directory to path
-sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
-
 from modelopt.onnx.quantization.autotune.common import Config

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unit/onnx/quantization/autotune/test_config.py` around lines 12 - 13,
Remove the ad-hoc sys.path manipulation in the test file: delete the
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
line in tests/unit/onnx/quantization/autotune/test_config.py and rely on the
package being installed in the test environment (or use test runner
configuration / PYTHONPATH) so imports resolve cleanly; do not add alternative
path hacks in this file.

modelopt/onnx/quantization/autotune/subgraph_extractor.py (1)

170-181: Consider using collections.deque for BFS queue.

Using list.pop(0) is O(n) per operation. For potentially large graphs, using collections.deque with popleft() provides O(1) performance.

♻️ Suggested improvement

+from collections import deque
+
 def _find_reachable_graph_inputs(
     graph: gs.Graph,
     target_nodes: List[gs.Node],
 ) -> List[str]:
     """BFS backward from target_nodes to find graph inputs that feed into them."""
     graph_input_names = {t.name for t in graph.inputs}
     visited = set()
-    queue = []
+    queue = deque()
     result = []

     for node in target_nodes:
         for inp in node.inputs:
             if isinstance(inp, gs.Variable) and inp.name not in visited:
                 visited.add(inp.name)
                 queue.append(inp)

     while queue:
-        tensor = queue.pop(0)
+        tensor = queue.popleft()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/autotune/subgraph_extractor.py` around lines 170 -
181, The BFS loop in subgraph_extractor.py uses a Python list and pop(0),
causing O(n) dequeuing; replace the list-based queue with collections.deque:
import deque, initialize queue = deque(...) where the queue is created, and
change queue.pop(0) to queue.popleft(), keeping queue.append(...) for
enqueueing; update any code that constructs the initial queue and ensure imports
include collections.deque so the BFS in the function that contains this loop
uses O(1) dequeue operations.

examples/qdq_placement/README.md (1)

152-162: Add language specifier to fenced code blocks.

The directory structure code blocks should have a language specifier for consistency. Consider using text or plaintext for directory listings.

📝 Suggested fix

-```
+```text
 resnet50_results/
 ├── optimized_final.onnx              # Optimized model

-```
+```text
 <output_dir>/
 ├── optimized_final.onnx              # Incrementally validated model (if --incremental-validation)

Also applies to: 166-175

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/qdq_placement/README.md` around lines 152 - 162, Update the fenced
code blocks that show directory listings so they include a language specifier
(use ```text) — specifically change the blocks containing the
"resnet50_results/" tree and the block containing the "<output_dir>/" tree to
start with ```text instead of ``` so the directory listings render consistently;
locate the blocks by the directory headers "resnet50_results/" and
"<output_dir>/" in the README and replace their opening fences accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/source/reference/2_qdq_placement.rst`:
- Line 828: The docs claim "Default threshold: 1.01" but the actual default is
defined as performance_threshold: float = 1.02 on the Config class in common.py;
update the documentation string in 2_qdq_placement.rst to state "Default
threshold: 1.02 (2% improvement minimum)" so the docs match the implementation
(or alternatively change Config.performance_threshold to 1.01 if you intend the
docs to be canonical).

In `@examples/qdq_placement/set_batch_size.py`:
- Around line 1-10: Add the required NVIDIA Apache 2.0 license header to the top
of the examples/qdq_placement/set_batch_size.py script (above or immediately
after the existing shebang and before the module docstring) so the file includes
the full license boilerplate mandated for examples/**/*.py; ensure the header
explicitly names "NVIDIA CORPORATION" and the Apache 2.0 terms and leave the
rest of the script (including the module docstring and usage comments)
unchanged.

In `@modelopt/onnx/quantization/autotune/fusion_grouping.py`:
- Around line 1-6: Add the required NVIDIA Apache 2.0 license header to the top
of the module modelopt/onnx/quantization/autotune/fusion_grouping.py by
inserting the standard multi-line NVIDIA Apache 2.0 header (including copyright
line, SPDX identifier and license text reference) as used across other files in
modelopt; ensure the header appears before the existing module docstring so
classes/functions in this file (e.g., the fusion grouping logic in
fusion_grouping.py) are properly licensed.

In `@modelopt/onnx/quantization/autotune/subgraph_extractor.py`:
- Around line 1-6: Add the NVIDIA Apache 2.0 license header to the top of the
module file modelopt/onnx/quantization/autotune/subgraph_extractor.py by
inserting the standard multi-line license comment before the existing module
docstring; ensure the header includes the SPDX identifier and full NVIDIA Apache
2.0 text (or the project-standard short header) so the file begins with the
required license block followed by the current docstring and existing code in
subgraph_extractor.py.

In `@modelopt/onnx/quantization/autotune/subgraph_workflow.py`:
- Around line 1-8: This file is missing the NVIDIA Apache 2.0 license header;
prepend the standard NVIDIA Apache-2.0 license block to the very top of
subgraph_workflow.py (above the existing module docstring) including the
copyright line and license text/notice, ensuring the SPDX identifier
(Apache-2.0) or full license header is present and that the existing file
docstring remains unchanged below the header.
- Around line 99-105: The loops that mutate the protobuf repeated field use
nonstandard pop()/add() calls on dim_proto; replace popping with explicit
deletion (e.g., use del dim_proto[-1] in the shrink loop) and call
dim_proto.add() when growing but capture the returned element if you need to
initialize it (e.g., new_dim = dim_proto.add()); keep the subsequent loop that
sets dim_proto[i].ClearField("dim_param") and dim_proto[i].dim_value = d, but
ensure dim_proto has been resized using del and dim_proto.add() rather than
pop()/add() without using the returned message.
- Around line 39-42: QUANT_DTYPES currently sets "fp8" to np.int8 when the FP8
dtype isn't present, causing silent fallback; change this by removing the silent
fallback from the module-level QUANT_DTYPES and instead detect/import ml_dtypes
(or np.float8 variant) conditionally at runtime where quantization is requested
(e.g., inside the function(s) that handle quantization/autotune and any call
sites referencing QUANT_DTYPES), and when a user requests "fp8" but the FP8
dtype isn't available emit a clear user-facing warning or raise a descriptive
error; specifically, update references to QUANT_DTYPES to validate availability
at usage time, attempt a conditional import of ml_dtypes (or check hasattr(np,
"float8_e4m3fn")) there, and log/warn if FP8 cannot be supported rather than
silently substituting int8.

In `@modelopt/onnx/quantization/autotune/tensorrt_utils.py`:
- Around line 1-41: The license header at the top of the module (the
module-level docstring and SPDX header in tensorrt_utils.py) uses an unusual
year range "1993-2025"; update the copyright/SPDX header to the project's
standard single-year format (e.g., "Copyright (c) 2026 NVIDIA CORPORATION &
AFFILIATES") and adjust the SPDX-License-Identifier line if needed so the header
matches other files; edit the top-of-file docstring/SPDX block where the current
year range appears to replace it with the standard format.
- Around line 987-1000: Fix _save_timing_cache: the condition and method call
are wrong and the finally cleans up an undefined name. Change the inverted check
so you only call combine when self._timing_cache is NOT None, correct the typo
`combline` to `combine` for the timing cache object returned by
config.create_timing_cache, and in the finally block remove the undefined
`builder` deletion (either delete only `config` or explicitly reference
`self.builder` if you intended to delete it); ensure you still serialize and
write timing_cache_data to self.timing_cache_file when a timing cache exists.

In `@modelopt/onnx/quantization/autotune/torch_region_builder.py`:
- Around line 311-318: The _build_id_to_region_map function currently uses a
mutable default dict for id_to_region_map which can be shared across calls;
change the signature to default id_to_region_map to None and inside
_build_id_to_region_map initialize id_to_region_map = {} when it's None, then
proceed to assign id_to_region_map[region.id] = region and recursively call
self._build_id_to_region_map(child, id_to_region_map) so each top-level call
gets a fresh map; keep the return type dict[int, Region] and behavior otherwise
unchanged.
- Around line 16-24: The module docstring in torch_region_builder.py duplicates
the license/copyright header (the "SPDX-FileCopyrightText" lines and
years)—remove the duplicate license block from the top of the module docstring
so only the canonical file header remains; keep the descriptive docstring text
("Torch Region Builder - Hierarchical Region Discovery...") and ensure
functions/classes in this module (e.g., any top-level descriptions used by
torch_region_builder.py) are unaffected by the removal.
- Around line 752-754: The parameter only_quantizable is being ignored because
the code unconditionally sets only_quantizable = True; fix by either removing
that assignment so the function respects the incoming only_quantizable
parameter, or remove the only_quantizable parameter from the function signature
if it's not needed; locate the assignment near the logger.info(f"Loading model:
{onnx_path}") call in torch_region_builder.py and update the function that
accepts only_quantizable to either use the passed value or eliminate the unused
parameter and adjust all callers accordingly.
- Around line 320-331: The _build_tensor_to_regions_map function uses a mutable
default argument (dict = {}), which can be shared across calls; change the
signature to accept tensor_to_regions_map: Optional[dict[str, set[int]]] = None
and inside the function initialize tensor_to_regions_map = {} if None, mirroring
the fix used in _build_id_to_region_map; keep the recursive behavior and return
the map as before so each top-level call gets a fresh dict.
- Line 333: The _merge_neighboring_regions function currently uses a mutable
default argument for to_remove (set()); change the signature to accept None
(e.g., to_remove: Optional[set[int]] = None or to_remove: set[int] | None =
None) and inside the body initialize it to an empty set when None (e.g., if
to_remove is None: to_remove = set()), updating any type imports as needed and
leaving the rest of the logic in _merge_neighboring_regions unchanged.

In `@tests/unit/onnx/quantization/autotune/test_config.py`:
- Around line 71-82: The test test_performance_threshold_validation is
incomplete: it only asserts valid values for Config.performance_threshold but
doesn't check that invalid values are rejected; either add validation in Config
or update the test to assert the expected failure. If Config should validate,
implement a check in Config.__init__ or the performance_threshold setter to
raise ValueError for values < 1.0 and then update
test_performance_threshold_validation to include a with
self.assertRaises(ValueError): Config(performance_threshold=0.9) case; otherwise
remove the misleading comment and the "Invalid values" note from
test_performance_threshold_validation.
- Around line 1-6: Add the required NVIDIA Apache 2.0 license header to the top
of tests/unit/onnx/quantization/autotune/test_config.py: insert the standard
NVIDIA Apache-2.0 license block (including copyright notice/SPDX identifier and
license text) immediately after the existing shebang (#!/usr/bin/env python3)
and before the module docstring so the file complies with the repository's
license/header convention.

---

Outside diff comments:
In `@modelopt/onnx/quantization/autotune/benchmark.py`:
- Line 34: Remove the inline "# nosec B404" on the subprocess import (import
subprocess) unless you add an explicit justification and approval from
`@NVIDIA/modelopt-setup-codeowners` in the PR description; if the pattern is
required, keep the comment but add the justification text to the PR and ensure
all uses of subprocess in this module (functions/methods that call
subprocess.run, subprocess.Popen, etc.) follow safe patterns (pass args as a
list, avoid shell=True) and reference that approval in the PR description.

---

Nitpick comments:
In `@examples/qdq_placement/README.md`:
- Around line 152-162: Update the fenced code blocks that show directory
listings so they include a language specifier (use ```text) — specifically
change the blocks containing the "resnet50_results/" tree and the block
containing the "<output_dir>/" tree to start with ```text instead of ``` so the
directory listings render consistently; locate the blocks by the directory
headers "resnet50_results/" and "<output_dir>/" in the README and replace their
opening fences accordingly.

In `@examples/qdq_placement/set_batch_size.py`:
- Around line 62-65: Replace the direct call to
onnx.checker.check_model(output_path) with the project's robust checker that
handles >2GB models by importing and calling modelopt.onnx.utils.check_model;
locate the verification block around the print("Verifying model...") and change
the call to use modelopt.onnx.utils.check_model(output_path) (or import the
function as check_model and call check_model(output_path)) so external-data
large protobuf models are supported.

In `@modelopt/onnx/quantization/autotune/qdq_utils.py`:
- Line 60: The condition checking node inputs is redundant: replace the `if
node.input and len(node.input) > 0:` guard with a simpler truthy check `if
node.input:` in the function where this appears (look for the code handling
`node.input` in qdq_utils.py) so the branch behavior remains identical but the
expression is simplified.
- Line 1: Update the SPDX header year in
modelopt/onnx/quantization/autotune/qdq_utils.py to match the rest of the PR
(change "2024" to "2026"); locate the SPDX comment at the top of the file and
modify the year in the copyright line so it is consistent across files.

In `@modelopt/onnx/quantization/autotune/subgraph_extractor.py`:
- Around line 170-181: The BFS loop in subgraph_extractor.py uses a Python list
and pop(0), causing O(n) dequeuing; replace the list-based queue with
collections.deque: import deque, initialize queue = deque(...) where the queue
is created, and change queue.pop(0) to queue.popleft(), keeping
queue.append(...) for enqueueing; update any code that constructs the initial
queue and ensure imports include collections.deque so the BFS in the function
that contains this loop uses O(1) dequeue operations.

In `@tests/unit/onnx/quantization/autotune/test_config.py`:
- Around line 12-13: Remove the ad-hoc sys.path manipulation in the test file:
delete the sys.path.insert(0,
os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) line in
tests/unit/onnx/quantization/autotune/test_config.py and rely on the package
being installed in the test environment (or use test runner configuration /
PYTHONPATH) so imports resolve cleanly; do not add alternative path hacks in
this file.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 48cf674c-cea7-4964-b1ed-c9c27e9d6e6e

📥 Commits

Reviewing files that changed from the base of the PR and between bc87981 and 48eed5d778e4899ba16d8db76a23b22309416fc7.

📒 Files selected for processing (15)

docs/source/guides/9_qdq_placement.rst
docs/source/reference/2_qdq_placement.rst
examples/qdq_placement/README.md
examples/qdq_placement/set_batch_size.py
modelopt/onnx/quantization/autotune/__init__.py
modelopt/onnx/quantization/autotune/__main__.py
modelopt/onnx/quantization/autotune/benchmark.py
modelopt/onnx/quantization/autotune/fusion_grouping.py
modelopt/onnx/quantization/autotune/qdq_utils.py
modelopt/onnx/quantization/autotune/subgraph_extractor.py
modelopt/onnx/quantization/autotune/subgraph_workflow.py
modelopt/onnx/quantization/autotune/tensorrt_utils.py
modelopt/onnx/quantization/autotune/torch_region_builder.py
modelopt/onnx/quantization/autotune/workflows.py
tests/unit/onnx/quantization/autotune/test_config.py

docs/source/reference/2_qdq_placement.rst

examples/qdq_placement/set_batch_size.py

modelopt/onnx/quantization/autotune/fusion_grouping.py

modelopt/onnx/quantization/autotune/subgraph_extractor.py

modelopt/onnx/quantization/autotune/subgraph_workflow.py

modelopt/onnx/quantization/autotune/torch_region_builder.py

tests/unit/onnx/quantization/autotune/test_config.py

coderabbitai · 2026-03-13T08:10:03Z

tests/unit/onnx/quantization/autotune/test_config.py

+    def test_performance_threshold_validation(self):
+        """Test that performance_threshold must be >= 1.0."""
+        # Valid values
+        config1 = Config(performance_threshold=1.0)
+        self.assertEqual(config1.performance_threshold, 1.0)
+
+        config2 = Config(performance_threshold=1.5)
+        self.assertEqual(config2.performance_threshold, 1.5)
+
+        # Invalid values should not be accepted
+        # Note: This test assumes validation exists, if not we should add it
+        print("✓ Config performance_threshold validation")


⚠️ Potential issue | 🟡 Minor

Incomplete validation test - invalid values not actually tested.

The test test_performance_threshold_validation only tests valid values but the comment on line 80-81 indicates invalid values should be rejected. Either implement the validation in Config and test it, or remove the misleading comment.

Option 1: If validation exists, add the test

def test_performance_threshold_validation(self): """Test that performance_threshold must be >= 1.0.""" # Valid values config1 = Config(performance_threshold=1.0) self.assertEqual(config1.performance_threshold, 1.0) config2 = Config(performance_threshold=1.5) self.assertEqual(config2.performance_threshold, 1.5) # Invalid values should raise ValueError with self.assertRaises(ValueError): Config(performance_threshold=0.9) print("✓ Config performance_threshold validation")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/unit/onnx/quantization/autotune/test_config.py` around lines 71 - 82, The test test_performance_threshold_validation is incomplete: it only asserts valid values for Config.performance_threshold but doesn't check that invalid values are rejected; either add validation in Config or update the test to assert the expected failure. If Config should validate, implement a check in Config.__init__ or the performance_threshold setter to raise ValueError for values < 1.0 and then update test_performance_threshold_validation to include a with self.assertRaises(ValueError): Config(performance_threshold=0.9) case; otherwise remove the misleading comment and the "Invalid values" note from test_performance_threshold_validation.

gcunhase · 2026-03-13T15:13:10Z

@Hale423 please find my comments below:

Please rebase this PR and ensure that the DCO is valid.
Please update this PR's description. It says This PR adds automated Q/DQ (Quantize/Dequantize) placement optimization for ONNX models using TensorRT performance measurements. However, Autotune has already been introduced previously, so the correct description is that this expands on Autotune, correct?
Is this PR related to Draft: Integrate TorchRegionBuilder to AutoQDQ #963?

Thanks

coderabbitai

Actionable comments posted: 14

♻️ Duplicate comments (3)

tests/unit/onnx/quantization/autotune/test_config.py (1)

75-87: ⚠️ Potential issue | 🟡 Minor

“Validation” tests currently don’t validate rejection paths.

These tests assert valid values only, but they don’t check that invalid inputs are rejected. Either add failure assertions or rename tests to avoid implying validation behavior that isn’t tested.

Suggested assertion additions

     def test_performance_threshold_validation(self):
         """Test that performance_threshold must be >= 1.0."""
         config1 = Config(performance_threshold=1.0)
         self.assertEqual(config1.performance_threshold, 1.0)

         config2 = Config(performance_threshold=1.5)
         self.assertEqual(config2.performance_threshold, 1.5)
+
+        with self.assertRaises(ValueError):
+            Config(performance_threshold=0.9)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unit/onnx/quantization/autotune/test_config.py` around lines 75 - 87,
The tests test_performance_threshold_validation and test_region_size_validation
only assert that valid values are accepted; add negative-path assertions to
ensure invalid inputs are rejected by Config: in
test_performance_threshold_validation wrap creating
Config(performance_threshold=0.9) (and maybe 0.0 or -1.0) in
self.assertRaises(ValueError) to assert a rejection, and in
test_region_size_validation use self.assertRaises(ValueError) for
Config(maximum_sequence_region_size=0) and Config(minimum_topdown_search_size=0)
(and optionally negative values) so the class Config enforces >0 for region
sizes; keep the original valid-value assertions and use the Config symbol and
the test method names to locate where to add the assertRaises blocks.

modelopt/onnx/quantization/autotune/tensorrt_utils.py (1)

987-999: ⚠️ Potential issue | 🟠 Major

Guard cleanup in _save_timing_cache().

config is only assigned inside the if self._timing_cache is not None block, but the finally block always deletes it. If create_builder_config() fails before assignment, or this method is ever called with no cache object, the cleanup path raises UnboundLocalError and masks the real failure.

🐛 Proposed fix

 def _save_timing_cache(self):
+    config = None
     try:
         if self._timing_cache is not None:
             config = self.builder.create_builder_config()
             output_cache = config.create_timing_cache(b"")
             output_cache.combine(self._timing_cache, ignore_errors=True)
             timing_cache_data = output_cache.serialize()
             with open(self.timing_cache_file, "wb") as f:
                 f.write(timing_cache_data)
             self.logger.debug(f"Saved timing cache to: {self.timing_cache_file}")
     except Exception as e:
         self.logger.warning(f"Failed to save timing cache: {e}")
     finally:
-        del config
+        if config is not None:
+            del config

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/autotune/tensorrt_utils.py` around lines 987 -
999, The finally block in _save_timing_cache references config even when it may
never be assigned (e.g., when self._timing_cache is None or
create_builder_config() raises), causing UnboundLocalError; fix by initializing
config = None before the try and only delete or cleanup config if it is not None
(or use a conditional cleanup) after the try/except so that config is only
referenced when created; update references to config creation in
_save_timing_cache and ensure create_builder_config() and output_cache are
guarded similarly to avoid the same issue.

modelopt/onnx/quantization/autotune/subgraph_workflow.py (1)

54-71: ⚠️ Potential issue | 🔴 Critical

quant_type="fp8" still silently degrades to INT8.

QUANT_DTYPES["fp8"] is resolved at import time and _get_fp8_dtype() returns np.int8 on ImportError, so every INT8-only session logs an FP8 warning at import and every FP8 request quantizes as INT8. Resolve and validate the FP8 dtype inside the FP8 call path instead of freezing the fallback in QUANT_DTYPES.

Also applies to: 381-383
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/autotune/subgraph_workflow.py` around lines 54 -
71, QUANT_DTYPES currently calls _get_fp8_dtype() at import time causing FP8 to
silently fallback to int8 and emit warnings for all runs; change QUANT_DTYPES to
store a lazy indicator (e.g., map "fp8" to a sentinel or the function
_get_fp8_dtype itself) instead of the resolved dtype, and update the FP8
quantization call path (where "fp8" is looked up) to call _get_fp8_dtype() at
runtime, validate the returned dtype is FP8-compatible, and only then proceed or
emit the fallback warning and convert to int8; update usages that reference
QUANT_DTYPES["fp8"] to resolve via the new lazy mechanism.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/qdq_placement/README.md`:
- Around line 5-8: The README contains hardcoded branch names
"dev-wahao-autotune-subgraph-profile" and
"dev-willg-add-auto-qdq-placement-tool"; remove these branch-specific strings
and replace them with generic, version-agnostic wording (e.g., "current branch"
or describe the feature differences without naming branches) so the permanent
example docs don't reference ephemeral branch names; update the heading and the
sentence that lists differences to use neutral language describing the added
autotuner/subgraph workflow instead of the explicit branch identifiers.
- Around line 112-126: The README currently always passes --graph-json
"$GRAPH_JSON" which contradicts "optional --graph-json"; update the example to
conditionally include the flag when GRAPH_JSON is non-empty (leave GRAPH_JSON
unset or empty to trigger auto-generation). Specifically, change the command
invocation shown for python3 -m modelopt.onnx.quantization.autotune so it only
appends the --graph-json argument if GRAPH_JSON is set (or show the alternate
bracketed pattern like [--graph-json "$GRAPH_JSON"]) and document that leaving
GRAPH_JSON empty omits the flag; refer to the MODEL, GRAPH_JSON, OUTPUT_DIR
variables and the --graph-json flag to locate the snippet to edit.

In `@modelopt/onnx/quantization/autotune/benchmark.py`:
- Around line 307-326: Only set self._profile_unsupported when the failure is
actually caused by profiling flags: after running _exec_and_log and seeing
"Unknown option" in result.stderr, check that export_profile_path is truthy
(meaning profiling was requested) AND that the stderr or the original cmd
contains one of the profiling-specific options ("--separateProfileRun",
"--profilingVerbosity", "--exportProfile"); only then log the warning, set
self._profile_unsupported = True, strip those profiling flags from cmd and retry
via _exec_and_log. This ensures unrelated "Unknown option" errors (e.g., from
extra_run_args) don't permanently disable profiling.
- Around line 296-305: The strip_shape_args branch only filters
--optShapes/--minShapes/--maxShapes but misses other shape flags like --shapes
or calibration shape flags; update the filter inside the strip_shape_args block
(where cmd is rebuilt) to drop any argument that refers to shapes (e.g., any c
where "shapes" is in c.lower() or
c.startswith("--shape")/c.startswith("--shapes")/c.startswith("--calibration")
), so the subgraph run command no longer carries original-model shape bindings
before extra_run_args (refer to strip_shape_args, cmd, extra_run_args, and
model_path to locate the change).
- Around line 236-239: Remove the inline Bandit suppression by deleting the "#
nosec B603" comment on the subprocess.run call inside the _exec_and_log method;
leave the call as subprocess.run(cmd, capture_output=True, text=True) (ensure it
remains a list-based invocation, not shell=True), and if Bandit flags this as a
false positive prepare a documented justification and request formal review from
`@NVIDIA/modelopt-setup-codeowners` per the security-review process instead of
using an in-code bypass.

In `@modelopt/onnx/quantization/autotune/fusion_grouping.py`:
- Around line 267-273: Check that the generated graph JSON file actually exists
before returning: after verifying result.returncode in the trtexec invocation
(the block referencing result and logger) add a Path(graph_json_path).exists()
check and if missing log an error via logger.error (include graph_json_path and
result.returncode/context) and raise a RuntimeError (similar to the existing
trtexec FP16 build failure) so callers won’t get a false success from the
function that returns graph_json_path.

In `@modelopt/onnx/quantization/autotune/qdq_utils.py`:
- Around line 49-51: The doc example uses an undefined variable new_model when
constructing QDQAutotuner; update the example to pass the initialized model (the
variable named model) or explicitly define new_model before use so
QDQAutotuner(model) is called correctly; locate the example lines with
QDQAutotuner and autotuner.initialize() and replace new_model with model (or add
a definition for new_model) to ensure the example is valid.

In `@modelopt/onnx/quantization/autotune/subgraph_workflow.py`:
- Around line 660-667: The resume logic currently trusts any loaded cache;
update the code around _load_cache(cache_path) so after loading you validate
cached metadata (at minimum cache.get("version") == CACHE_VERSION and
cache.get("model_path") == model_path) and also compare tuning inputs used by
this run (e.g. quant_type, graph_json_path, extra_trtexec_args or any other
run-specific keys you expect) against values stored in the cache; if any
mismatch is detected, invalidate the cache by replacing it with a fresh dict
containing the current metadata (include CACHE_VERSION, model_path and the
current tuning inputs) so that cached_p2 / cached_p3 are empty and stale
phase2/phase3 decisions are not reused (refer to _load_cache, cache_path,
CACHE_VERSION, model_path, quant_type, graph_json_path, extra_trtexec_args, and
the "phase2"/"phase3" keys to locate where to add this validation).

In `@modelopt/onnx/quantization/autotune/torch_region_builder.py`:
- Around line 773-783: The code currently mutates children but always returns
the original regions list, so include_all_regions=False has no effect; update
the loop that iterates over regions to build a new filtered list (e.g.,
filtered_regions = []) and for each region: perform the existing child-pruning
using region.get_children() and region.remove_child(child), then if
include_all_regions is True or has_quantizable_operations(region, graph) is
True, append the region to filtered_regions; finally return filtered_regions
instead of regions; apply the same change to the analogous block that handles
regions later in the file (the block that currently mirrors lines ~791-817) so
both code paths honor include_all_regions.
- Around line 624-629: The torch_node_ratio method can divide by zero when there
are no non-constant nodes; update torch_node_ratio to guard against empty
non_constant_nodes by checking if non_constant_nodes is empty (len == 0) and
returning 0.0 (or another safe default) instead of performing the division; keep
the existing logic that builds non_constant_nodes from self.graph.nodes
(filtering on n.op != "Constant") and the slash_count calculation (checking
n.name and n.name.startswith("/")) and only compute slash_count /
len(non_constant_nodes) when the length is > 0.
- Around line 549-551: The loop in _probe_epilogues_recursive unconditionally
appends consumer_idx to epilogue_ops, which allows non-fusible or divergent
consumers into regions; change it to first verify the consumer meets the
method's fusibility/divergence predicates (the same checks used elsewhere in
this module—e.g., the file's existing fusibility/divergence helper(s) such as
the region/component fusibility check and divergence test) and only append and
recurse when those checks pass (i.e., if consumer is fusible with the current
region and not divergent), otherwise skip that consumer.

In `@modelopt/onnx/quantization/autotune/workflows.py`:
- Around line 69-75: The call to _benchmark_instance.run is forwarding
trtexec-specific kwargs (strip_shape_args, extra_run_args, export_profile_path)
to all backends causing a TypeError when _benchmark_instance is a
TensorRTPyBenchmark; update the call to only include those kwargs when the
backend is TrtExecBenchmark (or otherwise supports them) by either checking
isinstance(_benchmark_instance, TrtExecBenchmark) or testing for
attribute/method support before adding
strip_shape_args/extra_run_args/export_profile_path to the kwargs dict, then
call _benchmark_instance.run with the filtered kwargs so non-trt backends
receive only supported parameters.

In `@tests/unit/onnx/quantization/autotune/test_config.py`:
- Around line 19-112: Replace the unittest-based TestConfig class/tests with
pytest-style tests: convert each test_* method (test_default_values,
test_custom_values, test_performance_threshold_validation,
test_region_size_validation, test_genetic_algorithm_params,
test_pattern_cache_params) into standalone functions that instantiate Config and
use plain assert statements (e.g., assert config.verbose is False) instead of
self.assertEqual/self.assertGreater, remove the unittest.TestCase inheritance
and the if __name__ == "__main__": unittest.main() block, and keep references to
the Config symbol so the same behavior is tested under pytest.

---

Duplicate comments:
In `@modelopt/onnx/quantization/autotune/subgraph_workflow.py`:
- Around line 54-71: QUANT_DTYPES currently calls _get_fp8_dtype() at import
time causing FP8 to silently fallback to int8 and emit warnings for all runs;
change QUANT_DTYPES to store a lazy indicator (e.g., map "fp8" to a sentinel or
the function _get_fp8_dtype itself) instead of the resolved dtype, and update
the FP8 quantization call path (where "fp8" is looked up) to call
_get_fp8_dtype() at runtime, validate the returned dtype is FP8-compatible, and
only then proceed or emit the fallback warning and convert to int8; update
usages that reference QUANT_DTYPES["fp8"] to resolve via the new lazy mechanism.

In `@modelopt/onnx/quantization/autotune/tensorrt_utils.py`:
- Around line 987-999: The finally block in _save_timing_cache references config
even when it may never be assigned (e.g., when self._timing_cache is None or
create_builder_config() raises), causing UnboundLocalError; fix by initializing
config = None before the try and only delete or cleanup config if it is not None
(or use a conditional cleanup) after the try/except so that config is only
referenced when created; update references to config creation in
_save_timing_cache and ensure create_builder_config() and output_cache are
guarded similarly to avoid the same issue.

In `@tests/unit/onnx/quantization/autotune/test_config.py`:
- Around line 75-87: The tests test_performance_threshold_validation and
test_region_size_validation only assert that valid values are accepted; add
negative-path assertions to ensure invalid inputs are rejected by Config: in
test_performance_threshold_validation wrap creating
Config(performance_threshold=0.9) (and maybe 0.0 or -1.0) in
self.assertRaises(ValueError) to assert a rejection, and in
test_region_size_validation use self.assertRaises(ValueError) for
Config(maximum_sequence_region_size=0) and Config(minimum_topdown_search_size=0)
(and optionally negative values) so the class Config enforces >0 for region
sizes; keep the original valid-value assertions and use the Config symbol and
the test method names to locate where to add the assertRaises blocks.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 46f738dd-cec0-48e7-b297-30a197b0f4fd

📥 Commits

Reviewing files that changed from the base of the PR and between 48eed5d778e4899ba16d8db76a23b22309416fc7 and cf3693855ff344d3ef389672629d38947f788c00.

📒 Files selected for processing (15)

docs/source/guides/9_qdq_placement.rst
docs/source/reference/2_qdq_placement.rst
examples/qdq_placement/README.md
examples/qdq_placement/set_batch_size.py
modelopt/onnx/quantization/autotune/__init__.py
modelopt/onnx/quantization/autotune/__main__.py
modelopt/onnx/quantization/autotune/benchmark.py
modelopt/onnx/quantization/autotune/fusion_grouping.py
modelopt/onnx/quantization/autotune/qdq_utils.py
modelopt/onnx/quantization/autotune/subgraph_extractor.py
modelopt/onnx/quantization/autotune/subgraph_workflow.py
modelopt/onnx/quantization/autotune/tensorrt_utils.py
modelopt/onnx/quantization/autotune/torch_region_builder.py
modelopt/onnx/quantization/autotune/workflows.py
tests/unit/onnx/quantization/autotune/test_config.py

🚧 Files skipped from review as they are similar to previous changes (1)

docs/source/guides/9_qdq_placement.rst

coderabbitai · 2026-03-16T05:41:30Z

examples/qdq_placement/README.md

+## Branch: Subgraph Mode (This Branch)
+
+On branch **dev-wahao-autotune-subgraph-profile**, the autotuner adds a second workflow and related features not present on **dev-willg-add-auto-qdq-placement-tool**:
+


⚠️ Potential issue | 🟡 Minor

Remove branch-specific naming from permanent example docs.

Hardcoding dev-wahao-autotune-subgraph-profile / dev-willg-add-auto-qdq-placement-tool will age quickly and confuse users reading this from main.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/qdq_placement/README.md` around lines 5 - 8, The README contains hardcoded branch names "dev-wahao-autotune-subgraph-profile" and "dev-willg-add-auto-qdq-placement-tool"; remove these branch-specific strings and replace them with generic, version-agnostic wording (e.g., "current branch" or describe the feature differences without naming branches) so the permanent example docs don't reference ephemeral branch names; update the heading and the sentence that lists differences to use neutral language describing the added autotuner/subgraph workflow instead of the explicit branch identifiers.

coderabbitai · 2026-03-16T05:41:30Z

examples/qdq_placement/README.md

+# Set output dir (e.g. <model_prefix>_autotune_subgraph) and optional graph.json
+MODEL="path/to/your_model.fp16.opt.onnx"
+GRAPH_JSON="path/to/your_model.fp16.opt.graph.json"   # optional; omit to auto-generate
+OUTPUT_DIR="${MODEL%.onnx}_autotune_subgraph"
+mkdir -p "$OUTPUT_DIR"
+
+# Optional: optShapes for dynamic inputs (comma-separated name:shape)
+OPT_SHAPES='--optShapes=agents_feature:21x33x10x16,agents_mask:21x33x10'
+
+python3 -m modelopt.onnx.quantization.autotune \
+    --model "$MODEL" \
+    --output "$OUTPUT_DIR" \
+    --mode subgraph \
+    --graph-json "$GRAPH_JSON" \
+    --quant-type fp8 \


⚠️ Potential issue | 🟡 Minor

The sample contradicts “optional --graph-json” behavior.

The text says graph.json is optional, but the command always passes --graph-json "$GRAPH_JSON". This makes copy/paste usage brittle when users want auto-generation.

Suggested command pattern

-MODEL="path/to/your_model.fp16.opt.onnx" -GRAPH_JSON="path/to/your_model.fp16.opt.graph.json" # optional; omit to auto-generate +MODEL="path/to/your_model.fp16.opt.onnx" +GRAPH_JSON="" # optional; set to existing graph.json to reuse OUTPUT_DIR="${MODEL%.onnx}_autotune_subgraph" mkdir -p "$OUTPUT_DIR" + +GRAPH_JSON_ARGS=() +if [ -n "$GRAPH_JSON" ]; then + GRAPH_JSON_ARGS=(--graph-json "$GRAPH_JSON") +fi python3 -m modelopt.onnx.quantization.autotune \ --model "$MODEL" \ --output "$OUTPUT_DIR" \ --mode subgraph \ - --graph-json "$GRAPH_JSON" \ + "${GRAPH_JSON_ARGS[@]}" \

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Set output dir (e.g. <model_prefix>_autotune_subgraph) and optional graph.json

MODEL="path/to/your_model.fp16.opt.onnx"

GRAPH_JSON="path/to/your_model.fp16.opt.graph.json" # optional; omit to auto-generate

OUTPUT_DIR="${MODEL%.onnx}_autotune_subgraph"

mkdir -p "$OUTPUT_DIR"

# Optional: optShapes for dynamic inputs (comma-separated name:shape)

OPT_SHAPES='--optShapes=agents_feature:21x33x10x16,agents_mask:21x33x10'

python3 -m modelopt.onnx.quantization.autotune \

--model "$MODEL" \

--output "$OUTPUT_DIR" \

--mode subgraph \

--graph-json "$GRAPH_JSON" \

--quant-type fp8 \

# Set output dir (e.g. <model_prefix>_autotune_subgraph) and optional graph.json

MODEL="path/to/your_model.fp16.opt.onnx"

GRAPH_JSON="" # optional; set to existing graph.json to reuse

OUTPUT_DIR="${MODEL%.onnx}_autotune_subgraph"

mkdir -p "$OUTPUT_DIR"

GRAPH_JSON_ARGS=()

if [ -n "$GRAPH_JSON" ]; then

GRAPH_JSON_ARGS=(--graph-json "$GRAPH_JSON")

fi

# Optional: optShapes for dynamic inputs (comma-separated name:shape)

OPT_SHAPES='--optShapes=agents_feature:21x33x10x16,agents_mask:21x33x10'

python3 -m modelopt.onnx.quantization.autotune \

--model "$MODEL" \

--output "$OUTPUT_DIR" \

--mode subgraph \

"${GRAPH_JSON_ARGS[@]}" \

--quant-type fp8 \

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/qdq_placement/README.md` around lines 112 - 126, The README currently always passes --graph-json "$GRAPH_JSON" which contradicts "optional --graph-json"; update the example to conditionally include the flag when GRAPH_JSON is non-empty (leave GRAPH_JSON unset or empty to trigger auto-generation). Specifically, change the command invocation shown for python3 -m modelopt.onnx.quantization.autotune so it only appends the --graph-json argument if GRAPH_JSON is set (or show the alternate bracketed pattern like [--graph-json "$GRAPH_JSON"]) and document that leaving GRAPH_JSON empty omits the flag; refer to the MODEL, GRAPH_JSON, OUTPUT_DIR variables and the --graph-json flag to locate the snippet to edit.

coderabbitai · 2026-03-16T05:41:30Z

examples/qdq_placement/set_batch_size.py

+    # Get the input tensor
+    graph = model.graph
+    input_tensor = graph.input[0]
+
+    print(f"Original input shape: {[d.dim_param or d.dim_value for d in input_tensor.type.tensor_type.shape.dim]}")
+
+    # Modify the batch dimension (first dimension)
+    if len(input_tensor.type.tensor_type.shape.dim) > 0:
+        input_tensor.type.tensor_type.shape.dim[0].dim_value = batch_size
+        # Clear any symbolic dimension parameter
+        input_tensor.type.tensor_type.shape.dim[0].ClearField('dim_param')
+
+    # Also update output shapes if needed
+    for output_tensor in graph.output:
+        if len(output_tensor.type.tensor_type.shape.dim) > 0:
+            output_tensor.type.tensor_type.shape.dim[0].dim_value = batch_size
+            output_tensor.type.tensor_type.shape.dim[0].ClearField('dim_param')
+


⚠️ Potential issue | 🟠 Major

Do not blindly rewrite output dim[0], and avoid single-input assumptions.

Current logic updates only the first model input but forces dim[0] on all outputs. That can silently produce invalid shapes for multi-input models or models where output dim[0] is not batch.

Suggested robustness fix

- # Get the input tensor - graph = model.graph - input_tensor = graph.input[0] + graph = model.graph + if batch_size <= 0: + raise ValueError("batch_size must be a positive integer") + if not graph.input: + raise ValueError("Model has no graph inputs to rewrite") + input_tensors = list(graph.input) - print(f"Original input shape: {[d.dim_param or d.dim_value for d in input_tensor.type.tensor_type.shape.dim]}") + print("Original input shapes:") + for input_tensor in input_tensors: + print(f" {input_tensor.name}: {[d.dim_param or d.dim_value for d in input_tensor.type.tensor_type.shape.dim]}") - # Modify the batch dimension (first dimension) - if len(input_tensor.type.tensor_type.shape.dim) > 0: - input_tensor.type.tensor_type.shape.dim[0].dim_value = batch_size - # Clear any symbolic dimension parameter - input_tensor.type.tensor_type.shape.dim[0].ClearField('dim_param') - - # Also update output shapes if needed - for output_tensor in graph.output: - if len(output_tensor.type.tensor_type.shape.dim) > 0: - output_tensor.type.tensor_type.shape.dim[0].dim_value = batch_size - output_tensor.type.tensor_type.shape.dim[0].ClearField('dim_param') + # Modify batch dimension on model inputs only + for input_tensor in input_tensors: + if len(input_tensor.type.tensor_type.shape.dim) > 0: + input_tensor.type.tensor_type.shape.dim[0].dim_value = batch_size + input_tensor.type.tensor_type.shape.dim[0].ClearField("dim_param")

coderabbitai · 2026-03-16T05:41:30Z

modelopt/onnx/quantization/autotune/benchmark.py

+    def _exec_and_log(self, cmd: list[str], log_file: str | None = None):
+        """Execute trtexec command and write logs."""
+        self.logger.debug(f"Running: {' '.join(cmd)}")
+        result = subprocess.run(cmd, capture_output=True, text=True)  # nosec B603


⚠️ Potential issue | 🔴 Critical

Remove the new # nosec bypass.

Inline Bandit suppressions are disallowed in this repo. This addition needs to be removed or routed through the explicit security-review path instead of suppressing B603 in-code.

As per coding guidelines, "Bandit security checks are enforced via pre-commit hooks. # nosec comments are not allowed as bypasses for security checks; contributors must request formal review from @NVIDIA/modelopt-setup-codeowners with documented justification instead."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/quantization/autotune/benchmark.py` around lines 236 - 239, Remove the inline Bandit suppression by deleting the "# nosec B603" comment on the subprocess.run call inside the _exec_and_log method; leave the call as subprocess.run(cmd, capture_output=True, text=True) (ensure it remains a list-based invocation, not shell=True), and if Bandit flags this as a false positive prepare a documented justification and request formal review from `@NVIDIA/modelopt-setup-codeowners` per the security-review process instead of using an in-code bypass.

modelopt/onnx/quantization/autotune/benchmark.py

coderabbitai · 2026-03-16T05:41:31Z

modelopt/onnx/quantization/autotune/torch_region_builder.py

+        for consumer_idx in consumer_indices:
+            epilogue_ops.append(consumer_idx)
+            self._probe_epilogues_recursive(consumer_idx, current_step + 1, max_steps, epilogue_ops)


⚠️ Potential issue | 🟠 Major

Epilogue candidates are added without fusibility/divergence gating.

consumer_idx is appended unconditionally, so non-fusible or divergent consumers can be added to regions, which contradicts the method’s own contract and over-expands epilogues.

Proposed fix

- for consumer_idx in consumer_indices: - epilogue_ops.append(consumer_idx) - self._probe_epilogues_recursive(consumer_idx, current_step + 1, max_steps, epilogue_ops) + for consumer_idx in consumer_indices: + if consumer_idx >= len(self.graph.nodes): + continue + consumer = self.graph.nodes[consumer_idx] + if not self.is_fusible_node(consumer.op): + continue + if self._is_node_divergent(consumer_idx): + continue + if consumer_idx in epilogue_ops: + continue + epilogue_ops.append(consumer_idx) + self._probe_epilogues_recursive(consumer_idx, current_step + 1, max_steps, epilogue_ops)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/quantization/autotune/torch_region_builder.py` around lines 549 - 551, The loop in _probe_epilogues_recursive unconditionally appends consumer_idx to epilogue_ops, which allows non-fusible or divergent consumers into regions; change it to first verify the consumer meets the method's fusibility/divergence predicates (the same checks used elsewhere in this module—e.g., the file's existing fusibility/divergence helper(s) such as the region/component fusibility check and divergence test) and only append and recurse when those checks pass (i.e., if consumer is fusible with the current region and not divergent), otherwise skip that consumer.

coderabbitai · 2026-03-16T05:41:31Z

modelopt/onnx/quantization/autotune/torch_region_builder.py

+    def torch_node_ratio(self) -> float:
+        """Count the number of nodes that are exported from PyTorch."""
+        non_constant_nodes = [n for n in self.graph.nodes if n.op != "Constant"]
+        slash_count = sum(1 for n in non_constant_nodes if n.name and n.name.startswith("/"))
+        return slash_count / len(non_constant_nodes)
+


⚠️ Potential issue | 🟠 Major

Guard torch_node_ratio() against empty non-constant graphs.

If all nodes are Constant (or graph has no nodes), this divides by zero and crashes inspection.

Proposed fix

def torch_node_ratio(self) -> float: """Count the number of nodes that are exported from PyTorch.""" non_constant_nodes = [n for n in self.graph.nodes if n.op != "Constant"] + if not non_constant_nodes: + return 0.0 slash_count = sum(1 for n in non_constant_nodes if n.name and n.name.startswith("/")) return slash_count / len(non_constant_nodes)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/quantization/autotune/torch_region_builder.py` around lines 624 - 629, The torch_node_ratio method can divide by zero when there are no non-constant nodes; update torch_node_ratio to guard against empty non_constant_nodes by checking if non_constant_nodes is empty (len == 0) and returning 0.0 (or another safe default) instead of performing the division; keep the existing logic that builds non_constant_nodes from self.graph.nodes (filtering on n.op != "Constant") and the slash_count calculation (checking n.name and n.name.startswith("/")) and only compute slash_count / len(non_constant_nodes) when the length is > 0.

coderabbitai · 2026-03-16T05:41:31Z

modelopt/onnx/quantization/autotune/torch_region_builder.py

+    for i, region in enumerate(regions):
+        if not include_all_regions:
+            children_to_remove = [
+                c for c in region.get_children() if not has_quantizable_operations(c, graph)
+            ]
+            for child in children_to_remove:
+                region.remove_child(child)
+
+        if not include_all_regions and not has_quantizable_operations(region, graph):
+            logger.debug(f"Filtered out region {i} (no quantizable operations)")
+            continue


⚠️ Potential issue | 🟡 Minor

include_all_regions=False does not actually filter the returned regions.

The loop skips logging for non-quantizable regions, but return regions still returns the unfiltered list. This makes include_all_regions behavior inconsistent with the function docs.

Proposed fix

- logger.info("Analyzing region structure") - for i, region in enumerate(regions): + logger.info("Analyzing region structure") + filtered_regions: list[Region] = [] + for i, region in enumerate(regions): if not include_all_regions: children_to_remove = [ c for c in region.get_children() if not has_quantizable_operations(c, graph) ] for child in children_to_remove: region.remove_child(child) if not include_all_regions and not has_quantizable_operations(region, graph): logger.debug(f"Filtered out region {i} (no quantizable operations)") continue + filtered_regions.append(region) logger.debug( f"Region {i}: {region.get_type().value}, {len(region.get_all_nodes_recursive())} nodes, " f"path={region.metadata.get('path', 'N/A')}" ) builder.print_tree(region, indent=2) - leaf_regions = sum(1 for r in regions if r.get_type() == RegionType.LEAF) - composite_regions = sum(1 for r in regions if r.get_type() == RegionType.COMPOSITE) - root_regions = sum(1 for r in regions if r.get_type() == RegionType.ROOT) + result_regions = regions if include_all_regions else filtered_regions + leaf_regions = sum(1 for r in result_regions if r.get_type() == RegionType.LEAF) + composite_regions = sum(1 for r in result_regions if r.get_type() == RegionType.COMPOSITE) + root_regions = sum(1 for r in result_regions if r.get_type() == RegionType.ROOT) all_nodes = set() - for region in regions: + for region in result_regions: all_nodes.update(region.get_all_nodes_recursive()) total_nodes = len(all_nodes) coverage_pct = 100 * total_nodes / len(graph.nodes) if graph.nodes else 0 @@ - paths = [r.metadata.get("path", "") for r in regions if r.metadata.get("path")] + paths = [r.metadata.get("path", "") for r in result_regions if r.metadata.get("path")] @@ - return regions + return result_regions

Also applies to: 791-817

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/quantization/autotune/torch_region_builder.py` around lines 773 - 783, The code currently mutates children but always returns the original regions list, so include_all_regions=False has no effect; update the loop that iterates over regions to build a new filtered list (e.g., filtered_regions = []) and for each region: perform the existing child-pruning using region.get_children() and region.remove_child(child), then if include_all_regions is True or has_quantizable_operations(region, graph) is True, append the region to filtered_regions; finally return filtered_regions instead of regions; apply the same change to the analogous block that handles regions later in the file (the block that currently mirrors lines ~791-817) so both code paths honor include_all_regions.

coderabbitai · 2026-03-16T05:41:31Z

modelopt/onnx/quantization/autotune/workflows.py

        latency = _benchmark_instance.run(
-            model_path, log_file=log_file, flush_timing_cache=flush_timing_cache
+            model_path,
+            log_file=log_file,
+            flush_timing_cache=flush_timing_cache,
+            strip_shape_args=strip_shape_args,
+            extra_run_args=extra_run_args,
+            export_profile_path=export_profile_path,


⚠️ Potential issue | 🔴 Critical

Don't forward trtexec-only kwargs to every benchmark backend.

_benchmark_instance can still be a TensorRTPyBenchmark, and that implementation does not accept strip_shape_args, extra_run_args, or export_profile_path. On the first subgraph benchmark this raises TypeError, gets caught here, and the workflow records inf instead of a real latency. Gate these kwargs to TrtExecBenchmark (or make the Python backend accept/ignore them).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/quantization/autotune/workflows.py` around lines 69 - 75, The call to _benchmark_instance.run is forwarding trtexec-specific kwargs (strip_shape_args, extra_run_args, export_profile_path) to all backends causing a TypeError when _benchmark_instance is a TensorRTPyBenchmark; update the call to only include those kwargs when the backend is TrtExecBenchmark (or otherwise supports them) by either checking isinstance(_benchmark_instance, TrtExecBenchmark) or testing for attribute/method support before adding strip_shape_args/extra_run_args/export_profile_path to the kwargs dict, then call _benchmark_instance.run with the filtered kwargs so non-trt backends receive only supported parameters.

coderabbitai · 2026-03-16T05:41:31Z

tests/unit/onnx/quantization/autotune/test_config.py

+import unittest
+
+from modelopt.onnx.quantization.autotune.common import Config
+
+
+class TestConfig(unittest.TestCase):
+    """Test Config class functionality."""
+
+    def test_default_values(self):
+        """Test that Config has correct default values."""
+        config = Config()
+
+        # Logging
+        self.assertEqual(config.verbose, False)
+
+        # Performance thresholds
+        self.assertEqual(config.performance_threshold, 1.02)
+
+        # Q/DQ defaults
+        self.assertEqual(config.default_q_scale, 0.1)
+        self.assertEqual(config.default_q_zero_point, 0)
+        self.assertEqual(config.default_quant_type, "int8")
+
+        # Region builder settings
+        self.assertEqual(config.maximum_sequence_region_size, 10)
+        self.assertEqual(config.minimum_topdown_search_size, 10)
+
+        # Scheme generation parameters
+        self.assertEqual(config.top_percent_to_mutate, 0.1)
+        self.assertEqual(config.minimum_schemes_to_mutate, 10)
+        self.assertEqual(config.maximum_mutations, 3)
+        self.assertEqual(config.maximum_generation_attempts, 100)
+
+        # Pattern cache parameters
+        self.assertEqual(config.pattern_cache_minimum_distance, 4)
+        self.assertEqual(config.pattern_cache_max_entries_per_pattern, 32)
+
+
+    def test_custom_values(self):
+        """Test creating Config with custom values."""
+        config = Config(
+            verbose=True,
+            performance_threshold=1.05,
+            default_q_scale=0.05,
+            default_q_zero_point=128,
+            default_quant_type="fp8",
+            maximum_sequence_region_size=20,
+        )
+
+        self.assertEqual(config.verbose, True)
+        self.assertEqual(config.performance_threshold, 1.05)
+        self.assertEqual(config.default_q_scale, 0.05)
+        self.assertEqual(config.default_q_zero_point, 128)
+        self.assertEqual(config.default_quant_type, "fp8")
+        self.assertEqual(config.maximum_sequence_region_size, 20)
+
+    def test_performance_threshold_validation(self):
+        """Test that performance_threshold must be >= 1.0."""
+        config1 = Config(performance_threshold=1.0)
+        self.assertEqual(config1.performance_threshold, 1.0)
+
+        config2 = Config(performance_threshold=1.5)
+        self.assertEqual(config2.performance_threshold, 1.5)
+
+    def test_region_size_validation(self):
+        """Test that region size parameters are positive."""
+        config = Config(maximum_sequence_region_size=50, minimum_topdown_search_size=5)
+        self.assertGreater(config.maximum_sequence_region_size, 0)
+        self.assertGreater(config.minimum_topdown_search_size, 0)
+
+    def test_genetic_algorithm_params(self):
+        """Test genetic algorithm parameters."""
+        config = Config(
+            top_percent_to_mutate=0.2,
+            minimum_schemes_to_mutate=2,
+            maximum_mutations=5,
+            maximum_generation_attempts=50,
+        )
+
+        self.assertEqual(config.top_percent_to_mutate, 0.2)
+        self.assertEqual(config.minimum_schemes_to_mutate, 2)
+        self.assertEqual(config.maximum_mutations, 5)
+        self.assertEqual(config.maximum_generation_attempts, 50)
+
+    def test_pattern_cache_params(self):
+        """Test pattern cache parameters."""
+        config = Config(pattern_cache_minimum_distance=3, pattern_cache_max_entries_per_pattern=10)
+
+        self.assertEqual(config.pattern_cache_minimum_distance, 3)
+        self.assertEqual(config.pattern_cache_max_entries_per_pattern, 10)
+
+
+if __name__ == "__main__":
+    unittest.main()


⚠️ Potential issue | 🟠 Major

Use pytest style for this new test module.

This new test file is written in unittest, but repository guidance requires new tests under tests/**/*.py to be authored with pytest style.

As per coding guidelines: tests/**/*.py: “Write tests using pytest for all new features and examples”.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/unit/onnx/quantization/autotune/test_config.py` around lines 19 - 112, Replace the unittest-based TestConfig class/tests with pytest-style tests: convert each test_* method (test_default_values, test_custom_values, test_performance_threshold_validation, test_region_size_validation, test_genetic_algorithm_params, test_pattern_cache_params) into standalone functions that instantiate Config and use plain assert statements (e.g., assert config.verbose is False) instead of self.assertEqual/self.assertGreater, remove the unittest.TestCase inheritance and the if __name__ == "__main__": unittest.main() block, and keep references to the Config symbol so the same behavior is tested under pytest.

Hale423 · 2026-03-16T05:49:05Z

Thanks @gcunhase for the review!

Rebase + DCO: The branch has been rebased onto the latest main (single clean commit) with Signed-off-by in the commit message.
Updated PR description: The description now correctly states this PR expands on the existing Autotune framework by adding the subgraph workflow.
Relation to Draft: Integrate TorchRegionBuilder to AutoQDQ #963: Yes, this PR is related to Draft: Integrate TorchRegionBuilder to AutoQDQ #963. The TorchRegionBuilder is included in this PR alongside the subgraph workflow additions. cc @willg-nv to view

coderabbitai

🧹 Nitpick comments (4)

tests/unit/onnx/quantization/autotune/test_fusion_grouping.py (1)

231-238: Consider a more specific assertion.

The assertion len(outputs) > 0 is weak and doesn't verify the expected output tensor names. For a full graph group, the outputs should be the graph's output tensors (relu_out based on the fixture).

💡 Suggested improvement

     def test_full_group_boundaries(self):
         graph = _build_conv_relu_graph()
         all_nodes = list(graph.nodes)
         names = {n.name for n in all_nodes}
         inputs, outputs = _find_boundary_tensors(all_nodes, names, graph)
         assert "input" in inputs
-        assert len(outputs) > 0
+        assert "relu_out" in outputs

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unit/onnx/quantization/autotune/test_fusion_grouping.py` around lines
231 - 238, The test_full_group_boundaries uses a weak assertion (len(outputs) >
0); update it to assert the expected output tensor name for the full conv-relu
graph. In the test_full_group_boundaries function (which calls
_build_conv_relu_graph and _find_boundary_tensors), replace the length check
with a specific assertion that outputs equals or contains the set {'relu_out'}
(or assert 'relu_out' in outputs) so the test verifies the graph's actual output
tensor name returned by _find_boundary_tensors.

tests/unit/onnx/quantization/autotune/test_subgraph_extractor.py (1)

130-133: Consider clarifying test name or adding an empty list test.

The test test_empty_node_list_raises tests with a list containing a nonexistent node (["nonexistent"]), not an empty list. The test logic is correct for validating that no matching nodes triggers the error, but the name could be more precise (e.g., test_nonexistent_nodes_raises). Additionally, consider whether passing an actual empty list [] should also be tested for completeness.

🔧 Suggested rename and additional test

-    def test_empty_node_list_raises(self):
+    def test_nonexistent_nodes_raises(self):
         graph = _build_linear_graph()
         with pytest.raises(ValueError, match="None of the specified nodes"):
             extract_subgraph_by_nodes(graph, ["nonexistent"])
+
+    def test_empty_node_list_raises(self):
+        graph = _build_linear_graph()
+        with pytest.raises(ValueError, match="None of the specified nodes"):
+            extract_subgraph_by_nodes(graph, [])

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unit/onnx/quantization/autotune/test_subgraph_extractor.py` around
lines 130 - 133, Rename the test `test_empty_node_list_raises` to
`test_nonexistent_nodes_raises` to reflect that it passes a non-matching node
list, and add a separate test (e.g., `test_empty_nodes_list_raises`) that calls
extract_subgraph_by_nodes(graph, []) and asserts it raises ValueError with the
same "None of the specified nodes" message; update any test references
accordingly and keep the existing case that passes ["nonexistent"] to validate
non-matching node behavior.

tests/unit/onnx/quantization/autotune/test_cli.py (1)

131-135: Avoid hardcoding preset internals in assertions.

Use MODE_PRESETS["extensive"] values (as done in quick preset tests) so this test tracks preset changes without churn.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unit/onnx/quantization/autotune/test_cli.py` around lines 131 - 135,
The test_extensive_preset currently asserts hardcoded preset values; change it
to reference the preset constants instead: call args =
self._make_args("extensive"), apply_mode_presets(args), then assert
args.num_schemes == MODE_PRESETS["extensive"]["num_schemes"] and
args.timing_runs == MODE_PRESETS["extensive"]["timing_runs"] so the test uses
the canonical MODE_PRESETS values and will follow future preset changes; update
imports or module references as needed to access MODE_PRESETS in the test file.

tests/unit/onnx/quantization/autotune/test_subgraph_workflow.py (1)

321-328: Add ONNX model structural validation after export.

Include onnx.checker.check_model(model) to validate the exported model for structural consistency. This strengthens the test by catching malformed graphs earlier, beyond just verifying Q/DQ node presence.

Proposed enhancement

     def test_qdq_produces_valid_onnx(self):
         graph = _build_simple_graph()
         insert_qdq_on_graph(graph, ["input"], quant_type="int8")
         model = gs.export_onnx(graph)
+        onnx.checker.check_model(model)
         assert len(model.graph.node) > 0
         op_types = {n.op_type for n in model.graph.node}
         assert "QuantizeLinear" in op_types
         assert "DequantizeLinear" in op_types

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unit/onnx/quantization/autotune/test_subgraph_workflow.py` around lines
321 - 328, The test test_qdq_produces_valid_onnx currently exports a graph via
gs.export_onnx(graph) and only asserts node presence; add structural validation
by calling onnx.checker.check_model(model) after export to ensure the exported
model is well-formed. Import or reference onnx.checker, then invoke
onnx.checker.check_model(model) immediately after model = gs.export_onnx(graph)
(keeping the existing assertions about QuantizeLinear/DequantizeLinear and
insert_qdq_on_graph intact) so malformed ONNX graphs fail the test.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/unit/onnx/quantization/autotune/test_cli.py`:
- Around line 131-135: The test_extensive_preset currently asserts hardcoded
preset values; change it to reference the preset constants instead: call args =
self._make_args("extensive"), apply_mode_presets(args), then assert
args.num_schemes == MODE_PRESETS["extensive"]["num_schemes"] and
args.timing_runs == MODE_PRESETS["extensive"]["timing_runs"] so the test uses
the canonical MODE_PRESETS values and will follow future preset changes; update
imports or module references as needed to access MODE_PRESETS in the test file.

In `@tests/unit/onnx/quantization/autotune/test_fusion_grouping.py`:
- Around line 231-238: The test_full_group_boundaries uses a weak assertion
(len(outputs) > 0); update it to assert the expected output tensor name for the
full conv-relu graph. In the test_full_group_boundaries function (which calls
_build_conv_relu_graph and _find_boundary_tensors), replace the length check
with a specific assertion that outputs equals or contains the set {'relu_out'}
(or assert 'relu_out' in outputs) so the test verifies the graph's actual output
tensor name returned by _find_boundary_tensors.

In `@tests/unit/onnx/quantization/autotune/test_subgraph_extractor.py`:
- Around line 130-133: Rename the test `test_empty_node_list_raises` to
`test_nonexistent_nodes_raises` to reflect that it passes a non-matching node
list, and add a separate test (e.g., `test_empty_nodes_list_raises`) that calls
extract_subgraph_by_nodes(graph, []) and asserts it raises ValueError with the
same "None of the specified nodes" message; update any test references
accordingly and keep the existing case that passes ["nonexistent"] to validate
non-matching node behavior.

In `@tests/unit/onnx/quantization/autotune/test_subgraph_workflow.py`:
- Around line 321-328: The test test_qdq_produces_valid_onnx currently exports a
graph via gs.export_onnx(graph) and only asserts node presence; add structural
validation by calling onnx.checker.check_model(model) after export to ensure the
exported model is well-formed. Import or reference onnx.checker, then invoke
onnx.checker.check_model(model) immediately after model = gs.export_onnx(graph)
(keeping the existing assertions about QuantizeLinear/DequantizeLinear and
insert_qdq_on_graph intact) so malformed ONNX graphs fail the test.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: cde4c50f-1806-4f91-bed7-d224e1d18ad3

📥 Commits

Reviewing files that changed from the base of the PR and between cf3693855ff344d3ef389672629d38947f788c00 and 847398815434814c50c1c9117a2d81a7b2bbfc9f.

📒 Files selected for processing (4)

tests/unit/onnx/quantization/autotune/test_cli.py
tests/unit/onnx/quantization/autotune/test_fusion_grouping.py
tests/unit/onnx/quantization/autotune/test_subgraph_extractor.py
tests/unit/onnx/quantization/autotune/test_subgraph_workflow.py

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

modelopt/onnx/quantization/autotune/benchmark.py (1)
232-236: ⚠️ Potential issue | 🔴 Critical

Critical: constructor returns early and leaves object partially initialized

Line 236 exits __init__ when --remoteAutoTuningConfig is supported, so self._base_cmd and self._profile_unsupported are never initialized. run() will then fail at runtime.
💡 Proposed fix
         if has_remote_config:
             try:
                 _check_for_tensorrt(min_version="10.16")
                 self.logger.debug("TensorRT Python API version >= 10.16 detected")
-                return
             except ImportError:
                 self.logger.warning(
                     "Remote autotuning is not supported with TensorRT version < 10.16"
                     "Removing --remoteAutoTuningConfig from trtexec arguments"
                 )

♻️ Duplicate comments (4)

modelopt/onnx/quantization/autotune/fusion_grouping.py (1)

285-290: ⚠️ Potential issue | 🟠 Major

Validate generated graph.json before returning success.

After Line 285 checks the exit code, the function still returns success on Line 290 without confirming the file exists. trtexec can exit successfully while not emitting --exportLayerInfo, causing delayed downstream failures.

Suggested fix

         if result.returncode != 0:
             logger.error(f"trtexec failed (exit {result.returncode}), see {log_path}")
             raise RuntimeError(f"trtexec FP16 build failed: {log_path}")
 
+        if not Path(graph_json_path).is_file():
+            logger.error(
+                f"trtexec exited with code 0 but graph.json was not generated: {graph_json_path}"
+            )
+            raise RuntimeError(f"graph.json was not generated: {graph_json_path}")
+
         logger.info(f"graph.json generated: {graph_json_path}")
         return graph_json_path

#!/bin/bash
# Verify whether generate_graph_json has an output-file existence guard before returning.
rg -n -C4 'def generate_graph_json|returncode != 0|return graph_json_path|is_file\(|exists\(' modelopt/onnx/quantization/autotune/fusion_grouping.py

Expected result: return graph_json_path appears without a preceding is_file()/exists() check in the success path.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/autotune/fusion_grouping.py` around lines 285 -
290, The function generate_graph_json returns graph_json_path after checking
trtexec exit code but doesn't verify the file was actually created; update
generate_graph_json to check that graph_json_path exists and is a regular file
(e.g., use Path(graph_json_path).is_file() or os.path.exists/isfile) before
logging success and returning, and if the file is missing log an error with the
log_path and raise a RuntimeError indicating the missing graph.json so
downstream callers don't get a false success.

modelopt/onnx/quantization/autotune/benchmark.py (3)

260-264: ⚠️ Potential issue | 🔴 Critical

Remove inline Bandit suppression (# nosec)

Line 263 adds an in-code Bandit bypass. This repo disallows # nosec comments; keep the list-arg subprocess call, but remove the suppression.

💡 Proposed fix

-        result = subprocess.run(cmd, capture_output=True, text=True)  # nosec B603
+        result = subprocess.run(cmd, capture_output=True, text=True)

As per coding guidelines, "Bandit security checks are enforced via pre-commit hooks. # nosec comments are not allowed as bypasses for security checks; contributors must request formal review from @NVIDIA/modelopt-setup-codeowners with documented justification instead."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/autotune/benchmark.py` around lines 260 - 264, In
_exec_and_log, remove the inline Bandit suppression by deleting the "# nosec
B603" trailing comment on the subprocess.run(...) call (keep the list-style arg
usage: subprocess.run(cmd, capture_output=True, text=True)); do not replace it
with another inline bypass—if security exceptions are needed, raise the issue
for formal review with the codeowners as per policy. Ensure the call remains in
the _exec_and_log method and that logging and _write_log_file behavior are
unchanged.

339-348: ⚠️ Potential issue | 🟠 Major

Only disable profiling for profiling-related unknown-option failures

Lines 340-343 mark profiling unsupported for any "Unknown option" error, including unrelated extra_run_args. That can permanently disable profiling after a non-profiling CLI mistake.

💡 Proposed fix

             if result.returncode != 0:
-                if "Unknown option" in (result.stderr or ""):
+                stderr = result.stderr or ""
+                profiling_prefixes = (
+                    "--separateProfileRun",
+                    "--profilingVerbosity",
+                    "--exportProfile",
+                )
+                profiling_requested = export_profile_path is not None
+                profiling_flags_in_cmd = any(
+                    any(c.startswith(p) for p in profiling_prefixes) for c in cmd
+                )
+                profiling_error = any(p in stderr for p in profiling_prefixes)
+
+                if (
+                    "Unknown option" in stderr
+                    and profiling_requested
+                    and (profiling_flags_in_cmd or profiling_error)
+                ):
                     self.logger.warning("Profiling flags unsupported, retrying without them")
                     self._profile_unsupported = True
                     cmd = [
                         c for c in cmd
                         if not any(

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/autotune/benchmark.py` around lines 339 - 348, The
code currently treats any "Unknown option" stderr as proof profiling is
unsupported and sets self._profile_unsupported; change this to only set
self._profile_unsupported (and strip profiling flags from cmd) when the stderr
specifically names one of the profiling flags. Update the conditional around
result.returncode != 0 to inspect result.stderr for any of the profiling flag
tokens used later (e.g., "--separateProfileRun", "--profilingVerbosity",
"--exportProfile" or their short forms) and only then run the block that logs,
sets self._profile_unsupported, and filters cmd; leave other "Unknown option"
errors (e.g., from extra_run_args) to be handled normally.

320-324: ⚠️ Potential issue | 🟠 Major

strip_shape_args still leaves other shape flags in place

Lines 320-324 only strip --optShapes/--minShapes/--maxShapes. Flags like --shapes and calibration-shape options can still leak full-model bindings into subgraph runs.

💡 Proposed fix

             if strip_shape_args:
                 cmd = [
                     c for c in cmd
-                    if not any(c.startswith(p) for p in ("--optShapes", "--minShapes", "--maxShapes"))
+                    if not (
+                        "shapes" in c.lower()
+                        or c.startswith("--shape")
+                        or c.startswith("--shapes")
+                        or c.startswith("--calibration")
+                    )
                 ]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/autotune/benchmark.py` around lines 320 - 324, In
benchmark.py the strip_shape_args branch that filters cmd currently only removes
"--optShapes", "--minShapes", and "--maxShapes" which lets other shape-related
flags like "--shapes" or calibration-shape options slip through; update the
filtering logic in the strip_shape_args block that processes the cmd list (same
scope where cmd is rebuilt) to also exclude "--shapes" and any other
shape/calibration shape flags by matching flag names case-insensitively (e.g.,
any c that startswith "--shapes" or where c.lower() contains "shape" or "calib"
combined with "shape"), so all shape-related arguments are removed before
spawning subgraph runs.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@modelopt/onnx/quantization/autotune/benchmark.py`:
- Around line 260-264: In _exec_and_log, remove the inline Bandit suppression by
deleting the "# nosec B603" trailing comment on the subprocess.run(...) call
(keep the list-style arg usage: subprocess.run(cmd, capture_output=True,
text=True)); do not replace it with another inline bypass—if security exceptions
are needed, raise the issue for formal review with the codeowners as per policy.
Ensure the call remains in the _exec_and_log method and that logging and
_write_log_file behavior are unchanged.
- Around line 339-348: The code currently treats any "Unknown option" stderr as
proof profiling is unsupported and sets self._profile_unsupported; change this
to only set self._profile_unsupported (and strip profiling flags from cmd) when
the stderr specifically names one of the profiling flags. Update the conditional
around result.returncode != 0 to inspect result.stderr for any of the profiling
flag tokens used later (e.g., "--separateProfileRun", "--profilingVerbosity",
"--exportProfile" or their short forms) and only then run the block that logs,
sets self._profile_unsupported, and filters cmd; leave other "Unknown option"
errors (e.g., from extra_run_args) to be handled normally.
- Around line 320-324: In benchmark.py the strip_shape_args branch that filters
cmd currently only removes "--optShapes", "--minShapes", and "--maxShapes" which
lets other shape-related flags like "--shapes" or calibration-shape options slip
through; update the filtering logic in the strip_shape_args block that processes
the cmd list (same scope where cmd is rebuilt) to also exclude "--shapes" and
any other shape/calibration shape flags by matching flag names
case-insensitively (e.g., any c that startswith "--shapes" or where c.lower()
contains "shape" or "calib" combined with "shape"), so all shape-related
arguments are removed before spawning subgraph runs.

In `@modelopt/onnx/quantization/autotune/fusion_grouping.py`:
- Around line 285-290: The function generate_graph_json returns graph_json_path
after checking trtexec exit code but doesn't verify the file was actually
created; update generate_graph_json to check that graph_json_path exists and is
a regular file (e.g., use Path(graph_json_path).is_file() or
os.path.exists/isfile) before logging success and returning, and if the file is
missing log an error with the log_path and raise a RuntimeError indicating the
missing graph.json so downstream callers don't get a false success.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 90e63dfd-dee0-4e93-a164-5574ed54b7c2

📥 Commits

Reviewing files that changed from the base of the PR and between 847398815434814c50c1c9117a2d81a7b2bbfc9f and fa67ad118e4e2b05b953813934ba6777a453e7c2.

📒 Files selected for processing (2)

modelopt/onnx/quantization/autotune/benchmark.py
modelopt/onnx/quantization/autotune/fusion_grouping.py

coderabbitai

♻️ Duplicate comments (3)

modelopt/onnx/quantization/autotune/fusion_grouping.py (1)

295-300: ⚠️ Potential issue | 🟡 Minor

Add existence check for generated graph.json file.

The function checks returncode != 0 but doesn't verify that graph.json was actually created. trtexec may exit with code 0 even when output files weren't generated (e.g., due to permission issues or disk space). This can cause obscured downstream failures.

🛡️ Suggested fix

         if result.returncode != 0:
             logger.error(f"trtexec failed (exit {result.returncode}), see {log_path}")
             raise RuntimeError(f"trtexec FP16 build failed: {log_path}")

+        if not Path(graph_json_path).is_file():
+            logger.error(f"trtexec exited successfully but no graph.json at {graph_json_path}")
+            raise RuntimeError(f"graph.json was not generated: {graph_json_path}")
+
         logger.info(f"graph.json generated: {graph_json_path}")
         return graph_json_path

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/autotune/fusion_grouping.py` around lines 295 -
300, The code currently only checks trtexec's returncode but not that the
expected output file exists; after the trtexec run (where variables result,
log_path and graph_json_path are available and logger is used) add an explicit
existence check (e.g., os.path.exists(graph_json_path)) and if the file is
missing log a clear error including graph_json_path and log_path and raise a
RuntimeError indicating the FP16 build produced no graph.json; ensure this check
is placed before returning graph_json_path so downstream callers never receive a
non-existent path.

modelopt/onnx/quantization/autotune/subgraph_workflow.py (2)

820-828: ⚠️ Potential issue | 🟠 Major

Add validation for cached metadata before resuming.

The cache loading at line 821 doesn't validate that the cached version and model_path match the current run. Reusing an output directory with a different model or after code changes could replay stale Phase 2/3 decisions.

🛡️ Suggested fix

     # ─── Load cache ──────────────────────────────────────────────────────
-    cache = _load_cache(cache_path) or {
-        "version": CACHE_VERSION, "model_path": model_path,
-    }
+    cache = _load_cache(cache_path)
+    if cache is not None:
+        # Validate cache matches current run parameters
+        if (cache.get("version") != CACHE_VERSION
+                or cache.get("model_path") != model_path):
+            logger.warning(
+                f"Cache mismatch (version or model_path changed), starting fresh"
+            )
+            cache = None
+    if cache is None:
+        cache = {"version": CACHE_VERSION, "model_path": model_path}
     cached_p2: dict = cache.get("phase2", {})

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/autotune/subgraph_workflow.py` around lines 820 -
828, The loaded cache from _load_cache must be validated: check
cache.get("version") equals CACHE_VERSION and cache.get("model_path") equals
model_path before trusting cached_p2 / cached_p3; if either mismatches, discard
or reinitialize the cache (e.g., set cache = {"version": CACHE_VERSION,
"model_path": model_path} and clear phase2/phase3 results) and emit a warning
log; implement this validation immediately after the cache = _load_cache(...)
line and before using cached_p2, cached_p2_ids, or cached_p3 so stale Phase 2/3
decisions are not replayed.

128-134: ⚠️ Potential issue | 🟡 Minor

Protobuf dimension manipulation uses non-standard pattern.

Using pop() on protobuf repeated fields is non-standard and may not work consistently across protobuf versions. The recommended approach is to clear and rebuild.

♻️ Safer approach

     for inp in model.graph.input:
         if inp.name not in input_shapes:
             continue
         shape = input_shapes[inp.name]
         dim_proto = inp.type.tensor_type.shape.dim
-        while len(dim_proto) > len(shape):
-            dim_proto.pop()
-        while len(dim_proto) < len(shape):
-            dim_proto.add()
+        # Clear and rebuild dimensions
+        del dim_proto[:]
         for i, d in enumerate(shape):
-            dim_proto[i].ClearField("dim_param")
-            dim_proto[i].dim_value = d
+            new_dim = dim_proto.add()
+            new_dim.dim_value = d

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/autotune/subgraph_workflow.py` around lines 128 -
134, Replace the non-standard pop/add loop on the protobuf repeated field with a
clear-and-rebuild pattern: call dim_proto.Clear() (or dim_proto.clear()
depending on proto API) to remove existing entries, then iterate over shape and
for each value call new_dim = dim_proto.add(); new_dim.ClearField("dim_param");
new_dim.dim_value = d. Update the block that currently manipulates dim_proto and
shape so it uses this clear-and-repopulate approach (keeping the same field
names dim_proto, dim_param, dim_value) to ensure consistent behavior across
protobuf versions.

🧹 Nitpick comments (3)

modelopt/onnx/quantization/autotune/__main__.py (1)
55-72: Consider explicit allow_pickle=False for security clarity.

While NumPy defaults to allow_pickle=False since version 1.16.3, explicitly specifying it makes the security posture clear and guards against potential future changes or environments with older NumPy.
♻️ Suggested improvement
     if path_str.endswith(".npz"):
-        data = np.load(path_str)
+        data = np.load(path_str, allow_pickle=False)
         return dict(data)
     elif path_str.endswith(".npy"):
-        return np.load(path_str)
+        return np.load(path_str, allow_pickle=False)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/autotune/__main__.py` around lines 55 - 72, The
_load_calibration_data function uses np.load to read .npz and .npy files; update
both calls to np.load in that function (the np.load call when
path_str.endswith(".npz") and the return np.load(path_str) for .npy) to pass
allow_pickle=False explicitly to make the security posture explicit and prevent
pickled object loading.
modelopt/onnx/quantization/autotune/subgraph_workflow.py (1)
54-71: FP8 fallback now warns user, but consider runtime validation.

The warning at import time addresses the silent fallback concern. However, users requesting --quant_type fp8 may not notice the import-time warning. Consider adding validation at workflow entry to fail early with a clear error when FP8 is requested but unavailable.
♻️ Consider adding runtime validation
# At the start of subgraph_autotuning_workflow, add:
if quant_type == "fp8" and QUANT_DTYPES["fp8"] == np.int8:
    raise ValueError(
        "FP8 quantization requested but float8_e4m3fn dtype is unavailable. "
        "Install ml_dtypes package for native FP8 support."
    )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/autotune/subgraph_workflow.py` around lines 54 -
71, Add a runtime check at the start of subgraph_autotuning_workflow to validate
FP8 availability: if the caller requested quant_type == "fp8" but
QUANT_DTYPES["fp8"] resolved to np.int8 (as returned by _get_fp8_dtype()), raise
a clear ValueError instructing the user to install ml_dtypes for native FP8
support; this ensures explicit failure when FP8 was requested but unavailable
rather than relying solely on the import-time warning.
modelopt/onnx/quantization/autotune/fusion_grouping.py (1)
122-127: Remove or document the unused _get_op_type_from_node_name helper function.

The _get_op_type_from_node_name function is defined but not called anywhere in the codebase. Either remove it to reduce dead code, or add a comment documenting why it is retained for future use.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/autotune/fusion_grouping.py` around lines 122 -
127, The helper function _get_op_type_from_node_name is defined but unused;
either remove the function to eliminate dead code, or retain it with a clear
comment/docstring explaining its intended future use (e.g., lookup helper for
debugging or future fusion logic) and add an annotation (like TODO or rationale)
so linters/maintainers know it is intentionally kept; update the
docstring/comment inside fusion_grouping.py adjacent to
_get_op_type_from_node_name to reflect the chosen action.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@modelopt/onnx/quantization/autotune/fusion_grouping.py`:
- Around line 295-300: The code currently only checks trtexec's returncode but
not that the expected output file exists; after the trtexec run (where variables
result, log_path and graph_json_path are available and logger is used) add an
explicit existence check (e.g., os.path.exists(graph_json_path)) and if the file
is missing log a clear error including graph_json_path and log_path and raise a
RuntimeError indicating the FP16 build produced no graph.json; ensure this check
is placed before returning graph_json_path so downstream callers never receive a
non-existent path.

In `@modelopt/onnx/quantization/autotune/subgraph_workflow.py`:
- Around line 820-828: The loaded cache from _load_cache must be validated:
check cache.get("version") equals CACHE_VERSION and cache.get("model_path")
equals model_path before trusting cached_p2 / cached_p3; if either mismatches,
discard or reinitialize the cache (e.g., set cache = {"version": CACHE_VERSION,
"model_path": model_path} and clear phase2/phase3 results) and emit a warning
log; implement this validation immediately after the cache = _load_cache(...)
line and before using cached_p2, cached_p2_ids, or cached_p3 so stale Phase 2/3
decisions are not replayed.
- Around line 128-134: Replace the non-standard pop/add loop on the protobuf
repeated field with a clear-and-rebuild pattern: call dim_proto.Clear() (or
dim_proto.clear() depending on proto API) to remove existing entries, then
iterate over shape and for each value call new_dim = dim_proto.add();
new_dim.ClearField("dim_param"); new_dim.dim_value = d. Update the block that
currently manipulates dim_proto and shape so it uses this clear-and-repopulate
approach (keeping the same field names dim_proto, dim_param, dim_value) to
ensure consistent behavior across protobuf versions.

---

Nitpick comments:
In `@modelopt/onnx/quantization/autotune/__main__.py`:
- Around line 55-72: The _load_calibration_data function uses np.load to read
.npz and .npy files; update both calls to np.load in that function (the np.load
call when path_str.endswith(".npz") and the return np.load(path_str) for .npy)
to pass allow_pickle=False explicitly to make the security posture explicit and
prevent pickled object loading.

In `@modelopt/onnx/quantization/autotune/fusion_grouping.py`:
- Around line 122-127: The helper function _get_op_type_from_node_name is
defined but unused; either remove the function to eliminate dead code, or retain
it with a clear comment/docstring explaining its intended future use (e.g.,
lookup helper for debugging or future fusion logic) and add an annotation (like
TODO or rationale) so linters/maintainers know it is intentionally kept; update
the docstring/comment inside fusion_grouping.py adjacent to
_get_op_type_from_node_name to reflect the chosen action.

In `@modelopt/onnx/quantization/autotune/subgraph_workflow.py`:
- Around line 54-71: Add a runtime check at the start of
subgraph_autotuning_workflow to validate FP8 availability: if the caller
requested quant_type == "fp8" but QUANT_DTYPES["fp8"] resolved to np.int8 (as
returned by _get_fp8_dtype()), raise a clear ValueError instructing the user to
install ml_dtypes for native FP8 support; this ensures explicit failure when FP8
was requested but unavailable rather than relying solely on the import-time
warning.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9a87f60a-3119-4ffb-8d28-3099c54cc8a6

📥 Commits

Reviewing files that changed from the base of the PR and between fa67ad118e4e2b05b953813934ba6777a453e7c2 and fc4d2f65d2230e5bb319d3f54f406a797d60b965.

📒 Files selected for processing (3)

modelopt/onnx/quantization/autotune/__main__.py
modelopt/onnx/quantization/autotune/fusion_grouping.py
modelopt/onnx/quantization/autotune/subgraph_workflow.py

Expand ONNX Q/DQ autotune with a new subgraph workflow (--workflow subgraph) that uses TensorRT fusion boundaries for faster, fusion-aware QDQ placement. Key additions: - Subgraph workflow: fusion grouping from graph.json, per-subgraph heuristic scheme profiling, optional per-layer timing, incremental full-model validation - Fusion grouping (fusion_grouping.py): parse TRT graph.json, auto-generate via trtexec if not provided - Subgraph extraction (subgraph_extractor.py): extract standalone ONNX subgraphs per fusion group - Torch region builder (torch_region_builder.py): PyTorch-style hierarchical region discovery - TensorRT utils (tensorrt_utils.py): TRT Python API benchmark with timing cache and plugin support - QDQ utilities (qdq_utils.py): quantized tensor discovery helpers - Cache/resume via autotune_cache.json for long-running jobs - trtexec profiling-flag compatibility: retry without --exportProfile on older TRT versions - CLI: --workflow, --graph_json, --incremental_validation flags - Documentation and examples (qdq_placement/) Signed-off-by: wahao <wahao@nvidia.com> Made-with: Cursor

…, extractor, and CLI Cover shape parsing, profile parsing, cache I/O, QDQ insertion, heuristic scheme generation, boundary tensor resolution, subgraph extraction, metadata parsing, graph.json parsing, and CLI argument parsing with mode presets. Signed-off-by: wahao <wahao@nvidia.com> Made-with: Cursor

Signed-off-by: wahao <wahao@nvidia.com>

Hale423 requested review from a team as code owners March 10, 2026 02:14

Hale423 requested a review from cjluo-nv March 10, 2026 02:14

cjluo-nv requested review from ajrasane and gcunhase March 10, 2026 06:38

cjluo-nv requested changes Mar 10, 2026

View reviewed changes

coderabbitai bot reviewed Mar 13, 2026

View reviewed changes

Hale423 force-pushed the dev-wahao-autotune-subgraph-profile branch from 48eed5d to cf36938 Compare March 16, 2026 05:25

coderabbitai bot reviewed Mar 16, 2026

View reviewed changes

coderabbitai bot reviewed Mar 18, 2026

View reviewed changes

Hale423 added 5 commits March 19, 2026 09:55

fix(autotune): apply CodeRabbit fixes + trtexec arg dedup

b60e8bc

Signed-off-by: wahao <wahao@nvidia.com>

feat: merging some autotune region & subgraph args

82f1e70

feat: trtexec options dedup

26f8621

Hale423 force-pushed the dev-wahao-autotune-subgraph-profile branch from fc4d2f6 to 26f8621 Compare March 23, 2026 01:50

Hale423 changed the title ~~[Feature Request] Support ONNX Q/DQ Autotuning with Subgraph Mode~~ [WIP][Feature Request] Support ONNX Q/DQ Autotuning with Subgraph Mode Mar 23, 2026

		## Branch: Subgraph Mode (This Branch)

		On branch dev-wahao-autotune-subgraph-profile, the autotuner adds a second workflow and related features not present on dev-willg-add-auto-qdq-placement-tool:

Conversation

Hale423 commented Mar 10, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request: Expand ONNX Q/DQ Autotuning with Subgraph Mode

Summary

What's New

Key Files

How to Test

Checklist

Documentation

Notes

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Mar 10, 2026

Uh oh!

cjluo-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Hale423 commented Mar 10, 2026

Design: ONNX Q/DQ Autotuning for TensorRT

1. Background

2. Goals

3. Non-goals

4. Scope Relative to main

5. User Experience

5.1 Region Mode

5.2 Subgraph Mode

6. High-level Architecture

6.1 Core Modeling Layer

6.2 Region-mode Optimization Layer

6.3 Subgraph-mode Optimization Layer

6.4 Benchmarking Layer

7. Region Mode Design

7.1 Why Region Mode Exists

7.2 Region Discovery

7.3 Pattern-based Scheme Reuse

7.4 Warm Start

7.5 State and Resume

8. Subgraph Mode Design

8.1 Why Subgraph Mode Exists

8.2 Phase 1: Fusion-aware Grouping

8.3 Phase 2: Subgraph Profiling

8.4 Per-layer Comparison

8.5 Phase 3: Full-model Merge and Incremental Validation

8.6 Cache and Resume

9. Benchmarking Design

9.1 Python API Benchmark

9.2 trtexec Benchmark

9.3 Profiling-flag Compatibility Fallback

10. CLI Design

11. Data and Output Model

11.1 Region Mode Outputs

11.2 Subgraph Mode Outputs

12. Key Trade-offs

12.1 Region Mode vs Subgraph Mode

12.2 Why Keep Both Raw and Final Models

12.3 Why Pattern-relative Insertion Points

13. Limitations and Risks

14. Alternatives Considered

14.1 Full-model-only Search

14.2 Subgraph-only Design

14.3 Absolute-node Placement Search

15. Validation Plan

16. Rollout and Documentation

17. Open Questions for Review

Uh oh!

coderabbitai bot commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Hale423 commented Mar 10, 2026 •

edited by coderabbitai bot

Loading

4. Scope Relative to `main`

coderabbitai bot commented Mar 13, 2026 •

edited

Loading