[WIP][Feature Request] Support ONNX Q/DQ Autotuning with Subgraph Mode#1015
[WIP][Feature Request] Support ONNX Q/DQ Autotuning with Subgraph Mode#1015Hale423 wants to merge 5 commits intoNVIDIA:mainfrom
Conversation
cjluo-nv
left a comment
There was a problem hiding this comment.
This PR introduces 16k+ lines of changes. Please consider sharing a design and get design review.
|
Thanks for the feedback. Sharing this design, please kindly take a look. Design: ONNX Q/DQ Autotuning with Subgraph ModeDesign: ONNX Q/DQ Autotuning for TensorRTDesign review document for PR #1015 1. BackgroundTensorRT performance for quantized ONNX models depends not only on whether Q/DQ nodes exist, but also on where they are inserted. In practice:
This branch introduces an ONNX Q/DQ autotuning system that searches for better Q/DQ placement using actual TensorRT latency measurements. The design intentionally supports two workflows:
2. Goals
3. Non-goals
4. Scope Relative to
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a subgraph-based Q/DQ placement autotuner: docs and examples, CLI workflow switch to "subgraph", TensorRT benchmarking (trtexec + Python API), fusion-group analysis from TensorRT graph.json, subgraph extraction and QDQ insertion, heuristic scheme generation, multi-phase profiling/validation with caching/resume, and many unit tests. Changes
sequenceDiagram
participant CLI as CLI / User
participant WF as Subgraph Workflow
participant GA as Graph Analysis
participant SG as Subgraph Extraction
participant BM as Benchmarking
participant FS as FileSystem
CLI->>WF: subgraph_autotuning_workflow(model_path, ...)
rect rgba(100,200,150,0.5)
note over WF: Phase 1: Setup
WF->>GA: parse graph.json + create_fusion_groups
GA->>GA: Map TRT layers → ONNX nodes
GA-->>WF: FusionGroup list
end
rect rgba(100,150,200,0.5)
note over WF: Phase 2: Per-Subgraph Profiling
loop For each FusionGroup
WF->>WF: generate_heuristic_schemes
loop For each scheme
WF->>SG: extract_subgraph_by_nodes
SG-->>WF: subgraph bytes
WF->>WF: insert_qdq_on_graph
WF->>BM: benchmark_onnx_model
BM-->>WF: latency_ms
end
WF->>WF: select best_scheme
end
WF->>FS: cache phase2 results
end
rect rgba(200,150,100,0.5)
note over WF: Phase 3: Full-Model Validation
WF->>BM: baseline FP16 benchmark full model
BM-->>WF: baseline_latency_ms
loop Incremental Validation (if enabled)
WF->>WF: apply candidate QDQ schemes
WF->>BM: benchmark full model per candidate
BM-->>WF: test latency_ms
WF->>WF: keep if latency improves
end
WF->>FS: save optimized_final.onnx and cache phase3 progress
end
WF-->>CLI: return optimized_model_path
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 16
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
modelopt/onnx/quantization/autotune/benchmark.py (1)
34-34:⚠️ Potential issue | 🟡 Minor
# noseccomments require explicit justification and approval.As per coding guidelines, use of
# noseccomments to bypass Bandit security checks is not allowed. If this security-sensitive pattern is genuinely necessary, the PR must be reviewed and approved by@NVIDIA/modelopt-setup-codeownerswith explicit justification in the PR description.The subprocess import itself is safe when used properly (with list arguments, not shell=True), but the
# nosec B404comment should be documented or removed.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/quantization/autotune/benchmark.py` at line 34, Remove the inline "# nosec B404" on the subprocess import (import subprocess) unless you add an explicit justification and approval from `@NVIDIA/modelopt-setup-codeowners` in the PR description; if the pattern is required, keep the comment but add the justification text to the PR and ensure all uses of subprocess in this module (functions/methods that call subprocess.run, subprocess.Popen, etc.) follow safe patterns (pass args as a list, avoid shell=True) and reference that approval in the PR description.
🧹 Nitpick comments (6)
examples/qdq_placement/set_batch_size.py (1)
62-65: Consider handling large models (>2GB) for model verification.The direct call to
onnx.checker.check_model(output_path)may fail for models larger than 2GB due to protobuf size limits. The codebase has a utility inmodelopt/onnx/utils.py(seecheck_modelfunction) that handles this case by using external data storage.Suggested approach
+import tempfile +import os + def set_batch_size(model_path: str, batch_size: int, output_path: str) -> None: # ... existing code ... # Verify the saved model print("Verifying model...") - onnx.checker.check_model(output_path) + # Handle large models that exceed protobuf 2GB limit + saved_model = onnx.load(output_path) + if saved_model.ByteSize() > (2 * (1024**3)): + # For large models, check_model needs the file path with external data + onnx.checker.check_model(output_path) + else: + onnx.checker.check_model(saved_model) print("✓ Model saved and verified successfully!")Alternatively, consider importing and using
modelopt.onnx.utils.check_modelfor consistency with the rest of the codebase.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/qdq_placement/set_batch_size.py` around lines 62 - 65, Replace the direct call to onnx.checker.check_model(output_path) with the project's robust checker that handles >2GB models by importing and calling modelopt.onnx.utils.check_model; locate the verification block around the print("Verifying model...") and change the call to use modelopt.onnx.utils.check_model(output_path) (or import the function as check_model and call check_model(output_path)) so external-data large protobuf models are supported.modelopt/onnx/quantization/autotune/qdq_utils.py (2)
60-60: Simplify redundant condition.The check
node.input and len(node.input) > 0is redundant sincenode.inputbeing truthy already implies it's non-empty.Suggested simplification
- if node.input and len(node.input) > 0: + if node.input:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/quantization/autotune/qdq_utils.py` at line 60, The condition checking node inputs is redundant: replace the `if node.input and len(node.input) > 0:` guard with a simpler truthy check `if node.input:` in the function where this appears (look for the code handling `node.input` in qdq_utils.py) so the branch behavior remains identical but the expression is simplified.
1-1: Update copyright year for consistency.The copyright year is 2024, but other new files in this PR use 2026. Consider updating for consistency.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/quantization/autotune/qdq_utils.py` at line 1, Update the SPDX header year in modelopt/onnx/quantization/autotune/qdq_utils.py to match the rest of the PR (change "2024" to "2026"); locate the SPDX comment at the top of the file and modify the year in the copyright line so it is consistent across files.tests/unit/onnx/quantization/autotune/test_config.py (1)
12-13: Remove unnecessary sys.path manipulation.The
sys.path.insertis unnecessary if the package is properly installed in the test environment. This pattern can cause import issues and is not recommended.Suggested fix
-# Add parent directory to path -sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) - from modelopt.onnx.quantization.autotune.common import Config🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/onnx/quantization/autotune/test_config.py` around lines 12 - 13, Remove the ad-hoc sys.path manipulation in the test file: delete the sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) line in tests/unit/onnx/quantization/autotune/test_config.py and rely on the package being installed in the test environment (or use test runner configuration / PYTHONPATH) so imports resolve cleanly; do not add alternative path hacks in this file.modelopt/onnx/quantization/autotune/subgraph_extractor.py (1)
170-181: Consider usingcollections.dequefor BFS queue.Using
list.pop(0)is O(n) per operation. For potentially large graphs, usingcollections.dequewithpopleft()provides O(1) performance.♻️ Suggested improvement
+from collections import deque + def _find_reachable_graph_inputs( graph: gs.Graph, target_nodes: List[gs.Node], ) -> List[str]: """BFS backward from target_nodes to find graph inputs that feed into them.""" graph_input_names = {t.name for t in graph.inputs} visited = set() - queue = [] + queue = deque() result = [] for node in target_nodes: for inp in node.inputs: if isinstance(inp, gs.Variable) and inp.name not in visited: visited.add(inp.name) queue.append(inp) while queue: - tensor = queue.pop(0) + tensor = queue.popleft()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/quantization/autotune/subgraph_extractor.py` around lines 170 - 181, The BFS loop in subgraph_extractor.py uses a Python list and pop(0), causing O(n) dequeuing; replace the list-based queue with collections.deque: import deque, initialize queue = deque(...) where the queue is created, and change queue.pop(0) to queue.popleft(), keeping queue.append(...) for enqueueing; update any code that constructs the initial queue and ensure imports include collections.deque so the BFS in the function that contains this loop uses O(1) dequeue operations.examples/qdq_placement/README.md (1)
152-162: Add language specifier to fenced code blocks.The directory structure code blocks should have a language specifier for consistency. Consider using
textorplaintextfor directory listings.📝 Suggested fix
-``` +```text resnet50_results/ ├── optimized_final.onnx # Optimized model-``` +```text <output_dir>/ ├── optimized_final.onnx # Incrementally validated model (if --incremental-validation)Also applies to: 166-175
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/qdq_placement/README.md` around lines 152 - 162, Update the fenced code blocks that show directory listings so they include a language specifier (use ```text) — specifically change the blocks containing the "resnet50_results/" tree and the block containing the "<output_dir>/" tree to start with ```text instead of ``` so the directory listings render consistently; locate the blocks by the directory headers "resnet50_results/" and "<output_dir>/" in the README and replace their opening fences accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/source/reference/2_qdq_placement.rst`:
- Line 828: The docs claim "Default threshold: 1.01" but the actual default is
defined as performance_threshold: float = 1.02 on the Config class in common.py;
update the documentation string in 2_qdq_placement.rst to state "Default
threshold: 1.02 (2% improvement minimum)" so the docs match the implementation
(or alternatively change Config.performance_threshold to 1.01 if you intend the
docs to be canonical).
In `@examples/qdq_placement/set_batch_size.py`:
- Around line 1-10: Add the required NVIDIA Apache 2.0 license header to the top
of the examples/qdq_placement/set_batch_size.py script (above or immediately
after the existing shebang and before the module docstring) so the file includes
the full license boilerplate mandated for examples/**/*.py; ensure the header
explicitly names "NVIDIA CORPORATION" and the Apache 2.0 terms and leave the
rest of the script (including the module docstring and usage comments)
unchanged.
In `@modelopt/onnx/quantization/autotune/fusion_grouping.py`:
- Around line 1-6: Add the required NVIDIA Apache 2.0 license header to the top
of the module modelopt/onnx/quantization/autotune/fusion_grouping.py by
inserting the standard multi-line NVIDIA Apache 2.0 header (including copyright
line, SPDX identifier and license text reference) as used across other files in
modelopt; ensure the header appears before the existing module docstring so
classes/functions in this file (e.g., the fusion grouping logic in
fusion_grouping.py) are properly licensed.
In `@modelopt/onnx/quantization/autotune/subgraph_extractor.py`:
- Around line 1-6: Add the NVIDIA Apache 2.0 license header to the top of the
module file modelopt/onnx/quantization/autotune/subgraph_extractor.py by
inserting the standard multi-line license comment before the existing module
docstring; ensure the header includes the SPDX identifier and full NVIDIA Apache
2.0 text (or the project-standard short header) so the file begins with the
required license block followed by the current docstring and existing code in
subgraph_extractor.py.
In `@modelopt/onnx/quantization/autotune/subgraph_workflow.py`:
- Around line 1-8: This file is missing the NVIDIA Apache 2.0 license header;
prepend the standard NVIDIA Apache-2.0 license block to the very top of
subgraph_workflow.py (above the existing module docstring) including the
copyright line and license text/notice, ensuring the SPDX identifier
(Apache-2.0) or full license header is present and that the existing file
docstring remains unchanged below the header.
- Around line 99-105: The loops that mutate the protobuf repeated field use
nonstandard pop()/add() calls on dim_proto; replace popping with explicit
deletion (e.g., use del dim_proto[-1] in the shrink loop) and call
dim_proto.add() when growing but capture the returned element if you need to
initialize it (e.g., new_dim = dim_proto.add()); keep the subsequent loop that
sets dim_proto[i].ClearField("dim_param") and dim_proto[i].dim_value = d, but
ensure dim_proto has been resized using del and dim_proto.add() rather than
pop()/add() without using the returned message.
- Around line 39-42: QUANT_DTYPES currently sets "fp8" to np.int8 when the FP8
dtype isn't present, causing silent fallback; change this by removing the silent
fallback from the module-level QUANT_DTYPES and instead detect/import ml_dtypes
(or np.float8 variant) conditionally at runtime where quantization is requested
(e.g., inside the function(s) that handle quantization/autotune and any call
sites referencing QUANT_DTYPES), and when a user requests "fp8" but the FP8
dtype isn't available emit a clear user-facing warning or raise a descriptive
error; specifically, update references to QUANT_DTYPES to validate availability
at usage time, attempt a conditional import of ml_dtypes (or check hasattr(np,
"float8_e4m3fn")) there, and log/warn if FP8 cannot be supported rather than
silently substituting int8.
In `@modelopt/onnx/quantization/autotune/tensorrt_utils.py`:
- Around line 1-41: The license header at the top of the module (the
module-level docstring and SPDX header in tensorrt_utils.py) uses an unusual
year range "1993-2025"; update the copyright/SPDX header to the project's
standard single-year format (e.g., "Copyright (c) 2026 NVIDIA CORPORATION &
AFFILIATES") and adjust the SPDX-License-Identifier line if needed so the header
matches other files; edit the top-of-file docstring/SPDX block where the current
year range appears to replace it with the standard format.
- Around line 987-1000: Fix _save_timing_cache: the condition and method call
are wrong and the finally cleans up an undefined name. Change the inverted check
so you only call combine when self._timing_cache is NOT None, correct the typo
`combline` to `combine` for the timing cache object returned by
config.create_timing_cache, and in the finally block remove the undefined
`builder` deletion (either delete only `config` or explicitly reference
`self.builder` if you intended to delete it); ensure you still serialize and
write timing_cache_data to self.timing_cache_file when a timing cache exists.
In `@modelopt/onnx/quantization/autotune/torch_region_builder.py`:
- Around line 311-318: The _build_id_to_region_map function currently uses a
mutable default dict for id_to_region_map which can be shared across calls;
change the signature to default id_to_region_map to None and inside
_build_id_to_region_map initialize id_to_region_map = {} when it's None, then
proceed to assign id_to_region_map[region.id] = region and recursively call
self._build_id_to_region_map(child, id_to_region_map) so each top-level call
gets a fresh map; keep the return type dict[int, Region] and behavior otherwise
unchanged.
- Around line 16-24: The module docstring in torch_region_builder.py duplicates
the license/copyright header (the "SPDX-FileCopyrightText" lines and
years)—remove the duplicate license block from the top of the module docstring
so only the canonical file header remains; keep the descriptive docstring text
("Torch Region Builder - Hierarchical Region Discovery...") and ensure
functions/classes in this module (e.g., any top-level descriptions used by
torch_region_builder.py) are unaffected by the removal.
- Around line 752-754: The parameter only_quantizable is being ignored because
the code unconditionally sets only_quantizable = True; fix by either removing
that assignment so the function respects the incoming only_quantizable
parameter, or remove the only_quantizable parameter from the function signature
if it's not needed; locate the assignment near the logger.info(f"Loading model:
{onnx_path}") call in torch_region_builder.py and update the function that
accepts only_quantizable to either use the passed value or eliminate the unused
parameter and adjust all callers accordingly.
- Around line 320-331: The _build_tensor_to_regions_map function uses a mutable
default argument (dict = {}), which can be shared across calls; change the
signature to accept tensor_to_regions_map: Optional[dict[str, set[int]]] = None
and inside the function initialize tensor_to_regions_map = {} if None, mirroring
the fix used in _build_id_to_region_map; keep the recursive behavior and return
the map as before so each top-level call gets a fresh dict.
- Line 333: The _merge_neighboring_regions function currently uses a mutable
default argument for to_remove (set()); change the signature to accept None
(e.g., to_remove: Optional[set[int]] = None or to_remove: set[int] | None =
None) and inside the body initialize it to an empty set when None (e.g., if
to_remove is None: to_remove = set()), updating any type imports as needed and
leaving the rest of the logic in _merge_neighboring_regions unchanged.
In `@tests/unit/onnx/quantization/autotune/test_config.py`:
- Around line 71-82: The test test_performance_threshold_validation is
incomplete: it only asserts valid values for Config.performance_threshold but
doesn't check that invalid values are rejected; either add validation in Config
or update the test to assert the expected failure. If Config should validate,
implement a check in Config.__init__ or the performance_threshold setter to
raise ValueError for values < 1.0 and then update
test_performance_threshold_validation to include a with
self.assertRaises(ValueError): Config(performance_threshold=0.9) case; otherwise
remove the misleading comment and the "Invalid values" note from
test_performance_threshold_validation.
- Around line 1-6: Add the required NVIDIA Apache 2.0 license header to the top
of tests/unit/onnx/quantization/autotune/test_config.py: insert the standard
NVIDIA Apache-2.0 license block (including copyright notice/SPDX identifier and
license text) immediately after the existing shebang (#!/usr/bin/env python3)
and before the module docstring so the file complies with the repository's
license/header convention.
---
Outside diff comments:
In `@modelopt/onnx/quantization/autotune/benchmark.py`:
- Line 34: Remove the inline "# nosec B404" on the subprocess import (import
subprocess) unless you add an explicit justification and approval from
`@NVIDIA/modelopt-setup-codeowners` in the PR description; if the pattern is
required, keep the comment but add the justification text to the PR and ensure
all uses of subprocess in this module (functions/methods that call
subprocess.run, subprocess.Popen, etc.) follow safe patterns (pass args as a
list, avoid shell=True) and reference that approval in the PR description.
---
Nitpick comments:
In `@examples/qdq_placement/README.md`:
- Around line 152-162: Update the fenced code blocks that show directory
listings so they include a language specifier (use ```text) — specifically
change the blocks containing the "resnet50_results/" tree and the block
containing the "<output_dir>/" tree to start with ```text instead of ``` so the
directory listings render consistently; locate the blocks by the directory
headers "resnet50_results/" and "<output_dir>/" in the README and replace their
opening fences accordingly.
In `@examples/qdq_placement/set_batch_size.py`:
- Around line 62-65: Replace the direct call to
onnx.checker.check_model(output_path) with the project's robust checker that
handles >2GB models by importing and calling modelopt.onnx.utils.check_model;
locate the verification block around the print("Verifying model...") and change
the call to use modelopt.onnx.utils.check_model(output_path) (or import the
function as check_model and call check_model(output_path)) so external-data
large protobuf models are supported.
In `@modelopt/onnx/quantization/autotune/qdq_utils.py`:
- Line 60: The condition checking node inputs is redundant: replace the `if
node.input and len(node.input) > 0:` guard with a simpler truthy check `if
node.input:` in the function where this appears (look for the code handling
`node.input` in qdq_utils.py) so the branch behavior remains identical but the
expression is simplified.
- Line 1: Update the SPDX header year in
modelopt/onnx/quantization/autotune/qdq_utils.py to match the rest of the PR
(change "2024" to "2026"); locate the SPDX comment at the top of the file and
modify the year in the copyright line so it is consistent across files.
In `@modelopt/onnx/quantization/autotune/subgraph_extractor.py`:
- Around line 170-181: The BFS loop in subgraph_extractor.py uses a Python list
and pop(0), causing O(n) dequeuing; replace the list-based queue with
collections.deque: import deque, initialize queue = deque(...) where the queue
is created, and change queue.pop(0) to queue.popleft(), keeping
queue.append(...) for enqueueing; update any code that constructs the initial
queue and ensure imports include collections.deque so the BFS in the function
that contains this loop uses O(1) dequeue operations.
In `@tests/unit/onnx/quantization/autotune/test_config.py`:
- Around line 12-13: Remove the ad-hoc sys.path manipulation in the test file:
delete the sys.path.insert(0,
os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) line in
tests/unit/onnx/quantization/autotune/test_config.py and rely on the package
being installed in the test environment (or use test runner configuration /
PYTHONPATH) so imports resolve cleanly; do not add alternative path hacks in
this file.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 48cf674c-cea7-4964-b1ed-c9c27e9d6e6e
📥 Commits
Reviewing files that changed from the base of the PR and between bc87981 and 48eed5d778e4899ba16d8db76a23b22309416fc7.
📒 Files selected for processing (15)
docs/source/guides/9_qdq_placement.rstdocs/source/reference/2_qdq_placement.rstexamples/qdq_placement/README.mdexamples/qdq_placement/set_batch_size.pymodelopt/onnx/quantization/autotune/__init__.pymodelopt/onnx/quantization/autotune/__main__.pymodelopt/onnx/quantization/autotune/benchmark.pymodelopt/onnx/quantization/autotune/fusion_grouping.pymodelopt/onnx/quantization/autotune/qdq_utils.pymodelopt/onnx/quantization/autotune/subgraph_extractor.pymodelopt/onnx/quantization/autotune/subgraph_workflow.pymodelopt/onnx/quantization/autotune/tensorrt_utils.pymodelopt/onnx/quantization/autotune/torch_region_builder.pymodelopt/onnx/quantization/autotune/workflows.pytests/unit/onnx/quantization/autotune/test_config.py
| def test_performance_threshold_validation(self): | ||
| """Test that performance_threshold must be >= 1.0.""" | ||
| # Valid values | ||
| config1 = Config(performance_threshold=1.0) | ||
| self.assertEqual(config1.performance_threshold, 1.0) | ||
|
|
||
| config2 = Config(performance_threshold=1.5) | ||
| self.assertEqual(config2.performance_threshold, 1.5) | ||
|
|
||
| # Invalid values should not be accepted | ||
| # Note: This test assumes validation exists, if not we should add it | ||
| print("✓ Config performance_threshold validation") |
There was a problem hiding this comment.
Incomplete validation test - invalid values not actually tested.
The test test_performance_threshold_validation only tests valid values but the comment on line 80-81 indicates invalid values should be rejected. Either implement the validation in Config and test it, or remove the misleading comment.
Option 1: If validation exists, add the test
def test_performance_threshold_validation(self):
"""Test that performance_threshold must be >= 1.0."""
# Valid values
config1 = Config(performance_threshold=1.0)
self.assertEqual(config1.performance_threshold, 1.0)
config2 = Config(performance_threshold=1.5)
self.assertEqual(config2.performance_threshold, 1.5)
# Invalid values should raise ValueError
with self.assertRaises(ValueError):
Config(performance_threshold=0.9)
print("✓ Config performance_threshold validation")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/unit/onnx/quantization/autotune/test_config.py` around lines 71 - 82,
The test test_performance_threshold_validation is incomplete: it only asserts
valid values for Config.performance_threshold but doesn't check that invalid
values are rejected; either add validation in Config or update the test to
assert the expected failure. If Config should validate, implement a check in
Config.__init__ or the performance_threshold setter to raise ValueError for
values < 1.0 and then update test_performance_threshold_validation to include a
with self.assertRaises(ValueError): Config(performance_threshold=0.9) case;
otherwise remove the misleading comment and the "Invalid values" note from
test_performance_threshold_validation.
|
@Hale423 please find my comments below:
Thanks |
48eed5d to
cf36938
Compare
There was a problem hiding this comment.
Actionable comments posted: 14
♻️ Duplicate comments (3)
tests/unit/onnx/quantization/autotune/test_config.py (1)
75-87:⚠️ Potential issue | 🟡 Minor“Validation” tests currently don’t validate rejection paths.
These tests assert valid values only, but they don’t check that invalid inputs are rejected. Either add failure assertions or rename tests to avoid implying validation behavior that isn’t tested.
Suggested assertion additions
def test_performance_threshold_validation(self): """Test that performance_threshold must be >= 1.0.""" config1 = Config(performance_threshold=1.0) self.assertEqual(config1.performance_threshold, 1.0) config2 = Config(performance_threshold=1.5) self.assertEqual(config2.performance_threshold, 1.5) + + with self.assertRaises(ValueError): + Config(performance_threshold=0.9)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/onnx/quantization/autotune/test_config.py` around lines 75 - 87, The tests test_performance_threshold_validation and test_region_size_validation only assert that valid values are accepted; add negative-path assertions to ensure invalid inputs are rejected by Config: in test_performance_threshold_validation wrap creating Config(performance_threshold=0.9) (and maybe 0.0 or -1.0) in self.assertRaises(ValueError) to assert a rejection, and in test_region_size_validation use self.assertRaises(ValueError) for Config(maximum_sequence_region_size=0) and Config(minimum_topdown_search_size=0) (and optionally negative values) so the class Config enforces >0 for region sizes; keep the original valid-value assertions and use the Config symbol and the test method names to locate where to add the assertRaises blocks.modelopt/onnx/quantization/autotune/tensorrt_utils.py (1)
987-999:⚠️ Potential issue | 🟠 MajorGuard cleanup in
_save_timing_cache().
configis only assigned inside theif self._timing_cache is not Noneblock, but thefinallyblock always deletes it. Ifcreate_builder_config()fails before assignment, or this method is ever called with no cache object, the cleanup path raisesUnboundLocalErrorand masks the real failure.🐛 Proposed fix
def _save_timing_cache(self): + config = None try: if self._timing_cache is not None: config = self.builder.create_builder_config() output_cache = config.create_timing_cache(b"") output_cache.combine(self._timing_cache, ignore_errors=True) timing_cache_data = output_cache.serialize() with open(self.timing_cache_file, "wb") as f: f.write(timing_cache_data) self.logger.debug(f"Saved timing cache to: {self.timing_cache_file}") except Exception as e: self.logger.warning(f"Failed to save timing cache: {e}") finally: - del config + if config is not None: + del config🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/quantization/autotune/tensorrt_utils.py` around lines 987 - 999, The finally block in _save_timing_cache references config even when it may never be assigned (e.g., when self._timing_cache is None or create_builder_config() raises), causing UnboundLocalError; fix by initializing config = None before the try and only delete or cleanup config if it is not None (or use a conditional cleanup) after the try/except so that config is only referenced when created; update references to config creation in _save_timing_cache and ensure create_builder_config() and output_cache are guarded similarly to avoid the same issue.modelopt/onnx/quantization/autotune/subgraph_workflow.py (1)
54-71:⚠️ Potential issue | 🔴 Critical
quant_type="fp8"still silently degrades to INT8.
QUANT_DTYPES["fp8"]is resolved at import time and_get_fp8_dtype()returnsnp.int8onImportError, so every INT8-only session logs an FP8 warning at import and every FP8 request quantizes as INT8. Resolve and validate the FP8 dtype inside the FP8 call path instead of freezing the fallback inQUANT_DTYPES.Also applies to: 381-383
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/quantization/autotune/subgraph_workflow.py` around lines 54 - 71, QUANT_DTYPES currently calls _get_fp8_dtype() at import time causing FP8 to silently fallback to int8 and emit warnings for all runs; change QUANT_DTYPES to store a lazy indicator (e.g., map "fp8" to a sentinel or the function _get_fp8_dtype itself) instead of the resolved dtype, and update the FP8 quantization call path (where "fp8" is looked up) to call _get_fp8_dtype() at runtime, validate the returned dtype is FP8-compatible, and only then proceed or emit the fallback warning and convert to int8; update usages that reference QUANT_DTYPES["fp8"] to resolve via the new lazy mechanism.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/qdq_placement/README.md`:
- Around line 5-8: The README contains hardcoded branch names
"dev-wahao-autotune-subgraph-profile" and
"dev-willg-add-auto-qdq-placement-tool"; remove these branch-specific strings
and replace them with generic, version-agnostic wording (e.g., "current branch"
or describe the feature differences without naming branches) so the permanent
example docs don't reference ephemeral branch names; update the heading and the
sentence that lists differences to use neutral language describing the added
autotuner/subgraph workflow instead of the explicit branch identifiers.
- Around line 112-126: The README currently always passes --graph-json
"$GRAPH_JSON" which contradicts "optional --graph-json"; update the example to
conditionally include the flag when GRAPH_JSON is non-empty (leave GRAPH_JSON
unset or empty to trigger auto-generation). Specifically, change the command
invocation shown for python3 -m modelopt.onnx.quantization.autotune so it only
appends the --graph-json argument if GRAPH_JSON is set (or show the alternate
bracketed pattern like [--graph-json "$GRAPH_JSON"]) and document that leaving
GRAPH_JSON empty omits the flag; refer to the MODEL, GRAPH_JSON, OUTPUT_DIR
variables and the --graph-json flag to locate the snippet to edit.
In `@modelopt/onnx/quantization/autotune/benchmark.py`:
- Around line 307-326: Only set self._profile_unsupported when the failure is
actually caused by profiling flags: after running _exec_and_log and seeing
"Unknown option" in result.stderr, check that export_profile_path is truthy
(meaning profiling was requested) AND that the stderr or the original cmd
contains one of the profiling-specific options ("--separateProfileRun",
"--profilingVerbosity", "--exportProfile"); only then log the warning, set
self._profile_unsupported = True, strip those profiling flags from cmd and retry
via _exec_and_log. This ensures unrelated "Unknown option" errors (e.g., from
extra_run_args) don't permanently disable profiling.
- Around line 296-305: The strip_shape_args branch only filters
--optShapes/--minShapes/--maxShapes but misses other shape flags like --shapes
or calibration shape flags; update the filter inside the strip_shape_args block
(where cmd is rebuilt) to drop any argument that refers to shapes (e.g., any c
where "shapes" is in c.lower() or
c.startswith("--shape")/c.startswith("--shapes")/c.startswith("--calibration")
), so the subgraph run command no longer carries original-model shape bindings
before extra_run_args (refer to strip_shape_args, cmd, extra_run_args, and
model_path to locate the change).
- Around line 236-239: Remove the inline Bandit suppression by deleting the "#
nosec B603" comment on the subprocess.run call inside the _exec_and_log method;
leave the call as subprocess.run(cmd, capture_output=True, text=True) (ensure it
remains a list-based invocation, not shell=True), and if Bandit flags this as a
false positive prepare a documented justification and request formal review from
`@NVIDIA/modelopt-setup-codeowners` per the security-review process instead of
using an in-code bypass.
In `@modelopt/onnx/quantization/autotune/fusion_grouping.py`:
- Around line 267-273: Check that the generated graph JSON file actually exists
before returning: after verifying result.returncode in the trtexec invocation
(the block referencing result and logger) add a Path(graph_json_path).exists()
check and if missing log an error via logger.error (include graph_json_path and
result.returncode/context) and raise a RuntimeError (similar to the existing
trtexec FP16 build failure) so callers won’t get a false success from the
function that returns graph_json_path.
In `@modelopt/onnx/quantization/autotune/qdq_utils.py`:
- Around line 49-51: The doc example uses an undefined variable new_model when
constructing QDQAutotuner; update the example to pass the initialized model (the
variable named model) or explicitly define new_model before use so
QDQAutotuner(model) is called correctly; locate the example lines with
QDQAutotuner and autotuner.initialize() and replace new_model with model (or add
a definition for new_model) to ensure the example is valid.
In `@modelopt/onnx/quantization/autotune/subgraph_workflow.py`:
- Around line 660-667: The resume logic currently trusts any loaded cache;
update the code around _load_cache(cache_path) so after loading you validate
cached metadata (at minimum cache.get("version") == CACHE_VERSION and
cache.get("model_path") == model_path) and also compare tuning inputs used by
this run (e.g. quant_type, graph_json_path, extra_trtexec_args or any other
run-specific keys you expect) against values stored in the cache; if any
mismatch is detected, invalidate the cache by replacing it with a fresh dict
containing the current metadata (include CACHE_VERSION, model_path and the
current tuning inputs) so that cached_p2 / cached_p3 are empty and stale
phase2/phase3 decisions are not reused (refer to _load_cache, cache_path,
CACHE_VERSION, model_path, quant_type, graph_json_path, extra_trtexec_args, and
the "phase2"/"phase3" keys to locate where to add this validation).
In `@modelopt/onnx/quantization/autotune/torch_region_builder.py`:
- Around line 773-783: The code currently mutates children but always returns
the original regions list, so include_all_regions=False has no effect; update
the loop that iterates over regions to build a new filtered list (e.g.,
filtered_regions = []) and for each region: perform the existing child-pruning
using region.get_children() and region.remove_child(child), then if
include_all_regions is True or has_quantizable_operations(region, graph) is
True, append the region to filtered_regions; finally return filtered_regions
instead of regions; apply the same change to the analogous block that handles
regions later in the file (the block that currently mirrors lines ~791-817) so
both code paths honor include_all_regions.
- Around line 624-629: The torch_node_ratio method can divide by zero when there
are no non-constant nodes; update torch_node_ratio to guard against empty
non_constant_nodes by checking if non_constant_nodes is empty (len == 0) and
returning 0.0 (or another safe default) instead of performing the division; keep
the existing logic that builds non_constant_nodes from self.graph.nodes
(filtering on n.op != "Constant") and the slash_count calculation (checking
n.name and n.name.startswith("/")) and only compute slash_count /
len(non_constant_nodes) when the length is > 0.
- Around line 549-551: The loop in _probe_epilogues_recursive unconditionally
appends consumer_idx to epilogue_ops, which allows non-fusible or divergent
consumers into regions; change it to first verify the consumer meets the
method's fusibility/divergence predicates (the same checks used elsewhere in
this module—e.g., the file's existing fusibility/divergence helper(s) such as
the region/component fusibility check and divergence test) and only append and
recurse when those checks pass (i.e., if consumer is fusible with the current
region and not divergent), otherwise skip that consumer.
In `@modelopt/onnx/quantization/autotune/workflows.py`:
- Around line 69-75: The call to _benchmark_instance.run is forwarding
trtexec-specific kwargs (strip_shape_args, extra_run_args, export_profile_path)
to all backends causing a TypeError when _benchmark_instance is a
TensorRTPyBenchmark; update the call to only include those kwargs when the
backend is TrtExecBenchmark (or otherwise supports them) by either checking
isinstance(_benchmark_instance, TrtExecBenchmark) or testing for
attribute/method support before adding
strip_shape_args/extra_run_args/export_profile_path to the kwargs dict, then
call _benchmark_instance.run with the filtered kwargs so non-trt backends
receive only supported parameters.
In `@tests/unit/onnx/quantization/autotune/test_config.py`:
- Around line 19-112: Replace the unittest-based TestConfig class/tests with
pytest-style tests: convert each test_* method (test_default_values,
test_custom_values, test_performance_threshold_validation,
test_region_size_validation, test_genetic_algorithm_params,
test_pattern_cache_params) into standalone functions that instantiate Config and
use plain assert statements (e.g., assert config.verbose is False) instead of
self.assertEqual/self.assertGreater, remove the unittest.TestCase inheritance
and the if __name__ == "__main__": unittest.main() block, and keep references to
the Config symbol so the same behavior is tested under pytest.
---
Duplicate comments:
In `@modelopt/onnx/quantization/autotune/subgraph_workflow.py`:
- Around line 54-71: QUANT_DTYPES currently calls _get_fp8_dtype() at import
time causing FP8 to silently fallback to int8 and emit warnings for all runs;
change QUANT_DTYPES to store a lazy indicator (e.g., map "fp8" to a sentinel or
the function _get_fp8_dtype itself) instead of the resolved dtype, and update
the FP8 quantization call path (where "fp8" is looked up) to call
_get_fp8_dtype() at runtime, validate the returned dtype is FP8-compatible, and
only then proceed or emit the fallback warning and convert to int8; update
usages that reference QUANT_DTYPES["fp8"] to resolve via the new lazy mechanism.
In `@modelopt/onnx/quantization/autotune/tensorrt_utils.py`:
- Around line 987-999: The finally block in _save_timing_cache references config
even when it may never be assigned (e.g., when self._timing_cache is None or
create_builder_config() raises), causing UnboundLocalError; fix by initializing
config = None before the try and only delete or cleanup config if it is not None
(or use a conditional cleanup) after the try/except so that config is only
referenced when created; update references to config creation in
_save_timing_cache and ensure create_builder_config() and output_cache are
guarded similarly to avoid the same issue.
In `@tests/unit/onnx/quantization/autotune/test_config.py`:
- Around line 75-87: The tests test_performance_threshold_validation and
test_region_size_validation only assert that valid values are accepted; add
negative-path assertions to ensure invalid inputs are rejected by Config: in
test_performance_threshold_validation wrap creating
Config(performance_threshold=0.9) (and maybe 0.0 or -1.0) in
self.assertRaises(ValueError) to assert a rejection, and in
test_region_size_validation use self.assertRaises(ValueError) for
Config(maximum_sequence_region_size=0) and Config(minimum_topdown_search_size=0)
(and optionally negative values) so the class Config enforces >0 for region
sizes; keep the original valid-value assertions and use the Config symbol and
the test method names to locate where to add the assertRaises blocks.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 46f738dd-cec0-48e7-b297-30a197b0f4fd
📥 Commits
Reviewing files that changed from the base of the PR and between 48eed5d778e4899ba16d8db76a23b22309416fc7 and cf3693855ff344d3ef389672629d38947f788c00.
📒 Files selected for processing (15)
docs/source/guides/9_qdq_placement.rstdocs/source/reference/2_qdq_placement.rstexamples/qdq_placement/README.mdexamples/qdq_placement/set_batch_size.pymodelopt/onnx/quantization/autotune/__init__.pymodelopt/onnx/quantization/autotune/__main__.pymodelopt/onnx/quantization/autotune/benchmark.pymodelopt/onnx/quantization/autotune/fusion_grouping.pymodelopt/onnx/quantization/autotune/qdq_utils.pymodelopt/onnx/quantization/autotune/subgraph_extractor.pymodelopt/onnx/quantization/autotune/subgraph_workflow.pymodelopt/onnx/quantization/autotune/tensorrt_utils.pymodelopt/onnx/quantization/autotune/torch_region_builder.pymodelopt/onnx/quantization/autotune/workflows.pytests/unit/onnx/quantization/autotune/test_config.py
🚧 Files skipped from review as they are similar to previous changes (1)
- docs/source/guides/9_qdq_placement.rst
| ## Branch: Subgraph Mode (This Branch) | ||
|
|
||
| On branch **dev-wahao-autotune-subgraph-profile**, the autotuner adds a second workflow and related features not present on **dev-willg-add-auto-qdq-placement-tool**: | ||
|
|
There was a problem hiding this comment.
Remove branch-specific naming from permanent example docs.
Hardcoding dev-wahao-autotune-subgraph-profile / dev-willg-add-auto-qdq-placement-tool will age quickly and confuse users reading this from main.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/qdq_placement/README.md` around lines 5 - 8, The README contains
hardcoded branch names "dev-wahao-autotune-subgraph-profile" and
"dev-willg-add-auto-qdq-placement-tool"; remove these branch-specific strings
and replace them with generic, version-agnostic wording (e.g., "current branch"
or describe the feature differences without naming branches) so the permanent
example docs don't reference ephemeral branch names; update the heading and the
sentence that lists differences to use neutral language describing the added
autotuner/subgraph workflow instead of the explicit branch identifiers.
| # Set output dir (e.g. <model_prefix>_autotune_subgraph) and optional graph.json | ||
| MODEL="path/to/your_model.fp16.opt.onnx" | ||
| GRAPH_JSON="path/to/your_model.fp16.opt.graph.json" # optional; omit to auto-generate | ||
| OUTPUT_DIR="${MODEL%.onnx}_autotune_subgraph" | ||
| mkdir -p "$OUTPUT_DIR" | ||
|
|
||
| # Optional: optShapes for dynamic inputs (comma-separated name:shape) | ||
| OPT_SHAPES='--optShapes=agents_feature:21x33x10x16,agents_mask:21x33x10' | ||
|
|
||
| python3 -m modelopt.onnx.quantization.autotune \ | ||
| --model "$MODEL" \ | ||
| --output "$OUTPUT_DIR" \ | ||
| --mode subgraph \ | ||
| --graph-json "$GRAPH_JSON" \ | ||
| --quant-type fp8 \ |
There was a problem hiding this comment.
The sample contradicts “optional --graph-json” behavior.
The text says graph.json is optional, but the command always passes --graph-json "$GRAPH_JSON". This makes copy/paste usage brittle when users want auto-generation.
Suggested command pattern
-MODEL="path/to/your_model.fp16.opt.onnx"
-GRAPH_JSON="path/to/your_model.fp16.opt.graph.json" # optional; omit to auto-generate
+MODEL="path/to/your_model.fp16.opt.onnx"
+GRAPH_JSON="" # optional; set to existing graph.json to reuse
OUTPUT_DIR="${MODEL%.onnx}_autotune_subgraph"
mkdir -p "$OUTPUT_DIR"
+
+GRAPH_JSON_ARGS=()
+if [ -n "$GRAPH_JSON" ]; then
+ GRAPH_JSON_ARGS=(--graph-json "$GRAPH_JSON")
+fi
python3 -m modelopt.onnx.quantization.autotune \
--model "$MODEL" \
--output "$OUTPUT_DIR" \
--mode subgraph \
- --graph-json "$GRAPH_JSON" \
+ "${GRAPH_JSON_ARGS[@]}" \📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Set output dir (e.g. <model_prefix>_autotune_subgraph) and optional graph.json | |
| MODEL="path/to/your_model.fp16.opt.onnx" | |
| GRAPH_JSON="path/to/your_model.fp16.opt.graph.json" # optional; omit to auto-generate | |
| OUTPUT_DIR="${MODEL%.onnx}_autotune_subgraph" | |
| mkdir -p "$OUTPUT_DIR" | |
| # Optional: optShapes for dynamic inputs (comma-separated name:shape) | |
| OPT_SHAPES='--optShapes=agents_feature:21x33x10x16,agents_mask:21x33x10' | |
| python3 -m modelopt.onnx.quantization.autotune \ | |
| --model "$MODEL" \ | |
| --output "$OUTPUT_DIR" \ | |
| --mode subgraph \ | |
| --graph-json "$GRAPH_JSON" \ | |
| --quant-type fp8 \ | |
| # Set output dir (e.g. <model_prefix>_autotune_subgraph) and optional graph.json | |
| MODEL="path/to/your_model.fp16.opt.onnx" | |
| GRAPH_JSON="" # optional; set to existing graph.json to reuse | |
| OUTPUT_DIR="${MODEL%.onnx}_autotune_subgraph" | |
| mkdir -p "$OUTPUT_DIR" | |
| GRAPH_JSON_ARGS=() | |
| if [ -n "$GRAPH_JSON" ]; then | |
| GRAPH_JSON_ARGS=(--graph-json "$GRAPH_JSON") | |
| fi | |
| # Optional: optShapes for dynamic inputs (comma-separated name:shape) | |
| OPT_SHAPES='--optShapes=agents_feature:21x33x10x16,agents_mask:21x33x10' | |
| python3 -m modelopt.onnx.quantization.autotune \ | |
| --model "$MODEL" \ | |
| --output "$OUTPUT_DIR" \ | |
| --mode subgraph \ | |
| "${GRAPH_JSON_ARGS[@]}" \ | |
| --quant-type fp8 \ |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/qdq_placement/README.md` around lines 112 - 126, The README
currently always passes --graph-json "$GRAPH_JSON" which contradicts "optional
--graph-json"; update the example to conditionally include the flag when
GRAPH_JSON is non-empty (leave GRAPH_JSON unset or empty to trigger
auto-generation). Specifically, change the command invocation shown for python3
-m modelopt.onnx.quantization.autotune so it only appends the --graph-json
argument if GRAPH_JSON is set (or show the alternate bracketed pattern like
[--graph-json "$GRAPH_JSON"]) and document that leaving GRAPH_JSON empty omits
the flag; refer to the MODEL, GRAPH_JSON, OUTPUT_DIR variables and the
--graph-json flag to locate the snippet to edit.
| # Get the input tensor | ||
| graph = model.graph | ||
| input_tensor = graph.input[0] | ||
|
|
||
| print(f"Original input shape: {[d.dim_param or d.dim_value for d in input_tensor.type.tensor_type.shape.dim]}") | ||
|
|
||
| # Modify the batch dimension (first dimension) | ||
| if len(input_tensor.type.tensor_type.shape.dim) > 0: | ||
| input_tensor.type.tensor_type.shape.dim[0].dim_value = batch_size | ||
| # Clear any symbolic dimension parameter | ||
| input_tensor.type.tensor_type.shape.dim[0].ClearField('dim_param') | ||
|
|
||
| # Also update output shapes if needed | ||
| for output_tensor in graph.output: | ||
| if len(output_tensor.type.tensor_type.shape.dim) > 0: | ||
| output_tensor.type.tensor_type.shape.dim[0].dim_value = batch_size | ||
| output_tensor.type.tensor_type.shape.dim[0].ClearField('dim_param') | ||
|
|
There was a problem hiding this comment.
Do not blindly rewrite output dim[0], and avoid single-input assumptions.
Current logic updates only the first model input but forces dim[0] on all outputs. That can silently produce invalid shapes for multi-input models or models where output dim[0] is not batch.
Suggested robustness fix
- # Get the input tensor
- graph = model.graph
- input_tensor = graph.input[0]
+ graph = model.graph
+ if batch_size <= 0:
+ raise ValueError("batch_size must be a positive integer")
+ if not graph.input:
+ raise ValueError("Model has no graph inputs to rewrite")
+ input_tensors = list(graph.input)
- print(f"Original input shape: {[d.dim_param or d.dim_value for d in input_tensor.type.tensor_type.shape.dim]}")
+ print("Original input shapes:")
+ for input_tensor in input_tensors:
+ print(f" {input_tensor.name}: {[d.dim_param or d.dim_value for d in input_tensor.type.tensor_type.shape.dim]}")
- # Modify the batch dimension (first dimension)
- if len(input_tensor.type.tensor_type.shape.dim) > 0:
- input_tensor.type.tensor_type.shape.dim[0].dim_value = batch_size
- # Clear any symbolic dimension parameter
- input_tensor.type.tensor_type.shape.dim[0].ClearField('dim_param')
-
- # Also update output shapes if needed
- for output_tensor in graph.output:
- if len(output_tensor.type.tensor_type.shape.dim) > 0:
- output_tensor.type.tensor_type.shape.dim[0].dim_value = batch_size
- output_tensor.type.tensor_type.shape.dim[0].ClearField('dim_param')
+ # Modify batch dimension on model inputs only
+ for input_tensor in input_tensors:
+ if len(input_tensor.type.tensor_type.shape.dim) > 0:
+ input_tensor.type.tensor_type.shape.dim[0].dim_value = batch_size
+ input_tensor.type.tensor_type.shape.dim[0].ClearField("dim_param")| def _exec_and_log(self, cmd: list[str], log_file: str | None = None): | ||
| """Execute trtexec command and write logs.""" | ||
| self.logger.debug(f"Running: {' '.join(cmd)}") | ||
| result = subprocess.run(cmd, capture_output=True, text=True) # nosec B603 |
There was a problem hiding this comment.
Remove the new # nosec bypass.
Inline Bandit suppressions are disallowed in this repo. This addition needs to be removed or routed through the explicit security-review path instead of suppressing B603 in-code.
As per coding guidelines, "Bandit security checks are enforced via pre-commit hooks. # nosec comments are not allowed as bypasses for security checks; contributors must request formal review from @NVIDIA/modelopt-setup-codeowners with documented justification instead."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/onnx/quantization/autotune/benchmark.py` around lines 236 - 239,
Remove the inline Bandit suppression by deleting the "# nosec B603" comment on
the subprocess.run call inside the _exec_and_log method; leave the call as
subprocess.run(cmd, capture_output=True, text=True) (ensure it remains a
list-based invocation, not shell=True), and if Bandit flags this as a false
positive prepare a documented justification and request formal review from
`@NVIDIA/modelopt-setup-codeowners` per the security-review process instead of
using an in-code bypass.
| for consumer_idx in consumer_indices: | ||
| epilogue_ops.append(consumer_idx) | ||
| self._probe_epilogues_recursive(consumer_idx, current_step + 1, max_steps, epilogue_ops) |
There was a problem hiding this comment.
Epilogue candidates are added without fusibility/divergence gating.
consumer_idx is appended unconditionally, so non-fusible or divergent consumers can be added to regions, which contradicts the method’s own contract and over-expands epilogues.
Proposed fix
- for consumer_idx in consumer_indices:
- epilogue_ops.append(consumer_idx)
- self._probe_epilogues_recursive(consumer_idx, current_step + 1, max_steps, epilogue_ops)
+ for consumer_idx in consumer_indices:
+ if consumer_idx >= len(self.graph.nodes):
+ continue
+ consumer = self.graph.nodes[consumer_idx]
+ if not self.is_fusible_node(consumer.op):
+ continue
+ if self._is_node_divergent(consumer_idx):
+ continue
+ if consumer_idx in epilogue_ops:
+ continue
+ epilogue_ops.append(consumer_idx)
+ self._probe_epilogues_recursive(consumer_idx, current_step + 1, max_steps, epilogue_ops)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/onnx/quantization/autotune/torch_region_builder.py` around lines 549
- 551, The loop in _probe_epilogues_recursive unconditionally appends
consumer_idx to epilogue_ops, which allows non-fusible or divergent consumers
into regions; change it to first verify the consumer meets the method's
fusibility/divergence predicates (the same checks used elsewhere in this
module—e.g., the file's existing fusibility/divergence helper(s) such as the
region/component fusibility check and divergence test) and only append and
recurse when those checks pass (i.e., if consumer is fusible with the current
region and not divergent), otherwise skip that consumer.
| def torch_node_ratio(self) -> float: | ||
| """Count the number of nodes that are exported from PyTorch.""" | ||
| non_constant_nodes = [n for n in self.graph.nodes if n.op != "Constant"] | ||
| slash_count = sum(1 for n in non_constant_nodes if n.name and n.name.startswith("/")) | ||
| return slash_count / len(non_constant_nodes) | ||
|
|
There was a problem hiding this comment.
Guard torch_node_ratio() against empty non-constant graphs.
If all nodes are Constant (or graph has no nodes), this divides by zero and crashes inspection.
Proposed fix
def torch_node_ratio(self) -> float:
"""Count the number of nodes that are exported from PyTorch."""
non_constant_nodes = [n for n in self.graph.nodes if n.op != "Constant"]
+ if not non_constant_nodes:
+ return 0.0
slash_count = sum(1 for n in non_constant_nodes if n.name and n.name.startswith("/"))
return slash_count / len(non_constant_nodes)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/onnx/quantization/autotune/torch_region_builder.py` around lines 624
- 629, The torch_node_ratio method can divide by zero when there are no
non-constant nodes; update torch_node_ratio to guard against empty
non_constant_nodes by checking if non_constant_nodes is empty (len == 0) and
returning 0.0 (or another safe default) instead of performing the division; keep
the existing logic that builds non_constant_nodes from self.graph.nodes
(filtering on n.op != "Constant") and the slash_count calculation (checking
n.name and n.name.startswith("/")) and only compute slash_count /
len(non_constant_nodes) when the length is > 0.
| for i, region in enumerate(regions): | ||
| if not include_all_regions: | ||
| children_to_remove = [ | ||
| c for c in region.get_children() if not has_quantizable_operations(c, graph) | ||
| ] | ||
| for child in children_to_remove: | ||
| region.remove_child(child) | ||
|
|
||
| if not include_all_regions and not has_quantizable_operations(region, graph): | ||
| logger.debug(f"Filtered out region {i} (no quantizable operations)") | ||
| continue |
There was a problem hiding this comment.
include_all_regions=False does not actually filter the returned regions.
The loop skips logging for non-quantizable regions, but return regions still returns the unfiltered list. This makes include_all_regions behavior inconsistent with the function docs.
Proposed fix
- logger.info("Analyzing region structure")
- for i, region in enumerate(regions):
+ logger.info("Analyzing region structure")
+ filtered_regions: list[Region] = []
+ for i, region in enumerate(regions):
if not include_all_regions:
children_to_remove = [
c for c in region.get_children() if not has_quantizable_operations(c, graph)
]
for child in children_to_remove:
region.remove_child(child)
if not include_all_regions and not has_quantizable_operations(region, graph):
logger.debug(f"Filtered out region {i} (no quantizable operations)")
continue
+ filtered_regions.append(region)
logger.debug(
f"Region {i}: {region.get_type().value}, {len(region.get_all_nodes_recursive())} nodes, "
f"path={region.metadata.get('path', 'N/A')}"
)
builder.print_tree(region, indent=2)
- leaf_regions = sum(1 for r in regions if r.get_type() == RegionType.LEAF)
- composite_regions = sum(1 for r in regions if r.get_type() == RegionType.COMPOSITE)
- root_regions = sum(1 for r in regions if r.get_type() == RegionType.ROOT)
+ result_regions = regions if include_all_regions else filtered_regions
+ leaf_regions = sum(1 for r in result_regions if r.get_type() == RegionType.LEAF)
+ composite_regions = sum(1 for r in result_regions if r.get_type() == RegionType.COMPOSITE)
+ root_regions = sum(1 for r in result_regions if r.get_type() == RegionType.ROOT)
all_nodes = set()
- for region in regions:
+ for region in result_regions:
all_nodes.update(region.get_all_nodes_recursive())
total_nodes = len(all_nodes)
coverage_pct = 100 * total_nodes / len(graph.nodes) if graph.nodes else 0
@@
- paths = [r.metadata.get("path", "") for r in regions if r.metadata.get("path")]
+ paths = [r.metadata.get("path", "") for r in result_regions if r.metadata.get("path")]
@@
- return regions
+ return result_regionsAlso applies to: 791-817
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/onnx/quantization/autotune/torch_region_builder.py` around lines 773
- 783, The code currently mutates children but always returns the original
regions list, so include_all_regions=False has no effect; update the loop that
iterates over regions to build a new filtered list (e.g., filtered_regions = [])
and for each region: perform the existing child-pruning using
region.get_children() and region.remove_child(child), then if
include_all_regions is True or has_quantizable_operations(region, graph) is
True, append the region to filtered_regions; finally return filtered_regions
instead of regions; apply the same change to the analogous block that handles
regions later in the file (the block that currently mirrors lines ~791-817) so
both code paths honor include_all_regions.
| latency = _benchmark_instance.run( | ||
| model_path, log_file=log_file, flush_timing_cache=flush_timing_cache | ||
| model_path, | ||
| log_file=log_file, | ||
| flush_timing_cache=flush_timing_cache, | ||
| strip_shape_args=strip_shape_args, | ||
| extra_run_args=extra_run_args, | ||
| export_profile_path=export_profile_path, |
There was a problem hiding this comment.
Don't forward trtexec-only kwargs to every benchmark backend.
_benchmark_instance can still be a TensorRTPyBenchmark, and that implementation does not accept strip_shape_args, extra_run_args, or export_profile_path. On the first subgraph benchmark this raises TypeError, gets caught here, and the workflow records inf instead of a real latency. Gate these kwargs to TrtExecBenchmark (or make the Python backend accept/ignore them).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/onnx/quantization/autotune/workflows.py` around lines 69 - 75, The
call to _benchmark_instance.run is forwarding trtexec-specific kwargs
(strip_shape_args, extra_run_args, export_profile_path) to all backends causing
a TypeError when _benchmark_instance is a TensorRTPyBenchmark; update the call
to only include those kwargs when the backend is TrtExecBenchmark (or otherwise
supports them) by either checking isinstance(_benchmark_instance,
TrtExecBenchmark) or testing for attribute/method support before adding
strip_shape_args/extra_run_args/export_profile_path to the kwargs dict, then
call _benchmark_instance.run with the filtered kwargs so non-trt backends
receive only supported parameters.
| import unittest | ||
|
|
||
| from modelopt.onnx.quantization.autotune.common import Config | ||
|
|
||
|
|
||
| class TestConfig(unittest.TestCase): | ||
| """Test Config class functionality.""" | ||
|
|
||
| def test_default_values(self): | ||
| """Test that Config has correct default values.""" | ||
| config = Config() | ||
|
|
||
| # Logging | ||
| self.assertEqual(config.verbose, False) | ||
|
|
||
| # Performance thresholds | ||
| self.assertEqual(config.performance_threshold, 1.02) | ||
|
|
||
| # Q/DQ defaults | ||
| self.assertEqual(config.default_q_scale, 0.1) | ||
| self.assertEqual(config.default_q_zero_point, 0) | ||
| self.assertEqual(config.default_quant_type, "int8") | ||
|
|
||
| # Region builder settings | ||
| self.assertEqual(config.maximum_sequence_region_size, 10) | ||
| self.assertEqual(config.minimum_topdown_search_size, 10) | ||
|
|
||
| # Scheme generation parameters | ||
| self.assertEqual(config.top_percent_to_mutate, 0.1) | ||
| self.assertEqual(config.minimum_schemes_to_mutate, 10) | ||
| self.assertEqual(config.maximum_mutations, 3) | ||
| self.assertEqual(config.maximum_generation_attempts, 100) | ||
|
|
||
| # Pattern cache parameters | ||
| self.assertEqual(config.pattern_cache_minimum_distance, 4) | ||
| self.assertEqual(config.pattern_cache_max_entries_per_pattern, 32) | ||
|
|
||
|
|
||
| def test_custom_values(self): | ||
| """Test creating Config with custom values.""" | ||
| config = Config( | ||
| verbose=True, | ||
| performance_threshold=1.05, | ||
| default_q_scale=0.05, | ||
| default_q_zero_point=128, | ||
| default_quant_type="fp8", | ||
| maximum_sequence_region_size=20, | ||
| ) | ||
|
|
||
| self.assertEqual(config.verbose, True) | ||
| self.assertEqual(config.performance_threshold, 1.05) | ||
| self.assertEqual(config.default_q_scale, 0.05) | ||
| self.assertEqual(config.default_q_zero_point, 128) | ||
| self.assertEqual(config.default_quant_type, "fp8") | ||
| self.assertEqual(config.maximum_sequence_region_size, 20) | ||
|
|
||
| def test_performance_threshold_validation(self): | ||
| """Test that performance_threshold must be >= 1.0.""" | ||
| config1 = Config(performance_threshold=1.0) | ||
| self.assertEqual(config1.performance_threshold, 1.0) | ||
|
|
||
| config2 = Config(performance_threshold=1.5) | ||
| self.assertEqual(config2.performance_threshold, 1.5) | ||
|
|
||
| def test_region_size_validation(self): | ||
| """Test that region size parameters are positive.""" | ||
| config = Config(maximum_sequence_region_size=50, minimum_topdown_search_size=5) | ||
| self.assertGreater(config.maximum_sequence_region_size, 0) | ||
| self.assertGreater(config.minimum_topdown_search_size, 0) | ||
|
|
||
| def test_genetic_algorithm_params(self): | ||
| """Test genetic algorithm parameters.""" | ||
| config = Config( | ||
| top_percent_to_mutate=0.2, | ||
| minimum_schemes_to_mutate=2, | ||
| maximum_mutations=5, | ||
| maximum_generation_attempts=50, | ||
| ) | ||
|
|
||
| self.assertEqual(config.top_percent_to_mutate, 0.2) | ||
| self.assertEqual(config.minimum_schemes_to_mutate, 2) | ||
| self.assertEqual(config.maximum_mutations, 5) | ||
| self.assertEqual(config.maximum_generation_attempts, 50) | ||
|
|
||
| def test_pattern_cache_params(self): | ||
| """Test pattern cache parameters.""" | ||
| config = Config(pattern_cache_minimum_distance=3, pattern_cache_max_entries_per_pattern=10) | ||
|
|
||
| self.assertEqual(config.pattern_cache_minimum_distance, 3) | ||
| self.assertEqual(config.pattern_cache_max_entries_per_pattern, 10) | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| unittest.main() |
There was a problem hiding this comment.
Use pytest style for this new test module.
This new test file is written in unittest, but repository guidance requires new tests under tests/**/*.py to be authored with pytest style.
As per coding guidelines: tests/**/*.py: “Write tests using pytest for all new features and examples”.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/unit/onnx/quantization/autotune/test_config.py` around lines 19 - 112,
Replace the unittest-based TestConfig class/tests with pytest-style tests:
convert each test_* method (test_default_values, test_custom_values,
test_performance_threshold_validation, test_region_size_validation,
test_genetic_algorithm_params, test_pattern_cache_params) into standalone
functions that instantiate Config and use plain assert statements (e.g., assert
config.verbose is False) instead of self.assertEqual/self.assertGreater, remove
the unittest.TestCase inheritance and the if __name__ == "__main__":
unittest.main() block, and keep references to the Config symbol so the same
behavior is tested under pytest.
|
Thanks @gcunhase for the review!
|
There was a problem hiding this comment.
🧹 Nitpick comments (4)
tests/unit/onnx/quantization/autotune/test_fusion_grouping.py (1)
231-238: Consider a more specific assertion.The assertion
len(outputs) > 0is weak and doesn't verify the expected output tensor names. For a full graph group, the outputs should be the graph's output tensors (relu_outbased on the fixture).💡 Suggested improvement
def test_full_group_boundaries(self): graph = _build_conv_relu_graph() all_nodes = list(graph.nodes) names = {n.name for n in all_nodes} inputs, outputs = _find_boundary_tensors(all_nodes, names, graph) assert "input" in inputs - assert len(outputs) > 0 + assert "relu_out" in outputs🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/onnx/quantization/autotune/test_fusion_grouping.py` around lines 231 - 238, The test_full_group_boundaries uses a weak assertion (len(outputs) > 0); update it to assert the expected output tensor name for the full conv-relu graph. In the test_full_group_boundaries function (which calls _build_conv_relu_graph and _find_boundary_tensors), replace the length check with a specific assertion that outputs equals or contains the set {'relu_out'} (or assert 'relu_out' in outputs) so the test verifies the graph's actual output tensor name returned by _find_boundary_tensors.tests/unit/onnx/quantization/autotune/test_subgraph_extractor.py (1)
130-133: Consider clarifying test name or adding an empty list test.The test
test_empty_node_list_raisestests with a list containing a nonexistent node (["nonexistent"]), not an empty list. The test logic is correct for validating that no matching nodes triggers the error, but the name could be more precise (e.g.,test_nonexistent_nodes_raises). Additionally, consider whether passing an actual empty list[]should also be tested for completeness.🔧 Suggested rename and additional test
- def test_empty_node_list_raises(self): + def test_nonexistent_nodes_raises(self): graph = _build_linear_graph() with pytest.raises(ValueError, match="None of the specified nodes"): extract_subgraph_by_nodes(graph, ["nonexistent"]) + + def test_empty_node_list_raises(self): + graph = _build_linear_graph() + with pytest.raises(ValueError, match="None of the specified nodes"): + extract_subgraph_by_nodes(graph, [])🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/onnx/quantization/autotune/test_subgraph_extractor.py` around lines 130 - 133, Rename the test `test_empty_node_list_raises` to `test_nonexistent_nodes_raises` to reflect that it passes a non-matching node list, and add a separate test (e.g., `test_empty_nodes_list_raises`) that calls extract_subgraph_by_nodes(graph, []) and asserts it raises ValueError with the same "None of the specified nodes" message; update any test references accordingly and keep the existing case that passes ["nonexistent"] to validate non-matching node behavior.tests/unit/onnx/quantization/autotune/test_cli.py (1)
131-135: Avoid hardcoding preset internals in assertions.Use
MODE_PRESETS["extensive"]values (as done in quick preset tests) so this test tracks preset changes without churn.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/onnx/quantization/autotune/test_cli.py` around lines 131 - 135, The test_extensive_preset currently asserts hardcoded preset values; change it to reference the preset constants instead: call args = self._make_args("extensive"), apply_mode_presets(args), then assert args.num_schemes == MODE_PRESETS["extensive"]["num_schemes"] and args.timing_runs == MODE_PRESETS["extensive"]["timing_runs"] so the test uses the canonical MODE_PRESETS values and will follow future preset changes; update imports or module references as needed to access MODE_PRESETS in the test file.tests/unit/onnx/quantization/autotune/test_subgraph_workflow.py (1)
321-328: Add ONNX model structural validation after export.Include
onnx.checker.check_model(model)to validate the exported model for structural consistency. This strengthens the test by catching malformed graphs earlier, beyond just verifying Q/DQ node presence.Proposed enhancement
def test_qdq_produces_valid_onnx(self): graph = _build_simple_graph() insert_qdq_on_graph(graph, ["input"], quant_type="int8") model = gs.export_onnx(graph) + onnx.checker.check_model(model) assert len(model.graph.node) > 0 op_types = {n.op_type for n in model.graph.node} assert "QuantizeLinear" in op_types assert "DequantizeLinear" in op_types🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/onnx/quantization/autotune/test_subgraph_workflow.py` around lines 321 - 328, The test test_qdq_produces_valid_onnx currently exports a graph via gs.export_onnx(graph) and only asserts node presence; add structural validation by calling onnx.checker.check_model(model) after export to ensure the exported model is well-formed. Import or reference onnx.checker, then invoke onnx.checker.check_model(model) immediately after model = gs.export_onnx(graph) (keeping the existing assertions about QuantizeLinear/DequantizeLinear and insert_qdq_on_graph intact) so malformed ONNX graphs fail the test.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tests/unit/onnx/quantization/autotune/test_cli.py`:
- Around line 131-135: The test_extensive_preset currently asserts hardcoded
preset values; change it to reference the preset constants instead: call args =
self._make_args("extensive"), apply_mode_presets(args), then assert
args.num_schemes == MODE_PRESETS["extensive"]["num_schemes"] and
args.timing_runs == MODE_PRESETS["extensive"]["timing_runs"] so the test uses
the canonical MODE_PRESETS values and will follow future preset changes; update
imports or module references as needed to access MODE_PRESETS in the test file.
In `@tests/unit/onnx/quantization/autotune/test_fusion_grouping.py`:
- Around line 231-238: The test_full_group_boundaries uses a weak assertion
(len(outputs) > 0); update it to assert the expected output tensor name for the
full conv-relu graph. In the test_full_group_boundaries function (which calls
_build_conv_relu_graph and _find_boundary_tensors), replace the length check
with a specific assertion that outputs equals or contains the set {'relu_out'}
(or assert 'relu_out' in outputs) so the test verifies the graph's actual output
tensor name returned by _find_boundary_tensors.
In `@tests/unit/onnx/quantization/autotune/test_subgraph_extractor.py`:
- Around line 130-133: Rename the test `test_empty_node_list_raises` to
`test_nonexistent_nodes_raises` to reflect that it passes a non-matching node
list, and add a separate test (e.g., `test_empty_nodes_list_raises`) that calls
extract_subgraph_by_nodes(graph, []) and asserts it raises ValueError with the
same "None of the specified nodes" message; update any test references
accordingly and keep the existing case that passes ["nonexistent"] to validate
non-matching node behavior.
In `@tests/unit/onnx/quantization/autotune/test_subgraph_workflow.py`:
- Around line 321-328: The test test_qdq_produces_valid_onnx currently exports a
graph via gs.export_onnx(graph) and only asserts node presence; add structural
validation by calling onnx.checker.check_model(model) after export to ensure the
exported model is well-formed. Import or reference onnx.checker, then invoke
onnx.checker.check_model(model) immediately after model = gs.export_onnx(graph)
(keeping the existing assertions about QuantizeLinear/DequantizeLinear and
insert_qdq_on_graph intact) so malformed ONNX graphs fail the test.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: cde4c50f-1806-4f91-bed7-d224e1d18ad3
📥 Commits
Reviewing files that changed from the base of the PR and between cf3693855ff344d3ef389672629d38947f788c00 and 847398815434814c50c1c9117a2d81a7b2bbfc9f.
📒 Files selected for processing (4)
tests/unit/onnx/quantization/autotune/test_cli.pytests/unit/onnx/quantization/autotune/test_fusion_grouping.pytests/unit/onnx/quantization/autotune/test_subgraph_extractor.pytests/unit/onnx/quantization/autotune/test_subgraph_workflow.py
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
modelopt/onnx/quantization/autotune/benchmark.py (1)
232-236:⚠️ Potential issue | 🔴 CriticalCritical: constructor returns early and leaves object partially initialized
Line 236 exits
__init__when--remoteAutoTuningConfigis supported, soself._base_cmdandself._profile_unsupportedare never initialized.run()will then fail at runtime.💡 Proposed fix
if has_remote_config: try: _check_for_tensorrt(min_version="10.16") self.logger.debug("TensorRT Python API version >= 10.16 detected") - return except ImportError: self.logger.warning( "Remote autotuning is not supported with TensorRT version < 10.16" "Removing --remoteAutoTuningConfig from trtexec arguments" )
♻️ Duplicate comments (4)
modelopt/onnx/quantization/autotune/fusion_grouping.py (1)
285-290:⚠️ Potential issue | 🟠 MajorValidate generated
graph.jsonbefore returning success.After Line 285 checks the exit code, the function still returns success on Line 290 without confirming the file exists.
trtexeccan exit successfully while not emitting--exportLayerInfo, causing delayed downstream failures.Suggested fix
if result.returncode != 0: logger.error(f"trtexec failed (exit {result.returncode}), see {log_path}") raise RuntimeError(f"trtexec FP16 build failed: {log_path}") + if not Path(graph_json_path).is_file(): + logger.error( + f"trtexec exited with code 0 but graph.json was not generated: {graph_json_path}" + ) + raise RuntimeError(f"graph.json was not generated: {graph_json_path}") + logger.info(f"graph.json generated: {graph_json_path}") return graph_json_path#!/bin/bash # Verify whether generate_graph_json has an output-file existence guard before returning. rg -n -C4 'def generate_graph_json|returncode != 0|return graph_json_path|is_file\(|exists\(' modelopt/onnx/quantization/autotune/fusion_grouping.pyExpected result:
return graph_json_pathappears without a precedingis_file()/exists()check in the success path.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/quantization/autotune/fusion_grouping.py` around lines 285 - 290, The function generate_graph_json returns graph_json_path after checking trtexec exit code but doesn't verify the file was actually created; update generate_graph_json to check that graph_json_path exists and is a regular file (e.g., use Path(graph_json_path).is_file() or os.path.exists/isfile) before logging success and returning, and if the file is missing log an error with the log_path and raise a RuntimeError indicating the missing graph.json so downstream callers don't get a false success.modelopt/onnx/quantization/autotune/benchmark.py (3)
260-264:⚠️ Potential issue | 🔴 CriticalRemove inline Bandit suppression (
# nosec)Line 263 adds an in-code Bandit bypass. This repo disallows
# noseccomments; keep the list-arg subprocess call, but remove the suppression.💡 Proposed fix
- result = subprocess.run(cmd, capture_output=True, text=True) # nosec B603 + result = subprocess.run(cmd, capture_output=True, text=True)As per coding guidelines, "Bandit security checks are enforced via pre-commit hooks.
# noseccomments are not allowed as bypasses for security checks; contributors must request formal review from@NVIDIA/modelopt-setup-codeownerswith documented justification instead."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/quantization/autotune/benchmark.py` around lines 260 - 264, In _exec_and_log, remove the inline Bandit suppression by deleting the "# nosec B603" trailing comment on the subprocess.run(...) call (keep the list-style arg usage: subprocess.run(cmd, capture_output=True, text=True)); do not replace it with another inline bypass—if security exceptions are needed, raise the issue for formal review with the codeowners as per policy. Ensure the call remains in the _exec_and_log method and that logging and _write_log_file behavior are unchanged.
339-348:⚠️ Potential issue | 🟠 MajorOnly disable profiling for profiling-related unknown-option failures
Lines 340-343 mark profiling unsupported for any
"Unknown option"error, including unrelatedextra_run_args. That can permanently disable profiling after a non-profiling CLI mistake.💡 Proposed fix
if result.returncode != 0: - if "Unknown option" in (result.stderr or ""): + stderr = result.stderr or "" + profiling_prefixes = ( + "--separateProfileRun", + "--profilingVerbosity", + "--exportProfile", + ) + profiling_requested = export_profile_path is not None + profiling_flags_in_cmd = any( + any(c.startswith(p) for p in profiling_prefixes) for c in cmd + ) + profiling_error = any(p in stderr for p in profiling_prefixes) + + if ( + "Unknown option" in stderr + and profiling_requested + and (profiling_flags_in_cmd or profiling_error) + ): self.logger.warning("Profiling flags unsupported, retrying without them") self._profile_unsupported = True cmd = [ c for c in cmd if not any(🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/quantization/autotune/benchmark.py` around lines 339 - 348, The code currently treats any "Unknown option" stderr as proof profiling is unsupported and sets self._profile_unsupported; change this to only set self._profile_unsupported (and strip profiling flags from cmd) when the stderr specifically names one of the profiling flags. Update the conditional around result.returncode != 0 to inspect result.stderr for any of the profiling flag tokens used later (e.g., "--separateProfileRun", "--profilingVerbosity", "--exportProfile" or their short forms) and only then run the block that logs, sets self._profile_unsupported, and filters cmd; leave other "Unknown option" errors (e.g., from extra_run_args) to be handled normally.
320-324:⚠️ Potential issue | 🟠 Major
strip_shape_argsstill leaves other shape flags in placeLines 320-324 only strip
--optShapes/--minShapes/--maxShapes. Flags like--shapesand calibration-shape options can still leak full-model bindings into subgraph runs.💡 Proposed fix
if strip_shape_args: cmd = [ c for c in cmd - if not any(c.startswith(p) for p in ("--optShapes", "--minShapes", "--maxShapes")) + if not ( + "shapes" in c.lower() + or c.startswith("--shape") + or c.startswith("--shapes") + or c.startswith("--calibration") + ) ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/quantization/autotune/benchmark.py` around lines 320 - 324, In benchmark.py the strip_shape_args branch that filters cmd currently only removes "--optShapes", "--minShapes", and "--maxShapes" which lets other shape-related flags like "--shapes" or calibration-shape options slip through; update the filtering logic in the strip_shape_args block that processes the cmd list (same scope where cmd is rebuilt) to also exclude "--shapes" and any other shape/calibration shape flags by matching flag names case-insensitively (e.g., any c that startswith "--shapes" or where c.lower() contains "shape" or "calib" combined with "shape"), so all shape-related arguments are removed before spawning subgraph runs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@modelopt/onnx/quantization/autotune/benchmark.py`:
- Around line 260-264: In _exec_and_log, remove the inline Bandit suppression by
deleting the "# nosec B603" trailing comment on the subprocess.run(...) call
(keep the list-style arg usage: subprocess.run(cmd, capture_output=True,
text=True)); do not replace it with another inline bypass—if security exceptions
are needed, raise the issue for formal review with the codeowners as per policy.
Ensure the call remains in the _exec_and_log method and that logging and
_write_log_file behavior are unchanged.
- Around line 339-348: The code currently treats any "Unknown option" stderr as
proof profiling is unsupported and sets self._profile_unsupported; change this
to only set self._profile_unsupported (and strip profiling flags from cmd) when
the stderr specifically names one of the profiling flags. Update the conditional
around result.returncode != 0 to inspect result.stderr for any of the profiling
flag tokens used later (e.g., "--separateProfileRun", "--profilingVerbosity",
"--exportProfile" or their short forms) and only then run the block that logs,
sets self._profile_unsupported, and filters cmd; leave other "Unknown option"
errors (e.g., from extra_run_args) to be handled normally.
- Around line 320-324: In benchmark.py the strip_shape_args branch that filters
cmd currently only removes "--optShapes", "--minShapes", and "--maxShapes" which
lets other shape-related flags like "--shapes" or calibration-shape options slip
through; update the filtering logic in the strip_shape_args block that processes
the cmd list (same scope where cmd is rebuilt) to also exclude "--shapes" and
any other shape/calibration shape flags by matching flag names
case-insensitively (e.g., any c that startswith "--shapes" or where c.lower()
contains "shape" or "calib" combined with "shape"), so all shape-related
arguments are removed before spawning subgraph runs.
In `@modelopt/onnx/quantization/autotune/fusion_grouping.py`:
- Around line 285-290: The function generate_graph_json returns graph_json_path
after checking trtexec exit code but doesn't verify the file was actually
created; update generate_graph_json to check that graph_json_path exists and is
a regular file (e.g., use Path(graph_json_path).is_file() or
os.path.exists/isfile) before logging success and returning, and if the file is
missing log an error with the log_path and raise a RuntimeError indicating the
missing graph.json so downstream callers don't get a false success.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 90e63dfd-dee0-4e93-a164-5574ed54b7c2
📥 Commits
Reviewing files that changed from the base of the PR and between 847398815434814c50c1c9117a2d81a7b2bbfc9f and fa67ad118e4e2b05b953813934ba6777a453e7c2.
📒 Files selected for processing (2)
modelopt/onnx/quantization/autotune/benchmark.pymodelopt/onnx/quantization/autotune/fusion_grouping.py
There was a problem hiding this comment.
♻️ Duplicate comments (3)
modelopt/onnx/quantization/autotune/fusion_grouping.py (1)
295-300:⚠️ Potential issue | 🟡 MinorAdd existence check for generated graph.json file.
The function checks
returncode != 0but doesn't verify thatgraph.jsonwas actually created.trtexecmay exit with code 0 even when output files weren't generated (e.g., due to permission issues or disk space). This can cause obscured downstream failures.🛡️ Suggested fix
if result.returncode != 0: logger.error(f"trtexec failed (exit {result.returncode}), see {log_path}") raise RuntimeError(f"trtexec FP16 build failed: {log_path}") + if not Path(graph_json_path).is_file(): + logger.error(f"trtexec exited successfully but no graph.json at {graph_json_path}") + raise RuntimeError(f"graph.json was not generated: {graph_json_path}") + logger.info(f"graph.json generated: {graph_json_path}") return graph_json_path🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/quantization/autotune/fusion_grouping.py` around lines 295 - 300, The code currently only checks trtexec's returncode but not that the expected output file exists; after the trtexec run (where variables result, log_path and graph_json_path are available and logger is used) add an explicit existence check (e.g., os.path.exists(graph_json_path)) and if the file is missing log a clear error including graph_json_path and log_path and raise a RuntimeError indicating the FP16 build produced no graph.json; ensure this check is placed before returning graph_json_path so downstream callers never receive a non-existent path.modelopt/onnx/quantization/autotune/subgraph_workflow.py (2)
820-828:⚠️ Potential issue | 🟠 MajorAdd validation for cached metadata before resuming.
The cache loading at line 821 doesn't validate that the cached
versionandmodel_pathmatch the current run. Reusing an output directory with a different model or after code changes could replay stale Phase 2/3 decisions.🛡️ Suggested fix
# ─── Load cache ────────────────────────────────────────────────────── - cache = _load_cache(cache_path) or { - "version": CACHE_VERSION, "model_path": model_path, - } + cache = _load_cache(cache_path) + if cache is not None: + # Validate cache matches current run parameters + if (cache.get("version") != CACHE_VERSION + or cache.get("model_path") != model_path): + logger.warning( + f"Cache mismatch (version or model_path changed), starting fresh" + ) + cache = None + if cache is None: + cache = {"version": CACHE_VERSION, "model_path": model_path} cached_p2: dict = cache.get("phase2", {})🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/quantization/autotune/subgraph_workflow.py` around lines 820 - 828, The loaded cache from _load_cache must be validated: check cache.get("version") equals CACHE_VERSION and cache.get("model_path") equals model_path before trusting cached_p2 / cached_p3; if either mismatches, discard or reinitialize the cache (e.g., set cache = {"version": CACHE_VERSION, "model_path": model_path} and clear phase2/phase3 results) and emit a warning log; implement this validation immediately after the cache = _load_cache(...) line and before using cached_p2, cached_p2_ids, or cached_p3 so stale Phase 2/3 decisions are not replayed.
128-134:⚠️ Potential issue | 🟡 MinorProtobuf dimension manipulation uses non-standard pattern.
Using
pop()on protobuf repeated fields is non-standard and may not work consistently across protobuf versions. The recommended approach is to clear and rebuild.♻️ Safer approach
for inp in model.graph.input: if inp.name not in input_shapes: continue shape = input_shapes[inp.name] dim_proto = inp.type.tensor_type.shape.dim - while len(dim_proto) > len(shape): - dim_proto.pop() - while len(dim_proto) < len(shape): - dim_proto.add() + # Clear and rebuild dimensions + del dim_proto[:] for i, d in enumerate(shape): - dim_proto[i].ClearField("dim_param") - dim_proto[i].dim_value = d + new_dim = dim_proto.add() + new_dim.dim_value = d🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/quantization/autotune/subgraph_workflow.py` around lines 128 - 134, Replace the non-standard pop/add loop on the protobuf repeated field with a clear-and-rebuild pattern: call dim_proto.Clear() (or dim_proto.clear() depending on proto API) to remove existing entries, then iterate over shape and for each value call new_dim = dim_proto.add(); new_dim.ClearField("dim_param"); new_dim.dim_value = d. Update the block that currently manipulates dim_proto and shape so it uses this clear-and-repopulate approach (keeping the same field names dim_proto, dim_param, dim_value) to ensure consistent behavior across protobuf versions.
🧹 Nitpick comments (3)
modelopt/onnx/quantization/autotune/__main__.py (1)
55-72: Consider explicitallow_pickle=Falsefor security clarity.While NumPy defaults to
allow_pickle=Falsesince version 1.16.3, explicitly specifying it makes the security posture clear and guards against potential future changes or environments with older NumPy.♻️ Suggested improvement
if path_str.endswith(".npz"): - data = np.load(path_str) + data = np.load(path_str, allow_pickle=False) return dict(data) elif path_str.endswith(".npy"): - return np.load(path_str) + return np.load(path_str, allow_pickle=False)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/quantization/autotune/__main__.py` around lines 55 - 72, The _load_calibration_data function uses np.load to read .npz and .npy files; update both calls to np.load in that function (the np.load call when path_str.endswith(".npz") and the return np.load(path_str) for .npy) to pass allow_pickle=False explicitly to make the security posture explicit and prevent pickled object loading.modelopt/onnx/quantization/autotune/subgraph_workflow.py (1)
54-71: FP8 fallback now warns user, but consider runtime validation.The warning at import time addresses the silent fallback concern. However, users requesting
--quant_type fp8may not notice the import-time warning. Consider adding validation at workflow entry to fail early with a clear error when FP8 is requested but unavailable.♻️ Consider adding runtime validation
# At the start of subgraph_autotuning_workflow, add: if quant_type == "fp8" and QUANT_DTYPES["fp8"] == np.int8: raise ValueError( "FP8 quantization requested but float8_e4m3fn dtype is unavailable. " "Install ml_dtypes package for native FP8 support." )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/quantization/autotune/subgraph_workflow.py` around lines 54 - 71, Add a runtime check at the start of subgraph_autotuning_workflow to validate FP8 availability: if the caller requested quant_type == "fp8" but QUANT_DTYPES["fp8"] resolved to np.int8 (as returned by _get_fp8_dtype()), raise a clear ValueError instructing the user to install ml_dtypes for native FP8 support; this ensures explicit failure when FP8 was requested but unavailable rather than relying solely on the import-time warning.modelopt/onnx/quantization/autotune/fusion_grouping.py (1)
122-127: Remove or document the unused_get_op_type_from_node_namehelper function.The
_get_op_type_from_node_namefunction is defined but not called anywhere in the codebase. Either remove it to reduce dead code, or add a comment documenting why it is retained for future use.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/quantization/autotune/fusion_grouping.py` around lines 122 - 127, The helper function _get_op_type_from_node_name is defined but unused; either remove the function to eliminate dead code, or retain it with a clear comment/docstring explaining its intended future use (e.g., lookup helper for debugging or future fusion logic) and add an annotation (like TODO or rationale) so linters/maintainers know it is intentionally kept; update the docstring/comment inside fusion_grouping.py adjacent to _get_op_type_from_node_name to reflect the chosen action.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@modelopt/onnx/quantization/autotune/fusion_grouping.py`:
- Around line 295-300: The code currently only checks trtexec's returncode but
not that the expected output file exists; after the trtexec run (where variables
result, log_path and graph_json_path are available and logger is used) add an
explicit existence check (e.g., os.path.exists(graph_json_path)) and if the file
is missing log a clear error including graph_json_path and log_path and raise a
RuntimeError indicating the FP16 build produced no graph.json; ensure this check
is placed before returning graph_json_path so downstream callers never receive a
non-existent path.
In `@modelopt/onnx/quantization/autotune/subgraph_workflow.py`:
- Around line 820-828: The loaded cache from _load_cache must be validated:
check cache.get("version") equals CACHE_VERSION and cache.get("model_path")
equals model_path before trusting cached_p2 / cached_p3; if either mismatches,
discard or reinitialize the cache (e.g., set cache = {"version": CACHE_VERSION,
"model_path": model_path} and clear phase2/phase3 results) and emit a warning
log; implement this validation immediately after the cache = _load_cache(...)
line and before using cached_p2, cached_p2_ids, or cached_p3 so stale Phase 2/3
decisions are not replayed.
- Around line 128-134: Replace the non-standard pop/add loop on the protobuf
repeated field with a clear-and-rebuild pattern: call dim_proto.Clear() (or
dim_proto.clear() depending on proto API) to remove existing entries, then
iterate over shape and for each value call new_dim = dim_proto.add();
new_dim.ClearField("dim_param"); new_dim.dim_value = d. Update the block that
currently manipulates dim_proto and shape so it uses this clear-and-repopulate
approach (keeping the same field names dim_proto, dim_param, dim_value) to
ensure consistent behavior across protobuf versions.
---
Nitpick comments:
In `@modelopt/onnx/quantization/autotune/__main__.py`:
- Around line 55-72: The _load_calibration_data function uses np.load to read
.npz and .npy files; update both calls to np.load in that function (the np.load
call when path_str.endswith(".npz") and the return np.load(path_str) for .npy)
to pass allow_pickle=False explicitly to make the security posture explicit and
prevent pickled object loading.
In `@modelopt/onnx/quantization/autotune/fusion_grouping.py`:
- Around line 122-127: The helper function _get_op_type_from_node_name is
defined but unused; either remove the function to eliminate dead code, or retain
it with a clear comment/docstring explaining its intended future use (e.g.,
lookup helper for debugging or future fusion logic) and add an annotation (like
TODO or rationale) so linters/maintainers know it is intentionally kept; update
the docstring/comment inside fusion_grouping.py adjacent to
_get_op_type_from_node_name to reflect the chosen action.
In `@modelopt/onnx/quantization/autotune/subgraph_workflow.py`:
- Around line 54-71: Add a runtime check at the start of
subgraph_autotuning_workflow to validate FP8 availability: if the caller
requested quant_type == "fp8" but QUANT_DTYPES["fp8"] resolved to np.int8 (as
returned by _get_fp8_dtype()), raise a clear ValueError instructing the user to
install ml_dtypes for native FP8 support; this ensures explicit failure when FP8
was requested but unavailable rather than relying solely on the import-time
warning.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 9a87f60a-3119-4ffb-8d28-3099c54cc8a6
📥 Commits
Reviewing files that changed from the base of the PR and between fa67ad118e4e2b05b953813934ba6777a453e7c2 and fc4d2f65d2230e5bb319d3f54f406a797d60b965.
📒 Files selected for processing (3)
modelopt/onnx/quantization/autotune/__main__.pymodelopt/onnx/quantization/autotune/fusion_grouping.pymodelopt/onnx/quantization/autotune/subgraph_workflow.py
Expand ONNX Q/DQ autotune with a new subgraph workflow (--workflow subgraph) that uses TensorRT fusion boundaries for faster, fusion-aware QDQ placement. Key additions: - Subgraph workflow: fusion grouping from graph.json, per-subgraph heuristic scheme profiling, optional per-layer timing, incremental full-model validation - Fusion grouping (fusion_grouping.py): parse TRT graph.json, auto-generate via trtexec if not provided - Subgraph extraction (subgraph_extractor.py): extract standalone ONNX subgraphs per fusion group - Torch region builder (torch_region_builder.py): PyTorch-style hierarchical region discovery - TensorRT utils (tensorrt_utils.py): TRT Python API benchmark with timing cache and plugin support - QDQ utilities (qdq_utils.py): quantized tensor discovery helpers - Cache/resume via autotune_cache.json for long-running jobs - trtexec profiling-flag compatibility: retry without --exportProfile on older TRT versions - CLI: --workflow, --graph_json, --incremental_validation flags - Documentation and examples (qdq_placement/) Signed-off-by: wahao <wahao@nvidia.com> Made-with: Cursor
…, extractor, and CLI Cover shape parsing, profile parsing, cache I/O, QDQ insertion, heuristic scheme generation, boundary tensor resolution, subgraph extraction, metadata parsing, graph.json parsing, and CLI argument parsing with mode presets. Signed-off-by: wahao <wahao@nvidia.com> Made-with: Cursor
Signed-off-by: wahao <wahao@nvidia.com>
fc4d2f6 to
26f8621
Compare
Pull Request: Expand ONNX Q/DQ Autotuning with Subgraph Mode
Branch:
dev-wahao-autotune-subgraph-profile→mainType: Feature
Summary
This PR expands on the existing ONNX Q/DQ Autotune framework by adding a subgraph workflow (
--workflow subgraph) that uses TensorRT fusion boundaries for faster, fusion-aware QDQ placement optimization.It also integrates @willg-nv's
TorchRegionBuilder(from Draft PR #963) for PyTorch-style hierarchical region discovery, and adds TensorRT Python API benchmarking utilities, documentation, and examples.Relation to existing PRs:
TorchRegionBuilderis included here along with the subgraph workflow.What's New
--workflow subgraph: fusion-aware grouping from TensorRTgraph.json; per-subgraph QDQ scheme profiling; optional per-layer timing with fallback to total latency.fusion_grouping.py: parse TRTgraph.json, build fusion groups, infer shapes. Auto-generatesgraph.jsonviatrtexecFP16 build if not provided.subgraph_extractor.py: extract standalone ONNX subgraphs per fusion group for isolated benchmarking.torch_region_builder.py: PyTorch-style hierarchical region discovery using node name conventions (from #963).tensorrt_utils.py: TRT Python API benchmark with timing cache, plugin support, and configurable warmup/timing runs.optimized_raw.onnx+optimized_final.onnx.autotune_cache.jsonfor Phase 2 (subgraph profiling) and Phase 3 (incremental validation).--exportProfile/--profilingVerbosityand retry with total latency.--workflow {region,subgraph},--graph_json,--incremental_validation/--no-incremental-validation.examples/qdq_placement/: README (Quick Start, region vs subgraph, best practices) andset_batch_size.py.Key Files
modelopt/onnx/quantization/autotune/__main__.py--workflow,--graph-json,--incremental-validation,--use-trtexec,--trtexec-args, etc.modelopt/onnx/quantization/autotune/subgraph_workflow.pymodelopt/onnx/quantization/autotune/fusion_grouping.pygraph.json, create fusion groups,generate_graph_json()(trtexec FP16 build when no graph is provided).modelopt/onnx/quantization/autotune/subgraph_extractor.pymodelopt/onnx/quantization/autotune/tensorrt_utils.pymodelopt/onnx/quantization/autotune/torch_region_builder.pymodelopt/onnx/quantization/autotune/benchmark.pyexport_profile_path, profiling-flag dedup and "Unknown option" retry.modelopt/onnx/quantization/autotune/workflows.pybenchmark_onnx_model(); passes throughexport_profile_pathwhen using trtexec.modelopt/onnx/quantization/autotune/qdq_utils.pyexamples/qdq_placement/README.mdexamples/qdq_placement/set_batch_size.pytests/unit/onnx/quantization/autotune/test_config.pyHow to Test
Region mode (default):
Subgraph mode with trtexec (FP8, optional graph.json):
Resume: Kill the subgraph run mid-way, then re-run the same command; it should resume from
autotune_cache.json.Checklist
mainwith DCO sign-off--use-trtexecverifiedexamples/qdq_placement/README.mdmatches behaviorDocumentation
examples/qdq_placement/README.md— Quick Start, subgraph best practices, output layout, optional graph generation.docs/source/guides/9_qdq_placement.rstanddocs/source/reference/2_qdq_placement.rstalign with the CLI and behavior above.Notes
--exportProfile/--profilingVerbosityare handled by retrying without those flags and using total latency for scheme selection.Summary by CodeRabbit
New Features
Utilities
Documentation
Tests