Integrate Automated QDQ autotuner - part 3.2 by willg-nv · Pull Request #838 · NVIDIA/Model-Optimizer

willg-nv · 2026-02-02T02:58:02Z

What does this PR do?

This PR implements QDQAutotuner class. This class is used to drive the main Autotuner workflow.

The workflow is:

uses RegionSearch to build regions
generate QDQ ONNX models and evaluate perf
save best model

This PR is part 2/4 of #703.

PR 3.1: #837
PR 3.2 #838
PR 3.3: #839

Overview: ?

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes
Did you write any new necessary tests?: Not in this part.
Did you add or update any necessary documentation?: No, document will be updated in part 4.
Did you update Changelog?: No, change log will be updated when all changes are ready.

Additional Information

Summary by CodeRabbit

New Features
- Introduced ONNX Q/DQ autotuning framework with automatic region discovery and pattern-based optimization.
- Added model profiling and quantization scheme generation capabilities.
- Enabled state persistence and quantization model export functionality.
- Introduced configuration management for quantization parameters and profiling workflows.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

copy-pr-bot · 2026-02-02T02:58:06Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-02-02T02:58:26Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

Introduces a new ONNX quantization autotuning module that enables automatic Q/DQ (Quantize/Dequantize) node insertion and optimization using pattern-based region analysis. Provides a comprehensive framework for discovering optimal insertion points, profiling schemes, and exporting quantized models.

Changes

Cohort / File(s)	Summary
Module Initialization `modelopt/onnx/quantization/autotune/__init__.py`	Exposes public API surface: QDQAutotuner class, configuration/exception types (Config, InsertionScheme, PatternSchemes, RegionType), insertion point abstractions, and utility classes (PatternCache, Region, RegionPattern, CombinedRegionSearch).
Core Autotuner Implementation `modelopt/onnx/quantization/autotune/autotuner.py`	Implements QDQAutotunerBase and QDQAutotuner with region discovery, pattern-based Q/DQ insertion logic, profiling workflow, state management, graph mutation, insertion point resolution, and ONNX export capabilities. Supports scheme generation, convergence tracking, FP8 conversion, and pattern cache integration.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Autotuner as QDQAutotuner
    participant RegionSearch as CombinedRegionSearch
    participant Profiler as Profiling System
    participant Inserter as Q/DQ Insertion
    participant Exporter as ONNX Exporter

    User->>Autotuner: initialize(config, pattern_cache)
    Autotuner->>Autotuner: Load model & init state

    User->>RegionSearch: discover regions
    RegionSearch-->>Autotuner: return regions

    loop For each region
        User->>Autotuner: set_profile_region(region)
        Autotuner->>Autotuner: Commit profiling outcomes
        Autotuner->>Profiler: Prepare region-pattern pairs
        
        loop Generate candidates
            User->>Autotuner: generate()
            Autotuner->>Inserter: Build insertion scheme
            Inserter->>Inserter: Insert Q/DQ nodes
            User->>Autotuner: submit(latency_ms)
            Autotuner->>Autotuner: Track performance metrics
        end
    end

    User->>Autotuner: export_onnx(best=True)
    Autotuner->>Inserter: Apply best scheme
    Inserter->>Exporter: Finalize Q/DQ graph
    Exporter-->>User: return quantized ONNX bytes

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Integrate Automated QDQ autotuner - part 3.2' accurately describes the PR's main objective: integrating the QDQAutotuner class implementation into the codebase as part 3.2 of a larger feature series.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Fix all issues with AI agents

In `@modelopt/onnx/quantization/autotune/autotuner.py`:
- Around line 1024-1029: The try/except around graph.cleanup().toposort()
swallows all exceptions (except Exception as e) and merely logs a warning, which
can hide serious graph corruption; update the handler in autotuner.py to either
catch only expected exception types (e.g., specific cleanup/toposort exceptions)
or log the error and re-raise it so execution stops on unexpected failures —
locate the graph.cleanup().toposort() call and replace the broad except with
either a narrowed except for known recoverable exceptions or add a raise after
logger.warning/failure log to propagate the error.
- Line 622: Remove the redundant local import "from datetime import datetime"
(the one added at line with the single import statement) in autotuner.py; the
module already imports datetime at the top of the file, so delete this local
import to avoid duplication and potential shadowing (look for the statement
"from datetime import datetime" inside the function or block and remove it).
- Around line 912-918: The zero-point arrays q_zp_values (and the corresponding
dq_zp_values) are created with a hardcoded dtype np.int8 which can mismatch the
QuantizeLinear/DequantizeLinear output type when quant_type is "uint8" or other
types; update their construction to use the same dtype as the computed
quant_dtype instead of np.int8 so q_zp_values and dq_zp_values match the
quantized output element type used when building q_inputs and dq_inputs (refer
to q_scale_values, q_zp_values, q_inputs and the corresponding dq_* variables to
locate where to change the dtype).
- Around line 1013-1021: The import of get_tensor_consumer_node_indices is wrong
and causes an import error; replace that import with get_tensor_consumer_nodes
and update any usage names accordingly (the code that uses tensor_users_map
already expects a defaultdict(list) so no KeyError handling is needed).
Specifically, change the symbol imported from
modelopt.onnx.quantization.graph_utils from get_tensor_consumer_node_indices to
get_tensor_consumer_nodes and ensure tensor_users_map is assigned from
get_tensor_consumer_nodes(...) where used in the autotuner (references:
get_tensor_consumer_node_indices, get_tensor_consumer_nodes, tensor_users_map).

🧹 Nitpick comments (4)

modelopt/onnx/quantization/autotune/autotuner.py (4)

229-229: Consider defining config attributes explicitly.

Using getattr(self.config, "maximum_generation_attempts", 100) with defaults (also seen at lines 718-719 and 744) suggests these attributes may not be formally defined on the Config class. This pattern makes it harder to discover available configuration options.

💡 Suggestion

Consider adding these attributes to the Config class with documented defaults rather than relying on getattr fallbacks:

# In Config class
maximum_generation_attempts: int = 100
top_percent_to_mutate: float = 0.1
minimum_schemes_to_mutate: int = 1
maximum_mutations: int = 3

333-335: Replace assertions with explicit checks for runtime validation.

Assertions on lines 333-335 (and similarly at line 314) are used for validating runtime conditions. Since assertions can be disabled with python -O, these should be explicit checks for production code.

🛡️ Proposed fix

-                full_insertion_scheme = pattern.get_full_insertion_scheme(region, self.graph)
-                assert full_insertion_scheme is not None
-                all_region_ips = pattern.matches(region, self.graph, full_insertion_scheme)
-                assert isinstance(all_region_ips, set)
+                full_insertion_scheme = pattern.get_full_insertion_scheme(region, self.graph)
+                if full_insertion_scheme is None:
+                    logger.warning(f"Failed to get full insertion scheme for region {region.id}")
+                    continue
+                all_region_ips = pattern.matches(region, self.graph, full_insertion_scheme)
+                if not isinstance(all_region_ips, set):
+                    raise TypeError(f"Expected set from pattern.matches, got {type(all_region_ips)}")

972-985: Assertions used for critical runtime validation.

These assertions validate critical invariants (node index bounds, input index bounds, tensor name matching) but can be disabled with python -O. Consider using explicit checks with ValueError/IndexError for production safety.

🛡️ Proposed fix

             if node_index is not None:
-                assert node_index < len(graph.nodes), "Node index out of range"
+                if node_index >= len(graph.nodes):
+                    raise IndexError(f"Node index {node_index} out of range (max: {len(graph.nodes) - 1})")
                 target_node = graph.nodes[node_index]
-                assert input_index is not None, "Input index must be set when node index is set"
-                assert input_index < len(target_node.inputs), (
-                    f"Input index out of range for node {target_node.name}"
-                )
+                if input_index is None:
+                    raise ValueError("Input index must be set when node index is set")
+                if input_index >= len(target_node.inputs):
+                    raise IndexError(f"Input index {input_index} out of range for node {target_node.name}")
                 original_tensor = target_node.inputs[input_index]
-                assert tensor_name == original_tensor.name, (
-                    f"Tensor name mismatch for node {target_node.name} input {input_index}"
-                )
+                if tensor_name != original_tensor.name:
+                    raise ValueError(f"Tensor name mismatch: expected '{tensor_name}', got '{original_tensor.name}'")
             else:
-                assert tensor_name in tensor_map, f"Tensor {tensor_name} not found in tensor map"
-                assert input_index is None, "Input index must be None when node index is None"
+                if tensor_name not in tensor_map:
+                    raise KeyError(f"Tensor {tensor_name} not found in tensor map")
+                if input_index is not None:
+                    raise ValueError("Input index must be None when node index is None")

1042-1049: Consider iterative approach for deep region hierarchies.

_visit_region_recursively uses recursion which could hit Python's stack limit for very deep region hierarchies. While this is unlikely for typical ONNX models, an iterative approach would be more robust.

♻️ Iterative alternative

def _visit_region_recursively(self, region: Region) -> list[Region]:
    """Iteratively traverse region hierarchy and collect all regions."""
    regions = []
    stack = [region]
    while stack:
        current = stack.pop()
        regions.append(current)
        stack.extend(current.get_children())
    return regions

modelopt/onnx/quantization/autotune/autotuner.py

Signed-off-by: Will Guo <willg@nvidia.com>

tests/unit/onnx/quantization/autotune/autotune/test_autotuner.py

Signed-off-by: Will Guo <willg@nvidia.com>

ajrasane · 2026-02-12T09:33:41Z

/ok to test b02fef1

modelopt/onnx/quantization/autotune/autotuner.py

Signed-off-by: Will Guo <willg@nvidia.com>

gcunhase · 2026-02-18T20:33:19Z

/ok to test 68ffa5a

willg-nv requested a review from a team as a code owner February 2, 2026 02:58

willg-nv requested a review from ajrasane February 2, 2026 02:58

coderabbitai bot reviewed Feb 2, 2026

View reviewed changes

This was referenced Feb 2, 2026

Integrate Automated QDQ benchmark - part 3.1 #837

Open

Integrate Automated QDQ placement tool - part 3.3 #839

Open

willg-nv force-pushed the dev-willg-integrate-auto-qdq-placement-part3.2 branch 2 times, most recently from b5032ed to 1ffcf7f Compare February 3, 2026 01:55

willg-nv added 2 commits February 9, 2026 08:35

Integrate Automated QDQ placement tool - part 3.2

98d7f55

Signed-off-by: Will Guo <willg@nvidia.com>

pick back docstrings

bd18dfa

Signed-off-by: Will Guo <willg@nvidia.com>

willg-nv force-pushed the dev-willg-integrate-auto-qdq-placement-part3.2 branch from 1ffcf7f to bd18dfa Compare February 9, 2026 08:36

gcunhase reviewed Feb 10, 2026

View reviewed changes

tests/unit/onnx/quantization/autotune/autotune/test_autotuner.py Outdated Show resolved Hide resolved

willg-nv added 3 commits February 10, 2026 03:13

resolve comments

f93f5d5

Signed-off-by: Will Guo <willg@nvidia.com>

resolve comments

0f41470

Signed-off-by: Will Guo <willg@nvidia.com>

resolve comments

b02fef1

Signed-off-by: Will Guo <willg@nvidia.com>