Integrate Automated QDQ autotuner - part 3.2#838
Integrate Automated QDQ autotuner - part 3.2#838willg-nv wants to merge 6 commits intoNVIDIA:mainfrom
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughIntroduces a new ONNX quantization autotuning module that enables automatic Q/DQ (Quantize/Dequantize) node insertion and optimization using pattern-based region analysis. Provides a comprehensive framework for discovering optimal insertion points, profiling schemes, and exporting quantized models. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Autotuner as QDQAutotuner
participant RegionSearch as CombinedRegionSearch
participant Profiler as Profiling System
participant Inserter as Q/DQ Insertion
participant Exporter as ONNX Exporter
User->>Autotuner: initialize(config, pattern_cache)
Autotuner->>Autotuner: Load model & init state
User->>RegionSearch: discover regions
RegionSearch-->>Autotuner: return regions
loop For each region
User->>Autotuner: set_profile_region(region)
Autotuner->>Autotuner: Commit profiling outcomes
Autotuner->>Profiler: Prepare region-pattern pairs
loop Generate candidates
User->>Autotuner: generate()
Autotuner->>Inserter: Build insertion scheme
Inserter->>Inserter: Insert Q/DQ nodes
User->>Autotuner: submit(latency_ms)
Autotuner->>Autotuner: Track performance metrics
end
end
User->>Autotuner: export_onnx(best=True)
Autotuner->>Inserter: Apply best scheme
Inserter->>Exporter: Finalize Q/DQ graph
Exporter-->>User: return quantized ONNX bytes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@modelopt/onnx/quantization/autotune/autotuner.py`:
- Around line 1024-1029: The try/except around graph.cleanup().toposort()
swallows all exceptions (except Exception as e) and merely logs a warning, which
can hide serious graph corruption; update the handler in autotuner.py to either
catch only expected exception types (e.g., specific cleanup/toposort exceptions)
or log the error and re-raise it so execution stops on unexpected failures —
locate the graph.cleanup().toposort() call and replace the broad except with
either a narrowed except for known recoverable exceptions or add a raise after
logger.warning/failure log to propagate the error.
- Line 622: Remove the redundant local import "from datetime import datetime"
(the one added at line with the single import statement) in autotuner.py; the
module already imports datetime at the top of the file, so delete this local
import to avoid duplication and potential shadowing (look for the statement
"from datetime import datetime" inside the function or block and remove it).
- Around line 912-918: The zero-point arrays q_zp_values (and the corresponding
dq_zp_values) are created with a hardcoded dtype np.int8 which can mismatch the
QuantizeLinear/DequantizeLinear output type when quant_type is "uint8" or other
types; update their construction to use the same dtype as the computed
quant_dtype instead of np.int8 so q_zp_values and dq_zp_values match the
quantized output element type used when building q_inputs and dq_inputs (refer
to q_scale_values, q_zp_values, q_inputs and the corresponding dq_* variables to
locate where to change the dtype).
- Around line 1013-1021: The import of get_tensor_consumer_node_indices is wrong
and causes an import error; replace that import with get_tensor_consumer_nodes
and update any usage names accordingly (the code that uses tensor_users_map
already expects a defaultdict(list) so no KeyError handling is needed).
Specifically, change the symbol imported from
modelopt.onnx.quantization.graph_utils from get_tensor_consumer_node_indices to
get_tensor_consumer_nodes and ensure tensor_users_map is assigned from
get_tensor_consumer_nodes(...) where used in the autotuner (references:
get_tensor_consumer_node_indices, get_tensor_consumer_nodes, tensor_users_map).
🧹 Nitpick comments (4)
modelopt/onnx/quantization/autotune/autotuner.py (4)
229-229: Consider defining config attributes explicitly.Using
getattr(self.config, "maximum_generation_attempts", 100)with defaults (also seen at lines 718-719 and 744) suggests these attributes may not be formally defined on theConfigclass. This pattern makes it harder to discover available configuration options.💡 Suggestion
Consider adding these attributes to the
Configclass with documented defaults rather than relying ongetattrfallbacks:# In Config class maximum_generation_attempts: int = 100 top_percent_to_mutate: float = 0.1 minimum_schemes_to_mutate: int = 1 maximum_mutations: int = 3
333-335: Replace assertions with explicit checks for runtime validation.Assertions on lines 333-335 (and similarly at line 314) are used for validating runtime conditions. Since assertions can be disabled with
python -O, these should be explicit checks for production code.🛡️ Proposed fix
- full_insertion_scheme = pattern.get_full_insertion_scheme(region, self.graph) - assert full_insertion_scheme is not None - all_region_ips = pattern.matches(region, self.graph, full_insertion_scheme) - assert isinstance(all_region_ips, set) + full_insertion_scheme = pattern.get_full_insertion_scheme(region, self.graph) + if full_insertion_scheme is None: + logger.warning(f"Failed to get full insertion scheme for region {region.id}") + continue + all_region_ips = pattern.matches(region, self.graph, full_insertion_scheme) + if not isinstance(all_region_ips, set): + raise TypeError(f"Expected set from pattern.matches, got {type(all_region_ips)}")
972-985: Assertions used for critical runtime validation.These assertions validate critical invariants (node index bounds, input index bounds, tensor name matching) but can be disabled with
python -O. Consider using explicit checks withValueError/IndexErrorfor production safety.🛡️ Proposed fix
if node_index is not None: - assert node_index < len(graph.nodes), "Node index out of range" + if node_index >= len(graph.nodes): + raise IndexError(f"Node index {node_index} out of range (max: {len(graph.nodes) - 1})") target_node = graph.nodes[node_index] - assert input_index is not None, "Input index must be set when node index is set" - assert input_index < len(target_node.inputs), ( - f"Input index out of range for node {target_node.name}" - ) + if input_index is None: + raise ValueError("Input index must be set when node index is set") + if input_index >= len(target_node.inputs): + raise IndexError(f"Input index {input_index} out of range for node {target_node.name}") original_tensor = target_node.inputs[input_index] - assert tensor_name == original_tensor.name, ( - f"Tensor name mismatch for node {target_node.name} input {input_index}" - ) + if tensor_name != original_tensor.name: + raise ValueError(f"Tensor name mismatch: expected '{tensor_name}', got '{original_tensor.name}'") else: - assert tensor_name in tensor_map, f"Tensor {tensor_name} not found in tensor map" - assert input_index is None, "Input index must be None when node index is None" + if tensor_name not in tensor_map: + raise KeyError(f"Tensor {tensor_name} not found in tensor map") + if input_index is not None: + raise ValueError("Input index must be None when node index is None")
1042-1049: Consider iterative approach for deep region hierarchies.
_visit_region_recursivelyuses recursion which could hit Python's stack limit for very deep region hierarchies. While this is unlikely for typical ONNX models, an iterative approach would be more robust.♻️ Iterative alternative
def _visit_region_recursively(self, region: Region) -> list[Region]: """Iteratively traverse region hierarchy and collect all regions.""" regions = [] stack = [region] while stack: current = stack.pop() regions.append(current) stack.extend(current.get_children()) return regions
b5032ed to
1ffcf7f
Compare
Signed-off-by: Will Guo <willg@nvidia.com>
Signed-off-by: Will Guo <willg@nvidia.com>
1ffcf7f to
bd18dfa
Compare
tests/unit/onnx/quantization/autotune/autotune/test_autotuner.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Will Guo <willg@nvidia.com>
Signed-off-by: Will Guo <willg@nvidia.com>
Signed-off-by: Will Guo <willg@nvidia.com>
|
/ok to test b02fef1 |
Signed-off-by: Will Guo <willg@nvidia.com>
c5340bb to
68ffa5a
Compare
|
/ok to test 68ffa5a |
What does this PR do?
This PR implements QDQAutotuner class. This class is used to drive the main Autotuner workflow.
The workflow is:
This PR is part 2/4 of #703.
PR 3.1: #837
PR 3.2 #838
PR 3.3: #839
Overview: ?
Testing
Before your PR is "Ready for review"
Additional Information
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.