Skip to content

Improve topology detection performance#514

Open
artulab wants to merge 1 commit intomainfrom
artulab/improve_topo_perf
Open

Improve topology detection performance#514
artulab wants to merge 1 commit intomainfrom
artulab/improve_topo_perf

Conversation

@artulab
Copy link
Copy Markdown
Collaborator

@artulab artulab commented Apr 21, 2026

Motivation

Technical Details

Currently, every helper function that touches the vendor library performs its own init() + shutdown() pair inside a try/finally block. During a single topology discovery this performs very poorly. Now it's calling these functions once lazily when required.

Also improving device_utils.py because unit tests that require no GPU fail when they are running on a machine with no supported AMD GPU as extra.hip package is only available on systems with AMD GPUs.

Test Plan

python -m pytest tests/unittests/test_topology.py -v

Test Result

tests/unittests/test_topology.py::TestFabricInfo::test_empty_fabric_info PASSED                                                         [  1%]
tests/unittests/test_topology.py::TestFabricInfo::test_valid_fabric_info PASSED                                                         [  2%]
tests/unittests/test_topology.py::TestFabricInfo::test_domain_key_comparison PASSED                                                     [  3%]
tests/unittests/test_topology.py::TestFabricInfo::test_empty_domain_keys_do_not_match PASSED                                            [  5%]
tests/unittests/test_topology.py::TestFabricInfo::test_serialization_roundtrip PASSED                                                   [  6%]
tests/unittests/test_topology.py::TestFabricInfo::test_from_dict_missing_keys PASSED                                                    [  7%]
tests/unittests/test_topology.py::TestGPUInfo::test_serialization_roundtrip PASSED                                                      [  8%]
tests/unittests/test_topology.py::TestGPUInfo::test_from_dict_does_not_mutate_input PASSED                                              [ 10%]
tests/unittests/test_topology.py::TestGPUInfo::test_from_dict_missing_fabric PASSED                                                     [ 11%]
tests/unittests/test_topology.py::TestNormalizePCIBusId::test_standard_format PASSED                                                    [ 12%]
tests/unittests/test_topology.py::TestNormalizePCIBusId::test_uppercase PASSED                                                          [ 13%]
tests/unittests/test_topology.py::TestNormalizePCIBusId::test_nvidia_8char_domain PASSED                                                [ 15%]
tests/unittests/test_topology.py::TestNormalizePCIBusId::test_prefix_junk PASSED                                                        [ 16%]
tests/unittests/test_topology.py::TestNormalizePCIBusId::test_no_match PASSED                                                           [ 17%]
tests/unittests/test_topology.py::TestVendorLibraryLifecycle::test_nvml_query_helpers_do_not_manage_lifecycle PASSED                    [ 18%]
tests/unittests/test_topology.py::TestVendorLibraryLifecycle::test_amdsmi_query_helpers_do_not_manage_lifecycle PASSED                  [ 20%]
tests/unittests/test_topology.py::TestVendorLibraryLifecycle::test_nvml_outer_helpers_initialize_and_shutdown PASSED                    [ 21%]
tests/unittests/test_topology.py::TestVendorLibraryLifecycle::test_amdsmi_outer_helpers_initialize_and_shutdown PASSED                  [ 22%]
tests/unittests/test_topology.py::TestVendorLibraryLifecycle::test_get_gpu_fabric_info_initializes_and_shuts_down_standalone PASSED     [ 23%]
tests/unittests/test_topology.py::TestVendorLibraryLifecycle::test_missing_library_fallbacks_preserve_logs PASSED                       [ 25%]
tests/unittests/test_topology.py::TestVendorLibraryLifecycle::test_discover_uses_outer_vendor_lifecycle PASSED                          [ 26%]
tests/unittests/test_topology.py::TestVendorLibraryLifecycle::test_discover_shuts_down_outer_vendor_lifecycle_on_exception PASSED       [ 27%]
tests/unittests/test_topology.py::TestVendorLibraryLifecycle::test_discover_unknown_vendor_uses_noop_outer_lifecycle PASSED             [ 28%]
tests/unittests/test_topology.py::TestNodeInfo::test_get_link_type_self PASSED                                                          [ 30%]
tests/unittests/test_topology.py::TestNodeInfo::test_get_link_type_out_of_bounds PASSED                                                 [ 31%]
tests/unittests/test_topology.py::TestNodeInfo::test_get_link_type_no_matrix PASSED                                                     [ 32%]
tests/unittests/test_topology.py::TestNodeInfo::test_p2p_access_out_of_bounds PASSED                                                    [ 33%]
tests/unittests/test_topology.py::TestNodeInfo::test_p2p_access_self_always_true PASSED                                                 [ 35%]
tests/unittests/test_topology.py::TestTopologyMap::test_same_rank_is_intra_node PASSED                                                  [ 36%]
tests/unittests/test_topology.py::TestTopologyMap::test_same_node_is_intra_node PASSED                                                  [ 37%]
tests/unittests/test_topology.py::TestTopologyMap::test_same_fabric_different_node_is_fabric PASSED                                     [ 38%]
tests/unittests/test_topology.py::TestTopologyMap::test_no_fabric_is_rdma PASSED                                                        [ 40%]
tests/unittests/test_topology.py::TestTopologyMap::test_node_peers PASSED                                                               [ 41%]
tests/unittests/test_topology.py::TestTopologyMap::test_fabric_domain_peers PASSED                                                      [ 42%]
tests/unittests/test_topology.py::TestTopologyMap::test_rdma_peers PASSED                                                               [ 43%]
tests/unittests/test_topology.py::TestTopologyMap::test_peer_groups_partition_world PASSED                                              [ 45%]
tests/unittests/test_topology.py::TestTopologyMap::test_comm_groups_intra_node PASSED                                                   [ 46%]
tests/unittests/test_topology.py::TestTopologyMap::test_comm_groups_fabric_includes_standalone PASSED                                   [ 47%]
tests/unittests/test_topology.py::TestTopologyMap::test_comm_groups_fabric_domain_group_content PASSED                                  [ 48%]
tests/unittests/test_topology.py::TestTopologyMap::test_comm_groups_fabric_is_not_empty_when_domains_exist PASSED                       [ 50%]
tests/unittests/test_topology.py::TestTopologyMap::test_comm_groups_rdma_is_world PASSED                                                [ 51%]
tests/unittests/test_topology.py::TestTopologyMap::test_heap_plan_completeness PASSED                                                   [ 52%]
tests/unittests/test_topology.py::TestTopologyMap::test_heap_plan_rank4 PASSED                                                          [ 53%]
tests/unittests/test_topology.py::TestTopologyMap::test_heap_plan_no_peer_overlap PASSED                                                [ 55%]
tests/unittests/test_topology.py::TestTopologyMap::test_summary_contains_all_nodes PASSED                                               [ 56%]
tests/unittests/test_topology.py::TestTopologyMap::test_ranks_for_fabric_domain PASSED                                                  [ 57%]
tests/unittests/test_topology.py::TestTopologyMap::test_ranks_for_nonexistent_domain PASSED                                             [ 58%]
tests/unittests/test_topology.py::TestOversubscription::test_num_gpus_is_physical_count PASSED                                          [ 60%]
tests/unittests/test_topology.py::TestOversubscription::test_link_type_by_gpu_id PASSED                                                 [ 61%]
tests/unittests/test_topology.py::TestOversubscription::test_p2p_by_gpu_id PASSED                                                       [ 62%]
tests/unittests/test_topology.py::TestOversubscription::test_all_ranks_are_node_peers PASSED                                            [ 63%]
tests/unittests/test_topology.py::TestIsolationCollapse::test_num_gpus_not_collapsed PASSED                                             [ 65%]
tests/unittests/test_topology.py::TestIsolationCollapse::test_both_ranks_are_node_peers PASSED                                          [ 66%]
tests/unittests/test_topology.py::TestNoFabricCluster::test_no_fabric_all_rdma PASSED                                                   [ 67%]
tests/unittests/test_topology.py::TestNoFabricCluster::test_fabric_peers_empty PASSED                                                   [ 68%]
tests/unittests/test_topology.py::TestNoFabricCluster::test_comm_groups_no_fabric_is_empty PASSED                                       [ 70%]
tests/unittests/test_topology.py::TestNoFabricCluster::test_comm_groups_intra_node_still_correct PASSED                                 [ 71%]
tests/unittests/test_topology.py::TestNoFabricCluster::test_comm_groups_rdma_still_covers_world PASSED                                  [ 72%]
tests/unittests/test_topology.py::TestNoFabricCluster::test_heap_plan_no_fabric_peers PASSED                                            [ 73%]
tests/unittests/test_topology.py::TestAllFabricCluster::test_comm_groups_fabric_spans_nodes PASSED                                      [ 75%]
tests/unittests/test_topology.py::TestAllFabricCluster::test_comm_groups_no_standalone_groups PASSED                                    [ 76%]
tests/unittests/test_topology.py::TestAllFabricCluster::test_comm_groups_intra_node_still_per_host PASSED                               [ 77%]
tests/unittests/test_topology.py::TestAllFabricCluster::test_heap_plan_fabric_peers_cross_node PASSED                                   [ 78%]
tests/unittests/test_topology.py::TestAllFabricCluster::test_interconnect_cross_node_is_fabric PASSED                                   [ 80%]
tests/unittests/test_topology.py::TestDistributed::test_all_gather_strings SKIPPED (No distributed process group)                       [ 81%]
tests/unittests/test_topology.py::TestDistributed::test_all_gather_strings_empty SKIPPED (No distributed process group)                 [ 82%]
tests/unittests/test_topology.py::TestDistributed::test_all_gather_strings_large_payload SKIPPED (No distributed process group)         [ 83%]
tests/unittests/test_topology.py::TestFullDiscovery::test_discover_returns_topology SKIPPED (No distributed process group)              [ 85%]
tests/unittests/test_topology.py::TestFullDiscovery::test_local_rank_is_unique_per_node SKIPPED (No distributed process group)          [ 86%]
tests/unittests/test_topology.py::TestFullDiscovery::test_own_rank_info_correct SKIPPED (No distributed process group)                  [ 87%]
tests/unittests/test_topology.py::TestFullDiscovery::test_interconnect_symmetry SKIPPED (No distributed process group)                  [ 88%]
tests/unittests/test_topology.py::TestFullDiscovery::test_peer_partition_exhaustive SKIPPED (No distributed process group)              [ 90%]
tests/unittests/test_topology.py::TestLogicalToPhysicalGpuIndex::test_no_env_var_returns_logical PASSED                                 [ 91%]
tests/unittests/test_topology.py::TestLogicalToPhysicalGpuIndex::test_nvidia_remapping PASSED                                           [ 92%]
tests/unittests/test_topology.py::TestLogicalToPhysicalGpuIndex::test_amd_hip_visible PASSED                                            [ 93%]
tests/unittests/test_topology.py::TestLogicalToPhysicalGpuIndex::test_amd_rocr_fallback PASSED                                          [ 95%]
tests/unittests/test_topology.py::TestLogicalToPhysicalGpuIndex::test_hip_takes_priority_over_rocr PASSED                               [ 96%]
tests/unittests/test_topology.py::TestLogicalToPhysicalGpuIndex::test_logical_out_of_range_returns_logical PASSED                       [ 97%]
tests/unittests/test_topology.py::TestLogicalToPhysicalGpuIndex::test_uuid_style_entry_returns_logical PASSED                           [ 98%]
tests/unittests/test_topology.py::TestLogicalToPhysicalGpuIndex::test_negative_index_passthrough PASSED                                 [100%]

======================================================== 72 passed, 8 skipped in 0.87s ========================================================

Submission Checklist

Copilot AI review requested due to automatic review settings April 21, 2026 21:56
@github-actions github-actions Bot added in-progress We are working on it iris Iris project issue labels Apr 21, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR improves GPU topology discovery performance by centralizing vendor-library initialization/shutdown into outer scopes, and makes Triton HIP utilities import-safe on non-AMD systems.

Changes:

  • Move NVML/AMDSMI lifecycle management out of low-level query helpers and into outer “scope” helpers.
  • Add unit tests validating vendor lifecycle behavior and fallbacks when vendor libs are missing.
  • Make iris/device_utils.py resilient to missing triton.language.extra.hip symbols.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
tests/unittests/test_topology.py Adds lifecycle-focused unit tests and fake NVML/AMDSMI modules to validate new behavior.
iris/topology.py Introduces outer init/shutdown helpers and removes repeated init/shutdown from query helpers to reduce overhead.
iris/device_utils.py Wraps HIP-only Triton imports and adds compile-time assertions for missing symbols.

Comment thread iris/topology.py Outdated
Comment thread iris/device_utils.py Outdated
Comment thread iris/topology.py Outdated
@artulab artulab force-pushed the artulab/improve_topo_perf branch from a32781a to 6a53fbc Compare April 22, 2026 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

in-progress We are working on it iris Iris project issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants