From 279a22c988ff265c83f942b6cfc90281ed621262 Mon Sep 17 00:00:00 2001 From: Eric Christenson Date: Wed, 15 Apr 2026 17:00:00 -0500 Subject: [PATCH 1/2] nvidia: make Resizable BAR resize failure non-fatal nv_resize_pcie_bars() is an optimization: it tries to grow BAR1 to the largest size the hardware advertises so the CPU can address the full VRAM directly. When the resize fails -- typically because the upstream bridge's prefetchable MMIO window is too small to accommodate the requested size -- the driver currently treats this as a fatal probe error and bails out via err_zero_dev, preventing the GPU from binding at all. This is overly aggressive. The GPU is still perfectly usable with its existing (un-resized) BAR allocation; that is the entire point of Resizable BAR being an optional enhancement rather than a hard requirement. Systems that cannot accommodate the full resize include: - Thunderbolt / USB4 eGPU enclosures, where the hotplug PCIe bridge prefetchable window is typically hundreds of MiB, not tens of GiB. With a modern GPU advertising a maximum BAR1 size of 16-32 GiB, pci_resize_resource() returns -ENOENT and nv_pci_probe() fails the whole device, so the eGPU silently never appears in nvidia-smi. - Hypervisor guests where the host has passed a constrained MMIO window through to the guest. - Older chipsets with small prefetchable windows. - Platforms where the firmware has locked resources conservatively (preserve_config set). The existing code already detects preserve_config and returns early without failure -- this patch extends the same "skip but keep going" principle to all other failure modes. Replace the goto err_zero_dev with a warning print and continue probe. The device will bind with whatever BAR size was allocated at PCI enumeration time, which for constrained bridges is already the largest size that fits. Tested on an RTX 5090 in a Gigabyte Aorus RTX 5090 AI Box (TB5) on Fedora 43 / kernel 7.0.0-rc7 + open-gpu-kernel-modules 595.58.03. Without this patch, the eGPU fails to bind during probe with the "Fatal Error while attempting to resize PCIe BARs" message and no further action is possible. With this patch, the eGPU binds successfully with its initial 256 MiB BAR1 (the largest that fits the Thunderbolt hotplug bridge prefetch window) and works normally for CUDA compute workloads. Signed-off-by: Eric Christenson --- kernel-open/nvidia/nv-pci.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/kernel-open/nvidia/nv-pci.c b/kernel-open/nvidia/nv-pci.c index 996d5c0e5..60a7b9ccb 100644 --- a/kernel-open/nvidia/nv-pci.c +++ b/kernel-open/nvidia/nv-pci.c @@ -1949,9 +1949,18 @@ nv_pci_probe goto err_zero_dev; if (nv_resize_pcie_bars(pci_dev)) { - nv_printf(NV_DBG_ERRORS, - "NVRM: Fatal Error while attempting to resize PCIe BARs.\n"); - goto err_zero_dev; + /* + * Resizable BAR is an enhancement, not a requirement. When + * the resize fails (commonly because the upstream bridge + * prefetchable window is too small to accommodate a GiB-scale + * BAR, as seen with Thunderbolt/USB4 hotplug bridges), the + * device is still functional with its existing BAR + * allocation. Do not turn a minor performance degradation + * into a hard probe failure -- log a warning and continue. + */ + nv_printf(NV_DBG_WARNINGS, + "NVRM: PCIe BAR resize failed; continuing with the existing " + "BAR allocation.\n"); } nvl->all_mappings_revoked = NV_TRUE; From 9e0563d51fb99dc7569dda27ffadc44bc8f6d353 Mon Sep 17 00:00:00 2001 From: Eric Christenson Date: Wed, 15 Apr 2026 19:28:11 -0500 Subject: [PATCH 2/2] nvidia: skip Resizable BAR for Thunderbolt-attached devices The previous commit ("nvidia: make Resizable BAR resize failure non-fatal") is the primary bug fix: it ensures that a failed resize no longer prevents device binding. This commit is a complementary optimization on top of that fix. Thunderbolt / USB4 hotplug PCIe bridges fundamentally cannot host a GiB-scale prefetchable MMIO window: the bridge prefetchable allocation on these buses is typically bounded to hundreds of MiB, which is far smaller than the multi-GiB BAR1 a modern NVIDIA GPU advertises. Attempting the resize on such a device wastes probe time, emits an uninformative ENOENT in the kernel log, and then takes the failure path (now softened to a warning by the previous commit). Avoid all of that by detecting Thunderbolt attachment up front via pci_is_thunderbolt_attached(), which walks the parent bridge chain looking for any bridge with is_thunderbolt set (set by the PCI core's existing quirks table for known Intel TB host controllers). The helper has been available in since Linux v4.15 (2017-12-04). For older kernels, the code is gated behind a conftest check (NV_PCI_IS_THUNDERBOLT_ATTACHED_PRESENT) and the original resize attempt is used unchanged; older kernels also predate most of the hardware this optimization targets, so the protection is low-value there but the guard keeps the driver build-clean on ancient trees. Non-Thunderbolt devices are unaffected: pci_is_thunderbolt_attached() returns false for any GPU on a native PCIe slot (CPU root complex or chipset downstream port), so normal ReBAR continues to run and GPUs keep their full resized BAR1. Tested alongside the previous commit on an RTX 5090 in a Gigabyte Aorus RTX 5090 AI Box (TB5) alongside an internal RTX 5090 in a PCIe 5.0 x16 slot. Result: internal card keeps its full 32 GiB resized BAR1 (verified by lspci and /sys/bus/pci/devices/.../resource); the eGPU stays at 256 MiB BAR1 (the largest that fits the TB5 hotplug bridge window) and binds cleanly without the resize attempt. Signed-off-by: Eric Christenson --- kernel-open/conftest.sh | 17 +++++++++++++++++ kernel-open/nvidia/nv-pci.c | 19 +++++++++++++++++++ kernel-open/nvidia/nvidia.Kbuild | 1 + 3 files changed, 37 insertions(+) diff --git a/kernel-open/conftest.sh b/kernel-open/conftest.sh index 6df97a5be..0ffa53b4a 100755 --- a/kernel-open/conftest.sh +++ b/kernel-open/conftest.sh @@ -4067,6 +4067,23 @@ compile_test() { compile_check_conftest "$CODE" "NV_PCI_REBAR_GET_POSSIBLE_SIZES_PRESENT" "" "functions" ;; + pci_is_thunderbolt_attached) + # + # Determine if the pci_is_thunderbolt_attached() function is + # present. + # + # Added by commit 0d4ef7ce7e78 ("PCI: Identify Thunderbolt + # devices") in v4.15. + # + CODE=" + #include + void conftest_pci_is_thunderbolt_attached(void) { + pci_is_thunderbolt_attached(); + }" + + compile_check_conftest "$CODE" "NV_PCI_IS_THUNDERBOLT_ATTACHED_PRESENT" "" "functions" + ;; + pci_resize_resource_has_exclude_bars_arg) # # Determine if pci_resize_resource() has exclude_bars argument. diff --git a/kernel-open/nvidia/nv-pci.c b/kernel-open/nvidia/nv-pci.c index 60a7b9ccb..7c0f1fa6b 100644 --- a/kernel-open/nvidia/nv-pci.c +++ b/kernel-open/nvidia/nv-pci.c @@ -208,6 +208,25 @@ static int nv_resize_pcie_bars(struct pci_dev *pci_dev) { return 0; } +#if defined(NV_PCI_IS_THUNDERBOLT_ATTACHED_PRESENT) + /* + * Thunderbolt / USB4 hotplug bridges have a small prefetchable MMIO + * window that cannot accommodate a GiB-scale resized BAR. Skip + * the resize attempt proactively rather than trying and failing, + * which avoids an uninformative -ENOENT in the kernel log and + * sidesteps the failure path entirely. + */ + if (pci_is_thunderbolt_attached(pci_dev)) + { + nv_printf(NV_DBG_INFO, + "NVRM: %04x:%02x:%02x.%x: device is downstream of Thunderbolt, " + "skipping BAR1 resize\n", + NV_PCI_DOMAIN_NUMBER(pci_dev), NV_PCI_BUS_NUMBER(pci_dev), + NV_PCI_SLOT_NUMBER(pci_dev), PCI_FUNC(pci_dev->devfn)); + return 0; + } +#endif + // Check if BAR1 has PCIe rebar capabilities sizes = pci_rebar_get_possible_sizes(pci_dev, NV_GPU_BAR1); if (sizes == 0) { diff --git a/kernel-open/nvidia/nvidia.Kbuild b/kernel-open/nvidia/nvidia.Kbuild index 6996bad11..d2e04549a 100644 --- a/kernel-open/nvidia/nvidia.Kbuild +++ b/kernel-open/nvidia/nvidia.Kbuild @@ -120,6 +120,7 @@ NV_CONFTEST_FUNCTION_COMPILE_TESTS += pde_data NV_CONFTEST_FUNCTION_COMPILE_TESTS += xen_ioemu_inject_msi NV_CONFTEST_FUNCTION_COMPILE_TESTS += phys_to_dma NV_CONFTEST_FUNCTION_COMPILE_TESTS += pci_rebar_get_possible_sizes +NV_CONFTEST_FUNCTION_COMPILE_TESTS += pci_is_thunderbolt_attached NV_CONFTEST_FUNCTION_COMPILE_TESTS += get_backlight_device_by_name NV_CONFTEST_FUNCTION_COMPILE_TESTS += dma_direct_map_resource NV_CONFTEST_FUNCTION_COMPILE_TESTS += flush_cache_all