Skip to content

Conversation

@bdandy
Copy link

@bdandy bdandy commented Dec 11, 2025

Thunderbolt eGPU Surprise Removal Support

Prevents kernel crashes when a Thunderbolt eGPU is unplugged unexpectedly.

Changes

Module Files Description
nvidia nv-pci.c, nv.c, nv-acpi.c, nv-i2c.c, nv-rsync.c Device removal tracking, nvInvalidateDeviceReferences()
nvidia-modeset nvkms.c, nvkms-kapi.c, nvkms-evo.c, nvkms-dma.c nvKmsIsDeviceValid() checks, safe event dispatch
nvidia-drm nvidia-drm-drv.c, nvidia-drm-gem-*.c, nvidia-drm-fb.c inSurpriseRemoval flag, skip nvKms calls on unplug
nvidia-uvm uvm_gpu_isr.c, uvm_gpu.c, uvm_pmm_gpu.c, uvm_channel.c, uvm_gpu_semaphore.c uvm_parent_gpu_is_accessible() guards in ISR, cleanup, and memory paths
RM core nv_gpu_ops.c, rs_server.c, kernel_gsp.c, intr.c gpuIsLost checks, graceful session teardown

Key Protections

  • ISR handlers: Skip HAL calls when GPU not accessible
  • Memory cleanup: Skip PMA/fault buffer operations for removed GPU
  • Semaphore reads: Return cached values instead of reading GPU memory
  • DRM objects: Skip nvKms free calls during surprise removal
  • Session destroy: Warn instead of assert on orphaned devices

Testing

RTX 3060 + Thunderbolt 3: idle unplug, workload unplug, reconnect, module reload

fixes #842

@CLAassistant
Copy link

CLAassistant commented Dec 11, 2025

CLA assistant check
All committers have signed the CLA.

@roger-pmta
Copy link

Thanks! Fyi, this appears related to some of the crashes I was seeing on #979 and worked around in #984.

@bdandy
Copy link
Author

bdandy commented Dec 12, 2025

I think that #984 is not related as it's about wrong detection of external gpu.

PR was tested on Thunderbolt 3 with 3060 GPU and everything working perfectly now (I was waiting for a fix more than few years).

As PR was initially created for 580.105.08 - merged it with master

Please review the changes and apply if possible!

Additionally created AUR package for those who need it right now https://aur.archlinux.org/packages/nvidia-open-egpu-dkms

@roger-pmta
Copy link

I think that #984 is not related as it's about wrong detection of external gpu.

Sorry, I should have been more clear: I'm also facing crashes on hot unplug and driver unload as well, mine just happened to be on a TB5 enclosure (which also happened to fail eGPU detection). Will try your patch alongside mine (on #984) when I get a moment and see if resolves crashes on TB5 as well 👍

@bdandy bdandy force-pushed the fix/hotunplug branch 3 times, most recently from e941598 to 1ba8ccb Compare December 12, 2025 07:09
@bdandy bdandy changed the title Fix: Thunderbolt eGPU hot-plug/unplug kernel crash support Fix: Thunderbolt eGPU hot-plug/unplug kernel support Dec 13, 2025
@bdandy bdandy changed the title Fix: Thunderbolt eGPU hot-plug/unplug kernel support Fix: Thunderbolt eGPU hot-unplug kernel support Dec 18, 2025
@Hi-Angel
Copy link

I don't think this repository is maintained… I looked through the list of commits and PRs, and it turns out they never merge pull-requests, at least as far back as 2023 there wasn't a single one they merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

add support of thunderbolt hotplug

4 participants