NVIDIA Open GPU Kernel Modules Version
580.126.20 (Ubuntu package nvidia-driver-580-open 580.126.20-1ubuntu1, built Feb 18 2026 $ cat /proc/driver/nvidia/version NVRM version: NVIDIA UNIX Open Kernel Module for x86_64 580.126.20 Release Build GCC version: gcc version 13.3.0 (Ubuntu 13.3.0-6ubuntu2~24.04.1)
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
Operating System and Version
Ubuntu 24.04.4 LTS
Kernel Release
Linux TRX5 6.14.0-37-generic #37~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 20 10:25:38 UTC 2 x86_64 GNU/Linux
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
Hardware: GPU
GPU 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition (UUID: GPU-93808743-a5f5-6561-b7ea-e70aa68a7e75)
Describe the bug
nvidia-bug-report.log.gz
Hardware: GPU
$ nvidia-smi -L
GPU 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition (UUID: GPU-93808743-a5f5-6561-b7ea-e70aa68a7e75)
GPU 1: NVIDIA GeForce RTX 3090 (UUID: GPU-5251258f-f189-8197-1523-06bc955c221d)
GPU 2: NVIDIA GeForce RTX 3090 (UUID: GPU-61483aef-c24c-1dd1-d09e-7323c87380d8)
GPU 3: NVIDIA GeForce RTX 3090 (UUID: GPU-ecd9b283-ad80-8315-b9d0-09af02b663bc)
The crash is specific to GPU 0 (Blackwell GB202, sm_120). The three 3090s (sm_86, Ampere) run the same sustained workload on separate ports without issue.
Summary
Sustained zero-gap LLM inference against a local llama.cpp server on GPU 0 (RTX Pro 6000 Blackwell, GB202, sm_120) causes a silent system-wide hard hang
after ~45 minutes / ~3000 sequential chat-completion requests. The entire host becomes unresponsive: SSH drops, nvidia-smi doesn't return, the physical display
freezes. Only a hard power-cycle / BIOS watchdog reboot recovers.
Distinguishing signature: no Xid, no NVRM error, no kernel log before freeze
Unlike #1080 (GB202 + Vulkan, logs Xid 109 / Xid 8) and #1045 (RTX 5080, logs Xid 62 / 45 / 119 / 154), my hang produces zero diagnostic output:
- No
Xid in dmesg
- No
NVRM / nvidia-modeset error
journalctl output after recovery shows the journal simply truncating at the hang moment — the kernel cannot flush before the freeze
nvidia-smi never returns after onset
krcWatchdog never fires
This is a harder failure mode than the other GSP timeout bugs already filed — the GSP firmware appears to take the kernel down with it before any signal can be
raised.
Reproduced across 5+ occurrences
Over 3 weeks of batch inference workloads, the same hang has reproduced at least 5 separate times, always on GPU 0 (Blackwell) and always under sustained
chat-completion load. Switching identical workload to a tensor-parallel pair of sm_86 Ampere GPUs (RTX 3090×2 on the same machine, NVLink NV4 56 GB/s) is stable for
20+ hour runs with the same request rate.
Mitigations tried (none fix the hang)
| Attempt |
Effect |
| BIOS ASUS WRX90E-SAGE SE 0901 → 1317 |
No change |
llama.cpp rebuild from HEAD (SHA ff5ef8278, version 8763) |
No change |
--parallel 2 → --parallel 1 |
No change |
| Stopped secondary llama.cpp instance on port 8012 (same GPU) |
No change |
| CUDA 12.8.0 runtime |
No change |
Workaround in production
Route all sustained LLM workloads to a separate tensor-parallel llama.cpp instance on GPU 2 + GPU 3 (RTX 3090 × 2, NVLink NV4). GPU 0 is restricted to bursty /
interactive use only. This works reliably for multi-hour batch inference.
Related existing issues (different workload / different failure trace, likely same family)
To Reproduce
Environment
- GPU under test: GPU 0 only —
NVIDIA RTX PRO 6000 Blackwell Workstation Edition (GB202, sm_120, 97 GB VRAM, PCIe 5.0 x16)
- Motherboard / CPU / PSU: ASUS WRX90E-SAGE SE, BIOS 1317 (latest, 2026-02-03) / AMD Threadripper Pro / 1600W Platinum
- Driver: 580.126.20 open kernel module (package
nvidia-driver-580-open 580.126.20-1ubuntu1)
- CUDA runtime: 12.8.0 (inside
nvidia/cuda:12.8.0-runtime-ubuntu24.04 container)
- Secure Boot: disabled (required for nvidia-open DKMS)
- llama.cpp: commit
ff5ef8278, version tag 8763, flash-attn ON
Reproduction
-
Pull a large GGUF (reproduced on Qwen_Qwen3-32B-Q4_K_M.gguf, ~19 GB).
-
Run llama.cpp bound to GPU 0 with exactly these flags:
llama-server
--model /models/Qwen_Qwen3-32B-Q4_K_M.gguf
--host 0.0.0.0 --port 8008
--ctx-size 32768
--n-gpu-layers 999
--threads 8
--parallel 1
--batch-size 2048
--cont-batching
--flash-attn on
--reasoning-budget 0
-
Issue a tight loop of sequential chat-completion requests with no idle gap:
import httpx, asyncio
async def main():
async with httpx.AsyncClient() as c:
for i in range(10_000):
r = await c.post(
"http://127.0.0.1:8008/v1/chat/completions",
json={
"model": "qwen3-32b",
"messages": [
{"role": "system", "content": "You extract structured legal obligations."},
{"role": "user", "content": "Clause text: <1-2 KB body>. Return JSON array."},
],
"temperature": 0.0,
"max_tokens": 2048,
"response_format": {"type": "json_schema", "json_schema": {...}},
},
timeout=300,
)
r.raise_for_status()
asyncio.run(main())
Each request: ~1–2 K prompt tokens in, ~500–1500 tokens out. No asyncio.sleep, no backoff, no batching pause — each iteration fires as soon as the previous
response returns.
4. Watch nvidia-smi dmon in another terminal. GPU 0 sits at 60–90 % util continuously.
5. After roughly 3,000 requests / ~45 minutes wall-clock, the system freezes silently. SSH drops; physical console is non-responsive; hard reset required.
Control experiment (same workload, different GPU)
Identical client code, identical model, same machine, pointing at a tensor-parallel llama.cpp on GPU 2 + GPU 3 (RTX 3090 × 2, sm_86, NVLink NV4):
llama-server \
--model /models/Qwen_Qwen3-32B-Q4_K_M.gguf \
--host 0.0.0.0 --port 8009 \
--tensor-split 0,0,1,1 \
--ctx-size 16384 --parallel 1 \
--n-gpu-layers 999 --flash-attn on
Stable for 20+ hour batch runs at the same request rate. No hang, no Xid, no warnings.
### **Bug Incidence**
Always (when GPU 0 serves sustained zero-gap inference for >30 min).
Reproduced 5+ times over ~3 weeks. Mean time to hang: ~45 min.
Zero occurrences on sm_86 (Ampere) with identical workload on the same host.
### Bug Incidence
Always
### nvidia-bug-report.log.gz
[nvidia-bug-report.log.gz](https://github.com/user-attachments/files/26814657/nvidia-bug-report.log.gz)
nvidia-bug-report.sh will now collect information about your
system and create the file 'nvidia-bug-report.log.gz' in the current
directory. It may take several seconds to run. In some
cases, it may hang trying to capture data generated dynamically
by the Linux kernel and/or the NVIDIA kernel module. While
the bug report log file will be incomplete if this happens, it
may still contain enough data to diagnose your problem.
If nvidia-bug-report.sh hangs, consider running with the --safe-mode
and --extra-system-data command line arguments.
Please include the 'nvidia-bug-report.log.gz' log file when reporting
your bug via the NVIDIA Linux forum (see forums.developer.nvidia.com)
or by sending email to '[email protected]'.
By delivering 'nvidia-bug-report.log.gz' to NVIDIA, you acknowledge
and agree that personal information may inadvertently be included in
the output. Notwithstanding the foregoing, NVIDIA will use the
output only for the purpose of investigating your reported issue.
Running nvidia-bug-report.sh...
Detected driver type: Running RM Driver on GPUs
complete.
Summary of Skipped Sections:
Skipped Component | Details
================================================================================
ldd output | glxinfo not found
--------------------------------------------------------------------------------
vulkaninfo output | vulkaninfo not found
--------------------------------------------------------------------------------
ibstat output | ibstat not found
--------------------------------------------------------------------------------
acpidump output | acpidump not found
--------------------------------------------------------------------------------
mst output | mst not found
--------------------------------------------------------------------------------
nvlsm-bug-report.sh output | nvlsm-bug-report.sh not found
--------------------------------------------------------------------------------
Summary of Errors:
Error Component | Details | Resolution
=========================================================================================================================
### More Info
### Additional observations
- **Not a thermal issue**: GPU 0 package temp sits at 60–65 °C throughout, with fan curve responding normally. Full thermal headroom.
- **Not PSU / PCIe power**: 1600 W Platinum PSU with all 12V-2×6 rails dedicated per vendor spec; no dropouts, no correctable errors on GPU 0's PCIe link until after
the hang.
- **Post-reboot `dmesg` shows correctable AER errors on an unrelated PCIe device** (`0000:02:00.0` — Intel NIC, not the GPU), which appear to be a downstream
side-effect of the hard reset, not the cause.
- **Can't test against the proprietary driver** because Blackwell (sm_120) is supported only via the open kernel module; the `.run` proprietary package does not
build against this silicon. This forecloses the standard "try proprietary" debugging path for anyone using RTX Pro 6000 Blackwell.
### What I expected
Either:
1. The GSP firmware to self-recover (as it does on my sm_86 GPUs under equivalent or heavier sustained load), **or**
2. At minimum, raise an Xid + recoverable kernel error — even Xid 119 / 154 would let the kernel log and let operators write a watchdog around it.
Instead the failure mode is an un-diagnosable silent lockup, which is a significant step backward in operability compared to Ampere and Hopper GSP behavior.
### Ask
- Confirmation whether the silent-hang class is tracked internally, or if it's a new variant of the #1080 / #1045 GSP-firmware-death family.
- Any workaround flag / firmware parameter that would either
(a) lower the sustained-inference timeout so the GSP resets before locking the host, or
(b) force Xid/NVRM reporting on GSP death so a user-space watchdog can act.
- Guidance on whether this is expected to be addressed in 595.x / 600.x driver line; Blackwell support in 590 regressed and driver 590 dropped sm_120, so 580-open is
currently the only path forward for this card.
Thank you for maintaining the open driver. Happy to run any additional diagnostics, capture perf counters before a controlled hang, or test patches against this
reproducer.
NVIDIA Open GPU Kernel Modules Version
580.126.20 (Ubuntu package nvidia-driver-580-open 580.126.20-1ubuntu1, built Feb 18 2026 $ cat /proc/driver/nvidia/version NVRM version: NVIDIA UNIX Open Kernel Module for x86_64 580.126.20 Release Build GCC version: gcc version 13.3.0 (Ubuntu 13.3.0-6ubuntu2~24.04.1)
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
Operating System and Version
Ubuntu 24.04.4 LTS
Kernel Release
Linux TRX5 6.14.0-37-generic #37~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 20 10:25:38 UTC 2 x86_64 GNU/Linux
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
Hardware: GPU
GPU 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition (UUID: GPU-93808743-a5f5-6561-b7ea-e70aa68a7e75)
Describe the bug
nvidia-bug-report.log.gz
Hardware: GPU
$ nvidia-smi -L
GPU 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition (UUID: GPU-93808743-a5f5-6561-b7ea-e70aa68a7e75)
GPU 1: NVIDIA GeForce RTX 3090 (UUID: GPU-5251258f-f189-8197-1523-06bc955c221d)
GPU 2: NVIDIA GeForce RTX 3090 (UUID: GPU-61483aef-c24c-1dd1-d09e-7323c87380d8)
GPU 3: NVIDIA GeForce RTX 3090 (UUID: GPU-ecd9b283-ad80-8315-b9d0-09af02b663bc)
The crash is specific to GPU 0 (Blackwell GB202, sm_120). The three 3090s (sm_86, Ampere) run the same sustained workload on separate ports without issue.
Summary
Sustained zero-gap LLM inference against a local llama.cpp server on GPU 0 (RTX Pro 6000 Blackwell, GB202, sm_120) causes a silent system-wide hard hang
after ~45 minutes / ~3000 sequential chat-completion requests. The entire host becomes unresponsive: SSH drops,
nvidia-smidoesn't return, the physical displayfreezes. Only a hard power-cycle / BIOS watchdog reboot recovers.
Distinguishing signature: no Xid, no NVRM error, no kernel log before freeze
Unlike #1080 (GB202 + Vulkan, logs
Xid 109 / Xid 8) and #1045 (RTX 5080, logsXid 62 / 45 / 119 / 154), my hang produces zero diagnostic output:XidindmesgNVRM/nvidia-modeseterrorjournalctloutput after recovery shows the journal simply truncating at the hang moment — the kernel cannot flush before the freezenvidia-sminever returns after onsetkrcWatchdognever firesThis is a harder failure mode than the other GSP timeout bugs already filed — the GSP firmware appears to take the kernel down with it before any signal can be
raised.
Reproduced across 5+ occurrences
Over 3 weeks of batch inference workloads, the same hang has reproduced at least 5 separate times, always on GPU 0 (Blackwell) and always under sustained
chat-completion load. Switching identical workload to a tensor-parallel pair of sm_86 Ampere GPUs (RTX 3090×2 on the same machine, NVLink NV4 56 GB/s) is stable for
20+ hour runs with the same request rate.
Mitigations tried (none fix the hang)
ff5ef8278, version 8763)--parallel 2→--parallel 1Workaround in production
Route all sustained LLM workloads to a separate tensor-parallel llama.cpp instance on GPU 2 + GPU 3 (RTX 3090 × 2, NVLink NV4). GPU 0 is restricted to bursty /
interactive use only. This works reliably for multi-hour batch inference.
Related existing issues (different workload / different failure trace, likely same family)
To Reproduce
Environment
NVIDIA RTX PRO 6000 Blackwell Workstation Edition(GB202, sm_120, 97 GB VRAM, PCIe 5.0 x16)nvidia-driver-580-open 580.126.20-1ubuntu1)nvidia/cuda:12.8.0-runtime-ubuntu24.04container)ff5ef8278, version tag 8763, flash-attn ONReproduction
Pull a large GGUF (reproduced on
Qwen_Qwen3-32B-Q4_K_M.gguf, ~19 GB).Run llama.cpp bound to GPU 0 with exactly these flags:
llama-server
--model /models/Qwen_Qwen3-32B-Q4_K_M.gguf
--host 0.0.0.0 --port 8008
--ctx-size 32768
--n-gpu-layers 999
--threads 8
--parallel 1
--batch-size 2048
--cont-batching
--flash-attn on
--reasoning-budget 0
Issue a tight loop of sequential chat-completion requests with no idle gap: