AlexSelimov.com/content/posts/nvidia_cuda_sleep_issue.md

+++
title = "NVIDIA graphics card faulting after sleep/wake cycle on Void Linux"
date = "2025-09-04T19:44:19-04:00"
topics = ["software development", "CUDA"]
+++

I run Void Linux on a Thinkpad T480 with Integrated Intel graphics and a discrete Nvidia MX150. I've decided to start working with more CUDA development and was running into an issue where my Nvidia GPU would suddenly stop being detected by CUDA applications. The only way I could figure out to get it back online was by rebooting my computer. Eventually I became so frustrated that I decided to dive in and find a solution.

## The Problem

Here's what I was experiencing:

- GPU appeared normal in `nvidia-smi`
- All NVIDIA kernel modules were loaded correctly
- Device files in `/dev/nvidia*` existed with proper permissions
- But testing CUDA availability using Pytorch, by running the following command would return FAlSE:

```python
 python -c "import torch; print(torch.cuda.is_available())"
```

The root cause was eventually found in the kernel logs (`dmesg`):

```
NVRM: Xid (PCI:0000:01:00): 31, pid=1596, name=modprobe, Ch 00000003, 
intr 10000000. MMU Fault: ENGINE HOST6 HUBCLIENT_HOST faulted @ 0x1_01010000. 
Fault is of type FAULT_PDE ACCESS_TYPE_READ

NVRM: Xid (PCI:0000:01:00): 154, GPU recovery action changed from 0x0 (None) 
to 0x2 (Node Reboot Required)
```

## The Root Cause

The issue stems from NVIDIA GPU memory management during sleep/resume cycles. When the system suspends:

1. The GPU's memory mappings and contexts can become corrupted
2. The GPU's Memory Management Unit (MMU) enters a faulted state
3. While the driver stack appears to reload correctly, the GPU hardware itself is in an inconsistent state
4. CUDA runtime fails to initialize because it can't establish proper memory contexts

This seems to be common on mobile NVIDIA GPUs (like the MX series) in laptops with hybrid graphics setups running Linux.
At least, I've seen a few postings about this.

## The Solution

The fix is to add these parameters to your kernel command line in `/etc/default/grub`:

```bash
GRUB_CMDLINE_LINUX_DEFAULT="... nvidia-drm.modeset=1 nvidia.NVreg_PreserveVideoMemoryAllocations=1"
```

Then update GRUB:

```bash
sudo update-grub
```

And reboot to apply the changes.

### What These Parameters Do

- **`nvidia-drm.modeset=1`**: Enables kernel mode setting for the NVIDIA driver, providing better integration with the display subsystem and more robust power management
- **`nvidia.NVreg_PreserveVideoMemoryAllocations=1`**: Tells the NVIDIA driver to preserve GPU memory allocations across suspend/resume cycles, preventing the MMU faults

## Why This Works

These parameters ensure that:

1. GPU memory contexts are properly preserved during sleep
2. The kernel's display management system maintains better control over the GPU state
3. Memory mappings remain consistent across suspend/resume cycles
4. The GPU's MMU doesn't enter the faulted state that breaks CUDA

## Conclusion

Haven't had an issue since making this fix! Hopefully someone else can benefit from this as well!
Add post about cuda 2025-09-04 19:59:35 -04:00			`+++`
			`title = "NVIDIA graphics card faulting after sleep/wake cycle on Void Linux"`
			`date = "2025-09-04T19:44:19-04:00"`
Fix toml header 2025-09-04 20:02:28 -04:00			`topics = ["software development", "CUDA"]`
Add post about cuda 2025-09-04 19:59:35 -04:00			`+++`

			`I run Void Linux on a Thinkpad T480 with Integrated Intel graphics and a discrete Nvidia MX150. I've decided to start working with more CUDA development and was running into an issue where my Nvidia GPU would suddenly stop being detected by CUDA applications. The only way I could figure out to get it back online was by rebooting my computer. Eventually I became so frustrated that I decided to dive in and find a solution.`

			`## The Problem`

			`Here's what I was experiencing:`

			- GPU appeared normal in `nvidia-smi`
			`- All NVIDIA kernel modules were loaded correctly`
			- Device files in `/dev/nvidia*` existed with proper permissions
			`- But testing CUDA availability using Pytorch, by running the following command would return FAlSE:`

			```python
			`python -c "import torch; print(torch.cuda.is_available())"`
			```

			The root cause was eventually found in the kernel logs (`dmesg`):

			```
			`NVRM: Xid (PCI:0000:01:00): 31, pid=1596, name=modprobe, Ch 00000003,`
			`intr 10000000. MMU Fault: ENGINE HOST6 HUBCLIENT_HOST faulted @ 0x1_01010000.`
			`Fault is of type FAULT_PDE ACCESS_TYPE_READ`

			`NVRM: Xid (PCI:0000:01:00): 154, GPU recovery action changed from 0x0 (None)`
			`to 0x2 (Node Reboot Required)`
			```

			`## The Root Cause`

			`The issue stems from NVIDIA GPU memory management during sleep/resume cycles. When the system suspends:`

			`1. The GPU's memory mappings and contexts can become corrupted`
			`2. The GPU's Memory Management Unit (MMU) enters a faulted state`
			`3. While the driver stack appears to reload correctly, the GPU hardware itself is in an inconsistent state`
			`4. CUDA runtime fails to initialize because it can't establish proper memory contexts`

			`This seems to be common on mobile NVIDIA GPUs (like the MX series) in laptops with hybrid graphics setups running Linux.`
			`At least, I've seen a few postings about this.`

			`## The Solution`

			The fix is to add these parameters to your kernel command line in `/etc/default/grub`:

			```bash
			`GRUB_CMDLINE_LINUX_DEFAULT="... nvidia-drm.modeset=1 nvidia.NVreg_PreserveVideoMemoryAllocations=1"`
			```

			`Then update GRUB:`

			```bash
			`sudo update-grub`
			```

			`And reboot to apply the changes.`

			`### What These Parameters Do`

			- `nvidia-drm.modeset=1`: Enables kernel mode setting for the NVIDIA driver, providing better integration with the display subsystem and more robust power management
			- `nvidia.NVreg_PreserveVideoMemoryAllocations=1`: Tells the NVIDIA driver to preserve GPU memory allocations across suspend/resume cycles, preventing the MMU faults

			`## Why This Works`

			`These parameters ensure that:`

			`1. GPU memory contexts are properly preserved during sleep`
			`2. The kernel's display management system maintains better control over the GPU state`
			`3. Memory mappings remain consistent across suspend/resume cycles`
			`4. The GPU's MMU doesn't enter the faulted state that breaks CUDA`

			`## Conclusion`

			`Haven't had an issue since making this fix! Hopefully someone else can benefit from this as well!`