I run Void Linux on a Thinkpad T480 with Integrated Intel graphics and a discrete Nvidia MX150. I've decided to start working with more CUDA development and was running into an issue where my Nvidia GPU would suddenly stop being detected by CUDA applications. The only way I could figure out to get it back online was by rebooting my computer. Eventually I became so frustrated that I decided to dive in and find a solution.
## The Problem
Here's what I was experiencing:
- GPU appeared normal in `nvidia-smi`
- All NVIDIA kernel modules were loaded correctly
- Device files in `/dev/nvidia*` existed with proper permissions
- But testing CUDA availability using Pytorch, by running the following command would return FAlSE:
- **`nvidia-drm.modeset=1`**: Enables kernel mode setting for the NVIDIA driver, providing better integration with the display subsystem and more robust power management
- **`nvidia.NVreg_PreserveVideoMemoryAllocations=1`**: Tells the NVIDIA driver to preserve GPU memory allocations across suspend/resume cycles, preventing the MMU faults
## Why This Works
These parameters ensure that:
1. GPU memory contexts are properly preserved during sleep
2. The kernel's display management system maintains better control over the GPU state
3. Memory mappings remain consistent across suspend/resume cycles
4. The GPU's MMU doesn't enter the faulted state that breaks CUDA
## Conclusion
Haven't had an issue since making this fix! Hopefully someone else can benefit from this as well!