https://www.nvidia.com/en-us/drivers/results/
步骤 1:卸载所有 NVIDIA 驱动
sudo apt purge nvidia
sudo apt autoremove
步骤 2:更新系统并安装驱动
sudo apt update
sudo ubuntu-drivers devices
查看可用版本
apt-cache search nvidia-driver
sudo apt install nvidia-driver-580
#sudo ubuntu-drivers autoinstall
步骤 3:重建 initramfs 并重启
sudo update-initramfs -u
sudo reboot
步骤 4:验证
重启后运行:
watch -n 1 nvidia-smi
nvidia-smi ,nvtop
nvidia-smi
Sun Jan 4 16:24:29 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla V100-SXM2-16GB Off | 00000000:00:10.0 Off | 0 |
| N/A 32C P0 23W / 300W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
其他
查看ecc
nvidia-smi -q -i 0
详细监控V100状态(温度、功耗、利用率)
nvidia-smi --query-gpu=index,name,temperature.gpu,utilization.gpu,power.draw,memory.used,memory.total --format=csv -l 1
功耗和温度记录
nvidia-smi --query-gpu=timestamp,name,temperature.gpu,power.draw --format=csv -f gpu_log.csv
限制功率
电源不太行,满载过久会断电保护
sudo nvidia-smi -pl 220
温度墙限制(gtx 1070)
gpu温度达到80度功率100w时会锁定功率,但是风扇实际转速不到100%。可能触发了**降频保护机制**,手动设置转速降温提示整体功率
启用手动控制 + 设置风扇为 80%
nvidia-settings -a "[gpu:0]/GPUFanControlState=1" -a "[fan:0]/GPUTargetFanSpeed=80"
恢复自动控制
nvidia-settings -a "[gpu:0]/GPUFanControlState=0"