When performing data science on an Ubuntu Linux machine remotely, the fans on NVIDIA GPUs may not spin up in response to increased load. This is due to fact that the NVIDIA controlling software generally requires logging into the GUI Desktop. Simply forwarding the xwindow with SSH has little effect, so an alternative method is required.
After poking around for a while, I came up with the solution listed below. In short, it simply utilizes the nvidia-settings command with the xauth credentials of the Gnome Desktop Manager account.
(Note that this solution isn't technically "headless" using the exact definition of the term. It does require an X11 desktop to be running, but the connection into the server is remote via SSH.)
First step: Get the userid for the "gdm" user.
$ id gdm uid=124(gdm) gid=128(gdm) groups=128(gdm)
Second step: Verify the "gdm" user has an XAuthority file
$ sudo ls -AFlh /run/user/124/gdm/ total 4.0K -rwx------ 1 gdm gdm 96 Sep 8 14:58 Xauthority*
Third step: Discover the GPUs on your system
$ nvidia-smi --list-gpus GPU 0: GeForce GTX 1080 Ti (UUID: GPU-aafbcced-b891-932a-1e58-33ead37229b4) GPU 1: GeForce GTX 1080 Ti (UUID: GPU-85fca2d3-b11d-e748-1d39-bd235de5334e)
Fourth step: For each GPU, enable a Fan Control State
$ sudo DISPLAY=:0 XAUTHORITY=/run/user/124/gdm/Xauthority nvidia-settings -a [gpu:0]/GPUFanControlState=1 Attribute 'GPUFanControlState' (server:0[gpu:0]) assigned value 1. $ sudo DISPLAY=:0 XAUTHORITY=/run/user/124/gdm/Xauthority nvidia-settings -a [gpu:1]/GPUFanControlState=1 Attribute 'GPUFanControlState' (server:0[gpu:1]) assigned value 1.
Fifth step: For each GPU, set a fan speed
$ sudo DISPLAY=:0 XAUTHORITY=/run/user/124/gdm/Xauthority nvidia-settings -a [fan:0]/GPUTargetFanSpeed=25 Attribute 'GPUTargetFanSpeed' (server:0[fan:0]) assigned value 25. $ sudo DISPLAY=:0 XAUTHORITY=/run/user/124/gdm/Xauthority nvidia-settings -a [fan:1]/GPUTargetFanSpeed=25 Attribute 'GPUTargetFanSpeed' (server:0[fan:1]) assigned value 25.
Sixth step: Verify fan speeds
$ nvidia-smi --query-gpu=timestamp,gpu_bus_id,utilization.gpu,utilization.memory,temperature.gpu,fan.speed,power.draw --format=csv timestamp, pci.bus_id, utilization.gpu [%], utilization.memory [%], temperature.gpu, fan.speed [%], power.draw [W] 2019/09/21 16:08:20.136, 00000000:01:00.0, 0 %, 3 %, 29, 27 %, 13.95 W 2019/09/21 16:08:20.139, 00000000:02:00.0, 0 %, 0 %, 31, 25 %, 11.93 W
If you need to modify the fan speeds, simply re-reun step 5 with the changed speed; there's no need to re-run the Fan Control State enablement. These settings will persist so long as the server and X11 session remains up and running. If the server is rebooted, you'll need to re-run those commands. You could make those settings persistent by using an rc.d boot script.