Douglas Holt
2015-02-18 15:54:29 UTC
Configuring Torque to use NVML instead of nvidia-smi prevents resetting GPUs because pbs_mom keeps devices loaded.
Resetting the GPU is needed, for example, to change ECC modes without rebooting, which ends the job.
Is there some method for getting pbs_mom to release the driver other than sending SIGKILL and recovering with -p?
(previous subject)
David Beer dbeer at adaptivecomputing.com
Fri Oct 31 10:01:06 MDT 2014
Previous message: [torqueusers] pbs_mom segfault caused by unexpected output from nvidia-smi
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Douglas,
For an immediate workaround, you can configure TORQUE to use nvidia's API
instead of the smi command to get the data. This code path does not copy to
a fixed-sized buffer and therefore won't segfault on you. The documentation
for how to configure this way is here:
http://docs.adaptivecomputing.com/suite/8-0/basic/help.htm#topics/moabWorkloadManager/topics/accelerators/nvidiaGpus.htm?Highlight=nvml
Note: using the API is also faster than using the smi command.
We will also fix the issue of copying to the fixed buffer here, but I would
advise anyone to switch to the API version instead of the smi command.
-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------
Resetting the GPU is needed, for example, to change ECC modes without rebooting, which ends the job.
Is there some method for getting pbs_mom to release the driver other than sending SIGKILL and recovering with -p?
(previous subject)
David Beer dbeer at adaptivecomputing.com
Fri Oct 31 10:01:06 MDT 2014
Previous message: [torqueusers] pbs_mom segfault caused by unexpected output from nvidia-smi
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Douglas,
For an immediate workaround, you can configure TORQUE to use nvidia's API
instead of the smi command to get the data. This code path does not copy to
a fixed-sized buffer and therefore won't segfault on you. The documentation
for how to configure this way is here:
http://docs.adaptivecomputing.com/suite/8-0/basic/help.htm#topics/moabWorkloadManager/topics/accelerators/nvidiaGpus.htm?Highlight=nvml
Note: using the API is also faster than using the smi command.
We will also fix the issue of copying to the fixed buffer here, but I would
advise anyone to switch to the API version instead of the smi command.
-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------