[torqueusers] Using NVML prevents GPU reset

Discussion:

Douglas Holt

2015-02-18 15:54:29 UTC

Configuring Torque to use NVML instead of nvidia-smi prevents resetting GPUs because pbs_mom keeps devices loaded.

Resetting the GPU is needed, for example, to change ECC modes without rebooting, which ends the job.

Is there some method for getting pbs_mom to release the driver other than sending SIGKILL and recovering with -p?

(previous subject)

David Beer dbeer at adaptivecomputing.com
Fri Oct 31 10:01:06 MDT 2014
Previous message: [torqueusers] pbs_mom segfault caused by unexpected output from nvidia-smi
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Douglas,

For an immediate workaround, you can configure TORQUE to use nvidia's API
instead of the smi command to get the data. This code path does not copy to
a fixed-sized buffer and therefore won't segfault on you. The documentation
for how to configure this way is here:
http://docs.adaptivecomputing.com/suite/8-0/basic/help.htm#topics/moabWorkloadManager/topics/accelerators/nvidiaGpus.htm?Highlight=nvml

Note: using the API is also faster than using the smi command.

We will also fix the issue of copying to the fixed buffer here, but I would
advise anyone to switch to the API version instead of the smi command.

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

David Beer

2015-02-24 16:54:30 UTC

Permalink

This seems like it might be a bug in how we're handling GPUs through the
API, but I think we need more information for how to fix it.

Post by Douglas Holt
Configuring Torque to use NVML instead of nvidia-smi prevents resetting
GPUs because pbs_mom keeps devices loaded.

1. Does "prevents resetting" refer to changing gpu modes?
2. What do you mean by "keeps devices loaded?"

Post by Douglas Holt
Resetting the GPU is needed, for example, to change ECC modes without
rebooting, which ends the job.
Is there some method for getting pbs_mom to release the driver other than
sending SIGKILL and recovering with -p?

Is there a way to specify that we are releasing the driver through the API?

Post by Douglas Holt
(previous subject)
David Beer dbeer at adaptivecomputing.com
Fri Oct 31 10:01:06 MDT 2014
Previous message: [torqueusers] pbs_mom segfault caused by unexpected
output from nvidia-smi
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Douglas,
For an immediate workaround, you can configure TORQUE to use nvidia's API
instead of the smi command to get the data. This code path does not copy to
a fixed-sized buffer and therefore won't segfault on you. The documentation
http://docs.adaptivecomputing.com/suite/8-0/basic/help.htm#topics/moabWorkloadManager/topics/accelerators/nvidiaGpus.htm?Highlight=nvml
Note: using the API is also faster than using the smi command.
We will also fix the issue of copying to the fixed buffer here, but I would
advise anyone to switch to the API version instead of the smi command.
------------------------------
This email message is for the sole use of the intended recipient(s) and
may contain confidential information. Any unauthorized review, use,
disclosure or distribution is prohibited. If you are not the intended
recipient, please contact the sender by reply email and destroy all copies
of the original message.
------------------------------
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers

--
David Beer | Senior Software Engineer
Adaptive Computing

David Beer

2017-02-15 16:27:45 UTC

Permalink

Douglas,

I thought I'd let you know, this bug has been fixed in 6.0-dev, 6.1-dev,
and develop. It will be released with the 6.0.4 and 6.1.1 releases.

David

Post by David Beer
This seems like it might be a bug in how we're handling GPUs through the
API, but I think we need more information for how to fix it.

Post by Douglas Holt
Configuring Torque to use NVML instead of nvidia-smi prevents resetting
GPUs because pbs_mom keeps devices loaded.

1. Does "prevents resetting" refer to changing gpu modes?
2. What do you mean by "keeps devices loaded?"

Is there a way to specify that we are releasing the driver through the API?

Post by Douglas Holt
(previous subject)
David Beer dbeer at adaptivecomputing.com
Fri Oct 31 10:01:06 MDT 2014
Previous message: [torqueusers] pbs_mom segfault caused by unexpected
output from nvidia-smi
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Douglas,
For an immediate workaround, you can configure TORQUE to use nvidia's API
instead of the smi command to get the data. This code path does not copy to
a fixed-sized buffer and therefore won't segfault on you. The
documentation
http://docs.adaptivecomputing.com/suite/8-0/basic/help.htm#
topics/moabWorkloadManager/topics/accelerators/
nvidiaGpus.htm?Highlight=nvml
Note: using the API is also faster than using the smi command.
We will also fix the issue of copying to the fixed buffer here, but I would
advise anyone to switch to the API version instead of the smi command.
------------------------------
This email message is for the sole use of the intended recipient(s) and
may contain confidential information. Any unauthorized review, use,
disclosure or distribution is prohibited. If you are not the intended
recipient, please contact the sender by reply email and destroy all copies
of the original message.
------------------------------
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers

--
David Beer | Senior Software Engineer
Adaptive Computing

--
David Beer | Torque Architect
Adaptive Computing

Jonas Mansoor

2017-02-24 10:06:55 UTC

Permalink

Hi David
Is there any estimate on when 6.1.1 will be released?

Thanks!

Best regards / Med venlig hilsen

Jonas Jan Mansoor

DTU Chemistry | IT Department

Danmarks Tekniske Universitet
Kemitorvet Byg.206 Lok. 248 | DK-2800 Kgs. Lyngby | Telefon +45 4525 2452 | Mobil +45 4068 0452

From: torqueusers-***@supercluster.org [mailto:torqueusers-***@supercluster.org] On Behalf Of David Beer
Sent: 15. februar 2017 17:28
To: Torque Users Mailing List <***@supercluster.org>
Subject: Re: [torqueusers] Using NVML prevents GPU reset

Douglas,

I thought I'd let you know, this bug has been fixed in 6.0-dev, 6.1-dev, and develop. It will be released with the 6.0.4 and 6.1.1 releases.

David

On Tue, Feb 24, 2015 at 9:54 AM, David Beer <***@adaptivecomputing.com<mailto:***@adaptivecomputing.com>> wrote:
This seems like it might be a bug in how we're handling GPUs through the API, but I think we need more information for how to fix it.

On Wed, Feb 18, 2015 at 8:54 AM, Douglas Holt <***@nvidia.com<mailto:***@nvidia.com>> wrote:
Configuring Torque to use NVML instead of nvidia-smi prevents resetting GPUs because pbs_mom keeps devices loaded.

1. Does "prevents resetting" refer to changing gpu modes?
2. What do you mean by "keeps devices loaded?"

Resetting the GPU is needed, for example, to change ECC modes without rebooting, which ends the job.

Is there some method for getting pbs_mom to release the driver other than sending SIGKILL and recovering with -p?

Is there a way to specify that we are releasing the driver through the API?

(previous subject)

David Beer dbeer at adaptivecomputing.com<http://adaptivecomputing.com>
Fri Oct 31 10:01:06 MDT 2014
Previous message: [torqueusers] pbs_mom segfault caused by unexpected output from nvidia-smi
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Douglas,

For an immediate workaround, you can configure TORQUE to use nvidia's API
instead of the smi command to get the data. This code path does not copy to
a fixed-sized buffer and therefore won't segfault on you. The documentation
for how to configure this way is here:
http://docs.adaptivecomputing.com/suite/8-0/basic/help.htm#topics/moabWorkloadManager/topics/accelerators/nvidiaGpus.htm?Highlight=nvml

Note: using the API is also faster than using the smi command.

We will also fix the issue of copying to the fixed buffer here, but I would
advise anyone to switch to the API version instead of the smi command.

________________________________
This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
________________________________

_______________________________________________
torqueusers mailing list
***@supercluster.org<mailto:***@supercluster.org>
http://www.supercluster.org/mailman/listinfo/torqueusers

--
David Beer | Senior Software Engineer
Adaptive Computing
--
David Beer | Torque Architect
Adaptive Computing