Discussion:
[torqueusers] Status of per-job accounting for NVIDIA GPUs?
Dave Ulrick
2016-11-01 18:42:52 UTC
Permalink
A while back, I inquired about support for per-job TORQUE accounting for
jobs that use NVIDIA GPUs. My understanding is that the metrics provided
by the NVIDIA GPU drivers have been node-wide, but that NVIDIA was working
on a new GPU driver feature that was going to make it possible for TORQUE
to gather GPU metrics for an individual job.

Questions:

1. Any idea of what NVIDIA might be calling the new GPU driver feature?
Per-process or per-thread accounting???

2. Any idea of when TORQUE's support for this feature would ship?

3. What CUDA release and GPU driver release would be required to leverage
the TORQUE GPU support?

Thanks,
Dave
--
Dave Ulrick
d-***@comcast.net
David Beer
2016-11-01 20:49:29 UTC
Permalink
Post by Dave Ulrick
1. Any idea of what NVIDIA might be calling the new GPU driver feature?
Per-process or per-thread accounting???
This is NVIDIA's DCGM tool. If I understand correctly, this is going to
replace NVML.
Post by Dave Ulrick
2. Any idea of when TORQUE's support for this feature would ship?
We were hoping to get this into the 6.1.0 release (coming out before SC
'16) but for reasons I don't want to discuss here, it did not make this
release. I would expect that we can release it soon, although I don't know
if we've committed to a date yet. I will try to remember to update the list
once we have a committed date.

3. What CUDA release and GPU driver release would be required to leverage
Post by Dave Ulrick
the TORQUE GPU support?
Torque will be linking against libdcgm, so from our perspective, installing
that library (NVIDIA probably calls it dcgm toolkit or something like that)
will be all that we require. Offhand, I don't know what driver / CUDA
requirements exist for DCGM on NVIDIA's end.

HTH,

David
Post by Dave Ulrick
Thanks,
Dave
--
Dave Ulrick
_______________________________________________
torqueusers mailing list
http://www.supercluster.org/mailman/listinfo/torqueusers
--
David Beer | Torque Architect
Adaptive Computing
Dave Ulrick
2016-11-01 21:37:13 UTC
Permalink
Post by David Beer
Post by Dave Ulrick
1. Any idea of what NVIDIA might be calling the new GPU driver feature?
Per-process or per-thread accounting???
This is NVIDIA's DCGM tool. If I understand correctly, this is going to
replace NVML.
Post by Dave Ulrick
2. Any idea of when TORQUE's support for this feature would ship?
We were hoping to get this into the 6.1.0 release (coming out before SC
'16) but for reasons I don't want to discuss here, it did not make this
release. I would expect that we can release it soon, although I don't know
if we've committed to a date yet. I will try to remember to update the list
once we have a committed date.
3. What CUDA release and GPU driver release would be required to leverage
Post by Dave Ulrick
the TORQUE GPU support?
Torque will be linking against libdcgm, so from our perspective, installing
that library (NVIDIA probably calls it dcgm toolkit or something like that)
will be all that we require. Offhand, I don't know what driver / CUDA
requirements exist for DCGM on NVIDIA's end.
This is very useful information. I'd not heard of DCGM before. I'll be
eagerly watching the list for updates. :-) Thanks!!!

Dave
--
Dave Ulrick
d-***@comcast.net
Loading...