[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Roadmap for support of NVIDIA Volta's new MPS capability



I'm looking at the new capability for NVIDIA GPUs starting in the recently-launched Volta architecture and the 9.0 CUDA Toolkit called the "Multi-Process Service."

This allows the driver to arbitrate sharing a single GPU among multiple processes. Looks like they finally heeded Miron's advice!

Here's the overview documentation:

https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf

It looks like it should be reasonably straightforward to pull the GPU shares managed by MPS into the HTCondor machine ClassAd and advertise them as separate "virtual" GPUs in condor_gpu_discovery, but what would be more interesting is the possibility of treating the GPU as a partitionable resource, by having HTCondor manage the MPS daemon or something like that.

We'll be getting a new machine with half a dozen NVIDIA V100 cards in it at the end of November for the Machine Learning folks, so I'll keep folks apprised.



Michael V. Pelletier
Principal Engineer
Information Technology
Future Technologies & Cloud
Integrated Defense Systems
Raytheon Company

+1 978-858-9681   (office)
+1 339-293-9149   (cell)
7-225-9681   (tie line)
Michael.V.Pelletier@xxxxxxxxxxxx

50 Apple Hill Drive
Tewksbury, MA 01876 USA
www.raytheon.com

Follow Raytheon On
    



This message contains information that may be confidential and privileged. Unless you are the addressee (or authorized to receive mail for the addressee), you should not use, copy or disclose to anyone this message or any information contained in this message. If you have received this message in error, please so advise the sender by reply e-mail and delete this message. Thank you for your cooperation.