[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_gpu_discovery: wrong amount of GPU memory



you may not need to, I will shortly have an experimental Windows version of condor_gpu_monitor for you to try out.  I’ll send info about this in a separate email.

 

If you want to do this yourself, you would use a feature we call STARTD_CRON.   explained here

https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToInsertClassAdIntoMachineAds

http://research.cs.wisc.edu/htcondor/manual/v8.8/Hooks.html#x51-4450004.4.3

 

There is a metaknob for setting up a kind of startd cron job we call a MONITOR.

 

use feature : Monitor( GPUs, WaitForExit, 1, $(LIBEXEC)/condor_gpu_utilization, SUM:GPUs, PEAK:GPUsMemory )

 

GPUs is the name of the monitor,  this is arbitrary, but you should call it GPUs in this case.

 

WaitForExit is the type of STARTD_CRON job, a WaitForExit job runs forever, and periodically writes an update classad to stdout.

 

$(LIBEXEC)/condor_gpu_utilization is the program to run.  this program needs to write out a classad periodically, that

ad will get merged into the slot ads.  You can call this anything you like, but you do need to use an absolute path here.

 

SUM:GPUs, PEAK:GPUsMemory  is a set of aggregation commands to the STARTD (called METRICS).   These commands assume that your program will emit something like

 

GPUs=0.9

GPUsMemory=1234

 

see STARTD_CRON_<jobname>_METRICS in the manual for more information.

 

-tj

 

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Jens Schmaler
Sent: Thursday, January 31, 2019 1:10 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] condor_gpu_discovery: wrong amount of GPU memory

 

Ok, I will try this. Could you maybe give me a pointer/example of what would be the recommended way to do something like this? E.g. how to make HTCondor regularly run some external tool and update a machine ad based on the result, etc.

 

Thanks a lot,

 

Jens

 

 

From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of John M Knoeller
Sent: Wednesday, January 30, 2019 5:01 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] condor_gpu_discovery: wrong amount of GPU memory

 

Yes, that should work.

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Jens Schmaler
Sent: Wednesday, January 30, 2019 3:23 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] condor_gpu_discovery: wrong amount of GPU memory

 

Regarding the gpu_monitor: I could imagine that, in the meantime, I could script something (based on nvidia-smi for example) that regularly updates the machine ad with the current GPU usage. Would you agree? And what would be the best way to do something like that?

 

Thanks and best regards,

 

Jens

 

 

From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of John M Knoeller
Sent: Tuesday, January 29, 2019 4:53 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] condor_gpu_discovery: wrong amount of GPU memory

 

> Regarding the gpu_monitor: I assume there is a reason why this is not built on Windows? Is there a fundamental limitation of the OS or will this be activated in one of the next versions?

 

Not a limitation of the OS that I’m aware of, it’s just a matter of changing the code to build for Windows as well as LINUX.  We plan on doing that work as part of the 8.9 series.

 

> In my case, condor_gpu_discovery reports CUDADriverVersion=10.0, but fails to detect the run time version since it fails to access the registry key I mentioned (see condor_gpu_discovery.cpp, line 386).

 

That code is a backstop for situations when cudart.dll does not load.   Did you try adding the cuda runtime to the path?

 

-tj

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Jens Schmaler
Sent: Tuesday, January 29, 2019 7:58 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] condor_gpu_discovery: wrong amount of GPU memory

 

Hi,

 

In my case, condor_gpu_discovery reports CUDADriverVersion=10.0, but fails to detect the run time version since it fails to access the registry key I mentioned (see condor_gpu_discovery.cpp, line 386). This key is not present in any of our systems, although we definitely have a working installation of CUDA 10. What else do I need to install to make it work?

 

Regarding the gpu_monitor: I assume there is a reason why this is not built on Windows? Is there a fundamental limitation of the OS or will this be activated in one of the next versions?

 

Thanks and best regards,

 

Jens

 

 

From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of John M Knoeller
Sent: Monday, January 28, 2019 5:41 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] condor_gpu_discovery: wrong amount of GPU memory

 

condor_gpu_discovery does not access any registry keys.  it just loads cudart.dll or nvcuda.dll and calls functions from those DLLs.

If the registry is being accessed, it is by those DLLs, so you would need to refer to the documentation from NVIDA about registry keys.

 

What does condor_gpu_discovery report as your driver version and runtime version?

 

The value that condor_gpu_discovery reports comes from these dlls.  If the value is wrong, it is because your version of the CUDA

libraries is incompatible with programs built with older versions of their SDK.   We are looking in to what it would take to make

our gpu_discovery work with CUDA 10 without breaking backward compatibility, but so far we do not have a solution for this problem.

 

As for the gpu_monitor, this tool is currently only being built on LINUX,  the documentation needs to be updated to reflect that.

 

-tj

 

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Michael Pelletier
Sent: Friday, January 25, 2019 11:54 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] condor_gpu_discovery: wrong amount of GPU memory

 

I think ran into this same problem with CUDA 10.0 in a recent 8.6 release, and I think it had something to do with a change to the interface between the 9.x and 10.0 CUDA libraries. It was giving some really off-the-wall numbers to the collector. I believe there’s a ticket open for the issue as a result of my inquiry to support.

 

In the meantime, you can also install the 9.2 release, and then set up the library path for condor_gpu_discovery to refer to /usr/local/cuda-9.2 instead of /usr/local/cuda, and that should get you through until they come out with an update for CUDA 10 support.

 

 

Michael V. Pelletier
Information Technology
Digital Transformation & Innovation
Integrated Defense Systems
Raytheon Company

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Jens Schmaler
Sent: Friday, January 25, 2019 10:04 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [External] [HTCondor-users] condor_gpu_discovery: wrong amount of GPU memory

 

Dear all,

 

we are using HTCondor 8.8 on Windows (Win 10 and Win 2016 specifically) with CUDA 10.0 installed. Some systems do have large GPUs, e.g. with 12 GB or even 32 GB of memory. Nevertheless, condor_gpu_discovery will only show a maximum of

CUDA0GlobalMemoryMb=4096

for these cards. I have tried to run cudaGetDeviceProperties from my own code and the memory is correctly returned, so I am not sure what is going on here. Any ideas what might be the reason? Btw: I am using the 64bit-Build of HTCondor.

 

Besides that, I discovered that condor_gpu_discovery tries to access the registry key

 

"SOFTWARE\\NVIDIA Corporation\\GPU Computing Toolkit\\CUDA"

 

which does not seem to exist on any of our systems. Could you please tell me under which circumstances you would expect this key to exist?

 

Thanks a lot,

 

Jens