[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_gpu_discovery: wrong amount of GPU memory



this is now ticket 6883

https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=6883

 

NVIDIA added some fields near the beginning of the cudaDeviceProp structure, but neither the structure nor the API that fills it out has a version or size field. At present, the CUDA 10 runtime DLL is just writing past the end of the buffer that condor_gpu_discovery is passing it when we query device properties.

 

If anyone out there knows how to ask the NVIDIA runtime libraries what version of the cudaDeviceProp structure they are expecting please let us know.

 

-tj

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Jens Schmaler
Sent: Sunday, January 27, 2019 11:41 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] condor_gpu_discovery: wrong amount of GPU memory

 

Thanks Michael, that sounds like it could be the reason for my issues. I assume that I could also build condor_gpu_discovery myself, linking against CUDA 10 to mitigate the problem. Can you (or anyone) confirm this?

 

On a slightly related note: I expected that I could use HTCondor 8.8’s gpu load monitoring to not only observe the gpu load created by the job, but also the overall gpu load of the system (e.g. to modify my START _expression_ such that gpu jobs are only scheduled when no interactive gpu jobs are running). However, I cannot find any variables in my machine ads (and neither in the docs) to use for this. Is this kind of thing supported at all?

 

Thanks and best regards,

 

Jens

 

 

From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Michael Pelletier
Sent: Friday, January 25, 2019 6:54 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] condor_gpu_discovery: wrong amount of GPU memory

 

I think ran into this same problem with CUDA 10.0 in a recent 8.6 release, and I think it had something to do with a change to the interface between the 9.x and 10.0 CUDA libraries. It was giving some really off-the-wall numbers to the collector. I believe there’s a ticket open for the issue as a result of my inquiry to support.

 

In the meantime, you can also install the 9.2 release, and then set up the library path for condor_gpu_discovery to refer to /usr/local/cuda-9.2 instead of /usr/local/cuda, and that should get you through until they come out with an update for CUDA 10 support.

 

 

Michael V. Pelletier
Information Technology
Digital Transformation & Innovation
Integrated Defense Systems
Raytheon Company

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Jens Schmaler
Sent: Friday, January 25, 2019 10:04 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [External] [HTCondor-users] condor_gpu_discovery: wrong amount of GPU memory

 

Dear all,

 

we are using HTCondor 8.8 on Windows (Win 10 and Win 2016 specifically) with CUDA 10.0 installed. Some systems do have large GPUs, e.g. with 12 GB or even 32 GB of memory. Nevertheless, condor_gpu_discovery will only show a maximum of

CUDA0GlobalMemoryMb=4096

for these cards. I have tried to run cudaGetDeviceProperties from my own code and the memory is correctly returned, so I am not sure what is going on here. Any ideas what might be the reason? Btw: I am using the 64bit-Build of HTCondor.

 

Besides that, I discovered that condor_gpu_discovery tries to access the registry key

 

"SOFTWARE\\NVIDIA Corporation\\GPU Computing Toolkit\\CUDA"

 

which does not seem to exist on any of our systems. Could you please tell me under which circumstances you would expect this key to exist?

 

Thanks a lot,

 

Jens