[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_gpu_discovery: wrong amount of GPU memory



Ok, I digged a bit deeper into the code of condor_gpu_discovery. I believe the issue is that it tries to load “cudart.dll” specifically while the real name of that dll on my system is “cudart64_100.dll” (and will be different for each CUDA version). Creating a symlink solved this issue, so that condor_gpu_discovery now also reports the correct run time version 10.0. This symlink was not created by the CUDA installer for me at least.

 

The reported memory is now totally off (CUDA0GlobalMemoryMb=4977051853851), in line with your finding that CUDA 10 seems to expect a different cudaDeviceProp structure from the one that you have used. Indeed, when comparing your code to the latest CUDA headers, the struct has changed quite a bit. Not sure how to properly adapt your code such that It would work with any CUDA version. Ideas anyone?

 

Thanks and best regards,

 

Jens

 

 

 

 

From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of John M Knoeller
Sent: Tuesday, January 29, 2019 6:34 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] condor_gpu_discovery: wrong amount of GPU memory

 

This is what the key in question looks like for me since I updated to CUDA 10

 

> reg query "HKLM\software\NVIDIA Corporation\GPU Computing Toolkit\CUDA"

 

HKEY_LOCAL_MACHINE\software\NVIDIA Corporation\GPU Computing Toolkit\CUDA

    FirstVersionInstalled    REG_SZ    v10.0

 

 

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Jens Schmaler
Sent: Tuesday, January 29, 2019 11:16 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] condor_gpu_discovery: wrong amount of GPU memory

 

Thanks for clarifying! I have the dll in the same location as you, just the installer does not seem to have set the registry key. Which version of Windows are you running? We have this issue on Win 10 and Win 2016. I will now try to set the key manually and check if it works then.

 

Best,

 

Jens

 

 

From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of John M Knoeller
Sent: Tuesday, January 29, 2019 6:10 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] condor_gpu_discovery: wrong amount of GPU memory

 

So I did some digging, and you were correct.  condor_gpu_discovery on Windows does try and access this registry key

 

"SOFTWARE\\NVIDIA Corporation\\GPU Computing Toolkit\\CUDA"

 

when it cannot find cudart.dll.  We don’t expect this code to execute most of the time, but it is there.

 

This key is created by the NVIDIA CUDA Toolkit installer for Windows.  I upgraded my workstation to v10 from here

https://developer.nvidia.com/cuda-downloads?target_os=Windows&target_arch=x86_64&target_version=10

 

And it updated the key

HKEY_LOCAL_MACHINE\software\NVIDIA Corporation\GPU Computing Toolkit\CUDA

so that it now says v10.0

 

On my workstation, the cuda runtime is installed in this directory:

 

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\bin

 

I’m curious as to where it is installed on your Windows machines, if not there.

 

-tj