[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] GPUsUsage



My fault, I sent some outputs from discovery utility, but nothing from the second one...
condor_gpu_utilization output is:
Unable to load nvml.dll.
Hanging to prevent process churn.

Re-installation of NVIDIA driver seems fix the issue.

BTW: It seems there is a bug in the configuration template. The template "use feature : GPUs" produces the followinf line:
STARTD_CRON_GPUs_MONITOR_EXECUTABLE = $(LIBEXEC)/condor_gpu_utilization
But it produces an error on Windows: StartLog: Create_Process(): Failed to extract the extension from file C:\condor\bin/condor_gpu_utilization.
Adding a bit updated line to config fixes this:
STARTD_CRON_GPUs_MONITOR_EXECUTABLE = $(LIBEXEC)\condor_gpu_utilization.exe


Masaj


On 30.11.2020 21:49, John M Knoeller wrote:

Oh. I meant condor_gpu_utilization

-tj

Â

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Martin Sajdl
Sent: Monday, November 30, 2020 1:17 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] GPUsUsage

Â

Unfortunately, the mentioned utility (condor_gpu_monitor) is not a part of my installation. There are just the following two in the bin directory:

condor_gpu_discovery.exe

condor_gpu_utilization.exe

Â

The output ofÂcondor_gpu_discovery -verbose is:

DetectedGPUs="CUDA0"

Â

With -extra parameter, it is:

DetectedGPUs="CUDA0"

CUDACapability=7.5

CUDAClockMhz=1845.00

CUDAComputeUnits=48

CUDADeviceName="GeForce RTX 2080 SUPER"

CUDADevicePciBusId="0000:05:00.0"

CUDADeviceUuid="132fe854-4afe-4e24-82ae-4eb1ef2dd963"

CUDADriverVersion=10.10

CUDAECCEnabled=false

CUDAGlobalMemoryMb=8192

CUDARuntimeVersion=10.10

Â

Any hint?

Â

Masaj

Â


---------- PÅvodnà e-mail ----------
Od: John M Knoeller <johnkn@xxxxxxxxxxx>
Komu: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Datum: 30. 11. 2020 18:03:04
PÅedmÄt: Re: [HTCondor-users] GPUsUsage

Try running

Â

c:\condor\bin\condor_gpu_monitor

Â

It may print out a message telling you what is wrong. If all you see is

Â

 Hanging to prevent process churn.

Â

then neither nvcuda.dll nor cudart.dll is in the PATH. If that happens, try running

Â

c:\condor\bin\condor_gpu_discovery -verbose

Â

We would expect that to fail also, and for the same reason. ÂThat would mean that you donât actually have the NVIDIA drivers or runtime installed properly.

Â

-tj

Â

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Martin Sajdl
Sent: Saturday, November 28, 2020 1:09 PM
To: htcondor-users@xxxxxxxxxxx
Subject: [HTCondor-users] GPUsUsage

Â

Hi,

Â

we would like to monitor GPU load on our machines in the pool during running jobs (or even without a running job). We found that there is machine classad which shows that, so we started to use it, but now it does not work in some machines. We have the same GPU cards there, same drivers, same HTCondor configuration (just "use feature:GPUs").

Could someone tell me what are the conditions when the classad is provided or if there is another one we could use for gpu load monitoring? We are using Windows version of HTCondor - 8.8.10. Unfortunately, there is almost no mention about this classad in the documentation.

Â

Thank you in advance!

Masaj

Â

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/