[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] GPUs not detected in 9.0.6 version



well that's not good.  

Could you try running the 9.0.6 condor_gpu_discovery with

  condor_gpu_discovery -verbose -diag

and send me the results?

thanks
-tj


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Carles Acosta <cacosta@xxxxxx>
Sent: Tuesday, September 28, 2021 11:53 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] GPUs not detected in 9.0.6 version
 
Hi Stuart, TJ,

Thank you for your replies.

Regarding the CUDA version, we did not update it. We are using old CUDA 10.1.243 for these GPUs. 

We did some testing as TJ suggests. We are running right now HTCondor 9.0.6 but using the condor_gpu_discovery from HTCondor 9.0.5 and the GPU is correctly discovered:

# condor_status slot2@xxxxxxxxxxxx -af Gpus DetectedGpus CondorVersion
1 GPU-c659279d $CondorVersion: 9.0.6 Sep 23 2021 BuildID: 557184 PackageID: 9.0.6-1 $

Thus, it is related to the new condor_gpu_discovery binary in version 9.0.6.  In fact:

[root@gpu03 ~]# /usr/libexec/condor/condor_gpu_discovery-9.0.5
DetectedGPUs="GPU-c659279d"
[root@gpu03 ~]# /usr/libexec/condor/condor_gpu_discovery-9.0.6
Segmentation fault

Sep 29 06:47:09 gpu03 kernel: condor_gpu_disc[22684]: segfault at 0 ip           (null) sp 00007ffda4fe0088 error 14 in condor_gpu_discovery-9.0.6[400000+17000]

Thank you very much.

Cheers,

Carles

On Tue, 28 Sept 2021 at 23:24, John M Knoeller <johnkn@xxxxxxxxxxx> wrote:
the condor_gpu_discovery binary is completely portable,  so could you try copying it from a machine that has  8.8.15 installed to one of the machines that is not detecting GPUs and running it there interactively?

This will help us to know if this is really a problem with the condor_gpu_discovery binary, or something else

thanks
-tj


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Carles Acosta <cacosta@xxxxxx>
Sent: Tuesday, September 28, 2021 3:20 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] GPUs not detected in 9.0.6 version
 
Dear all,

We have recently migrated from HTCondor 8.8.15 to 9.0.6 all our pool (keeping, for now, our old PASSWORD security configuration).

Everything is working fine with the exception of two machines that have GeForce GTX 1050 Ti GPUs. We have realized that the GPU is not detected using HTCondor 9.0.6, while it is detected again with version 9.0.5.

# condor_status slot2@xxxxxxxxxxxx -af Gpus DetectedGpus CondorVersion
1 GPU-c659279d $CondorVersion: 9.0.5 Aug 18 2021 BuildID: 554415 PackageID: 9.0.5-1 $
# condor_status slot2@xxxxxxxxxxxx -af Gpus DetectedGpus CondorVersion
0 0 $CondorVersion: 9.0.6 Sep 23 2021 BuildID: 557184 PackageID: 9.0.6-1 $

We have other GPUs machines (GeForce RTX 2080 Ti or Tesla V100) that are correctly detected with 9.0.6 version, it seems that it just affects these older gpus.

Do you know what is happening? Please let me know if you need further information.

Cheers,

Carles



--
Carles Acosta i Silva
PIC (Port d'Informació Científica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
Avís - Aviso - Legal Notice:  http://legal.ifae.es
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


--
Carles Acosta i Silva
PIC (Port d'Informació Científica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
Avís - Aviso - Legal Notice:  http://legal.ifae.es