[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] GPUs not detected in 9.0.6 version



> On Sep 28, 2021, at 1:20 AM, Carles Acosta <cacosta@xxxxxx> wrote:
> 
> Dear all,
> 
> We have recently migrated from HTCondor 8.8.15 to 9.0.6 all our pool (keeping, for now, our old PASSWORD security configuration).
> 
> Everything is working fine with the exception of two machines that have GeForce GTX 1050 Ti GPUs. We have realized that the GPU is not detected using HTCondor 9.0.6, while it is detected again with version 9.0.5.
> 
> # condor_status slot2@xxxxxxxxxxxx -af Gpus DetectedGpus CondorVersion
> 1 GPU-c659279d $CondorVersion: 9.0.5 Aug 18 2021 BuildID: 554415 PackageID: 9.0.5-1 $
> # condor_status slot2@xxxxxxxxxxxx -af Gpus DetectedGpus CondorVersion
> 0 0 $CondorVersion: 9.0.6 Sep 23 2021 BuildID: 557184 PackageID: 9.0.6-1 $
> 
> We have other GPUs machines (GeForce RTX 2080 Ti or Tesla V100) that are correctly detected with 9.0.6 version, it seems that it just affects these older gpus.
> 
> Do you know what is happening? Please let me know if you need further information.

FWIW, we just upgraded a cluster today from 9.0.4 to 9.0.6 and are still able to see GTX 1050 Ti devices, e.g.,

# condor_status slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx -af Gpus
DetectedGpus CondorVersion
1 GPU-f3daa19c $CondorVersion: 9.0.6 Sep 23 2021 BuildID: 557184
PackageID: 9.0.6-1 $

Is there any chance you also updated to version of the NVIDIA driver you are using? For our 1050 Ti (and GTX 1650) devices I have found a more general problem updating from the driver bundled with CUDA 11.2 to anything newer.

Thanks.

--
Stuart Anderson
sba@xxxxxxxxxxx