[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] startd bug? Seems to be able to reliably kill startd with GPU preemption on 8.8.7

Hi tj,

On 4/10/20 12:41 AM, John M Knoeller wrote:
> 04/09/20 07:05:14 ERROR "Failed to bind local resource 'GPUs'" at line 1272 ..
> There was a known bug in this code when there were multiple GPUS that had the same device name.  
> (i.e. the device list was  CUDA0,CUDA0)  Is that the case here?
nope, this box only has a single (old) GPU in it:

condor_status -l slot1@xxxxxxxxxxxxxxxxx |awk 'tolower($1)~/gpu/ {print}'
AssignedGPUs = "CUDA0"
ChildGPUs = { 0,0,0,0 }
DetectedGPUs = 1
GPUs = 1
TotalGPUs = 1
TotalSlotGPUs = 1

nvidia-smi -L
GPU 0: GeForce GT 640 (UUID: GPU-27ce3be5-06de-e8b2-419e-6edc9e05b2c7)

But maybe, the startd thinks it has an invisible second one as some
strings seems to be incomplete in its logs:

StartLog:04/10/20 02:21:07 unbind_DevIds for slot1.3 before :
GPUs:{CUDA0, }{1_5, }



Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics,
CallinstraÃe 38, 30167 Hannover, Germany
Phone: +49 511 762 17185

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature