[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Startd crash w/ "Failed to bind local resource 'GPUs'"

On 11/15/2017 3:57 AM, bert.deknuydt@xxxxxxxxxxxxxxxx wrote:

I've seen this with 8.6.5 and 8.6.6 (Fedora 26 rpms). It's sporadic (couple of times a week) on machines with just one GPU but quite frequent (up to several per hour) on machines with 8 GPUs.

Has anyone see something similar? Any suggestions to figure out what happens?

On a GPU equipped machine that has lots of problems (i.e. your 8 GPU machines), what does

   condor_config_val -dump GPU

say? In other words, are you doing anything special to configure gpu management beyond just having

   use feature:gpus
 in your config?