[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Removing / disabling GPUs without stopping jobs
- Date: Thu, 14 May 2020 16:13:14 +0000
- From: John M Knoeller <johnkn@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Removing / disabling GPUs without stopping jobs
Are you saying that changing OFFLINE_MACHINE_RESOURCE_<name> in the config and then running condor_reconfig
does not take the GPU offline?
If it does not, I would consider that bug.
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Joan Josep Piles-Contreras
Sent: Thursday, May 14, 2020 4:41 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Removing / disabling GPUs without stopping jobs
In a partitionable slot, is there any way to disable a given GPU without
restarting the startd, or in any case to prevent it from being assigned
to new jobs? In short, what `OFFLINE_MACHINE_RESOURCE_<name>` already
does, just without restarting the startd (i.e. without stopping already
We have some nodes with multiple GPUs, and from time to time one of the
GPUs crashes and ends up in some unstable state. The node then becomes a
"black hole" because it keeps accepting jobs that then just crash.
We can already detect it, but fixing it usually requires a reboot
(trying to reset the card doesn't always do the trick, we already tried).
What we ideally would want is to prevent the non-working card from being
assigned to new jobs until we can find the right spot to reboot the
node, because some jobs need a long time to run, and we want to make it
as little intrusive as possible.
With static slots it would be easy to just set START to false in the
slot, but would it be possible to do something equivalent with dynamic
I fear the system will be very confused if we make the START expression
conditional on the assigned GPU...
Dr. Joan Josep Piles-Contreras
ZWE Scientific Computing
Max Planck Institute for Intelligent Systems
(p) +49 7071 601 1750
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
You can also unsubscribe by visiting
The archives can be found at: