[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Removing / disabling GPUs without stopping jobs
- Date: Thu, 14 May 2020 18:58:20 +0000
- From: John M Knoeller <johnkn@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Removing / disabling GPUs without stopping jobs
Any configuration of the *amount* of resources requires a restart. And you should not expect that taking
a GPU offline would have any effect on the resources assigned to any existing slots.
In particular if a job was running on a slot, marking the GPU as offline should not have any effect on the job.
But it *should* prevent a new dynamic slot from being assigned that GPU. It was intended that this work.
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Joan Josep Piles-Contreras
Sent: Thursday, May 14, 2020 11:24 AM
Subject: Re: [HTCondor-users] Removing / disabling GPUs without stopping jobs
I have to confess, I didn't try it, I found the following in the
> A restart of the condor_startd is required for changes to this configuration variable to take effect.
And I just directly though that we'd need a full condor restart, so it
wouldn't make what we need.
Of course, if a condor_reconfig is enough, that would fit our needs!
On 14/5/20 18:13, John M Knoeller wrote:
> Are you saying that changing OFFLINE_MACHINE_RESOURCE_<name> in the config and then running condor_reconfig
> does not take the GPU offline?
> If it does not, I would consider that bug.
> -----Original Message-----
> From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Joan Josep Piles-Contreras
> Sent: Thursday, May 14, 2020 4:41 AM
> To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
> Subject: [HTCondor-users] Removing / disabling GPUs without stopping jobs
> In a partitionable slot, is there any way to disable a given GPU without
> restarting the startd, or in any case to prevent it from being assigned
> to new jobs? In short, what `OFFLINE_MACHINE_RESOURCE_<name>` already
> does, just without restarting the startd (i.e. without stopping already
> running jobs).
> We have some nodes with multiple GPUs, and from time to time one of the
> GPUs crashes and ends up in some unstable state. The node then becomes a
> "black hole" because it keeps accepting jobs that then just crash.
> We can already detect it, but fixing it usually requires a reboot
> (trying to reset the card doesn't always do the trick, we already tried).
> What we ideally would want is to prevent the non-working card from being
> assigned to new jobs until we can find the right spot to reboot the
> node, because some jobs need a long time to run, and we want to make it
> as little intrusive as possible.
> With static slots it would be easy to just set START to false in the
> slot, but would it be possible to do something equivalent with dynamic
> I fear the system will be very confused if we make the START expression
> conditional on the assigned GPU...
Dr. Joan Josep Piles-Contreras
ZWE Scientific Computing
Max Planck Institute for Intelligent Systems
(p) +49 7071 601 1750