[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Removing / disabling GPUs without stopping jobs



Any configuration of the *amount* of resources requires a restart.  And you should not expect that taking
a GPU offline would have any effect on the resources assigned to any existing slots.  

In particular if a job was running on a slot, marking the GPU as offline should not have any effect on the job.

But it *should* prevent a new dynamic slot from being assigned that GPU. It was intended that this work.

-tj

-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Joan Josep Piles-Contreras
Sent: Thursday, May 14, 2020 11:24 AM
To: htcondor-users@xxxxxxxxxxx
Subject: Re: [HTCondor-users] Removing / disabling GPUs without stopping jobs

I have to confess, I didn't try it, I found the following in the 
documentation:

> A restart of the condor_startd is required for changes to this configuration variable to take effect.

And I just directly though that we'd need a full condor restart, so it 
wouldn't make what we need.

Of course, if a condor_reconfig is enough, that would fit our needs!

Joan

On 14/5/20 18:13, John M Knoeller wrote:
> Are you saying that changing OFFLINE_MACHINE_RESOURCE_<name> in the config and then running condor_reconfig
> does not take the GPU offline?
> 
> If it does not, I would consider that  bug.
> 
> -tj
> 
> -----Original Message-----
> From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Joan Josep Piles-Contreras
> Sent: Thursday, May 14, 2020 4:41 AM
> To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
> Subject: [HTCondor-users] Removing / disabling GPUs without stopping jobs
> 
> Hi,
> 
> In a partitionable slot, is there any way to disable a given GPU without
> restarting the startd, or in any case to prevent it from being assigned
> to new jobs? In short, what `OFFLINE_MACHINE_RESOURCE_<name>` already
> does, just without restarting the startd (i.e. without stopping already
> running jobs).
> 
> We have some nodes with multiple GPUs, and from time to time one of the
> GPUs crashes and ends up in some unstable state. The node then becomes a
> "black hole" because it keeps accepting jobs that then just crash.
> 
> We can already detect it, but fixing it usually requires a reboot
> (trying to reset the card doesn't always do the trick, we already tried).
> 
> What we ideally would want is to prevent the non-working card from being
> assigned to new jobs until we can find the right spot to reboot the
> node, because some jobs need a long time to run, and we want to make it
> as little intrusive as possible.
> 
> With static slots it would be easy to just set START to false in the
> slot, but would it be possible to do something equivalent with dynamic
> slots?
> 
> I fear the system will be very confused if we make the START expression
> conditional on the assigned GPU...
> 
> Best,
> 
> Joan
> 

-- 
Dr. Joan Josep Piles-Contreras
ZWE Scientific Computing
Max Planck Institute for Intelligent Systems
(p) +49 7071 601 1750