[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Removing / disabling GPUs without stopping jobs

I had a look at the code, and I was wrong.  The Startd does not pick up changes to the OFFLINE_MACHINE_RESOURCE_* KnobS on a reconfig.   The code to make this work on a reconfig is missing, looks like we never got around to implementing this part. 

Also you aren't going to have any luck using STARTD_CRON to overwrite the GPU attributes.  The attributes are a reflection of
the internal status of the Startd, but the Startd does not use Classads as *storage* internally for resources.  Overwriting
the GPU attributes will change what the Negotiator sees, and that will be *somewhat* helpful - but The Startd isn't fooled and will continue to use internal c++ data members for the actual GPU bookkeeping.   So overwriting the GPU attributes will very likely lead to contradictions that cause the Startd to abort. 

Sorry for the confusion.  I will submit this as a ticket for the 8.9 series. 

-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Joan Josep Piles-Contreras
Sent: Friday, May 15, 2020 3:09 AM
To: htcondor-users@xxxxxxxxxxx
Subject: Re: [HTCondor-users] Removing / disabling GPUs without stopping jobs

Thanks for the tip!

We did some tests, and, in short, it seems that the current behavior 
matches the documentation, because in order to prevent certain GPUs from 
being assigned to new jobs a full restart of the startd *is* needed, but 
not the intended behavior (reconfig should be enough).

This is what we found:

1) Using condor_config_val -start -rset 
OFFLINE_MACHINE_RESOURCE_gpus=<...> failed with the following message in 
the logs:

> 05/15/20 09:37:17 WARNING: Someone at xxx.xxx.xxx.xxx is trying to modify "OFFLINE_MACHINE_RESOURCE_gpus"
> 05/15/20 09:37:17 WARNING: Potential security problem, request refused

This user is an administrator and can set other knobs without issue.

2) Setting it in a configuration file and doing a `condor_reconfig 
-startd` does work, the variable is set, but it doesn't come into 
effect, i.e. jobs keep being assigned.

We tested it by trying to disable all GPUs from a node and trying to 
force a job with `request_gpus=1` to run there (via "requirements", the 
job couldn't run anywhere else), the job was started as usual.

3) We had to do a "full" `condor_restart -startd` for the change to have 
effect, and of course that killed the running jobs in that node. The job 
couldn't be matched in this case, so the configuration was OK.

I've got the feeling that the problem lies in that the GPUs are already 
assigned to the partitionable slot, we verified it by checking the 
`AssignedGpus` of the partitionable slot.

Before the condor_restart, even after the reconfig, it was still showing 
all the devices, they only were gone after the restart. The 
configuration knob was set, but the already existing partitionable slot 
wasn't modified.

Then, since the partitionable slot still had them, they were assigned to 
the children, what also makse sense... I'm not sure if it can be 
prevented that a partitionable slot assigns the resources it already "owns".



On 14/5/20 20:58, John M Knoeller wrote:
> Any configuration of the *amount* of resources requires a restart.  And you should not expect that taking
> a GPU offline would have any effect on the resources assigned to any existing slots.
> In particular if a job was running on a slot, marking the GPU as offline should not have any effect on the job.
> But it *should* prevent a new dynamic slot from being assigned that GPU. It was intended that this work.
> -tj
> -----Original Message-----
> From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Joan Josep Piles-Contreras
> Sent: Thursday, May 14, 2020 11:24 AM
> To: htcondor-users@xxxxxxxxxxx
> Subject: Re: [HTCondor-users] Removing / disabling GPUs without stopping jobs
> I have to confess, I didn't try it, I found the following in the
> documentation:
>> A restart of the condor_startd is required for changes to this configuration variable to take effect.
> And I just directly though that we'd need a full condor restart, so it
> wouldn't make what we need.
> Of course, if a condor_reconfig is enough, that would fit our needs!
> Joan
> On 14/5/20 18:13, John M Knoeller wrote:
>> Are you saying that changing OFFLINE_MACHINE_RESOURCE_<name> in the config and then running condor_reconfig
>> does not take the GPU offline?
>> If it does not, I would consider that  bug.
>> -tj
>> -----Original Message-----
>> From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Joan Josep Piles-Contreras
>> Sent: Thursday, May 14, 2020 4:41 AM
>> To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
>> Subject: [HTCondor-users] Removing / disabling GPUs without stopping jobs
>> Hi,
>> In a partitionable slot, is there any way to disable a given GPU without
>> restarting the startd, or in any case to prevent it from being assigned
>> to new jobs? In short, what `OFFLINE_MACHINE_RESOURCE_<name>` already
>> does, just without restarting the startd (i.e. without stopping already
>> running jobs).
>> We have some nodes with multiple GPUs, and from time to time one of the
>> GPUs crashes and ends up in some unstable state. The node then becomes a
>> "black hole" because it keeps accepting jobs that then just crash.
>> We can already detect it, but fixing it usually requires a reboot
>> (trying to reset the card doesn't always do the trick, we already tried).
>> What we ideally would want is to prevent the non-working card from being
>> assigned to new jobs until we can find the right spot to reboot the
>> node, because some jobs need a long time to run, and we want to make it
>> as little intrusive as possible.
>> With static slots it would be easy to just set START to false in the
>> slot, but would it be possible to do something equivalent with dynamic
>> slots?
>> I fear the system will be very confused if we make the START expression
>> conditional on the assigned GPU...
>> Best,
>> Joan

Dr. Joan Josep Piles-Contreras
ZWE Scientific Computing
Max Planck Institute for Intelligent Systems
(p) +49 7071 601 1750