[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_drain query



The technique you describe is old, but still very reasonable.  

The only way to get the STARTD to stop matching jobs and have that change persist across a restart is to change the configuration files.  

The condor_drain command does not change the configuration files, so it does not persist across a restart of the daemon. 

In the CHTC pool, we have a clause in the START expression that exists so that admins can disable matchmaking for the STARTD.  it looks something like this

    START = (PreventJobsReason =?= undefined) && $(START)
    STARTD_ATTRS = $(STARTD_ATTRS) PreventJobsReason
    PreventJobsReason =

Then when the admin wants to configure the node to not match any jobs, The will add a config file
That defines PreventJobsReason

    PeventJobsReason = "johnkn 5/9/2023: Down for kernel upgrade"

And then reconfig.

You could combine this a condor_drain -restart command so that the node shows up in the Draining state until it finishes draining and restarts

-tj

-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Steffen Grunewald
Sent: Tuesday, May 9, 2023 4:12 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: condor-users@xxxxxxxxxxx
Subject: Re: [HTCondor-users] condor_drain query

On Tue, 2023-05-09 at 08:51:18 +0000, HTCondor Users Mailinglist wrote:
> Hi Condor community,
> 
> Iâve recently started experimenting with the `condor_drain` command and found that the drain status does not persist across startd restarts or node reboots. Previously we have achieved the desired behaviour by setting a ClassAd ( `StartJobs = false` ) as part of the `START` expression. Is there a way `condor_drain` or similar command can achieve a drain and persist it even if the workernode is rebooted?


We're using a "LOCAL_CONFIG_FILE = /etc/condor_config_local|" - the script
collects some snippets, and right at the end also includes /etc/condor/local
if that exists.
The file contains "START = False" and (for historical reasons) another line
"IS_OWNER = True".
Draining a node (avoiding the "condor_drain" command which didn't exist all
the time) is achieved by creating the file and running condor_reconfig.
Undraining is done by removing the file and running condor_reconfig.
Draining/drained machines will be in "Owner" state, no longer accept new
workloads but keep the running jobs unharmed. (Note that you'll have to add
some extra checks for dynamic slots.)


This approach may be outdated now but has served us very well for more than 
ten years ;)


HTH,
 Steffen


-- 
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am MÃhlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/