[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] node maintenance mode



Oh boy, I needed to only look on the next line of my config file, tunnel vision got me again. Sure enough, added it to STARTD_ATTRS and it works as expected. Thanks for the help Di!

 

-------------------------------------

Gianni Pezzarossi

Computational System Analyst

Research Services

Engineering IT Shared Services

University of Illinois @ Urbana-Champaign

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Di Qing
Sent: Friday, April 22, 2022 6:02 PM
To: htcondor-users@xxxxxxxxxxx
Subject: Re: [HTCondor-users] node maintenance mode

 

Hello Gianni,

We are using a similar approach to control the draining of the pools since we were not satisfied with the command 'condor_off âstartd âpeaceful' before. Basically we added the following lines in condor_config.local on pools:

ENABLE_PERSISTENT_CONFIG = TRUE
PERSISTENT_CONFIG_DIR = /etc/condor
SETTABLE_ATTRS_CONFIG = NodeOnline
STARTD_ATTRS = NodeOnline, $(STARTD_ATTRS)

NodeOnline = False

START = (NodeOnline =?= True)

there would be a configure file /etc/condor/.config.STARTD.nodeonline created with 'NodeOnline' defined there. We set the default value of NodeOnline to False to avoid accidentally putting a new node online. We also configured our HTCondor pool to allow the head node/admin node to use condor_config_val and condor_reconfig to change the value of NodeOnline and update it. You can also modify the file /etc/condor/.config.STARTD.nodeonline directly. Of course there are security risks in this way, but it is easier for us to manage a large cluster.

In your case, I am not sure whether you need to add MAINTENANCE_MODE into the list of STARTD_ATTRS.

Cheers,

Di

On 2022-04-22 2:31 p.m., Pezzarossi, Gianni wrote:

Hey all,

Was wondering if someone can help me trace an issue Iâm having, but also maybe let me know if my approach in general is terrible/if there is a better way to accomplish what I am trying.

 

So first off, I am wanting to find a way to set a condor node into a âmaintenance modeâ basically where the node will stop taking new jobs, but let what is already running to finish, for example if I need to reboot the nodes of a cluster and donât want to interrupt running jobs. My thought was that I just need to set START = FALSE is some manner, and for a time, we could do just that and push a config.local file with that change to the startd nodes. However, wanting to make this a bit more automated, the idea I had was to change START to something like

 

START = TRUE && ! MAINTENANCE_MODE

 

Where MAINTENANCE_MODE was a variable defined in /etc/condor/config.d/maintenance.conf like so:

MAINTENANCE_MODE = FALSE

 

That way I just need to have a script/config management just drop a new maintenance.conf file and not worry about blowing away any settings in the .local config file.

 

However, I cannot get jobs to run when MAINTENANCE_MODE = FALSE, almost as if the START statement is not getting evaluated correctly. I tried even putting the MAINTENANCE_MODE variable in the same .local file thinking maybe it had something to do with the external file. But nothing has allowed jobs to run when the node is out of maintenance mode. As soon as I set START = TRUE again and run condor_reconfig, jobs launch.

 

I confirmed the syntax should be correct with:

 

classad_eval -file /etc/condor/config.d/maintenance.conf 'TRUE && !MAINTENANCE_MODE'

 

output:

[ MAINTENANCE_MODE = false ]

true

 

So I must be missing something about how START gets evaluated, orâ?

 

For the record, I do know that I can use something like `condor_off âstartd âpeaceful` only reason I donât want to depend on this is if I am installing updates and will need a few reboots, the service will restart after a reboot. If there is a better way I can accomplish this, Iâm happy to scrap what I am working on above.

 

This is all on condor version 8.8.17

 

Thanks in advance!

-------------------------------------

Gianni Pezzarossi

Computational System Analyst

Research Services

Engineering IT Shared Services

University of Illinois @ Urbana-Champaign



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
 
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/