[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] node maintenance mode

Hello Gianni,

We are using a similar approach to control the draining of the pools since we were not satisfied with the command 'condor_off âstartd âpeaceful' before. Basically we added the following lines in condor_config.local on pools:


NodeOnline = False

START = (NodeOnline =?= True)

there would be a configure file /etc/condor/.config.STARTD.nodeonline created with 'NodeOnline' defined there. We set the default value of NodeOnline to False to avoid accidentally putting a new node online. We also configured our HTCondor pool to allow the head node/admin node to use condor_config_val and condor_reconfig to change the value of NodeOnline and update it. You can also modify the file /etc/condor/.config.STARTD.nodeonline directly. Of course there are security risks in this way, but it is easier for us to manage a large cluster.

In your case, I am not sure whether you need to add MAINTENANCE_MODE into the list of STARTD_ATTRS.



On 2022-04-22 2:31 p.m., Pezzarossi, Gianni wrote:

Hey all,

Was wondering if someone can help me trace an issue Iâm having, but also maybe let me know if my approach in general is terrible/if there is a better way to accomplish what I am trying.


So first off, I am wanting to find a way to set a condor node into a âmaintenance modeâ basically where the node will stop taking new jobs, but let what is already running to finish, for example if I need to reboot the nodes of a cluster and donât want to interrupt running jobs. My thought was that I just need to set START = FALSE is some manner, and for a time, we could do just that and push a config.local file with that change to the startd nodes. However, wanting to make this a bit more automated, the idea I had was to change START to something like




Where MAINTENANCE_MODE was a variable defined in /etc/condor/config.d/maintenance.conf like so:



That way I just need to have a script/config management just drop a new maintenance.conf file and not worry about blowing away any settings in the .local config file.


However, I cannot get jobs to run when MAINTENANCE_MODE = FALSE, almost as if the START statement is not getting evaluated correctly. I tried even putting the MAINTENANCE_MODE variable in the same .local file thinking maybe it had something to do with the external file. But nothing has allowed jobs to run when the node is out of maintenance mode. As soon as I set START = TRUE again and run condor_reconfig, jobs launch.


I confirmed the syntax should be correct with:


classad_eval -file /etc/condor/config.d/maintenance.conf 'TRUE && !MAINTENANCE_MODE'






So I must be missing something about how START gets evaluated, orâ?


For the record, I do know that I can use something like `condor_off âstartd âpeaceful` only reason I donât want to depend on this is if I am installing updates and will need a few reboots, the service will restart after a reboot. If there is a better way I can accomplish this, Iâm happy to scrap what I am working on above.


This is all on condor version 8.8.17


Thanks in advance!


Gianni Pezzarossi

Computational System Analyst

Research Services

Engineering IT Shared Services

University of Illinois @ Urbana-Champaign

HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting

The archives can be found at: