We are using a similar approach to control the draining of the pools since we were not satisfied with the command 'condor_off âstartd âpeaceful' before. Basically we added the following lines in condor_config.local on pools:ENABLE_PERSISTENT_CONFIG = TRUE
NodeOnline = False
START = (NodeOnline =?= True)
there would be a configure file
/etc/condor/.config.STARTD.nodeonline created with 'NodeOnline'
defined there. We set the default value of NodeOnline to False to
avoid accidentally putting a new node online. We also configured
our HTCondor pool to allow the head node/admin node to use
condor_config_val and condor_reconfig to change the value of
NodeOnline and update it. You can also modify the file
/etc/condor/.config.STARTD.nodeonline directly. Of course there
are security risks in this way, but it is easier for us to manage
a large cluster.
In your case, I am not sure whether you need to add MAINTENANCE_MODE into the list of STARTD_ATTRS.
Was wondering if someone can help me trace an issue Iâm having, but also maybe let me know if my approach in general is terrible/if there is a better way to accomplish what I am trying.
So first off, I am wanting to find a way to set a condor node into a âmaintenance modeâ basically where the node will stop taking new jobs, but let what is already running to finish, for example if I need to reboot the nodes of a cluster and donât want to interrupt running jobs. My thought was that I just need to set START = FALSE is some manner, and for a time, we could do just that and push a config.local file with that change to the startd nodes. However, wanting to make this a bit more automated, the idea I had was to change START to something like
START = TRUE && ! MAINTENANCE_MODE
Where MAINTENANCE_MODE was a variable defined in /etc/condor/config.d/maintenance.conf like so:
MAINTENANCE_MODE = FALSE
That way I just need to have a script/config management just drop a new maintenance.conf file and not worry about blowing away any settings in the .local config file.
However, I cannot get jobs to run when MAINTENANCE_MODE = FALSE, almost as if the START statement is not getting evaluated correctly. I tried even putting the MAINTENANCE_MODE variable in the same .local file thinking maybe it had something to do with the external file. But nothing has allowed jobs to run when the node is out of maintenance mode. As soon as I set START = TRUE again and run condor_reconfig, jobs launch.
I confirmed the syntax should be correct with:
classad_eval -file /etc/condor/config.d/maintenance.conf 'TRUE && !MAINTENANCE_MODE'
[ MAINTENANCE_MODE = false ]
So I must be missing something about how START gets evaluated, orâ?
For the record, I do know that I can use something like `condor_off âstartd âpeaceful` only reason I donât want to depend on this is if I am installing updates and will need a few reboots, the service will restart after a reboot. If there is a better way I can accomplish this, Iâm happy to scrap what I am working on above.
This is all on condor version 8.8.17
Thanks in advance!
Computational System Analyst
Engineering IT Shared Services
University of Illinois @ Urbana-Champaign
_______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users The archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/