[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] node maintenance mode



Hey all,

Was wondering if someone can help me trace an issue I’m having, but also maybe let me know if my approach in general is terrible/if there is a better way to accomplish what I am trying.

 

So first off, I am wanting to find a way to set a condor node into a “maintenance mode” basically where the node will stop taking new jobs, but let what is already running to finish, for example if I need to reboot the nodes of a cluster and don’t want to interrupt running jobs. My thought was that I just need to set START = FALSE is some manner, and for a time, we could do just that and push a config.local file with that change to the startd nodes. However, wanting to make this a bit more automated, the idea I had was to change START to something like

 

START = TRUE && ! MAINTENANCE_MODE

 

Where MAINTENANCE_MODE was a variable defined in /etc/condor/config.d/maintenance.conf like so:

MAINTENANCE_MODE = FALSE

 

That way I just need to have a script/config management just drop a new maintenance.conf file and not worry about blowing away any settings in the .local config file.

 

However, I cannot get jobs to run when MAINTENANCE_MODE = FALSE, almost as if the START statement is not getting evaluated correctly. I tried even putting the MAINTENANCE_MODE variable in the same .local file thinking maybe it had something to do with the external file. But nothing has allowed jobs to run when the node is out of maintenance mode. As soon as I set START = TRUE again and run condor_reconfig, jobs launch.

 

I confirmed the syntax should be correct with:

 

classad_eval -file /etc/condor/config.d/maintenance.conf 'TRUE && !MAINTENANCE_MODE'

 

output:

[ MAINTENANCE_MODE = false ]

true

 

So I must be missing something about how START gets evaluated, or…?

 

For the record, I do know that I can use something like `condor_off –startd –peaceful` only reason I don’t want to depend on this is if I am installing updates and will need a few reboots, the service will restart after a reboot. If there is a better way I can accomplish this, I’m happy to scrap what I am working on above.

 

This is all on condor version 8.8.17

 

Thanks in advance!

-------------------------------------

Gianni Pezzarossi

Computational System Analyst

Research Services

Engineering IT Shared Services

University of Illinois @ Urbana-Champaign