[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] How to let jobs finish on a worker machine to be taken offline



Hi Douglas,

Some comments below inline...

On 7/16/2021 9:46 AM, Vechinski, Douglas wrote:

Recently, one of our worker/compute nodes needed to be taken offline for some maintenance. Usually the IT guys just shut it down without checking whether or not its busy running jobs. Since it was fully busy and with the pool expected to be fully busy for the next several weeks, I suggested that there should be a way to prevent new jobs from being submitted to this machine and we let the currently running jobs on it finish, then it would be free to be taken offline.

 

After a bit of searching I found the condor_drain command and thought that this was the way to achieve this. So I had them issue a condor_drain on this machine (with no other arguments) so the –graceful should have been the default. However, once issued, all of the current jobs on this machine were immediately terminated. None of the jobs were setting a MaxJobRetirementTime which was probably being defaulted to 0. I was wondering if this had something to do with it. However, it seems like there ought to be a way for this to be done without requiring this setting.

 


Yes, draining a machine with -graceful (the default) will currently signal the job to shutdown after the MaxJobRetirementTime has expired.  You (the administrator) can set up MaxJobRetirementTime in the config file of your worker nodes; it is not necessary for users to place it in their job submit files.  And yes, it defaults to 0, meaning jobs can be killed immediately whenever draining or preempting the job to make room for a high-priorty user.  If you want to define X number of seconds that a job can run unmolested, then use MaxJobRetirementTime.  From the Manual:
MaxJobRetirementTime

When the condor_startd wants to kick the job off, a job which has run for less than this number of seconds will not be hard-killed. The condor_startd will wait for the job to finish or to exceed this amount of time, whichever comes sooner. If the job vacating policy grants the job X seconds of vacating time, a preempted job will be soft-killed X seconds before the end of its retirement time, so that hard-killing of the job will not happen until the end of the retirement time if the job does not finish shutting down before then. This is an _expression_ evaluated in the context of the job ClassAd, so it may refer to job attributes as well as machine attributes.

Another option is to use "condor_off -peaceful -name <worker node name>"....  This tells the startd to stop accepting jobs and then shutdown once all running jobs complete - in this case the startd will wait indefinitely for this to happen, so no need for MaxJobRetirementTime setting.

It is on my wish list to add a "condor_drain -peaceful" option, which would tell the startd to drain with an effective MaxJobRetirementTime of infinity....

Also, once this machine was rebooted and it immediately began accepting jobs again. I was under the impression that it would be necessary to issue a condor_drain –cancel command in order for it to be free to start running jobs again. If you are doing some maintenance and you have to reboot a couple of times you don’t want it to keep accepting jobs during this process.


Both condor_drain and condor_off are "soft state", meaning everything is reset after a reboot.  As Dmitri already mentioned, to keep the node "off" after a reboot you will need to add a line to the config file like "START = False". 

Making both the drain state and/or the  on/off state persistent across reboots is something we could do - I have to admit it never really occurred to me to do this.  My own preference is when configuration changes continue across reboots, the change should be declarative (via config) instead of imperative (via a command) to facilitate configuration revision control.  Here at UW-Madison, we keep our HTCondor configs in git....  especially for persistent changes I would want to be able to look up who/when/why the change was made.  But I do understand this may be a personal preference.

 

This condor pool is currently using v. 8.2.8.


I honestly cannot recall the differences between v8.2.8 (which is over six years old!) and current releases of HTCondor, but hopefully not much has changed in these areas. 

Hope the above helps,
Todd