[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] How to let jobs finish on a worker machine to be taken offline



Recently, one of our worker/compute nodes needed to be taken offline for some maintenance. Usually the IT guys just shut it down without checking whether or not its busy running jobs. Since it was fully busy and with the pool expected to be fully busy for the next several weeks, I suggested that there should be a way to prevent new jobs from being submitted to this machine and we let the currently running jobs on it finish, then it would be free to be taken offline.

 

After a bit of searching I found the condor_drain command and thought that this was the way to achieve this. So I had them issue a condor_drain on this machine (with no other arguments) so the –graceful should have been the default. However, once issued, all of the current jobs on this machine were immediately terminated. None of the jobs were setting a MaxJobRetirementTime which was probably being defaulted to 0. I was wondering if this had something to do with it. However, it seems like there ought to be a way for this to be done without requiring this setting.

 

Also, once this machine was rebooted and it immediately began accepting jobs again. I was under the impression that it would be necessary to issue a condor_drain –cancel command in order for it to be free to start running jobs again. If you are doing some maintenance and you have to reboot a couple of times you don’t want it to keep accepting jobs during this process.

 

This condor pool is currently using v. 8.2.8.