[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] How to set a worker node offline in HTCondor



On 3/31/2021 2:09 PM, gthain@xxxxxxxxxxx wrote:
On 3/31/21 1:55 PM, templon@xxxxxxxxx wrote:
 

What is the corresponding simplest way to achieve exactly this in HTCondor?

Note the word âexactlyâ :)

The answer was the condor_drain command, but it does not achieve exactly this, without a bit more. condor_drain also evicts running jobs from slots, depending on what the value of MaxJobRetirementTime is. I did not know about this variable so we did not have it set, and aside from nodes not accepting new jobs (the question), they stopped running the already-running jobs - not the desired behavior.


Yes, apologies for this Jeff!  I had forgotten our pool sets MaxJobRetirementTime. Indeed, as you discovered, I suggest you set MaxJobRetirementTime as documented, i.e. set it in your config to be how long a job should be able to run without being interrupted by HTCondor ... note this is a classad _expression_ that can reference attributes in the job itself if you desire.  Alternative, using condor_off -peaceful as Greg suggested is another option.

While you can configure all kinds of time-based policies (e.g. maximum run time until killed, maximum run time until candidate for preemption, etc) today using the flexibility and ability to insert customized attribute offered by HTCondor's ClassAds, we plan to look at how to make these sort of policies more "first-class".  Doing so would allow perhaps simplify their use, and at the very least standardize how users specify these time limits.

regards
Todd