[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] How to set a worker node offline in HTCondor



Hi,

May be obvious to everybody else, but just in case (because it was not to us): my question was:

In Torque, if I donât want a worker node to accept any new jobs, I issue this command:
|pbsnodes -o wn-lot-099.nikhef.nl <http://wn-lot-099.nikhef.nl> |
|-o| is for âofflineâ.
What is the corresponding simplest way to achieve exactly this in HTCondor?

Note the word âexactlyâ :)

The answer was the condor_drain command, but it does not achieve exactly this, without a bit more. condor_drain also evicts running jobs from slots, depending on what the value of MaxJobRetirementTime is. I did not know about this variable so we did not have it set, and aside from nodes not accepting new jobs (the question), they stopped running the already-running jobs - not the desired behavior.

Sure, after we âundrainedâ it re-started the jobs, but we lost the (in some cases weeks) that they had already been running.

A tip for the next time somebody asks this question ;-)

JT

On 12 Mar 2021, at 17:00, Todd Tannenbaum wrote:

Hi Jeff,

I would suggest using "condor_drain", e.g.

    condor_drain wn-lot-099.nikhef.nl

This will tell the given worker node (condor_startd) to stop accepting new jobs by changing the START _expression_ to False (as Vikrant suggested below), and change the state of the slots to Drain (so it is easy to see nodes in this state via condor_status).

Then when/if you want to resume accepting new jobs, you can do

    condor_drain -cancel wn-lot-099.nikhef.nl

which will return the START _expression_ back to whatever it was before you issued the drain request.

See
  https://htcondor.readthedocs.io/en/latest/man-pages/condor_drain.html
for more info.

condor_drain can do nifty things, like setting the START _expression_ to anything when in drained state.... the default is START=False, but here we set it up so our draining machines will continue to accept preemptable jobs...

Hope this helps
Todd


On 3/12/2021 9:17 AM, ervikrant06@xxxxxxxxx wrote:
Worker nodes accept the jobs if the START condition evaluates to True, start should evaluate False for node(s) to not accept new jobs, existing jobs will keep on running. 

START = False

condor_reconfig (to reflect the change) 

From condor master node for multiple worker nodes you may use this:

condor_config_val -set 'START = False' -startd -name $host
condor_reconfig -name $host

Thanks & Regards,
Vikrant Aggarwal


On Fri, Mar 12, 2021 at 8:34 PM Jeff Templon <templon@xxxxxxxxx> wrote:

Hi

In Torque, if I donât want a worker node to accept any new jobs, I issue this command:

pbsnodes -o wn-lot-099.nikhef.nl

-o is for âofflineâ.

What is the corresponding simplest way to achieve exactly this in HTCondor?

Thanks, JT

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


-- 
Todd Tannenbaum <tannenba@xxxxxxxxxxx>  University of Wisconsin-Madison
Center for High Throughput Computing    Department of Computer Sciences
Calendar: https://tinyurl.com/yd55mtgd  1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                   Madison, WI 53706-1685