[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] How to set a worker node offline in HTCondor




On 3/31/21 1:55 PM, templon@xxxxxxxxx wrote:

Hi,

May be obvious to everybody else, but just in case (because it was not to us): my question was:


Sorry about that, Jeff:

I'm afraid that we aren't slurm experts, so we're not quite sure what "exactly" means in this case.

"condor_off -peaceful" may have been what you want. Our naming of these options isn't the best. My mnemonic for "peaceful" really means "pacificistic" -- won't ever kill a job. Compare this with "condor_off -graceful", which means gives some "grace period" before killing the jobs. "condor_off -fast" is probably the only option that doesn't need an explanation.

Another option might have been to set "START" on that worker node to false, in which case it would not start any new jobs, but all existing jobs will continue as usual, but this might be a bit more work on your end

-greg

In Torque, if I donât want a worker node to accept any new jobs, I issue this command:
|pbsnodes -o wn-lot-099.nikhef.nl <http://wn-lot-099.nikhef.nl> |
|-o| is for âofflineâ.
What is the corresponding simplest way to achieve exactly this in HTCondor?

Note the word âexactlyâ :)

The answer was the condor_drain command, but it does not achieve exactly this, without a bit more. condor_drain also evicts running jobs from slots, depending on what the value of MaxJobRetirementTime is. I did not know about this variable so we did not have it set, and aside from nodes not accepting new jobs (the question), they stopped running the already-running jobs - not the desired behavior.

Sure, after we âundrainedâ it re-started the jobs, but we lost the (in some cases weeks) that they had already been running.

A tip for the next time somebody asks this question ;-)

JT

On 12 Mar 2021, at 17:00, Todd Tannenbaum wrote:

Hi Jeff,

I would suggest using "condor_drain", e.g.

ÂÂÂ condor_drain wn-lot-099.nikhef.nl

This will tell the given worker node (condor_startd) to stop accepting new jobs by changing the START _expression_ to False (as Vikrant suggested below), and change the state of the slots to Drain (so it is easy to see nodes in this state via condor_status).

Then when/if you want to resume accepting new jobs, you can do

ÂÂÂ condor_drain -cancel wn-lot-099.nikhef.nl

which will return the START _expression_ back to whatever it was before you issued the drain request.

See
 https://htcondor.readthedocs.io/en/latest/man-pages/condor_drain.html
for more info.

condor_drain can do nifty things, like setting the START _expression_ to anything when in drained state.... the default is START=False, but here we set it up so our draining machines will continue to accept preemptable jobs...

Hope this helps
Todd


On 3/12/2021 9:17 AM, ervikrant06@xxxxxxxxx wrote:
Worker nodes accept the jobs if the START condition evaluates to True, start should evaluate False for node(s) to not accept new jobs, existing jobs will keep on running.Â

START = False

condor_reconfig (to reflect the change)Â

From condor master node for multiple worker nodes you may use this:

condor_config_val -set 'START = False' -startd -name $host
condor_reconfig -name $host

Thanks & Regards,
Vikrant Aggarwal


On Fri, Mar 12, 2021 at 8:34 PM Jeff Templon <templon@xxxxxxxxx> wrote:

Hi

In Torque, if I donât want a worker node to accept any new jobs, I issue this command:

pbsnodes -o wn-lot-099.nikhef.nl

-o is for âofflineâ.

What is the corresponding simplest way to achieve exactly this in HTCondor?

Thanks, JT

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


-- 
Todd Tannenbaum <tannenba@xxxxxxxxxxx>  University of Wisconsin-Madison
Center for High Throughput Computing    Department of Computer Sciences
Calendar: https://tinyurl.com/yd55mtgd  1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                   Madison, WI 53706-1685 

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/