[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] How to set a worker node offline in HTCondor

The âcondor_off âpeacefulâ is what I usually use to get this behavior. The drawback is that it closes down the daemons and stops reporting to the collector when all the jobs finish, rather than leaving the startd active in the âDrainedâ state, but thatâs reasonably straightforward to work with. I keep looking for a condor_drain -peaceful.

Michael V Pelletier
Principal Engineer

Raytheon Technologies
Digital Technology
HPC Support Team

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of templon@xxxxxxxxx
Sent: Wednesday, March 31, 2021 2:55 PM
To: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
Cc: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [External] Re: [HTCondor-users] How to set a worker node offline in HTCondor

May be obvious to everybody else, but just in case (because it was not to us): my question was:
In Torque, if I donât want a worker node to accept any new jobs, I issue this command: 
|pbsnodes -o wn-lot-099.nikhef.nl <http://wn-lot-099.nikhef.nl> | 
|-o| is for âofflineâ. 
What is the corresponding simplest way to achieve exactly this in HTCondor? 
Note the word âexactlyâ :)
The answer was the condor_drain command, but it does not achieve exactly this, without a bit more. condor_drain also evicts running jobs from slots, depending on what the value of MaxJobRetirementTime is. I did not know about this variable so we did not have it set, and aside from nodes not accepting new jobs (the question), they stopped running the already-running jobs - not the desired behavior.
Sure, after we âundrainedâ it re-started the jobs, but we lost the (in some cases weeks) that they had already been running.
A tip for the next time somebody asks this question ;-)