[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Cancel peaceful shutdown / cgroupsv2



Hi Carsten,

good to hear from you :) 

If your jobs have an estimated or forced runtime you  can alter the start expression on the workernodes to take a shutdown time into account before starting the job and make it remote administrable: 

On the worker:

InStageDrain = False
ShutdownTime = 0
Drain = ((InStageDrain =?= True && (time() + MaxJobRetirementTime < ShutdownTime)) || InStageDrain =?= False)
STARTD_ATTRS = InStageDrain, ShutdownTime, StartJobs, $(STARTD_ATTRS)
STARTD.SETTABLE_ATTRS_ADMINISTRATOR = StartJobs, InStageDrain, ShutdownTime
START = (your other start options) && $(Drain)

then:

get the time of shutdown of the node in unix time, for ex->
zitpcx35701%  date -d "May 30 14:59:48 CEST 2019" +%s
1559221188


condor_config_val -name <workernode> -startd -set "ShutdownTime = 1559221188"
condor_config_val -name <workernode> -startd -set "InStageDrain = True"
condor_reconfig <workernode> -daemon startd

Hence this node will only start jobs that will fit into the remaining time frame ... 

Best
Christoph

-- 
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx

----- UrsprÃngliche Mail -----
Von: "Carsten Aulbert" <carsten.aulbert@xxxxxxxxxx>
An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Gesendet: Mittwoch, 24. Juli 2019 21:35:05
Betreff: [HTCondor-users] Cancel peaceful shutdown / cgroupsv2

Hi,

two quick questions which possibly don't warrant a full email on their own:

(1) After running condor_off -peaceful -daemon startd on a node, is it
possible to cancel this?

Background: We have to shuffle servers around and want to be nice
admins. Therefore, we stop the nodes peaceful 12+ hours in advance
before powering them off - and killing jobs which are still running then.

However, sometimes plans change and we have nodes where say 10 cores are
idle due to the shutdown but 2 cores are busy with jobs running for
another day or so. Then we have the choice to let these job finish and
"waste" the idle cores or kill those for the greater benefit. Or is
there a third option?

(2) I have not found it in the docs and I will only be able to start
testing this with condor 8.8 on Debian Buster after your next point
release, but do you already support cgroups v2? Our preliminary testing
has shown it to be quite powerful in terms of preventing jobs going into
swap while allowing processes outside of condor to use swap if needed.

Cheers and thanks a lot in advance

Carsten

-- 
Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics,
CallinstraÃe 38, 30167 Hannover, Germany
Phone: +49 511 762 17185



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/