[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] scheduled downtime configuration



Hi Michael,

the START classadd is only effecting jobstarts not running jobs, hence you can use all kind of checks and put them into the evaluation string, for ex uptime of the machine (should be more than 10 minutes maybe), health of mounted filsystems, space in /var etc. 

We put all that stuff in a so called healthcheckscript run through startd-cron coming up with NODE_IS_HEALTHY true or false - in the second case the START expression evaluates to false and the machine does not start anymore jobs until the next run of the healthscript (iven that this time it comes up with a 'true'). Running jobs are not affected by this in whatsoever way .... 

Hope this helps :) 

Best
Christoph

-- 
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx

----- UrsprÃngliche Mail -----
Von: "Michael Pelletier" <Michael.V.Pelletier@xxxxxxxxxxxx>
An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Gesendet: Freitag, 17. Mai 2019 16:12:24
Betreff: Re: [HTCondor-users] scheduled downtime configuration

Hey Christoph,

I seem to remember that if START goes false, a machine will start evicting jobs. Am I remembering incorrectly?

I like this draining-to-shutdown approach, though - it's been something I've been meaning to look into, and it was nice to see it effortlessly appear in my inbox. :D

Michael V. Pelletier
Information Technology
Digital Transformation & Innovation
Integrated Defense Systems
Raytheon Company

-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Beyer, Christoph
Sent: Friday, May 17, 2019 8:21 AM
To: htcondor-users <htcondor-users@xxxxxxxxxxx>
Subject: [External] Re: [HTCondor-users] scheduled downtime configuration

Hi Ben,

always good to hear from you :) 

Thanks for the insight, that's what I had in mund more or less, I do skip the file/startdcron part for now and just made it remote controllable through condor_config_val.

It's not very sophisticated but I put it here for whoever might be looking for something similar anyway :) 

On the workernode: 

InStageDrain = False
ShutdownTime = 0
Drain = ((InStageDrain =?= True && (time() + MaxJobRetirementTime < ShutdownTime)) || InStageDrain =?= False) STARTD_ATTRS = InStageDrain, ShutdownTime, StartJobs, $(STARTD_ATTRS) STARTD.SETTABLE_ATTRS_ADMINISTRATOR = StartJobs, InStageDrain, ShutdownTime START = (NODE_IS_HEALTHY =?= True) && (StartJobs =?= True) && $(Drain)

Remote control: 

zitpcx35701%  date -d "May 30 14:59:48 CEST 2019" +%s
1559221188

condor_config_val -name <workernode> -startd -set "ShutdownTime = 1559221188"
condor_config_val -name <workernode> -startd -set "InStageDrain = True"
condor_reconfig <workernode> -daemon startd                      

Cheers
Chris

--
Christoph Beyer
DESY Hamburg
IT-Department

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/