[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] implement scheduled downtimes for one accountinggroup in the pool



Hi Greg,

thanks for looking into this ! 

We do indeed use the START expression for similar checks for badmounts and do a regex on a 'groups_down' list. Also the idea of putting scheduled downtimes in a central place on a shared filesystem occured to me as an option :) 

In the meantime I found NEGOTIATOR_TRIM_SHUTDOWN_THRESHOLD which vaguely seems to be a mechanism that could be used for something similar and NEGOTIATOR_MATCH_EXPRS that might be used as a vehicle to transfer a 'group.downtime' value from the negotiatior into the jobclassadd, then referenced by the startd. 

Please do not put too much effort in this if there are more urgent things to do . I will find my way somehow and just wanted to make sure to not reinvent the wheel with this ! 

Best
christoph

-- 
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx

----- UrsprÃngliche Mail -----
Von: "gthain" <gthain@xxxxxxxxxxx>
An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Gesendet: Montag, 6. April 2020 05:17:22
Betreff: Re: [HTCondor-users] implement scheduled downtimes for one accountinggroup in the pool

On 4/3/20 3:06 AM, Beyer, Christoph wrote:
> Hi all,
>
>
>
> Our pool is used by different VOs that match accountinggroups to get the quotas right. Every now and then we do have scheduled downtimes for fileserver maintenance, dcache upgrades etc. for one or more of these VOs.
>
> As we do have all jobs with estimated runtimes it would be the most elegant way to handle these temporary interruptions automated. The begin of downtime should be noted in a config file and then the jobs of the matching VO should be checked if they fit in to the remaining time window.
>
Hi Christoph:

I don't think we have a good way to do this at the negotiator level.

The best practice that we recommend for worker nodes that have shared 
filesystems is to write a STARTD_CRON for each filesystem that detects 
if the filesystem is healthy, and advertise that in the startd classad 
as a boolean. Jobs that need those shared filesystems add this boolean 
attribute to their job requirements, so they don't match machines with 
bad filesystem mounts.

One idea is to extend this, so you don't advertise a boolean, but rather 
some kind of time that you suspect the mount is good until, and factor 
this into the job's requirement expression. I realize this is not the 
centralized solution that's best for you, but the startd cron could read 
this data from a centralized place before advertising it. It would also 
add in local knowledge, testing whether the fileystem mount is currently 
working on that node.

Does this sound like the kind of hack you can live with?


-greg

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/