[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] implement scheduled downtimes for one accountinggroup in the pool
- Date: Mon, 06 Apr 2020 07:48:16 +0200 (CEST)
- From: "Beyer, Christoph" <christoph.beyer@xxxxxxx>
- Subject: Re: [HTCondor-users] implement scheduled downtimes for one accountinggroup in the pool
thanks for looking into this !
We do indeed use the START expression for similar checks for badmounts and do a regex on a 'groups_down' list. Also the idea of putting scheduled downtimes in a central place on a shared filesystem occured to me as an option :)
In the meantime I found NEGOTIATOR_TRIM_SHUTDOWN_THRESHOLD which vaguely seems to be a mechanism that could be used for something similar and NEGOTIATOR_MATCH_EXPRS that might be used as a vehicle to transfer a 'group.downtime' value from the negotiatior into the jobclassadd, then referenced by the startd.
Please do not put too much effort in this if there are more urgent things to do . I will find my way somehow and just wanted to make sure to not reinvent the wheel with this !
Building 02b, Room 009
----- UrsprÃngliche Mail -----
Von: "gthain" <gthain@xxxxxxxxxxx>
An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Gesendet: Montag, 6. April 2020 05:17:22
Betreff: Re: [HTCondor-users] implement scheduled downtimes for one accountinggroup in the pool
On 4/3/20 3:06 AM, Beyer, Christoph wrote:
> Hi all,
> Our pool is used by different VOs that match accountinggroups to get the quotas right. Every now and then we do have scheduled downtimes for fileserver maintenance, dcache upgrades etc. for one or more of these VOs.
> As we do have all jobs with estimated runtimes it would be the most elegant way to handle these temporary interruptions automated. The begin of downtime should be noted in a config file and then the jobs of the matching VO should be checked if they fit in to the remaining time window.
I don't think we have a good way to do this at the negotiator level.
The best practice that we recommend for worker nodes that have shared
filesystems is to write a STARTD_CRON for each filesystem that detects
if the filesystem is healthy, and advertise that in the startd classad
as a boolean.Â Jobs that need those shared filesystems add this boolean
attribute to their job requirements, so they don't match machines with
bad filesystem mounts.
One idea is to extend this, so you don't advertise a boolean, but rather
some kind of time that you suspect the mount is good until, and factor
this into the job's requirement expression.Â I realize this is not the
centralized solution that's best for you, but the startd cron could read
this data from a centralized place before advertising it.Â It would also
add in local knowledge, testing whether the fileystem mount is currently
working on that node.
Does this sound like the kind of hack you can live with?
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
You can also unsubscribe by visiting
The archives can be found at: