[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] implement scheduled downtimes for one accountinggroup in the pool

On 4/3/20 3:06 AM, Beyer, Christoph wrote:
Hi all,

Our pool is used by different VOs that match accountinggroups to get the quotas right. Every now and then we do have scheduled downtimes for fileserver maintenance, dcache upgrades etc. for one or more of these VOs.

As we do have all jobs with estimated runtimes it would be the most elegant way to handle these temporary interruptions automated. The begin of downtime should be noted in a config file and then the jobs of the matching VO should be checked if they fit in to the remaining time window.

Hi Christoph:

I don't think we have a good way to do this at the negotiator level.

The best practice that we recommend for worker nodes that have shared filesystems is to write a STARTD_CRON for each filesystem that detects if the filesystem is healthy, and advertise that in the startd classad as a boolean. Jobs that need those shared filesystems add this boolean attribute to their job requirements, so they don't match machines with bad filesystem mounts.

One idea is to extend this, so you don't advertise a boolean, but rather some kind of time that you suspect the mount is good until, and factor this into the job's requirement expression. I realize this is not the centralized solution that's best for you, but the startd cron could read this data from a centralized place before advertising it. It would also add in local knowledge, testing whether the fileystem mount is currently working on that node.

Does this sound like the kind of hack you can live with?