Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] implement scheduled downtimes for one accountinggroup in the pool

Date: Sun, 05 Apr 2020 22:17:22 -0500
From: Gregory Thain <gthain@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] implement scheduled downtimes for one accountinggroup in the pool


On 4/3/20 3:06 AM, Beyer, Christoph wrote:

Hi all,



Our pool is used by different VOs that match accountinggroups to get the quotas right. Every now and then we do have scheduled downtimes for fileserver maintenance, dcache upgrades etc. for one or more of these VOs.

As we do have all jobs with estimated runtimes it would be the most elegant way to handle these temporary interruptions automated. The begin of downtime should be noted in a config file and then the jobs of the matching VO should be checked if they fit in to the remaining time window.

Hi Christoph:

I don't think we have a good way to do this at the negotiator level.

The best practice that we recommend for worker nodes that have sharedfilesystems is to write a STARTD_CRON for each filesystem that detectsif the filesystem is healthy, and advertise that in the startd classadas a boolean.Â Jobs that need those shared filesystems add this booleanattribute to their job requirements, so they don't match machines withbad filesystem mounts.

One idea is to extend this, so you don't advertise a boolean, but rathersome kind of time that you suspect the mount is good until, and factorthis into the job's requirement expression.Â I realize this is not thecentralized solution that's best for you, but the startd cron could readthis data from a centralized place before advertising it.Â It would alsoadd in local knowledge, testing whether the fileystem mount is currentlyworking on that node.


Does this sound like the kind of hack you can live with?


-greg

Follow-Ups:
- Re: [HTCondor-users] implement scheduled downtimes for one accountinggroup in the pool
  - From: Beyer, Christoph

References:
- [HTCondor-users] implement scheduled downtimes for one accountinggroup in the pool
  - From: Beyer, Christoph

Prev by Date: Re: [HTCondor-users] [HTCondor-CE] Maximum number of established TCP connections
Next by Date: [HTCondor-users] Bug STARTER at <ip> failed to send file(s) to <<ip>:9618>; remaps resulted in a cycle:
Previous by thread: [HTCondor-users] implement scheduled downtimes for one accountinggroup in the pool
Next by thread: Re: [HTCondor-users] implement scheduled downtimes for one accountinggroup in the pool
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] implement scheduled downtimes for one accountinggroup in the pool