[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Hardening against NFS failure

I wrote my check in such a way that with it can also be called by Nagios as a check. So you get active reporting of problems.

Tom Downes
Senior Scientist and Data CenterÂManager
Center for Gravitation, Cosmology and Astrophysics
University of Wisconsin-Milwaukee

On Wed, Mar 1, 2017 at 10:15 AM, Michael Pelletier <Michael.V.Pelletier@xxxxxxxxxxxx> wrote:
I can recommend this approach - we had the same kind of problem with automount maps in RHEL5 through about RHEL6.3, so I added a startd_cron check to insure that automountd was running and that an exemplar automount point was reachable.

Another useful trick is that you can create nested ifThenElse() statements to report the reason that the start _expression_ went false when such a condition occurs:

StartError = ifThenElse( DeadNFS, "NFS is dead", ifThenElse(DeadAutomount, "Automount is dead", "No error" ))

    -Michael Pelletier.

> -----Original Message-----
> From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On
> Behalf Of Ben Cotton
> Sent: Monday, February 27, 2017 12:46 PM
> To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
> Subject: Re: [HTCondor-users] Hardening against NFS failure
> Justin,
> One option would be to write a check that verifies the status of he NFS
> mount and put that in a STARTD_CRON (see
> https://research.cs.wisc.edu/htcondor/manual/latest/4_4Hooks.html#SECTI
> ON00543000000000000000).
> Then your START _expression_ could use that value. For example, if the
> attribute from the STARTD_CRON is nfsCheck_IsGood, then you can set
> START = $(START) && nfsCheck_IsGood
> That way, if the NFS check fails, those slots won't accept jobs until the check
> passes again.

HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting

The archives can be found at: