[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Hardening against NFS failure



I can recommend this approach - we had the same kind of problem with automount maps in RHEL5 through about RHEL6.3, so I added a startd_cron check to insure that automountd was running and that an exemplar automount point was reachable.

Another useful trick is that you can create nested ifThenElse() statements to report the reason that the start expression went false when such a condition occurs:

StartError = ifThenElse( DeadNFS, "NFS is dead", ifThenElse(DeadAutomount, "Automount is dead", "No error" ))

	-Michael Pelletier.

> -----Original Message-----
> From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On
> Behalf Of Ben Cotton
> Sent: Monday, February 27, 2017 12:46 PM
> To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
> Subject: Re: [HTCondor-users] Hardening against NFS failure
> 
> Justin,
> 
> One option would be to write a check that verifies the status of he NFS
> mount and put that in a STARTD_CRON (see
> https://research.cs.wisc.edu/htcondor/manual/latest/4_4Hooks.html#SECTI
> ON00543000000000000000).
> Then your START expression could use that value. For example, if the
> attribute from the STARTD_CRON is nfsCheck_IsGood, then you can set
> 
> START = $(START) && nfsCheck_IsGood
> 
> That way, if the NFS check fails, those slots won't accept jobs until the check
> passes again.
>