Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Hardening against NFS failure

Date: Mon, 27 Feb 2017 17:42:27 +0000
From: Stephen Jones <sjones@xxxxxxxxxxxxxxxx>
Subject: Re: [HTCondor-users] Hardening against NFS failure

Hi Justin,

On 02/27/2017 04:36 PM, Justin Fisher wrote:

The input file is on an NFS share and points to thousands of files onthe same NFS share. The output is written to another directory on thesame NFS share.
I find that one of my machines is flaky and the NFS keeps dropping out.

Let's see if I have this straight. You have a job running which readsdata from file(s) on an NFS share. NFS is flaky and quits, so


a) the jobs that are running can't read or write and crash out.

b) the jobs that are queued get run, and they can't even start to reador write.


Is that it?

Is there a way to run divert these files onto a machine that is alive?

It doesn't sound like HTCondor is doing anything wrong; just NFS. Idon't know what you mean by "divert these files". Do you mean the filesread by the job, the files written by the job or both? And do you meanthat the job should look elsewhere for its data (or to write data) if itfails to find (or write) data on the original NFS share? If so, this isa "job level" action. For a job that is just starting to run, it couldsense whether the file it expects is available. If it is not, the jobcould look in another place to see if it is there. That would mean somekind of change to the logic of the job, or perhaps a wrapper around thejob. For jobs that are already running: that's harder. If the data issnatched from beneath a running job when NFS fails, then the results arerather unpredictable.

Obviously, I need to find out why this machine keeps crashing NFS, butI'm wondering if there is a workaround while I do this?

Many years ago, on another batch system, I saw that NFS "locked up"after running busy jobs for a long time. The answer then was to drainthe nodes every few days, and reboot them. As long as we didn't exceed(say) three days uptime for a node, it would not break. We'd have to dothat continuously for weeks to get the jobs through. Anything to keepthe show on the road. There was a bloke called Trond /Myklebust/ who dida lot to try to make NFS better, but I don't know if he ever made itabsolutely bomb proof.


Cheers,

Ste



--
Steve Jones                             sjones@xxxxxxxxxxxxxxxx
Grid System Administrator               office: 220
High Energy Physics Division            tel (int): 43396
Oliver Lodge Laboratory                 tel (ext): +44 (0)151 794 3396
University of Liverpool                 http://www.liv.ac.uk/physics/hep/

References:
- [HTCondor-users] Hardening against NFS failure
  - From: Justin Fisher

Prev by Date: Re: [HTCondor-users] Personal HTCondor Install - how to reduce number of slots?
Next by Date: Re: [HTCondor-users] Hardening against NFS failure
Previous by thread: [HTCondor-users] Hardening against NFS failure
Next by thread: Re: [HTCondor-users] Hardening against NFS failure
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] Hardening against NFS failure