[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Hardening against NFS failure
- Date: Mon, 27 Feb 2017 17:42:27 +0000
- From: Stephen Jones <sjones@xxxxxxxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Hardening against NFS failure
On 02/27/2017 04:36 PM, Justin Fisher wrote:
The input file is on an NFS share and points to thousands of files on
the same NFS share. The output is written to another directory on the
same NFS share.
I find that one of my machines is flaky and the NFS keeps dropping out.
Let's see if I have this straight. You have a job running which reads
data from file(s) on an NFS share. NFS is flaky and quits, so
a) the jobs that are running can't read or write and crash out.
b) the jobs that are queued get run, and they can't even start to read
Is that it?
Is there a way to run divert these files onto a machine that is alive?
It doesn't sound like HTCondor is doing anything wrong; just NFS. I
don't know what you mean by "divert these files". Do you mean the files
read by the job, the files written by the job or both? And do you mean
that the job should look elsewhere for its data (or to write data) if it
fails to find (or write) data on the original NFS share? If so, this is
a "job level" action. For a job that is just starting to run, it could
sense whether the file it expects is available. If it is not, the job
could look in another place to see if it is there. That would mean some
kind of change to the logic of the job, or perhaps a wrapper around the
job. For jobs that are already running: that's harder. If the data is
snatched from beneath a running job when NFS fails, then the results are
Obviously, I need to find out why this machine keeps crashing NFS, but
I'm wondering if there is a workaround while I do this?
Many years ago, on another batch system, I saw that NFS "locked up"
after running busy jobs for a long time. The answer then was to drain
the nodes every few days, and reboot them. As long as we didn't exceed
(say) three days uptime for a node, it would not break. We'd have to do
that continuously for weeks to get the jobs through. Anything to keep
the show on the road. There was a bloke called Trond /Myklebust/ who did
a lot to try to make NFS better, but I don't know if he ever made it
absolutely bomb proof.
Steve Jones sjones@xxxxxxxxxxxxxxxx
Grid System Administrator office: 220
High Energy Physics Division tel (int): 43396
Oliver Lodge Laboratory tel (ext): +44 (0)151 794 3396
University of Liverpool http://www.liv.ac.uk/physics/hep/