[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Hardening against NFS failure

Hi Justin,

On 02/27/2017 04:36 PM, Justin Fisher wrote:
The input file is on an NFS share and points to thousands of files on the same NFS share. The output is written to another directory on the same NFS share.

I find that one of my machines is flaky and the NFS keeps dropping out.

Let's see if I have this straight. You have a job running which reads data from file(s) on an NFS share. NFS is flaky and quits, so

a) the jobs that are running can't read or write and crash out.

b) the jobs that are queued get run, and they can't even start to read or write.

Is that it?

Is there a way to run divert these files onto a machine that is alive?

It doesn't sound like HTCondor is doing anything wrong; just NFS. I don't know what you mean by "divert these files". Do you mean the files read by the job, the files written by the job or both? And do you mean that the job should look elsewhere for its data (or to write data) if it fails to find (or write) data on the original NFS share? If so, this is a "job level" action. For a job that is just starting to run, it could sense whether the file it expects is available. If it is not, the job could look in another place to see if it is there. That would mean some kind of change to the logic of the job, or perhaps a wrapper around the job. For jobs that are already running: that's harder. If the data is snatched from beneath a running job when NFS fails, then the results are rather unpredictable.

Obviously, I need to find out why this machine keeps crashing NFS, but I'm wondering if there is a workaround while I do this?

Many years ago, on another batch system, I saw that NFS "locked up" after running busy jobs for a long time. The answer then was to drain the nodes every few days, and reboot them. As long as we didn't exceed (say) three days uptime for a node, it would not break. We'd have to do that continuously for weeks to get the jobs through. Anything to keep the show on the road. There was a bloke called Trond /Myklebust/ who did a lot to try to make NFS better, but I don't know if he ever made it absolutely bomb proof.



Steve Jones                             sjones@xxxxxxxxxxxxxxxx
Grid System Administrator               office: 220
High Energy Physics Division            tel (int): 43396
Oliver Lodge Laboratory                 tel (ext): +44 (0)151 794 3396
University of Liverpool                 http://www.liv.ac.uk/physics/hep/