[HTCondor-users] Hardening against NFS failure

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

Hi.

Is there a way to keep a condor job running if an NFS mount goes down during that job?

I'm using v8.6 and my submit file looks lie this:

Universe = vanilla

Requirements = Arch == "X86_64" && TARGET.OpSys == "LINUX"

Executable = /usr/share/ngspice_2016_08_05/bin/ngspice

transfer_input_files = $(filename)

Arguments = -o $Fdb(filename)_$Fn(filename).log $(filename)

Should_transfer_files = Yes

When_to_transfer_output = on_exit

Request_memory = 8 GB

Request_disk = 50 MB

Request_cpus = 4

accounting_group = group_ANALOG

accounting_group_user = jfisher

## Log = log

Queue filename from /some/file/location/condor.in

The input file is on an NFS share and points to thousands of files on the same NFS share. The output is written to another directory on the same NFS share.

I find that one of my machines is flaky and the NFS keeps dropping out. When that happens many of the submitted jobs all fail with messages saying can't find the input file. Is there a way to run divert these files onto a machine that is alive? Obviously, I need to find out why this machine keeps crashing NFS, but I'm wondering if there is a workaround while I do this?

--
Kind regards,

Justin Fisher.

Mailing List Archives

Public Access

[HTCondor-users] Hardening against NFS failure