[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Hardening against NFS failure



Hi.

Is there a way to keep a condor job running if an NFS mount goes down during that job?

I'm using v8.6 and my submit file looks lie this:

Universe = vanilla
Requirements = Arch == "X86_64" && TARGET.OpSys == "LINUX"
Executable = /usr/share/ngspice_2016_08_05/bin/ngspice
transfer_input_files = $(filename)
Arguments = -o $Fdb(filename)_$Fn(filename).log $(filename)
Should_transfer_files = Yes
When_to_transfer_output = on_exit
Request_memory = 8 GB
Request_disk = 50 MB
Request_cpus = 4
accounting_group = group_ANALOG
accounting_group_user = jfisher
## Log = log
Queue filename from /some/file/location/condor.in

The input file is on an NFS share and points to thousands of files on the same NFS share. The output is written to another directory on the same NFS share.

I find that one of my machines is flaky and the NFS keeps dropping out. When that happens many of the submitted jobs all fail with messages saying can't find the input file. Is there a way to run divert these files onto a machine that is alive? Obviously, I need to find out why this machine keeps crashing NFS, but I'm wondering if there is a workaround while I do this?

--
Kind regards,

Justin Fisher.