[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Some jobs held on HTcondor 8.0
- Date: Wed, 10 Jul 2013 13:54:44 -0500
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Some jobs held on HTcondor 8.0
On 7/10/2013 12:29 PM, Russell Poyner wrote:
We are in the process of upgrading to HTcondor 8.0, and getting
intermittant errors where condor can't open output files causing the
jobs to be held.
1. Submit a group jobs from a single submit file.
2. Some jobs run and others are immediately held.
3. The held jobs report errors like:
007 (602.000.000) 07/10 07:59:46 Shadow exception!
Error from slot1@xxxxxxxxxxxxxxxxxx <mailto:slot1@xxxxxxxxxxxxxxxxxxx>:
Failed to open ‘/home/user/cond_result/newfast-MR2/output-243.txt’ as
standard output: No such file or directory (errno 2)
Code 7 Subcode 2
I've already checked the relevant permissions, which seems an unlikely
source since most of the jobs in the batch run fine.
Perhaps a config issue?
Some type of race?
Hi Russell, regards from UW Comp Sci a couple blocks down the street!
Perhaps this shared file system mount on this particular machine
(iris-5.ece) is stale or failed for some reason. You mention that most
jobs in the batch run fine... do the some jobs fail and some work *on
the same execute node* ? Or is the set of machines that fail to run
jobs disjoint from the set of machines that successfully run jobs? If
so, that is a strong indicator that something is wrong with the mounts
of /home on a subset of your pool.
Or perhaps not all the machines in your pool mount the same file servers
at the same place. If this is the case, then the admins should set
HTCondor's FILESYSTEM_DOMAIN knob correctly so HTCondor knows which
machines share which sets of file system mounts.
If none of the above pans out, then a couple other quick questions:
1. Is /home/user/... mounted via autofs or some other type of automounter?
2. Do you specify MOUNT_UNDER_SCRATCH in your condor_config file(s) on
your execute machine? Specifically, on iris-5.ece.wisc.edu, what does
the following command return:
I ask the above two questions to see if the issue is related to a
potential regression introduced in HTCondor v7.9.5 - see
Hope the above helps,