[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Some jobs held on HTcondor 8.0

On 7/10/2013 12:29 PM, Russell Poyner wrote:
We are in the process of upgrading to HTcondor 8.0, and getting
intermittant errors where condor can't open output files causing the
jobs to be held.

The scenario:

1. Submit a group jobs from a single submit file.
2. Some jobs run and others are immediately held.
3. The held jobs report errors like:

007 (602.000.000) 07/10 07:59:46 Shadow exception!
Error from slot1@xxxxxxxxxxxxxxxxxx <mailto:slot1@xxxxxxxxxxxxxxxxxxx>:
Failed to open ‘/home/user/cond_result/newfast-MR2/output-243.txt’ as
standard output: No such file or directory (errno 2)
Code 7 Subcode 2

I've already checked the relevant permissions, which seems an unlikely
source since most of the jobs in the batch run fine.

Perhaps a config issue?
Some type of race?


Hi Russell, regards from UW Comp Sci a couple blocks down the street!

Perhaps this shared file system mount on this particular machine (iris-5.ece) is stale or failed for some reason. You mention that most jobs in the batch run fine... do the some jobs fail and some work *on the same execute node* ? Or is the set of machines that fail to run jobs disjoint from the set of machines that successfully run jobs? If so, that is a strong indicator that something is wrong with the mounts of /home on a subset of your pool.

Or perhaps not all the machines in your pool mount the same file servers at the same place. If this is the case, then the admins should set HTCondor's FILESYSTEM_DOMAIN knob correctly so HTCondor knows which machines share which sets of file system mounts.

If none of the above pans out, then a couple other quick questions:

1. Is /home/user/... mounted via autofs or some other type of automounter?

2. Do you specify MOUNT_UNDER_SCRATCH in your condor_config file(s) on your execute machine? Specifically, on iris-5.ece.wisc.edu, what does the following command return:
  condor_config_val MOUNT_UNDER_SCRATCH

I ask the above two questions to see if the issue is related to a potential regression introduced in HTCondor v7.9.5 - see

Hope the above helps,