[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Some jobs held on HTcondor 8.0
- Date: Thu, 11 Jul 2013 11:38:20 -0500
- From: Russell Poyner <rpoyner@xxxxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Some jobs held on HTcondor 8.0
On 07/10/13 13:54, Todd Tannenbaum wrote:
On 7/10/2013 12:29 PM, Russell Poyner wrote:
We are in the process of upgrading to HTcondor 8.0, and getting
intermittant errors where condor can't open output files causing the
jobs to be held.
1. Submit a group jobs from a single submit file.
2. Some jobs run and others are immediately held.
3. The held jobs report errors like:
007 (602.000.000) 07/10 07:59:46 Shadow exception!
Error from slot1@xxxxxxxxxxxxxxxxxx <mailto:slot1@xxxxxxxxxxxxxxxxxxx>:
Failed to open ‘/home/user/cond_result/newfast-MR2/output-243.txt’ as
standard output: No such file or directory (errno 2)
Code 7 Subcode 2
I've already checked the relevant permissions, which seems an unlikely
source since most of the jobs in the batch run fine.
Perhaps a config issue?
Some type of race?
Hi Russell, regards from UW Comp Sci a couple blocks down the street!
Perhaps this shared file system mount on this particular machine
(iris-5.ece) is stale or failed for some reason. You mention that
most jobs in the batch run fine... do the some jobs fail and some work
*on the same execute node* ? Or is the set of machines that fail to
run jobs disjoint from the set of machines that successfully run
jobs? If so, that is a strong indicator that something is wrong with
the mounts of /home on a subset of your pool.
Or perhaps not all the machines in your pool mount the same file
servers at the same place. If this is the case, then the admins
should set HTCondor's FILESYSTEM_DOMAIN knob correctly so HTCondor
knows which machines share which sets of file system mounts.
If none of the above pans out, then a couple other quick questions:
1. Is /home/user/... mounted via autofs or some other type of
2. Do you specify MOUNT_UNDER_SCRATCH in your condor_config file(s) on
your execute machine? Specifically, on iris-5.ece.wisc.edu, what does
the following command return:
I ask the above two questions to see if the issue is related to a
potential regression introduced in HTCondor v7.9.5 - see
Hope the above helps,
From earlier testing I think some jobs would be held and others succeed
on the same execute node. The current collection of machines should be
nearly identical in that they all implement the same policy from the
CMS. Specifially the hard mounts are the same, and the maps for autofs
are the same.
autofs mount latency might be an issue since /home/user resides on a SAN
that sometimes has latency issues. However, that didn't seem to be a
problem when this same group of machines was running condor 7.4.4 which
was our previous version.
@iris-5:~$ condor_config_val MOUNT_UNDER_SCRATCH
Not defined: MOUNT_UNDER_SCRATCH
Just FYI this is a small cluster I manage for a group inside ECE. It's
only CAE's problem to the extent that I whine to them about it ;-)