[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Some jobs held on HTcondor 8.0



On 07/10/13 13:54, Todd Tannenbaum wrote:
On 7/10/2013 12:29 PM, Russell Poyner wrote:
We are in the process of upgrading to HTcondor 8.0, and getting
intermittant errors where condor can't open output files causing the
jobs to be held.

The scenario:

1. Submit a group jobs from a single submit file.
2. Some jobs run and others are immediately held.
3. The held jobs report errors like:

007 (602.000.000) 07/10 07:59:46 Shadow exception!
Error from slot1@xxxxxxxxxxxxxxxxxx <mailto:slot1@xxxxxxxxxxxxxxxxxxx>:
Failed to open ‘/home/user/cond_result/newfast-MR2/output-243.txt’ as
standard output: No such file or directory (errno 2)
Code 7 Subcode 2

I've already checked the relevant permissions, which seems an unlikely
source since most of the jobs in the batch run fine.

Perhaps a config issue?
Some type of race?

Thanks
RP


Hi Russell, regards from UW Comp Sci a couple blocks down the street!

Perhaps this shared file system mount on this particular machine (iris-5.ece) is stale or failed for some reason. You mention that most jobs in the batch run fine... do the some jobs fail and some work *on the same execute node* ? Or is the set of machines that fail to run jobs disjoint from the set of machines that successfully run jobs? If so, that is a strong indicator that something is wrong with the mounts of /home on a subset of your pool.

Or perhaps not all the machines in your pool mount the same file servers at the same place. If this is the case, then the admins should set HTCondor's FILESYSTEM_DOMAIN knob correctly so HTCondor knows which machines share which sets of file system mounts.

If none of the above pans out, then a couple other quick questions:

1. Is /home/user/... mounted via autofs or some other type of automounter?

2. Do you specify MOUNT_UNDER_SCRATCH in your condor_config file(s) on your execute machine? Specifically, on iris-5.ece.wisc.edu, what does the following command return:
  condor_config_val MOUNT_UNDER_SCRATCH

I ask the above two questions to see if the issue is related to a potential regression introduced in HTCondor v7.9.5 - see
https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=3505

Hope the above helps,
Todd



Todd,

From earlier testing I think some jobs would be held and others succeed on the same execute node. The current collection of machines should be nearly identical in that they all implement the same policy from the CMS. Specifially the hard mounts are the same, and the maps for autofs are the same.

autofs mount latency might be an issue since /home/user resides on a SAN that sometimes has latency issues. However, that didn't seem to be a problem when this same group of machines was running condor 7.4.4 which was our previous version.

No MOUNT_UNDER_SCRATCH:

@iris-5:~$ condor_config_val MOUNT_UNDER_SCRATCH
Not defined: MOUNT_UNDER_SCRATCH

Just FYI this is a small cluster I manage for a group inside ECE. It's only CAE's problem to the extent that I whine to them about it ;-)

Russ