Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Some jobs held on HTcondor 8.0

Date: Thu, 11 Jul 2013 11:38:20 -0500
From: Russell Poyner <rpoyner@xxxxxxxxxxxxx>
Subject: Re: [HTCondor-users] Some jobs held on HTcondor 8.0

On 07/10/13 13:54, Todd Tannenbaum wrote:

On 7/10/2013 12:29 PM, Russell Poyner wrote:
We are in the process of upgrading to HTcondor 8.0, and getting
intermittant errors where condor can't open output files causing the
jobs to be held.

The scenario:

1. Submit a group jobs from a single submit file.
2. Some jobs run and others are immediately held.
3. The held jobs report errors like:

007 (602.000.000) 07/10 07:59:46 Shadow exception!
Error from slot1@xxxxxxxxxxxxxxxxxx <mailto:slot1@xxxxxxxxxxxxxxxxxxx>:
Failed to open ‘/home/user/cond_result/newfast-MR2/output-243.txt’ as
standard output: No such file or directory (errno 2)
Code 7 Subcode 2

I've already checked the relevant permissions, which seems an unlikely
source since most of the jobs in the batch run fine.

Perhaps a config issue?
Some type of race?

Thanks
RP
Hi Russell, regards from UW Comp Sci a couple blocks down the street!
Perhaps this shared file system mount on this particular machine(iris-5.ece) is stale or failed for some reason. You mention thatmost jobs in the batch run fine... do the some jobs fail and some work*on the same execute node* ? Or is the set of machines that fail torun jobs disjoint from the set of machines that successfully runjobs? If so, that is a strong indicator that something is wrong withthe mounts of /home on a subset of your pool.
Or perhaps not all the machines in your pool mount the same fileservers at the same place. If this is the case, then the adminsshould set HTCondor's FILESYSTEM_DOMAIN knob correctly so HTCondorknows which machines share which sets of file system mounts.
If none of the above pans out, then a couple other quick questions:
1. Is /home/user/... mounted via autofs or some other type ofautomounter?
2. Do you specify MOUNT_UNDER_SCRATCH in your condor_config file(s) onyour execute machine? Specifically, on iris-5.ece.wisc.edu, what doesthe following command return:
  condor_config_val MOUNT_UNDER_SCRATCH
I ask the above two questions to see if the issue is related to apotential regression introduced in HTCondor v7.9.5 - see
https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=3505

Hope the above helps,
Todd



Todd,

From earlier testing I think some jobs would be held and others succeedon the same execute node. The current collection of machines should benearly identical in that they all implement the same policy from theCMS. Specifially the hard mounts are the same, and the maps for autofsare the same.

autofs mount latency might be an issue since /home/user resides on a SANthat sometimes has latency issues. However, that didn't seem to be aproblem when this same group of machines was running condor 7.4.4 whichwas our previous version.


No MOUNT_UNDER_SCRATCH:

@iris-5:~$ condor_config_val MOUNT_UNDER_SCRATCH
Not defined: MOUNT_UNDER_SCRATCH

Just FYI this is a small cluster I manage for a group inside ECE. It'sonly CAE's problem to the extent that I whine to them about it ;-)


Russ

Follow-Ups:
- Re: [HTCondor-users] Some jobs held on HTcondor 8.0
  - From: Todd Tannenbaum

References:
- [HTCondor-users] Some jobs held on HTcondor 8.0
  - From: Russell Poyner
- Re: [HTCondor-users] Some jobs held on HTcondor 8.0
  - From: Todd Tannenbaum

Prev by Date: Re: [HTCondor-users] How use less ports
Next by Date: Re: [HTCondor-users] spontaneous reboots after enabling cgroups
Previous by thread: Re: [HTCondor-users] Some jobs held on HTcondor 8.0
Next by thread: Re: [HTCondor-users] Some jobs held on HTcondor 8.0
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] Some jobs held on HTcondor 8.0