[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Mysterious Unclaimed resources in cluster
- Date: Wed, 11 Jan 2012 18:13:25 +0000
- From: Eric Abel <Eric.Abel@xxxxxxxxxx>
- Subject: Re: [Condor-users] Mysterious Unclaimed resources in cluster
Thanks for your response. In response to your first question, the execute machines are using UWCS_START, UWCS_SUSPEND, UWCS_PREEMPT, and UWCS_KILL.
The current run I have and average job length of about 2 hours.
Since posting this morning, I have learned that I can utilize the entire pool by submitting from an additional machine, so the one still has 85 jobs running, and the second one has the rest, and there are no longer any unclaimed resources on the pool.
Thanks again for your response.
From: Todd Tannenbaum [mailto:tannenba@xxxxxxxxxxx]
Sent: Wednesday, January 11, 2012 10:08 AM
To: Condor-Users Mail List
Cc: Eric Abel
Subject: Re: [Condor-users] Mysterious Unclaimed resources in cluster
On 1/11/2012 9:55 AM, Eric Abel wrote:
> I have a follow up to this issue. After additional troubleshooting, I've
> discovered that the unclaimed resources move from one machine to
> another, so I can rule out any Class Ad incompatibility. One thing I
> have noticed, is that the maximum number of Claimed resources seems to
> be about 85-88. Modifying the MaxJobsRunning variable doesn't help.
> Setting it to 50 limits the number of running jobs to 50, but no matter
> how high I set it, the number of running jobs is still about 85.
> Currently I am the only user on the pool, and I am also the
> administrator. If anyone has any log file, command, or utility I can use
> to try to identify the problem, I would much appreciate it.
Some quick thoughts -
What is your START, SUSPEND, PREEMPT, and KILL expressions on your
execute machines? (or just post your condor_config file) Perhaps your
policy expressions are limiting slot usage due to something like load
average or keyboard activity.
How long do your jobs typically run? If many jobs are very short (we
are talking on the order of a couple seconds), perhaps your submit
machine is unable to keep more than ~85 machines busy at a time.
Whenever a job completes, the submit machine needs to do some work -
pick the next job, spawn (or resue) a new shadow process, update
classads on disk, etc. The shorter your jobs, the more work the submit
machine needs to do, and therefore the more likely it may be a
bottleneck. Another related thought - are you also running jobs on your
submit machine? If so, perhaps the submit machine is so busy running
jobs that the schedd is not getting enough CPU/disk/ram/whatever to
juggle more than ~80 machines at a time, esp if the jobs only run a