[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Mysterious Unclaimed resources in cluster

Hi Todd, 

Thanks for your response.  In response to your first question, the execute machines are using UWCS_START, UWCS_SUSPEND, UWCS_PREEMPT, and UWCS_KILL.  

The current run I have and average job length of about 2 hours.

Since posting this morning, I have learned that I can utilize the entire pool by submitting from an additional machine, so the one still has 85 jobs running, and the second one has the rest, and there are no longer any unclaimed resources on the pool.

Thanks again for your response.


-----Original Message-----
From: Todd Tannenbaum [mailto:tannenba@xxxxxxxxxxx] 
Sent: Wednesday, January 11, 2012 10:08 AM
To: Condor-Users Mail List
Cc: Eric Abel
Subject: Re: [Condor-users] Mysterious Unclaimed resources in cluster

On 1/11/2012 9:55 AM, Eric Abel wrote:
> I have a follow up to this issue. After additional troubleshooting, I've
> discovered that the unclaimed resources move from one machine to
> another, so I can rule out any Class Ad incompatibility. One thing I
> have noticed, is that the maximum number of Claimed resources seems to
> be about 85-88. Modifying the MaxJobsRunning variable doesn't help.
> Setting it to 50 limits the number of running jobs to 50, but no matter
> how high I set it, the number of running jobs is still about 85.
> Currently I am the only user on the pool, and I am also the
> administrator. If anyone has any log file, command, or utility I can use
> to try to identify the problem, I would much appreciate it.
> Thanks,
> Eric

Some quick thoughts -

What is your START, SUSPEND, PREEMPT, and KILL expressions on your 
execute machines?  (or just post your condor_config file)  Perhaps your 
policy expressions are limiting slot usage due to something like load 
average or keyboard activity.

How long do your jobs typically run?  If many jobs are very short (we 
are talking on the order of a couple seconds), perhaps your submit 
machine is unable to keep more than ~85 machines busy at a time. 
Whenever a job completes, the submit machine needs to do some work - 
pick the next job, spawn (or resue) a new shadow process, update 
classads on disk, etc.  The shorter your jobs, the more work the submit 
machine needs to do, and therefore the more likely it may be a 
bottleneck.  Another related thought - are you also running jobs on your 
submit machine?  If so, perhaps the submit machine is so busy running 
jobs that the schedd is not getting enough CPU/disk/ram/whatever to 
juggle more than ~80 machines at a time, esp if the jobs only run a 
short time...