[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Mysterious Unclaimed resources in cluster



On 1/11/2012 9:55 AM, Eric Abel wrote:
I have a follow up to this issue. After additional troubleshooting, I’ve
discovered that the unclaimed resources move from one machine to
another, so I can rule out any Class Ad incompatibility. One thing I
have noticed, is that the maximum number of Claimed resources seems to
be about 85-88. Modifying the MaxJobsRunning variable doesn’t help.
Setting it to 50 limits the number of running jobs to 50, but no matter
how high I set it, the number of running jobs is still about 85.
Currently I am the only user on the pool, and I am also the
administrator. If anyone has any log file, command, or utility I can use
to try to identify the problem, I would much appreciate it.

Thanks,

Eric

Some quick thoughts -

What is your START, SUSPEND, PREEMPT, and KILL expressions on your execute machines? (or just post your condor_config file) Perhaps your policy expressions are limiting slot usage due to something like load average or keyboard activity.

How long do your jobs typically run? If many jobs are very short (we are talking on the order of a couple seconds), perhaps your submit machine is unable to keep more than ~85 machines busy at a time. Whenever a job completes, the submit machine needs to do some work - pick the next job, spawn (or resue) a new shadow process, update classads on disk, etc. The shorter your jobs, the more work the submit machine needs to do, and therefore the more likely it may be a bottleneck. Another related thought - are you also running jobs on your submit machine? If so, perhaps the submit machine is so busy running jobs that the schedd is not getting enough CPU/disk/ram/whatever to juggle more than ~80 machines at a time, esp if the jobs only run a short time...

regards
Todd