Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Mysterious Unclaimed resources in cluster

Date: Wed, 11 Jan 2012 12:08:29 -0600
From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
Subject: Re: [Condor-users] Mysterious Unclaimed resources in cluster

On 1/11/2012 9:55 AM, Eric Abel wrote:

I have a follow up to this issue. After additional troubleshooting, I’ve
discovered that the unclaimed resources move from one machine to
another, so I can rule out any Class Ad incompatibility. One thing I
have noticed, is that the maximum number of Claimed resources seems to
be about 85-88. Modifying the MaxJobsRunning variable doesn’t help.
Setting it to 50 limits the number of running jobs to 50, but no matter
how high I set it, the number of running jobs is still about 85.
Currently I am the only user on the pool, and I am also the
administrator. If anyone has any log file, command, or utility I can use
to try to identify the problem, I would much appreciate it.

Thanks,

Eric


Some quick thoughts -

What is your START, SUSPEND, PREEMPT, and KILL expressions on yourexecute machines? (or just post your condor_config file) Perhaps yourpolicy expressions are limiting slot usage due to something like loadaverage or keyboard activity.

How long do your jobs typically run? If many jobs are very short (weare talking on the order of a couple seconds), perhaps your submitmachine is unable to keep more than ~85 machines busy at a time.Whenever a job completes, the submit machine needs to do some work -pick the next job, spawn (or resue) a new shadow process, updateclassads on disk, etc. The shorter your jobs, the more work the submitmachine needs to do, and therefore the more likely it may be abottleneck. Another related thought - are you also running jobs on yoursubmit machine? If so, perhaps the submit machine is so busy runningjobs that the schedd is not getting enough CPU/disk/ram/whatever tojuggle more than ~80 machines at a time, esp if the jobs only run ashort time...


regards
Todd

Follow-Ups:
- Re: [Condor-users] Mysterious Unclaimed resources in cluster
  - From: Eric Abel

References:
- [Condor-users] Mysterious Unclaimed resources in cluster
  - From: Eric Abel
- Re: [Condor-users] Mysterious Unclaimed resources in cluster
  - From: Eric Abel

Prev by Date: Re: [Condor-users] Mysterious Unclaimed resources in cluster
Next by Date: Re: [Condor-users] Mysterious Unclaimed resources in cluster
Previous by thread: Re: [Condor-users] Mysterious Unclaimed resources in cluster
Next by thread: Re: [Condor-users] Mysterious Unclaimed resources in cluster
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Mysterious Unclaimed resources in cluster