Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Setup Advice Needed

Date: Thu, 18 Nov 2010 10:47:07 -0600
From: Dan Bradley <dan@xxxxxxxxxxxx>
Subject: Re: [Condor-users] Setup Advice Needed



On 11/18/10 6:47 AM, Matthew Farrellee wrote:

On 11/12/2010 11:02 AM, Craig A. Struble, Ph.D. wrote:

At Marquette, our Condor pools have been growing and we seem to be at
a tipping point in terms of performance. We have recently configured
the job router on our primary cluster to route jobs to our other
pools across campus using Condor-C (flocking isn't really an option),
giving us over 1600 available slots.

Our current Condor 7.4.4 setup has the collector, negotiator, job
router and schedd all running on the head node (an 8 core machine
with 24 GB of RAM, 2 x 1 Gbs networks, 1 x 20Gbs Infiniband). When we
launch a few thousand jobs capable of being routed, the system is
fine for a while, but eventually the schedd becomes unresponsive and
the overall head node load skyrockets due to the number of running
shadow daemons.

Should we consider partitioning our Condor daemons onto different
nodes? What partitioning works best? Would a second schedd, to handle
the routed jobs, be helpful? What have others done and what seems to
work well?

Thanks.

Craig -- Craig A. Struble, Ph.D. | Marquette University Associate
Professor of Computer Science | 369 Cudahy Hall (414)288-3783 |
(414)288-5472 (fax) http://www.mscs.mu.edu/~cstruble |
craig.struble@xxxxxxxxxxxxx

That hardware should be able to handle the number of slots you'retalking about. The only question may be how the job router is performing.

You could try running 7.5, it has many perf&scale improvements. If youcan't do that, one of the simplest things you can do is "SHADOW_LOCK=". Do you really need to be super certain that the ShadowLog isconsistent? It's growing/rotating fast and most important errors getreflected in the SchedLog.


Best,


matt


Hi Craig,

It would be good to confirm what is actually responsible for high systemload. Is the run queue large, or are lots of processes blocked on i/o?

You mention shadow processes. Those would be for jobs that were _not_routed. Are you sure the shadow processes are generating a lot of workfor the system? I'd be surprised if 1600 vanilla universe shadowscreated lots of load on the cpu. As Matt mentioned, there have beencases of high system load caused by contention for the shadow log lock.However, I've only seen that at scales of over 30,000 shadows, so itseems unlikely to be the cause of trouble in your case unless you have avery slow disk for the log file or extremely verbose logging.


--Dan

Follow-Ups:
- Re: [Condor-users] Setup Advice Needed
  - From: Matthew Farrellee

References:
- [Condor-users] Setup Advice Needed
  - From: Craig A. Struble, Ph.D.
- Re: [Condor-users] Setup Advice Needed
  - From: Matthew Farrellee

Prev by Date: Re: [Condor-users] CEDAR:6001:Failed to fetch ads
Next by Date: Re: [Condor-users] Setup Advice Needed
Previous by thread: Re: [Condor-users] Setup Advice Needed
Next by thread: Re: [Condor-users] Setup Advice Needed
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Setup Advice Needed