[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Setup Advice Needed



I appreciate the feedback. It's helping us to diagnose the problems better. The cause of the unresponsive system may be a user ID lookup failure from the shadows. We're still trying to narrow down the specifics.

It's at least good to know that our hardware should be capable of supporting the Condor pool. We can focus our efforts in the right places now.

    Craig

On Nov 18, 2010, at 11:18 AM, Matthew Farrellee wrote:

> On 11/18/2010 11:47 AM, Dan Bradley wrote:
>> 
>> 
>> On 11/18/10 6:47 AM, Matthew Farrellee wrote:
>>> On 11/12/2010 11:02 AM, Craig A. Struble, Ph.D. wrote:
>>>> At Marquette, our Condor pools have been growing and we seem to be at
>>>> a tipping point in terms of performance. We have recently configured
>>>> the job router on our primary cluster to route jobs to our other
>>>> pools across campus using Condor-C (flocking isn't really an option),
>>>> giving us over 1600 available slots.
>>>> 
>>>> Our current Condor 7.4.4 setup has the collector, negotiator, job
>>>> router and schedd all running on the head node (an 8 core machine
>>>> with 24 GB of RAM, 2 x 1 Gbs networks, 1 x 20Gbs Infiniband). When we
>>>> launch a few thousand jobs capable of being routed, the system is
>>>> fine for a while, but eventually the schedd becomes unresponsive and
>>>> the overall head node load skyrockets due to the number of running
>>>> shadow daemons.
>>>> 
>>>> Should we consider partitioning our Condor daemons onto different
>>>> nodes? What partitioning works best? Would a second schedd, to handle
>>>> the routed jobs, be helpful? What have others done and what seems to
>>>> work well?
>>>> 
>>>> Thanks.
>>>> 
>>>> Craig -- Craig A. Struble, Ph.D. | Marquette University Associate
>>>> Professor of Computer Science | 369 Cudahy Hall (414)288-3783 |
>>>> (414)288-5472 (fax) http://www.mscs.mu.edu/~cstruble |
>>>> craig.struble@xxxxxxxxxxxxx
>>> 
>>> That hardware should be able to handle the number of slots you're
>>> talking about. The only question may be how the job router is performing.
>>> 
>>> You could try running 7.5, it has many perf&scale improvements. If you
>>> can't do that, one of the simplest things you can do is "SHADOW_LOCK
>>> =". Do you really need to be super certain that the ShadowLog is
>>> consistent? It's growing/rotating fast and most important errors get
>>> reflected in the SchedLog.
>>> 
>>> Best,
>>> 
>>> 
>>> matt
>> 
>> Hi Craig,
>> 
>> It would be good to confirm what is actually responsible for high system
>> load. Is the run queue large, or are lots of processes blocked on i/o?
>> 
>> You mention shadow processes. Those would be for jobs that were _not_
>> routed. Are you sure the shadow processes are generating a lot of work
>> for the system? I'd be surprised if 1600 vanilla universe shadows
>> created lots of load on the cpu. As Matt mentioned, there have been
>> cases of high system load caused by contention for the shadow log lock.
>> However, I've only seen that at scales of over 30,000 shadows, so it
>> seems unlikely to be the cause of trouble in your case unless you have a
>> very slow disk for the log file or extremely verbose logging.
>> 
>> --Dan
> 
> You're right that the lock issue is very much about the rate of logging. You may not hit a painful rate until 30,000 concurrent jobs when the jobs run for hours. You may it with 1,600 jobs that run for minutes. Except for in the case where you truly rely on the contents of the ShadowLog (I'll argue is rare and/or misguided 8o), turning off locking is a good step.
> 
> That said, you're right on that determining the source of the load is better than my weighted guess. 8o)
> 
> Best,
> 
> 
> matt
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/

--
Craig A. Struble, Ph.D. | Marquette University
Associate Professor of Computer Science | 369 Cudahy Hall
(414)288-3783 | (414)288-5472 (fax)
http://www.mscs.mu.edu/~cstruble | craig.struble@xxxxxxxxxxxxx