[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Schedd Overloaded??



On Mon, Jul 11, 2005 at 11:56:04AM -0700, Sean Looper wrote:
> Any idea why this would be the case?  I've used other queue managers in the past that have no trouble with jobs in the tens of thousands.  I will try reducing the debugging.  Any ideas on distributing the schedd load across multiple machines?  This will be a HUGE setback for us adopting Condor if I can't figure out a way to stably handle 10,000+ jobs.  
> 

10K jobs in the queue is certainly possible, but you need to watch out for
a couple of things.

The biggest concern is how long do the jobs run for - the schedd has to do
a lot of expensive lock operations when a job completes, so if you've
got a job completeing every second that's a lot of load on the schedd. 
Job submission is nearly as expensive, so try to batch that up as much as
possible; ie submit clusters of 100 or 1000 jobs at a time, instead of 
running condor_submit 10000 times. Every cluster shares a copy of the
executable, so Condor only has to spool the executable once. (Also, consider
using copy_to_spool = false). 10,000 jobs in the queue where each one 
runs for an hour is easy, a queue where a job is submitted and completed
every second can start falling behind at around 100 jobs.

As others have pointed out, more things that can be painful are long
negotiation cycles, frequent polling with condor_q (and worse, condor_history)
and excessive debug levels. 

-Erik

> Thanks for the heads up. 
> 
> Sean
> 
> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Michael Rusch
> Sent: Monday, July 11, 2005 11:49 AM
> To: 'Condor-Users Mail List'
> Subject: RE: [Condor-users] Schedd Overloaded??
> 
> I can't give an official answer, but I can tell you that we had the same
> problem with 5136 jobs.  In our cases, there were a couple other things that
> contributed, so you could check these, too: high debug level on the schedd
> and a supervising process that used condor_q and condor_history to monitor
> jobs.  Condor_q talks to the schedd, so if you're doing anything like that
> you may want to parse log files instead.
> 
> However, even after taking down debug level and using log parsing, our
> schedd still struggled with 5000 jobs in the queue.
> 
> Michael.
> 
> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx
> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Sean Looper
> Sent: Monday, July 11, 2005 12:37 PM
> To: Condor-Users Mail List
> Subject: [Condor-users] Schedd Overloaded??
> 
> I have a remote schedd with 9000+ jobs.? The schedd is continually running
> at 100% cpu.? I am hoping to gain some suggestions on how to improve the
> efficiency of the schedd.? 
> 
> Do I need to split the jobs between schedds on 2 or 3 more machines?? 
> 
> Would it help significantly to move the negotiator and collector to another
> machine? 
> 
> Are there ways to speed up the schedd so that it does not take as long to
> run through the job queue?
> 
> I am using Condor 6.7.7 with a nearly out-of-the-box config.
> 
> Thanks!
> 
> Sean
> 
> 
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> 
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users