[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] Schedd Overloaded??



Here is an excerpt from my schedd log.  I'm not sure what I'm seeing,
but it seems to transfer the executable "sub_Teapot_Diffuse_01.bat" for
every proc.  I don't know, maybe someone can see something here that can
be optimized.


7/11 15:10:52 perm::init: Found Account Name slooper
7/11 15:10:52 Calling Perm::userInAce() for DOMAIN\slooper
7/11 15:10:52 perm::UserInAce: Checking \Everyone
7/11 15:10:52 entering FileTransfer::DownloadFiles
7/11 15:10:52 entering FileTransfer::Download
7/11 15:10:52 entering FileTransfer::DoDownload sync=1
7/11 15:10:52 get_file(): going to write to filename
C:\Condor/spool\cluster219.proc18.subproc0.tmp\sub_Teapot_Diffuse_01.bat
7/11 15:10:52 get_file: Receiving 282 bytes
7/11 15:10:52 get_file: wrote 282 bytes to file
7/11 15:10:52 ReliSock::get_file_with_permissions(): received null
permissions from peer, not setting
7/11 15:10:52 generalJobFilesWorkerThread(): transfer files for job
219.19
7/11 15:10:52 entering FileTransfer::SimpleInit
7/11 15:10:52 perm::init() starting up for account (slooper) domain
(DOMAIN)
7/11 15:10:52 perm::init: Found Account Name slooper
7/11 15:10:52 Calling Perm::userInAce() for DOMAIN\slooper
7/11 15:10:52 perm::UserInAce: Checking \Everyone
7/11 15:10:52 entering FileTransfer::DownloadFiles
7/11 15:10:52 entering FileTransfer::Download
7/11 15:10:52 entering FileTransfer::DoDownload sync=1
7/11 15:10:52 get_file(): going to write to filename
C:\Condor/spool\cluster219.proc19.subproc0.tmp\sub_Teapot_Diffuse_01.bat
7/11 15:10:52 get_file: Receiving 282 bytes
7/11 15:10:52 get_file: wrote 282 bytes to file
7/11 15:10:52 ReliSock::get_file_with_permissions(): received null
permissions from peer, not setting
7/11 15:10:52 generalJobFilesWorkerThread(): transfer files for job
219.20
7/11 15:10:52 entering FileTransfer::SimpleInit
7/11 15:10:52 perm::init() starting up for account (slooper) domain
(DOMAIN)
7/11 15:10:52 perm::init: Found Account Name slooper
7/11 15:10:52 Calling Perm::userInAce() for DOMAIN\slooper
7/11 15:10:52 perm::UserInAce: Checking \Everyone
7/11 15:10:52 entering FileTransfer::DownloadFiles
7/11 15:10:52 entering FileTransfer::Download
7/11 15:10:52 entering FileTransfer::DoDownload sync=1
7/11 15:10:53 get_file(): going to write to filename
C:\Condor/spool\cluster219.proc20.subproc0.tmp\sub_Teapot_Diffuse_01.bat
7/11 15:10:53 get_file: Receiving 282 bytes
7/11 15:10:53 get_file: wrote 282 bytes to file
7/11 15:10:53 ReliSock::get_file_with_permissions(): received null
permissions from peer, not setting
7/11 15:10:53 generalJobFilesWorkerThread(): transfer files for job
219.21
7/11 15:10:53 entering FileTransfer::SimpleInit
7/11 15:10:53 perm::init() starting up for account (slooper) domain
(DOMAIN)
7/11 15:10:53 perm::init: Found Account Name slooper
7/11 15:10:53 Calling Perm::userInAce() for DOMAIN\slooper
7/11 15:10:53 perm::UserInAce: Checking \Everyone
7/11 15:10:53 entering FileTransfer::DownloadFiles
7/11 15:10:53 entering FileTransfer::Download
7/11 15:10:53 entering FileTransfer::DoDownload sync=1
7/11 15:10:53 get_file(): going to write to filename
C:\Condor/spool\cluster219.proc21.subproc0.tmp\sub_Teapot_Diffuse_01.bat
7/11 15:10:53 get_file: Receiving 282 bytes
7/11 15:10:53 get_file: wrote 282 bytes to file
7/11 15:10:53 ReliSock::get_file_with_permissions(): received null
permissions from peer, not setting
7/11 15:10:53 generalJobFilesWorkerThread(): transfer files for job
219.22
7/11 15:10:53 entering FileTransfer::SimpleInit
7/11 15:10:53 perm::init() starting up for account (slooper) domain
(DOMAIN)
7/11 15:10:53 perm::init: Found Account Name slooper
7/11 15:10:53 Calling Perm::userInAce() for DOMAIN\slooper
7/11 15:10:53 perm::UserInAce: Checking \Everyone
7/11 15:10:53 entering FileTransfer::DownloadFiles
7/11 15:10:53 entering FileTransfer::Download
7/11 15:10:53 entering FileTransfer::DoDownload sync=1
7/11 15:10:53 get_file(): going to write to filename
C:\Condor/spool\cluster219.proc22.subproc0.tmp\sub_Teapot_Diffuse_01.bat
7/11 15:10:53 get_file: Receiving 282 bytes
7/11 15:10:53 get_file: wrote 282 bytes to file
7/11 15:10:53 ReliSock::get_file_with_permissions(): received null
permissions from peer, not setting
7/11 15:10:54 generalJobFilesWorkerThread(): transfer files for job
219.23
7/11 15:10:54 entering FileTransfer::SimpleInit
7/11 15:10:54 perm::init() starting up for account (slooper) domain
(DOMAIN)
7/11 15:10:54 perm::init: Found Account Name slooper
7/11 15:10:54 Calling Perm::userInAce() for DOMAIN\slooper
7/11 15:10:54 perm::UserInAce: Checking \Everyone
7/11 15:10:54 entering FileTransfer::DownloadFiles
7/11 15:10:54 entering FileTransfer::Download
7/11 15:10:54 entering FileTransfer::DoDownload sync=1
7/11 15:10:54 get_file(): going to write to filename
C:\Condor/spool\cluster219.proc23.subproc0.tmp\sub_Teapot_Diffuse_01.bat
7/11 15:10:54 get_file: Receiving 282 bytes
7/11 15:10:54 get_file: wrote 282 bytes to file
7/11 15:10:54 ReliSock::get_file_with_permissions(): received null
permissions from peer, not setting
7/11 15:10:54 generalJobFilesWorkerThread(): transfer files for job
219.24
7/11 15:10:54 entering FileTransfer::SimpleInit
7/11 15:10:54 perm::init() starting up for account (slooper) domain
(DOMAIN)
7/11 15:10:54 perm::init: Found Account Name slooper
7/11 15:10:54 Calling Perm::userInAce() for DOMAIN\slooper
7/11 15:10:54 perm::UserInAce: Checking \Everyone
7/11 15:10:54 entering FileTransfer::DownloadFiles
7/11 15:10:54 entering FileTransfer::Download
7/11 15:10:54 entering FileTransfer::DoDownload sync=1
7/11 15:10:54 get_file(): going to write to filename
C:\Condor/spool\cluster219.proc24.subproc0.tmp\sub_Teapot_Diffuse_01.bat
7/11 15:10:54 get_file: Receiving 282 bytes
7/11 15:10:54 get_file: wrote 282 bytes to file
7/11 15:10:54 ReliSock::get_file_with_permissions(): received null
permissions from peer, not setting
7/11 15:10:54 generalJobFilesWorkerThread(): transfer files for job
219.25
7/11 15:10:54 entering FileTransfer::SimpleInit
7/11 15:10:54 perm::init() starting up for account (slooper) domain
(DOMAIN)
7/11 15:10:54 perm::init: Found Account Name slooper
7/11 15:10:54 Calling Perm::userInAce() for DOMAIN\slooper
7/11 15:10:54 perm::UserInAce: Checking \Everyone
7/11 15:10:54 entering FileTransfer::DownloadFiles
7/11 15:10:54 entering FileTransfer::Download
7/11 15:10:54 entering FileTransfer::DoDownload sync=1
7/11 15:10:55 get_file(): going to write to filename
C:\Condor/spool\cluster219.proc25.subproc0.tmp\sub_Teapot_Diffuse_01.bat
7/11 15:10:55 get_file: Receiving 282 bytes
7/11 15:10:55 get_file: wrote 282 bytes to file
7/11 15:10:55 ReliSock::get_file_with_permissions(): received null
permissions from peer, not setting
7/11 15:10:55 generalJobFilesWorkerThread(): transfer files for job
219.26
7/11 15:10:55 entering FileTransfer::SimpleInit
7/11 15:10:55 perm::init() starting up for account (slooper) domain
(DOMAIN)
7/11 15:10:55 perm::init: Found Account Name slooper
7/11 15:10:55 Calling Perm::userInAce() for DOMAIN\slooper
7/11 15:10:55 perm::UserInAce: Checking \Everyone
7/11 15:10:55 entering FileTransfer::DownloadFiles
7/11 15:10:55 entering FileTransfer::Download
7/11 15:10:55 entering FileTransfer::DoDownload sync=1
7/11 15:10:55 get_file(): going to write to filename
C:\Condor/spool\cluster219.proc26.subproc0.tmp\sub_Teapot_Diffuse_01.bat
7/11 15:10:55 get_file: Receiving 282 bytes
7/11 15:10:55 get_file: wrote 282 bytes to file
7/11 15:10:55 ReliSock::get_file_with_permissions(): received null
permissions from peer, not setting

-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Erik Paulson
Sent: Monday, July 11, 2005 12:57 PM
To: Condor-Users Mail List
Subject: Re: [Condor-users] Schedd Overloaded??

On Mon, Jul 11, 2005 at 11:56:04AM -0700, Sean Looper wrote:
> Any idea why this would be the case?  I've used other queue managers
in the past that have no trouble with jobs in the tens of thousands.  I
will try reducing the debugging.  Any ideas on distributing the schedd
load across multiple machines?  This will be a HUGE setback for us
adopting Condor if I can't figure out a way to stably handle 10,000+
jobs.  
> 

10K jobs in the queue is certainly possible, but you need to watch out
for
a couple of things.

The biggest concern is how long do the jobs run for - the schedd has to
do
a lot of expensive lock operations when a job completes, so if you've
got a job completeing every second that's a lot of load on the schedd. 
Job submission is nearly as expensive, so try to batch that up as much
as
possible; ie submit clusters of 100 or 1000 jobs at a time, instead of 
running condor_submit 10000 times. Every cluster shares a copy of the
executable, so Condor only has to spool the executable once. (Also,
consider
using copy_to_spool = false). 10,000 jobs in the queue where each one 
runs for an hour is easy, a queue where a job is submitted and completed
every second can start falling behind at around 100 jobs.

As others have pointed out, more things that can be painful are long
negotiation cycles, frequent polling with condor_q (and worse,
condor_history)
and excessive debug levels. 

-Erik

> Thanks for the heads up. 
> 
> Sean
> 
> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Michael Rusch
> Sent: Monday, July 11, 2005 11:49 AM
> To: 'Condor-Users Mail List'
> Subject: RE: [Condor-users] Schedd Overloaded??
> 
> I can't give an official answer, but I can tell you that we had the
same
> problem with 5136 jobs.  In our cases, there were a couple other
things that
> contributed, so you could check these, too: high debug level on the
schedd
> and a supervising process that used condor_q and condor_history to
monitor
> jobs.  Condor_q talks to the schedd, so if you're doing anything like
that
> you may want to parse log files instead.
> 
> However, even after taking down debug level and using log parsing, our
> schedd still struggled with 5000 jobs in the queue.
> 
> Michael.
> 
> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx
> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Sean Looper
> Sent: Monday, July 11, 2005 12:37 PM
> To: Condor-Users Mail List
> Subject: [Condor-users] Schedd Overloaded??
> 
> I have a remote schedd with 9000+ jobs.? The schedd is continually
running
> at 100% cpu.? I am hoping to gain some suggestions on how to improve
the
> efficiency of the schedd.? 
> 
> Do I need to split the jobs between schedds on 2 or 3 more machines?? 
> 
> Would it help significantly to move the negotiator and collector to
another
> machine? 
> 
> Are there ways to speed up the schedd so that it does not take as long
to
> run through the job queue?
> 
> I am using Condor 6.7.7 with a nearly out-of-the-box config.
> 
> Thanks!
> 
> Sean
> 
> 
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> 
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users