[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] ERROR: Failed to commit job submission into the queue.



Many thanks to Simon and Dan for their responses. I was encountering the
problem with a single cluster of 300,000 jobs.
I guess the real implication is impose an arbitrary limit on our cluster size.

Thanks,
Dan
------------------------------------------------------
   Dan Scarborough
   Research IT
   Deutsche Bank
   +44 (0)20 754 55914
------------------------------------------------------



                                                                              
             Dan Bradley                                                      
             <dan@xxxxxxxxxxxx>                                               
             Sent by:                                                      To 
             condor-users-bounc         Condor-Users Mail List                
             es@xxxxxxxxxxx             <condor-users@xxxxxxxxxxx>            
                                                                           cc 
                                                                              
             06/08/2007 15:39                                         Subject 
                                        Re: [Condor-users] ERROR: Failed to   
                                        commit job submission into the queue. 
             Please respond to                                                
             Condor-Users Mail                                                
                    List                                                      
             <condor-users@xxxx                                               
                  isc.edu>                                                    
                                                                              
                                                                              






Simon Hammond wrote:

> Have you tried using the latest development series download of Condor?
> It has some improvements for the size of the queue I think. I don't
> know whether it will take 300,000 jobs but we regularly have 10000
> jobs in the queue without much worry.


I have tested Condor 6.9.3 with up to 100,000 jobs in the queue, and it
performed well at that scale.  6.8 will certainly run into scaling
problems if you try to run with that many jobs.  No matter how carefully
I tuned configuration options, I was not able to get the schedd to
operate at this scale.  At lesser scales (e.g. ~20,000 jobs), you can
tune 6.8 to work a little better:

http://www.cs.wisc.edu/condor/CondorWeek2007/large_condor_pools.html

Since the entire job queue is held in memory, this is probably the main
limiting factor to queue sizes in 6.9.3.  I have observed a minimum of
about 10k/job memory usage by the Condor schedd.  That was with job
clusters of size 1.  With larger job clusters (e.g. queue 10000), you
should be able to decrease the schedd's memory usage per job.

The schedd also tries to prevent starting up shadows (the processes that
manage running jobs) if it estimates that there is not enough virtual
memory.  If this is happening, you will see a message in the logs like this:

Swap space exhausted! No more jobs can be run!
Solution: get more swap space, or set RESERVED_SWAP = 0

--Dan Bradley

> On 06/08/07, *Dan Scarborough* <dan.scarborough@xxxxxx
> <mailto:dan.scarborough@xxxxxx>> wrote:
>
>
>     Hello,
>     I have set up a small(4 node) test grid using condor - 4 linux(
>     2.6 kernel)
>     machines using a shared file system, running condor 6.8. On
>     Friday, I tested a
>     job in the java universe, which ran a number (~20) of times quite
>     happily. I
>     then ramped up the number of jobs, somewhat optimistically, to
>     300,000 and
>     left for the day. I've come back in to find the following error
>     and zero
>     output:
>     ERROR: Failed to commit job submission into the queue.
>
>     1) Is there a limit on the job queue length in condor?
>     2) If so, is this by design, or determined by an installation
>     specific factor,
>     such as the O/S or available memory?
>     3) Where is this documented? Sorry, but I cannot find it anywhere
>     in the
>     manual, or forum history.
>
>     Any help would be much appreciated.
>     Many thanks,
>     Dan
>     ------------------------------------------------------
>        Dan Scarborough
>        Research IT
>        Deutsche Bank
>        +44 (0)20 754 55914
>     ------------------------------------------------------
>
>     ---
>
>     This e-mail may contain confidential and/or privileged
>     information. If you are
>     not the intended recipient (or have received this e-mail in error)
>     please
>     notify the sender immediately and delete this e-mail. Any unauthorized
>     copying, disclosure or distribution of the material in this e-mail
>     is strictly
>     forbidden.
>
>     Please refer to http://www.db.com/en/content/eu_disclosures.htm
>     for additional
>     EU corporate and regulatory disclosures.
>
>     _______________________________________________
>     Condor-users mailing list
>     To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
>     <mailto:condor-users-request@xxxxxxxxxxx> with a
>     subject: Unsubscribe
>     You can also unsubscribe by visiting
>     https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
>     The archives can be found at:
>     https://lists.cs.wisc.edu/archive/condor-users/
>
>
>------------------------------------------------------------------------
>
>_______________________________________________
>Condor-users mailing list
>To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>subject: Unsubscribe
>You can also unsubscribe by visiting
>https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
>The archives can be found at:
>https://lists.cs.wisc.edu/archive/condor-users/
>
>
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/


---

This e-mail may contain confidential and/or privileged information. If you are
not the intended recipient (or have received this e-mail in error) please
notify the sender immediately and delete this e-mail. Any unauthorized
copying, disclosure or distribution of the material in this e-mail is strictly
forbidden.

Please refer to http://www.db.com/en/content/eu_disclosures.htm for additional
EU corporate and regulatory disclosures.