[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] ERROR: Failed to commit job submission into the queue.





Simon Hammond wrote:

Have you tried using the latest development series download of Condor? It has some improvements for the size of the queue I think. I don't know whether it will take 300,000 jobs but we regularly have 10000 jobs in the queue without much worry.


I have tested Condor 6.9.3 with up to 100,000 jobs in the queue, and it performed well at that scale. 6.8 will certainly run into scaling problems if you try to run with that many jobs. No matter how carefully I tuned configuration options, I was not able to get the schedd to operate at this scale. At lesser scales (e.g. ~20,000 jobs), you can tune 6.8 to work a little better:

http://www.cs.wisc.edu/condor/CondorWeek2007/large_condor_pools.html

Since the entire job queue is held in memory, this is probably the main limiting factor to queue sizes in 6.9.3. I have observed a minimum of about 10k/job memory usage by the Condor schedd. That was with job clusters of size 1. With larger job clusters (e.g. queue 10000), you should be able to decrease the schedd's memory usage per job.

The schedd also tries to prevent starting up shadows (the processes that manage running jobs) if it estimates that there is not enough virtual memory. If this is happening, you will see a message in the logs like this:

Swap space exhausted! No more jobs can be run!
Solution: get more swap space, or set RESERVED_SWAP = 0

--Dan Bradley

On 06/08/07, *Dan Scarborough* <dan.scarborough@xxxxxx <mailto:dan.scarborough@xxxxxx>> wrote:


    Hello,
    I have set up a small(4 node) test grid using condor - 4 linux(
    2.6 kernel)
    machines using a shared file system, running condor 6.8. On
    Friday, I tested a
    job in the java universe, which ran a number (~20) of times quite
    happily. I
    then ramped up the number of jobs, somewhat optimistically, to
    300,000 and
    left for the day. I've come back in to find the following error
    and zero
    output:
    ERROR: Failed to commit job submission into the queue.

    1) Is there a limit on the job queue length in condor?
    2) If so, is this by design, or determined by an installation
    specific factor,
    such as the O/S or available memory?
    3) Where is this documented? Sorry, but I cannot find it anywhere
    in the
    manual, or forum history.

    Any help would be much appreciated.
    Many thanks,
    Dan
    ------------------------------------------------------
       Dan Scarborough
       Research IT
       Deutsche Bank
       +44 (0)20 754 55914
    ------------------------------------------------------

    ---

    This e-mail may contain confidential and/or privileged
    information. If you are
    not the intended recipient (or have received this e-mail in error)
    please
    notify the sender immediately and delete this e-mail. Any unauthorized
    copying, disclosure or distribution of the material in this e-mail
    is strictly
    forbidden.

    Please refer to http://www.db.com/en/content/eu_disclosures.htm
    for additional
    EU corporate and regulatory disclosures.

    _______________________________________________
    Condor-users mailing list
    To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
    <mailto:condor-users-request@xxxxxxxxxxx> with a
    subject: Unsubscribe
    You can also unsubscribe by visiting
    https://lists.cs.wisc.edu/mailman/listinfo/condor-users

    The archives can be found at:
    https://lists.cs.wisc.edu/archive/condor-users/


------------------------------------------------------------------------

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/