[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] submitting LOTS of jobs

Thanks Steve for bringing up Swift/Falkon. Outside of bringing down you job count, by repackaging your application logic somehow, and perhaps implementing some application specific framework, I think your best bet is something like Swift and/or Falkon.

Here is some info (mostly on Falkon) that will give you an idea of the scales and performance we can achieve with Falkon.

Here are the URLs to these 2 projects:

Here are the short descriptions of what they are:
Swift is a system for the rapid and reliable specification, execution, and management of large-scale science and engineering workflows. It supports applications that execute many tasks coupled by disk-resident datasets - as is common, for example, when analyzing large quantities of data or performing parameter studies or ensemble simulations.
Falkon aims to enable the rapid and efficient execution of many tasks on large compute clusters, and to improve application performance and scalability using novel data management techniques. Falkon combines three techniques to achieve these goals: (1) multi-level scheduling techniques to enable separate treatments of resource provisioning and the dispatch of user tasks to those resources; (2) a streamlined task dispatcher able to achieve order-of-magnitude higher task dispatch rates than conventional schedulers; and (3) performs data caching and uses a data-aware scheduler to leverage the co-located computational and storage resources to minimize the use of shared storage infrastructure. Falkon’s integration of multi-level scheduling, streamlined dispatchers, and data management delivers performance not provided by any other system.
Now, about the scales that these projects have been tested on. The largest real application runs were made on a IBM BlueGene/P with 128K processors, running 128K jobs concurrently, with a total of 1M jobs.

Synthetic benchmarks have been run up to 160K processors, with a variety of job execution times to measure efficiency:

The kinds of job throughputs you can expect from Falkon are:

We also have done stress testing running 1 billion jobs (sleep 0) on a small cluster of 128 processors in about 19 hours.

To also get an idea of the level of performance you can expect from Falkon, if you have a large number of processors available, see the run we made with 1 million 60 second jobs on 160K processors on the BlueGene/P, the run completed in 453 seconds (assuming provisioned resources):

Also, the overheads of starting up at such large scale can be high, due to various reasons; here is the startup time broken down by components:

Here are some papers that will give you further reading on both Swift and Falkon:
Michael Wilde, Ian Foster, Kamil Iskra, Pete Beckman, Zhao Zhang, Allan Espinosa, Mihael Hategan, Ben Clifford, Ioan Raicu. "Parallel Scripting for App lications at the Petascale and Beyond", IEEE Computer Nov. 2009 Special Issue on Extreme Scale Computing, 2009
Ioan Raicu, Zhao Zhang, Mike Wilde, Ian Foster, Pete Beckman, Kamil Iskra, Ben Clifford. "Toward Loosely Coupled Programming on Petascale Systems", to appear at IEEE/ACM Supercomputing 2008.
Ioan Raicu, Yong Zhao, Catalin Dumitrescu, Ian Foster, Mike Wilde.  "Falkon: a Fast and Light-weight tasK executiON framework", IEEE/ACM SuperComputing 2007.
Yong Zhao, Mihael Hategan, Ben Clifford, Ian Foster, Gregor von Laszewski, Ioan Raicu, Tiberiu Stef-Praun, Mike Wilde.  "Swift: Fast, Reliable, Loosely Coupled Parallel Computation", IEEE Workshop on Scientific Workflows 2007.

If you have any questions about how Swift/Falkon can help run/manage your millions to billions of jobs, don't hesitate to contact me or others from these projects.


Steven Timm wrote:
Most of the schedd submit-side limits deal with how many jobs
are running at any given time, not how many are in the queue at all.
DAGman or one of the newer workflow managers should be used to control
the workflow and the job submissions.  You might want to look at 
Falkon/Swift from the U of Chicago, they have been doing a lot of work
like this recently although things seem to be working best for them
when they run on a BlueGene.


On Wed, 4 Nov 2009, Jonathan D. Proulx wrote:

Hi All,

I have a user looking to submit 15million jobs this is about two
orders of magnitude above what weve done previously.  I don't yet know
too much about the calculation other than it's genome based and
running against the human genome (a later goal) would result in
1.8 billion jobs.

leaving aside for a monent the question of how long this will take on
my resources, any advise on facilitating this number of jobs?  Are
there submit side limits I'm going to run into?

Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting

The archives can be found at:



Ioan Raicu, Ph.D.
NSF/CRA Computing Innovation Fellow
Center for Ultra-scale Computing and Information Security (CUCIS)
Department of Electrical Engineering and Computer Science
Northwestern University
2145 Sheridan Rd, Tech M384 
Evanston, IL 60208-3118
Cel:   1-847-722-0876
Tel:   1-847-491-8163
Email: iraicu@xxxxxxxxxxxxxxxxxxxxx
Web:   http://www.eecs.northwestern.edu/~iraicu/