Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] submitting LOTS of jobs

Date: Thu, 05 Nov 2009 08:38:39 -0600
From: Ioan Raicu <iraicu@xxxxxxxxxxxxxxx>
Subject: Re: [Condor-users] submitting LOTS of jobs

Thanks Steve for bringing up Swift/Falkon. Outside of bringing down you job count, by repackaging your application logic somehow, and perhaps implementing some application specific framework, I think your best bet is something like Swift and/or Falkon.

Here is some info (mostly on Falkon) that will give you an idea of the scales and performance we can achieve with Falkon.

Here are the URLs to these 2 projects:
http://www.ci.uchicago.edu/swift/
http://dev.globus.org/wiki/Incubator/Falkon

Here are the short descriptions of what they are:

Swift is a system for the rapid and reliable specification, execution, and management of large-scale science and engineering workflows. It supports applications that execute many tasks coupled by disk-resident datasets - as is common, for example, when analyzing large quantities of data or performing parameter studies or ensemble simulations.

Falkon aims to enable the rapid and efficient execution of many tasks on large compute clusters, and to improve application performance and scalability using novel data management techniques. Falkon combines three techniques to achieve these goals: (1) multi-level scheduling techniques to enable separate treatments of resource provisioning and the dispatch of user tasks to those resources; (2) a streamlined task dispatcher able to achieve order-of-magnitude higher task dispatch rates than conventional schedulers; and (3) performs data caching and uses a data-aware scheduler to leverage the co-located computational and storage resources to minimize the use of shared storage infrastructure. Falkon’s integration of multi-level scheduling, streamlined dispatchers, and data management delivers performance not provided by any other system.

Now, about the scales that these projects have been tested on. The largest real application runs were made on a IBM BlueGene/P with 128K processors, running 128K jobs concurrently, with a total of 1M jobs.
http://dev.globus.org/images/7/74/Falkon-DOCK5-1M-120Kcpus.jpg
http://dev.globus.org/images/5/5a/Falkon-MARS-1M-128Kcpus.jpg

Synthetic benchmarks have been run up to 160K processors, with a variety of job execution times to measure efficiency:
http://dev.globus.org/images/b/b3/Falkon-efficiency-BGP-1-to-2048cpus.jpg
http://dev.globus.org/images/2/25/Falkon-efficiency-BGP-256-to-160Kcpus.jpg

The kinds of job throughputs you can expect from Falkon are:
http://dev.globus.org/images/b/b7/Falkon-throughputs.jpg

We also have done stress testing running 1 billion jobs (sleep 0) on a small cluster of 128 processors in about 19 hours.
http://dev.globus.org/images/e/e1/Falkon-1B-4serv-128cpu.jpg

To also get an idea of the level of performance you can expect from Falkon, if you have a large number of processors available, see the run we made with 1 million 60 second jobs on 160K processors on the BlueGene/P, the run completed in 453 seconds (assuming provisioned resources):
http://dev.globus.org/images/6/62/Falkon-1M-gkrellm.jpg

Also, the overheads of starting up at such large scale can be high, due to various reasons; here is the startup time broken down by components:
http://dev.globus.org/images/8/82/Falkon-BGP-startup-time.jpg

Here are some papers that will give you further reading on both Swift and Falkon:
Michael Wilde, Ian Foster, Kamil Iskra, Pete Beckman, Zhao Zhang, Allan Espinosa, Mihael Hategan, Ben Clifford, Ioan Raicu. "Parallel Scripting for App lications at the Petascale and Beyond", IEEE Computer Nov. 2009 Special Issue on Extreme Scale Computing, 2009
Ioan Raicu, Zhao Zhang, Mike Wilde, Ian Foster, Pete Beckman, Kamil Iskra, Ben Clifford. "Toward Loosely Coupled Programming on Petascale Systems", to appear at IEEE/ACM Supercomputing 2008.
Ioan Raicu, Yong Zhao, Catalin Dumitrescu, Ian Foster, Mike Wilde. "Falkon: a Fast and Light-weight tasK executiON framework", IEEE/ACM SuperComputing 2007.
Yong Zhao, Mihael Hategan, Ben Clifford, Ian Foster, Gregor von Laszewski, Ioan Raicu, Tiberiu Stef-Praun, Mike Wilde. "Swift: Fast, Reliable, Loosely Coupled Parallel Computation", IEEE Workshop on Scientific Workflows 2007.

If you have any questions about how Swift/Falkon can help run/manage your millions to billions of jobs, don't hesitate to contact me or others from these projects.

Ioan

Steven Timm wrote:

Most of the schedd submit-side limits deal with how many jobs
are running at any given time, not how many are in the queue at all.
DAGman or one of the newer workflow managers should be used to control
the workflow and the job submissions.  You might want to look at 
Falkon/Swift from the U of Chicago, they have been doing a lot of work
like this recently although things seem to be working best for them
when they run on a BlueGene.

Steve


On Wed, 4 Nov 2009, Jonathan D. Proulx wrote:

Hi All,

I have a user looking to submit 15million jobs this is about two
orders of magnitude above what weve done previously.  I don't yet know
too much about the calculation other than it's genome based and
running against the human genome (a later goal) would result in
1.8 billion jobs.

leaving aside for a monent the question of how long this will take on
my resources, any advise on facilitating this number of jobs?  Are
there submit side limits I'm going to run into?

Thanks
-Jon
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/

-- 
=================================================================
Ioan Raicu, Ph.D.
NSF/CRA Computing Innovation Fellow
=================================================================
Center for Ultra-scale Computing and Information Security (CUCIS)
Department of Electrical Engineering and Computer Science
Northwestern University
2145 Sheridan Rd, Tech M384 
Evanston, IL 60208-3118
=================================================================
Cel:   1-847-722-0876
Tel:   1-847-491-8163
Email: iraicu@xxxxxxxxxxxxxxxxxxxxx
Web:   http://www.eecs.northwestern.edu/~iraicu/
       https://wiki.cucis.eecs.northwestern.edu/
=================================================================
=================================================================

References:
- [Condor-users] submitting LOTS of jobs
  - From: Jonathan D. Proulx
- Re: [Condor-users] submitting LOTS of jobs
  - From: Steven Timm

Prev by Date: [Condor-users] Running Lots of Jobs with Makeflow
Next by Date: Re: [Condor-users] SYSTEM_PERIODIC_HOLD
Previous by thread: Re: [Condor-users] submitting LOTS of jobs
Next by thread: Re: [Condor-users] submitting LOTS of jobs
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] submitting LOTS of jobs