[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Condor's DAG scalability?
- Date: Fri, 24 Feb 2006 12:17:47 -0600 (CST)
- From: "R. Kent Wenger" <wenger@xxxxxxxxxxx>
- Subject: Re: [Condor-users] Condor's DAG scalability?
On Fri, 24 Feb 2006, Armen Babikyan wrote:
> I'm planning to use Condor on a cluster of ~50 CPUs to carry out a large
> set of experiments. Each experiment will have several different
> modules, which need to be executed in a sequential fashion. My block
> diagrams of each experiment are arranged such that both looping and
> nested looping need to occur. Fortunately, iterations of loops are
> completely independent of each other data-wise.
> I see that Condor's DAG functionality only allows inclusion of one job
> per submit file that is referenced with the "JOB" directive. Therefore,
> I see the most straightforward solution to condor-izing my experiment is
> to dynamically generate a DAG file with (potentially) hundreds or
> thousands of JOB entries, and PARENT/CHILD entries with hundreds or
> thousands of arguments.
Actually, this will change with the upcoming 6.7.17 release (which should
be out within days). 6.7.17 will allow multiple Condor jobs per submit
file, as long as the jobs are all part of a single cluster (e.g., they
use the same executable).
> May I solicit some words of wisdom with respect to the scalability of
> Condor's DAG functionality as I will be using it? :-) Have others used
> Condor's DAG tools for single experiments in which there are thousands
> (or even millions) of component processes? Of course, some of these
> components will be hidden under nested condor_dagman executions, but
> nevertheless, there will be a lot of schedule-processing going on...will
> Condor and/or condor_dagman be able to handle this?
We routinely run DAGs here that have hundreds of nodes, and I think
other people have run bigger ones than that. So far I don't think
anyone has run into limitations with the actual number of nodes.
There are throttles in DAGMan to control job submission of the sheer
number of jobs is a problem for your pool (e.g., if you have lots of
"sibling" jobs that might all get submitted at once).