I have ported a LIGO gravity wave application that uses DAGMan
to the Open Science Grid. Frequently I test this application
on multiple OSG production and test bed clusters with from 14 to 300 processors.
My standard test generates between 700 and 900 DAG nodes.
In most cases, recovery dags have allowed continuation of these
complex jobs when a critical resource became unavailable in the
middle of the work flow.
To test scalability, I have run a test with up to 8000 DAG nodes
that completed successfully with over 72 hours wall time on a small

I'm planning to use Condor on a cluster of ~50 CPUs to carry out a large
set of experiments.  Each experiment will have several different
modules, which need to be executed in a sequential fashion.  My block
diagrams of each experiment are arranged such that both looping and
nested looping need to occur.  Fortunately, iterations of loops are
completely independent of each other data-wise.

I see that Condor's DAG functionality only allows inclusion of one job
per submit file that is referenced with the "JOB" directive.  Therefore,
I see the most straightforward solution to condor-izing my experiment is
to dynamically generate a DAG file with (potentially) hundreds or
thousands of JOB entries, and PARENT/CHILD entries with hundreds or
thousands of arguments.

May I solicit some words of wisdom with respect to the scalability of
Condor's DAG functionality as I will be using it?  :-)  Have others used
Condor's DAG tools for single experiments in which there are thousands
(or even millions) of component processes?  Of course, some of these
components will be hidden under nested condor_dagman executions, but
nevertheless, there will be a lot of schedule-processing going on...will
Condor and/or condor_dagman be able to handle this?

Any advice is appreciated!  Thanks,

  - Armen

