[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Data Locality in DAGs?
- Date: Tue, 29 Sep 2009 18:53:22 -0400
- From: "Babikyan, Armen" <armenb@xxxxxxxxxx>
- Subject: [Condor-users] Data Locality in DAGs?
I made heavy use of Condor in a system I helped create about 3 years
ago, and Condor continues to serve us very well in that project. Now,
in a completely different project, I am once again in need of a
batch-processing system, and I'm revisiting Condor.
First thing I've been doing is looking through 3 years of Condor version
history and relevant manual chapters. For my project, however, I see
the need for a feature I requested about 3 years ago for my old
project: A mechanism to exploit data locality.
I have a system that generates about 2GB of data per run, and I need 200
runs (sweeping sets of input parameters) to compute 200 numbers for a
graph that shows an experiment's results. Although this creates 400GB
of data, I don't necessarily need to keep the data, since it can be
regenerated (I tag input parameters and code in svn so experiments are
Each run of this experiment uses a combination of (A) proprietary
binaries, (B) programs I've written, and (C) Matlab to create and
process data. Within each run, I naturally see a A->B->C relationship,
and in-between each of these stages lies the large temporary output.
The structure of execution suggests a DAG, but I don't see a Condor
feature that forces the execution of each of these runs on the same
machine (or slot). Running different stages of the same run on
different machines is very inefficient because the stages deal with data
that is going to be thrown away anyway, and local disks are orders of
magnitude faster than the network throughput and/or network
filesystems. Staging input/output data using Condor's File Transfer
mechanism is prohibitively expensive, because (as far as I know - please
correct me if I'm wrong) all this data goes through the submit node.
My solution to this problem was to create a specification language that
allowed me to express data locality by effectively by shoving A->B->C
into a single DAG node (e.g. wrapped them in a shell-script-like
entity). This specification language gets processed by a python
program, and the python program generates sets of Condor/DAG files
(according to specification) for ordinary job submission by a user.
Does Condor already allow me to do something like this at the DAG level
in a way I have overlooked?
To other Condor users - have you used Condor in situations where data
locality has serious performance implications, and how have you handled
it? Would anyone be interested if I cleaned up this code and put it on
the web somewhere? My specification language/program also provides the
ability to do several other things:
* Completely specify the experiment configuration in a single file, for
ease of experiment management
* Sweep both individual and tuples of input values, e.g. (x1, x2, x3) as
well as ((a1, b1, c1), (a2, b2, c2), (a3, b3, c3))
* Nested DAG support
* The shell-script-like entity described above has mechanisms to specify
exception handlers, ignore vs. exit on non-zero return values, and a few
* Crude multi-user support, so users running experiments aren't writing
to the same local directories on each grid node
* Localized cleanup (i.e. leaving it on for normal operations or
shutting it off for debugging experiments)
* Supply additional job attributes to Condor on a per-job basis
* Provides special file-copying wrapper programs that use a turnstile to
reduce load on shared filesystems when reading/writing data to network
file systems is necessary.
* Run arbitrary python statements (possibly dangerous, but very useful
* Probably other things I'm forgetting, since I wrote this 2 or 3 years ago
I'm interested to know your thoughts. Thanks!
MIT Lincoln Laboratory
armenb@xxxxxxxxxx . 781-981-1796