[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Data Locality in DAGs?



Hi Armen,

We actually have the same need for data locality as yourself. I have set myself the task to write ... pretty much what you've apparently already done. So yes!, I would be very interested to see the code, cleaned up or not, and hopefully be able to be a friendly tester for another environment.

It looks like you've done quite a bit of work. I'd be happy (and honored) to stand on your shoulders :-)

James Burnash
Unix & Cluster SA

-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Babikyan, Armen
Sent: Tuesday, September 29, 2009 6:53 PM
To: Condor-Users Mail List
Subject: [Condor-users] Data Locality in DAGs?

Hello,

I made heavy use of Condor in a system I helped create about 3 years
ago, and Condor continues to serve us very well in that project.  Now,
in a completely different project, I am once again in need of a
batch-processing system, and I'm revisiting Condor.

First thing I've been doing is looking through 3 years of Condor version
history and relevant manual chapters.  For my project, however, I see
the need for a feature I requested about 3 years ago for my old
project:  A mechanism to exploit data locality.

I have a system that generates about 2GB of data per run, and I need 200
runs (sweeping sets of input parameters) to compute 200 numbers for a
graph that shows an experiment's results.  Although this creates 400GB
of data, I don't necessarily need to keep the data, since it can be
regenerated (I tag input parameters and code in svn so experiments are
completely repeatable).

Each run of this experiment uses a combination of (A) proprietary
binaries, (B) programs I've written, and (C) Matlab to create and
process data.  Within each run, I naturally see a A->B->C relationship,
and in-between each of these stages lies the large temporary output.
The structure of execution suggests a DAG, but I don't see a Condor
feature that forces the execution of each of these runs on the same
machine (or slot).  Running different stages of the same run on
different machines is very inefficient because the stages deal with data
that is going to be thrown away anyway, and local disks are orders of
magnitude faster than the network throughput and/or network
filesystems.  Staging input/output data using Condor's File Transfer
mechanism is prohibitively expensive, because (as far as I know - please
correct me if I'm wrong) all this data goes through the submit node.

My solution to this problem was to create a specification language that
allowed me to express data locality by effectively by shoving A->B->C
into a single DAG node (e.g. wrapped them in a shell-script-like
entity).  This specification language gets processed by a python
program, and the python program generates sets of Condor/DAG files
(according to specification) for ordinary job submission by a user.
Does Condor already allow me to do something like this at the DAG level
in a way I have overlooked?

To other Condor users - have you used Condor in situations where data
locality has serious performance implications, and how have you handled
it?  Would anyone be interested if I cleaned up this code and put it on
the web somewhere?  My specification language/program also provides the
ability to do several other things:

* Completely specify the experiment configuration in a single file, for
ease of experiment management
* Sweep both individual and tuples of input values, e.g. (x1, x2, x3) as
well as ((a1, b1, c1), (a2, b2, c2), (a3, b3, c3))
* Nested DAG support
* The shell-script-like entity described above has mechanisms to specify
exception handlers, ignore vs. exit on non-zero return values, and a few
other things
* Crude multi-user support, so users running experiments aren't writing
to the same local directories on each grid node
* Localized cleanup (i.e. leaving it on for normal operations or
shutting it off for debugging experiments)
* Supply additional job attributes to Condor on a per-job basis
* Provides special file-copying wrapper programs that use a turnstile to
reduce load on shared filesystems when reading/writing data to network
file systems is necessary.
* Run arbitrary python statements (possibly dangerous, but very useful
sometimes)
* Probably other things I'm forgetting, since I wrote this 2 or 3 years ago

I'm interested to know your thoughts.  Thanks!

Armen

--
Armen Babikyan
MIT Lincoln Laboratory
armenb@xxxxxxxxxx . 781-981-1796

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/


DISCLAIMER:
This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this in error, please immediately notify me and permanently delete the original and any copy of any e-mail and any printout thereof. E-mail transmission cannot be guaranteed to be secure or error-free. The sender therefore does not accept liability for any errors or omissions in the contents of this message which arise as a result of e-mail transmission.
NOTICE REGARDING PRIVACY AND CONFIDENTIALITY Knight Capital Group may, at its discretion, monitor and review the content of all e-mail communications. http://www.knight.com