[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Job with the same data on many machines, wanting a way to reducing file I/O
- Date: Sat, 14 Jan 2012 21:39:48 +1100
- From: Mark Assad <massad@xxxxxxxxx>
- Subject: [Condor-users] Job with the same data on many machines, wanting a way to reducing file I/O
I would like to be able to create a job that starts with as many
slots as possible, and have them all start at one time. Repeating this
until all jobs are done. Is this kind of task possible in condor? I've
only been able to find was to run tasks where I know I need N tasks to
run at once. I don't care for a specific value of N, I just want it to
be as high as it can be.
The situation is that I have an executable that runs and
calculates a result by reading the same set of data, but the
parameters that are fed into the executable changes regularly. The
data file is about 300MB, the parameters are less than 1K, and the
results are 2 numbers. I am running this job on a dedicated cluster of
about 64 machines, each with local disk, and 8 cores. The local disk
is large enough to store the data that is being processed, but not
large enough to store the data permanently.
The same 300MB of data will be processed a few thousand times. The
300MB of data is a subset of a much larger set of data so I just can't
copy it to the nodes. The results of early runs are used to tune the
parameters for later runs. Which means I can't just create one massive
What I would like to do is: Start the job with as many slots that are
free, somehow multicast the data to those nodes, run the job across
those nodes, and then as nodes start to free up try and multicast the
data out to those nodes.
At the moment, what I do is use the condor file transfer to send the
data files at the start of the job. This means that the same file is
sent many times. Often it's sent many times to the same host (multi
I was thinking about re-working the application to use something like
MPI -- where the master reads the file, and then sends it to all the
slaves as a broadcast. To do this though, what I'll need is a way to
say "start as many jobs at the same time as possible". Is it possible
to configure condor to do that?
I am open to other suggestions for how I can improve the task. I don't
have direct access to the machines that are running the jobs, no SSH,
and no shared file system, so I can't pre-load the nodes with the data
that the job will need.
Any hints would be appreciated, or just general areas to look.