[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Buncing many (2-30 sec) calculations as one job



On 10/25/07, Atle Rudshaug <atle.rudshaug@xxxxxxxxx> wrote:

> Anyway, I have been looking at DAGMan for automatic post operations
> (finding the best result from all the DAG jobs after they all are
> finished. I have a child DAG node with a script as executable in the
> "local" universe). But each DAG node/job is submitted as individual
> jobs from the DAGman script and require their own scheduling (or so it
> seems). I have X number of condor submit scripts (DAG nodes) each with
> ONE job/calculation.
>
> Since one execution can be as short as 2 sec (and the executable is
> ~22MB+libs(~2MB)) I would like to bunch them up a bit so that one job
> submission would run 10 or 100++ calculations. I guess this could be
> done by sending a bash script as executable with 10 or 100 lines of
> the executable with their respective command line arguments. But what
> will happen to the output file from each condor job if there are
> multiple executions of the binary in one job? Will each execution
> overwrite each others data so I only get the result from the last
> execution? Is there some other way to do this?

Golden rule - exploit application domain optimizations before platform
domain ones.

In this instance each 'iteration' is using the same executable so will
benefit from repeating the calculation locally several times (since
the transfer, startup times become considerable if your calculation
takes on average a few seconds).

If the output from each iteration is directly comparable (for the
purposes of the 'best' function you have) to the other iterations
within a job then have each job (as opposed to each iteration work out
the best one as it goes and only output the one which wins. In this
way the final reduction becomes smaller too (you have distributed much
of it)

> And, by the way, is there any smooth way of sending larger
> jobs/bunches to more powerful nodes more or less automatically?

If you can identify the larger jobs at submit time yes (though it is
something of a hassle).

There are other options if not but that complicates things further and
starts to have restrictions on how your iterations are expressed
(essentially you could start haveing a job dynamically work out how
many iterations it would take responsibility for based on it's view of
how fast it was going)

> Say I
> have 10 jobs each with 100 computations and nodes ranging from old P3
> to new Core2Quad machines. It would be nice to be able run a bunch of
> maybe 200 calculations on the best machine and 20 on the slowest so
> that they use more or less the same time.

To get this trully efficient the jobs would need to be dynamic (which
is doable).

I should point out here that for what you are describing Google's
MapReduce (http://en.wikipedia.org/wiki/MapReduce) distributed system
sounds a much better fit for your tasks (it may not work in other
areas and I cannot vouch for it personally but I would suggest
considering it).

Matt