[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Buncing many (2-30 sec) calculations as one job

>From an earlier post with another topic
> > I am also looking for some way to send a collection of small
> > jobs/executions as a single larger job to a node to enhance the
> > scheduling vs execution time ratio. Each single computation is <30
> > sec, but there can be many of them. Is there a way to batch up a
> > collection of these executions and send this batch as a single job so
> > the node will compute more than it is scheduling and transferring
> > files? Maybe just use a batch script as executable? How does that
> > work? Do I need to have the executable as an input file? Can the
> > executable be on a NFS mount along with the libraries?
> You could look at DAGMAN, but a shellscript (or batch script for Windows)
> could do the job nicely. Your executable becomes your shellscript, your
> executable becomes another of your input files to be transferred, along with
> the libraries and real input files. Your arguments will be options to the shellscript
> telling it how many jobs to run.
> Note: you will not be able to use standard universe any more, vanilla becomes
> your universe.

I am unable to condor_compile my application so I have to use the
vanilla universe either way. It sais it can't find my application
libraries when running condor_compile. With regular "make" everything
is fine. Do I need to statically compile my app for this to work?

Anyway, I have been looking at DAGMan for automatic post operations
(finding the best result from all the DAG jobs after they all are
finished. I have a child DAG node with a script as executable in the
"local" universe). But each DAG node/job is submitted as individual
jobs from the DAGman script and require their own scheduling (or so it
seems). I have X number of condor submit scripts (DAG nodes) each with
ONE job/calculation.

Since one execution can be as short as 2 sec (and the executable is
~22MB+libs(~2MB)) I would like to bunch them up a bit so that one job
submission would run 10 or 100++ calculations. I guess this could be
done by sending a bash script as executable with 10 or 100 lines of
the executable with their respective command line arguments. But what
will happen to the output file from each condor job if there are
multiple executions of the binary in one job? Will each execution
overwrite each others data so I only get the result from the last
execution? Is there some other way to do this?

And, by the way, is there any smooth way of sending larger
jobs/bunches to more powerful nodes more or less automatically? Say I
have 10 jobs each with 100 computations and nodes ranging from old P3
to new Core2Quad machines. It would be nice to be able run a bunch of
maybe 200 calculations on the best machine and 20 on the slowest so
that they use more or less the same time.

- Atle