[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Determine when all jobs in a cluster have finished?



On Wed, Jan 30, 2013 at 2:03 PM, Michael Aschenbeck
<m.g.aschenbeck@xxxxxxxxx> wrote:
> DAGman is certainly an awesome tool...but if you have a java master process
> that is creating these submit files, I think it's over-kill.  I guess I
> could be missing something, but it seems like your master process should be
> like this:
>
> In Java code:
> 1) create initial submit file and submit it (already doing this)
> 2) make a system call to condor_wait to wait for the initial job to finish
> 3) Have java check which data files came back, create submit files for post
> processing, and submit
> 4) Use condor wait to wait for those jobs (if necessary)
>
> I just think that if you are using Java to create these scripts
> automatically and fire all this stuff off, that you might as well use that
> for the "dynamic" aspect.  I do this sort of thing in C++ quite a bit and,
> while I know DAGman and have used it, when i don't have crazy job
> dependencies it's much easier to just to a condor_wait as a system call in
> c++ and then go to setting up a post process script. Please let me know if
> I'm missing something or if you are familiar with condor_q and it's really
> just not what you want, but I think this should be considered.
>

Well, I sort of (kinda) came to that conclusion too, but for a
different reason... I realized I don't need to wait till all the
workers are done to start postprocessing. The only thing that holds
all the workers together is the fact they were all created from the
same DB query but other than that, they aren't connected. So.. I can
actually start postprocessing as soon as each worker is finished.
(Note: postprocessing requires DB access which is only available to
the machine running QueryDB anyway). Here is the revised plan:

1) QueryDB job is submitted periodically (via cron, Jenkins, thread, whatever)
2) QueryDB creates one initial submit file for all workers and submits it
3) after submit, QueryDB starts a loop that looks for individual jobs
completing and runs postprocess() on it. The QueryDB process that
submitted the worker jobs will also be the one running postprocess()
on it.
4) workers run the jobs and write results to files
5) as QueryDB processes worker jobs, it updates DB with results

The only tricky part I can think of is QueryDB knowing when each job
is finished and to know when to stop looking for results (so it
doesn't wait forever and knows when it's done). I guess I'll just have
to periodically run condor_q and condor_query to figure out which
worker jobs are done and once I know they are done, postprocess them.

I thought about using DAG to submit the workers and a child DoneJob
and all the DoneJob would do is write a marker file (like a file named
done) that the QueryDB code would look for as an indicator that all
workers were finished. I imagine using one of the condor executables
will be easier (and easier to make reusable).

Thanks again for everyone's input. It's really helped having someone
to bounce ideas and questions off of. HTCondor has a great community.

Brian