[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Determine when all jobs in a cluster have finished?



>Note 1: You'll find that there's 12 seconds delay before dagman starts the
>first job, and a few seconds delay between the first completing and the
>second starting.  Hopefully that's acceptable in overall scheme of things.
>
Yep - that's fine. Small price to pay.

>Note 2: DAGman also has a great feature called "rescue DAG", which means
>that if some DAG nodes fail, you can restart the DAG and the
>successfully-completed nodes will not be re-run.
>
I need to account for failures anyway myself (and record them) so I'll
probably handle all failures and retries myself. Some of the files in
a job will fail (which is expected), so if one fails, I can't have the
whole job fail, and retrying the whole job would then retry all the
files, which won't be terribly helpful. Due to the overhead, I
can't/don't want to make each file a separate job, so grouping them in
bundles makes the most sense.

I'd really like the whole thing to be self-contained in one DAG like:
###
Job QueryDB querydb.job
Job Workers workers.job
Job PostProcess postprocess.job
PARENT QueryDB CHILD Workers
PARENT Workers CHILD PostProcess
###

since that seems much simpler and self-contained but I don't think
that's doable since the results of the QueryDB job determines the data
and number of worker jobs I'll need. For example, one run of QueryDB
could get 2 million results and I would create 2000 data files
containing 1000 entries each and those would be consumed by 2000
worker jobs. Another run might create only 1 data file and 1 worker. I
can't think of a way to get this all working within one DAG file.
Right now, I pass in to each worker an argument of the datafile to
process.

The only way I can imagine it working is if the QueryDB is a separate
condor job that after it's done and writes out the datafiles, kicks
off the workers and post-processing dag. I think this will work, the
only trick I can think of is getting the QueryDB data over to the
PostProcess job, but I think I can do that with a simple data file (or
another DB query).

Thanks again for the help,
Brian

On Wed, Jan 30, 2013 at 8:25 AM, Brian Pipa <brianpipa@xxxxxxxxx> wrote:
> 3 Answers so far - thanks - let me hit each one:
>>I believe you are looking for "condor_wait".  The following page has all the info you need.
>> http://research.cs.wisc.edu/htcondor/manual/current/condor_wait.html
>>
> Def seems better than the condor_q options but still doesn't seem ideal.
>
>>Won't  NOTIFICATION=complete in your job submission do it?
>>Should email when the cluster is complete - though it may email you when each job completes which you probably don't want ...
>>--Russell Smithies
>>
> I don't want an email, I need programmatic notification, so I don't
> think that will work for me. I don't want to kick off a separate
> process that monitors email.
>
>>I ran into a similar issue recently.  One option is to use DAGMan with a single node representing your job.  DAGMan will monitor the job for you and report completion.
>>Mike
>>
> Ah - this looks like just what I need. I'll have to re-architect my
> code a bit but this certainly looks like what I need. Thanks! Wait -
> can you elaborate on "use DAGMan with a single node representing your
> job"? Is that what I described below?
>
> So, with DAGMan it looks like I will need to have my DBQueries job be
> completely separate, then it will create a DAGman job and submit it
> such that it creates multiple jobs in a cluster (the workers) and
> those must be done before we can post-process. I guess that would
> work... it connects the processing with the post-processing but the
> pre-processing (the DB query) is essentially separate (not managed
> with DAGMan).
>
> So, my dag file would look like
> ###
> Job Workers workers.job
> Job PostProcess postprocess.job
> PARENT Workers CHILD PostProcess
> ###
> and the initial (non-DAGMan job) QueryDB would create the workers.job,
> postprocess.job, and the dag file and submit the DAG job.
>
> Brian
>
> On Tue, Jan 29, 2013 at 5:18 PM, Brian Pipa <brianpipa@xxxxxxxxx> wrote:
>> Short: I'm trying to figure out when all jobs from a job cluster have
>> finished so that I can do some post-processing. I can think of lots of
>> ways for me to code this up, but it seems like there would be some
>> easy way in Condor to do this - does anyone know how?
>>
>> Long: I have a single Java master task (that is also a Condor job,
>> though that's not relevant) that does a large DB query then splits the
>> results into chunks and submits each chunk to Condor as a job via one
>> ClassAd so they all have the same Cluster id. These jobs are all Java
>> worker jobs that call various tools to process the data. I have all of
>> the output for each worker cluster going to a single directory so it's
>> easy to keep them together and know what output is from which run. As
>> I said above, I can think of a bunch of ways I could code up a
>> solution but it seems like Condor might have a way to tell if a
>> Cluster of jobs has finished or not.  Does anyone know if Condor does
>> have a way to do this?
>>
>> UPDATE: while typing this email up I found:
>> condor_q <cluster>
>> which might work. When I submit the one big worker job, I capture the
>> output from condor_submit and I can parse out the id from that "X
>> job(s) submitted to cluster Y".  Then, after I submit the job, I can
>> call
>> condor_q Y
>> periodically until it tells me no more jobs are in the q.
>> or I could call
>> condor_q Y |grep Y
>> until I get nothing back.
>>
>> Does this sounds right/make sense? is there an easier way to do this?
>> My way seems kind of hacky though I think it should work.
>>
>> Thanks!
>> Brian