[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Determine when all jobs in a cluster have finished?



3 Answers so far - thanks - let me hit each one:
>I believe you are looking for "condor_wait".  The following page has all the info you need.
> http://research.cs.wisc.edu/htcondor/manual/current/condor_wait.html
>
Def seems better than the condor_q options but still doesn't seem ideal.

>Won't  NOTIFICATION=complete in your job submission do it?
>Should email when the cluster is complete - though it may email you when each job completes which you probably don't want ...
>--Russell Smithies
>
I don't want an email, I need programmatic notification, so I don't
think that will work for me. I don't want to kick off a separate
process that monitors email.

>I ran into a similar issue recently.  One option is to use DAGMan with a single node representing your job.  DAGMan will monitor the job for you and report completion.
>Mike
>
Ah - this looks like just what I need. I'll have to re-architect my
code a bit but this certainly looks like what I need. Thanks! Wait -
can you elaborate on "use DAGMan with a single node representing your
job"? Is that what I described below?

So, with DAGMan it looks like I will need to have my DBQueries job be
completely separate, then it will create a DAGman job and submit it
such that it creates multiple jobs in a cluster (the workers) and
those must be done before we can post-process. I guess that would
work... it connects the processing with the post-processing but the
pre-processing (the DB query) is essentially separate (not managed
with DAGMan).

So, my dag file would look like
###
Job Workers workers.job
Job PostProcess postprocess.job
PARENT Workers CHILD PostProcess
###
and the initial (non-DAGMan job) QueryDB would create the workers.job,
postprocess.job, and the dag file and submit the DAG job.

Brian

On Tue, Jan 29, 2013 at 5:18 PM, Brian Pipa <brianpipa@xxxxxxxxx> wrote:
> Short: I'm trying to figure out when all jobs from a job cluster have
> finished so that I can do some post-processing. I can think of lots of
> ways for me to code this up, but it seems like there would be some
> easy way in Condor to do this - does anyone know how?
>
> Long: I have a single Java master task (that is also a Condor job,
> though that's not relevant) that does a large DB query then splits the
> results into chunks and submits each chunk to Condor as a job via one
> ClassAd so they all have the same Cluster id. These jobs are all Java
> worker jobs that call various tools to process the data. I have all of
> the output for each worker cluster going to a single directory so it's
> easy to keep them together and know what output is from which run. As
> I said above, I can think of a bunch of ways I could code up a
> solution but it seems like Condor might have a way to tell if a
> Cluster of jobs has finished or not.  Does anyone know if Condor does
> have a way to do this?
>
> UPDATE: while typing this email up I found:
> condor_q <cluster>
> which might work. When I submit the one big worker job, I capture the
> output from condor_submit and I can parse out the id from that "X
> job(s) submitted to cluster Y".  Then, after I submit the job, I can
> call
> condor_q Y
> periodically until it tells me no more jobs are in the q.
> or I could call
> condor_q Y |grep Y
> until I get nothing back.
>
> Does this sounds right/make sense? is there an easier way to do this?
> My way seems kind of hacky though I think it should work.
>
> Thanks!
> Brian