Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Determine when all jobs in a cluster have finished?

Date: Wed, 30 Jan 2013 10:10:35 -0600
From: Dimitri Maziuk <dmaziuk@xxxxxxxxxxxxx>
Subject: Re: [HTCondor-users] Determine when all jobs in a cluster have finished?

On 1/30/2013 8:21 AM, Brian Pipa wrote:

I need to account for failures anyway myself (and record them) so I'll
probably handle all failures and retries myself. Some of the files in
a job will fail (which is expected), so if one fails, I can't have the
whole job fail, and retrying the whole job would then retry all the
files, which won't be terribly helpful. Due to the overhead, I
can't/don't want to make each file a separate job, so grouping them in
bundles makes the most sense.

What we do is have a post-script that returns 0: if a job fails, this"fools" htcondor into continuing with the batch. Keep the failed job's.stdout and .stderr files for post-mortem.

I'd really like the whole thing to be self-contained in one DAG like:
###
Job QueryDB querydb.job
Job Workers workers.job
Job PostProcess postprocess.job
PARENT QueryDB CHILD Workers
PARENT Workers CHILD PostProcess
###

since that seems much simpler and self-contained but I don't think
that's doable since the results of the QueryDB job determines the data
and number of worker jobs I'll need. For example, one run of QueryDB
could get 2 million results and I would create 2000 data files
containing 1000 entries each and those would be consumed by 2000
worker jobs. Another run might create only 1 data file and 1 worker. I
can't think of a way to get this all working within one DAG file.
Right now, I pass in to each worker an argument of the datafile to
process.

(We've similar job structure.) We create one job per entry because thatmakes more sense in our overall scheme of things -- that comes with timepenalty but we can live with that. I haven't made the "prepare db" and"prepare worker inputs" dag nodes, I run them in a script outside ofhtcondor; the script submits the "workers" dag in the end. Mostlybecause those steps aren't really parallelizable, there isn't much togain by turning them into dag nodes.


Dimitri

References:
- [HTCondor-users] Determine when all jobs in a cluster have finished?
  - From: Brian Pipa
- Re: [HTCondor-users] Determine when all jobs in a cluster have finished?
  - From: Brian Pipa
- Re: [HTCondor-users] Determine when all jobs in a cluster have finished?
  - From: Brian Pipa

Prev by Date: Re: [HTCondor-users] Determine when all jobs in a cluster have finished?
Next by Date: Re: [HTCondor-users] Determine when all jobs in a cluster have finished?
Previous by thread: Re: [HTCondor-users] Determine when all jobs in a cluster have finished?
Next by thread: Re: [HTCondor-users] Determine when all jobs in a cluster have finished?
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] Determine when all jobs in a cluster have finished?