[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Determine when all jobs in a cluster have finished?

On 1/30/2013 8:21 AM, Brian Pipa wrote:

I need to account for failures anyway myself (and record them) so I'll
probably handle all failures and retries myself. Some of the files in
a job will fail (which is expected), so if one fails, I can't have the
whole job fail, and retrying the whole job would then retry all the
files, which won't be terribly helpful. Due to the overhead, I
can't/don't want to make each file a separate job, so grouping them in
bundles makes the most sense.

What we do is have a post-script that returns 0: if a job fails, this "fools" htcondor into continuing with the batch. Keep the failed job's .stdout and .stderr files for post-mortem.

I'd really like the whole thing to be self-contained in one DAG like:
Job QueryDB querydb.job
Job Workers workers.job
Job PostProcess postprocess.job
PARENT Workers CHILD PostProcess

since that seems much simpler and self-contained but I don't think
that's doable since the results of the QueryDB job determines the data
and number of worker jobs I'll need. For example, one run of QueryDB
could get 2 million results and I would create 2000 data files
containing 1000 entries each and those would be consumed by 2000
worker jobs. Another run might create only 1 data file and 1 worker. I
can't think of a way to get this all working within one DAG file.
Right now, I pass in to each worker an argument of the datafile to

(We've similar job structure.) We create one job per entry because that makes more sense in our overall scheme of things -- that comes with time penalty but we can live with that. I haven't made the "prepare db" and "prepare worker inputs" dag nodes, I run them in a script outside of htcondor; the script submits the "workers" dag in the end. Mostly because those steps aren't really parallelizable, there isn't much to gain by turning them into dag nodes.