[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] DAGMan Job Problem
- Date: Tue, 28 Aug 2007 18:24:47 -0400
- From: "Natarajan, Senthil" <senthil@xxxxxxxx>
- Subject: Re: [Condor-users] DAGMan Job Problem
My submit file creating the same cluster id, something like this
And also I am using $CondorVersion: 6.8.4 Feb 1 2007
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of R. Kent Wenger
Sent: Tuesday, August 28, 2007 4:52 PM
To: Condor-Users Mail List
Subject: Re: [Condor-users] DAGMan Job Problem
On Tue, 28 Aug 2007, Natarajan, Senthil wrote:
> I am trying to submit DAGMan job in linux.
> I have sixteen batches of job. Each job inturn has 41 jobs.
> And my requirement is batch2 jobs shouldn't start until all batch1 jobs
> are done, similarly batch3 jobs shouldn't start until all batch2 job are
> I created dagman job like the one below, the problem is dagman job
> fails randomly on the batch3 or batch4 etc and the reason is some of the
> batch3 job needs input which will be output from some of the batch2 job.
> And condor complains about the file is not found
If I'm understanding your setup correctly, the submit file for batch1,
for example, ends up submitting 41 Condor jobs. If that is correct,
that's probably what's causing your problem.
If your submit files are creating more than one cluster of jobs, this
will definitely break DAGMan. Even if your submit file creates a single
cluster with multiple jobs, this will break things unless your DAGMan
is 6.7.17 or newer.
If you send your entire dagman.out file, I can tell for sure if this
is the problem.