[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] DAGMan Job Problem

Hi Kent,
My submit file creating the same cluster id, something like this

And also I am using $CondorVersion: 6.8.4 Feb  1 2007


-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of R. Kent Wenger
Sent: Tuesday, August 28, 2007 4:52 PM
To: Condor-Users Mail List
Subject: Re: [Condor-users] DAGMan Job Problem

On Tue, 28 Aug 2007, Natarajan, Senthil wrote:

> I am trying to submit DAGMan job in linux.
> I have sixteen batches of job. Each job inturn has 41 jobs.
> And my requirement is batch2 jobs shouldn't start until all batch1 jobs
> are done, similarly batch3 jobs shouldn't start until all batch2 job are
> done.
> I created dagman job like the one below, the problem is dagman job
> fails randomly on the batch3 or batch4 etc and the reason is some of the
> batch3 job needs input which will be output from some of the batch2 job.
> And condor complains about the file is not found

If I'm understanding your setup correctly, the submit file for batch1,
for example, ends up submitting 41 Condor jobs.  If that is correct,
that's probably what's causing your problem.

If your submit files are creating more than one cluster of jobs, this
will definitely break DAGMan.  Even if your submit file creates a single
cluster with multiple jobs, this will break things unless your DAGMan
is 6.7.17 or newer.

If you send your entire dagman.out file, I can tell for sure if this
is the problem.

Kent Wenger
Condor Team