[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Recursive DAGman



On Thu, Apr 19, 2012 at 03:11:22PM +0000, Ian Cottam wrote:
> Without going in to too much detail at this stage, we have a user who is
> trying to get a recursive DAGman to work.
> Can anyone point me at examples or advice on this?
> 
> At the end of the DAG script he does a convergence test and if necessary
> re-submits the DAG with updated files.
> At first this was failing because DAGman thought it was the same job and
> the lock file stopped it running.
> He 'fixed' this by renaming the recursive DAGman script.

It sounds like this convergence test he does is in a POST script. He
should do it in a PRE script. If you are using DAGman post 7.7.2, you
can use the PRE_SKIP value to effect this.

> 
> This is the comment I got from him
> "So we can see, it has some issue that the parent process (I'm not
> actually sure whether this is the parent dagman process or the parent
> script) exits, causing the newly launched dagman process to get signal 3
> and thus enter recovery mode. It does this infinitely, never escaping from
> this loop until I removed the dagman process from the queue using
> condor_rm."
> 
> I can supply more details if anyone can help; I've also asked him to
> create a bare bones example of the problem
> (the real one is quite hairy/messy).

That would be quite helpful.

> 
> Thanks all
> -Ian