[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Strange DAGMan behaviour



On Tue, 18 Oct 2005, Mark Fox wrote:

> I ran into some strange DAGMan behaviour in a software system I
> maintain.  The system submits a DAG that may recursively submit
> another DAG, and so on.  The problem is that the execution of one of
> the DAGs eventually fails without ever having run.  The first DAG
> always succeeds, but the following DAGs seem to have about a 50-50
> chance of success.  Sometimes it will iterate several times, but most
> of the time, it fails on the first or second iteration.  In the DAGMan
> log for the last DAG, I get this:

Several questions:

- When you say that you are doing recursion, are you re-submitting the
  same DAG file or a different DAG file?  If you're re-submitting the same
  DAG file, it's not surprising that you're running into problems.

- Do you get a dagman.out file for the DAG that fails?  It would help
  a lot if we could see that.

- If you take one of the DAGs that fails as a subdag, and just run it on
  its own, does it still sometimes fail?

> 005 (518.000.000) 10/18 15:42:55 Job terminated.
>         (0) Abnormal termination (signal 9)
>         (0) No core file
>                 Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
>                 Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
>                 Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
>                 Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
>         0  -  Run Bytes Sent By Job
>         0  -  Run Bytes Received By Job
>         0  -  Total Bytes Sent By Job
>         0  -  Total Bytes Received By Job

You're saying that job 518.0 is one of the condor_dagman jobs, right?

Kent Wenger
Condor Team