[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Strange DAGMan behaviour



Kent,

Thanks for helping me dig into this.

On 10/19/05, R. Kent Wenger <wenger@xxxxxxxxxxx> wrote:
> - When you say that you are doing recursion, are you re-submitting the
>   same DAG file or a different DAG file?  If you're re-submitting the same
>   DAG file, it's not surprising that you're running into problems.

Nope. It's a different DAG each time. I've confirmed that the only
files in common between DAGs are the scripts (ie. "Executable =
script.pl in the job files).

> - Do you get a dagman.out file for the DAG that fails?  It would help
>   a lot if we could see that.

No problem.  I've attached two. The first (test.dag.1.dagman.out)
worked and the second (test.dag.2.dagman.out) failed.

> - If you take one of the DAGs that fails as a subdag, and just run it on
>   its own, does it still sometimes fail?

Nope. It works all the time.  Of course, one of *its* subdags will usually fail.

> > 005 (518.000.000) 10/18 15:42:55 Job terminated.
> >         (0) Abnormal termination (signal 9)
> >         (0) No core file
> >                 Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
> >                 Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
> >                 Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
> >                 Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
> >         0  -  Run Bytes Sent By Job
> >         0  -  Run Bytes Received By Job
> >         0  -  Total Bytes Sent By Job
> >         0  -  Total Bytes Received By Job
>
> You're saying that job 518.0 is one of the condor_dagman jobs, right?

Yes.  That came from a dagman.log file, and it seems like only
condor_dagman jobs get logged there.  Here's a full dagman.log from a
DAG that failed (it corresponds with test.dag.2.dagman.out above):

000 (562.000.000) 10/19 13:25:55 Job submitted from host:
<136.159.220.105:48532>
...
001 (562.000.000) 10/19 13:25:55 Job executing on host: <136.159.220.105:48532>
...
005 (562.000.000) 10/19 13:25:55 Job terminated.
        (0) Abnormal termination (signal 9)
        (0) No core file
         Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
         Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
         Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
         Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
        0  -  Total Bytes Sent By Job
        0  -  Total Bytes Received By Job
...


Mark

Attachment: test.dag.1.dagman.out
Description: Binary data

Attachment: test.dag.2.dagman.out
Description: Binary data