[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Dagman exits, restart and hangs....



I'm having a problem with dagman on an all Windows XP pool. Basically what happens, occasionally, is that a dagman job exits before completing all nodes. It is then is restarted and it completes the remaining nodes, but then hangs waiting, I think, for some "phantom" node to complete. There are three problems:

1 - dagman appears to exit for no reason, with no errors in any logs that I can find 2 - after recovering, dagman hangs after all the nodes have been submitted and completed
3 - the delay in dagman recovering is nearly 1 hour

I've attached logs, examples dags and subs that will hopefully be enough for someone to understand what is going on. To fill in more of the details.....

The main dag is called "master.dag" it consists of anywhere from 1 to a couple of thousand of independent nodes that are in themselves dags. A couple of lines out of master.dag look like:

JOB eali01nondemo.0 eali01nondemo.0/testcase.dag.condor.sub
JOB eana01non3rdord.0 eana01non3rdord.0/testcase.dag.condor.sub
JOB eana02nontracea.0 eana02nontracea.0/testcase.dag.condor.sub
JOB eana03nonssrayt.0 eana03nonssrayt.0/testcase.dag.condor.sub
    ...and so on...

The attached example has 160 nodes like this. Each testcase.dag.condor.sub is another dag, see testcase.dag for an example, that has a single node with a PRE and a POST script. These dags, including the node, pre and post scripts, only take about a minute to complete. The .sub files for the dag (see testcase.dag.condor.sub) are edited to send their dagman logs to a common log, orats.dagman.log which is also included. The output from dagman, master.dag.dagman.out, is included and you will note that the run starts at 08:03:50 and at 08:10:09 node eenv01nonstackd.0 is submitted, but output stops until 09:04:22 when a new STARTING UP header appears and dagman attempt to recover. Upon completing recovery, the first node submitted is the same eenv01nonstackd.0 as clusterID 30068.0. Shortly thereafter, the following appears in the log:

09:10:02 ERROR: node eenv01nonstackd.0: job ID in userlog submit event (30069.0) doesn't match ID reported earlier by submit command (30068.0)! Trusting the userlog for now, but this is scary!

In the common log, orats.dagman.log you can see that the node eenv01nonstackd.0 is started twice, once with cluster id 30068.0 and once with 30069.0. Both of these are after 9:00.

If other logs, or info would be useful, I can supply them.

Thanks in advance for any help that anyone can offer.
Bob Mortensen


Attachment: dagman.zip
Description: Zip archive