Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Dagman exits, restart and hangs....

Date: Tue, 30 Jan 2007 18:13:25 -0800
From: Robert Mortensen <bobm@xxxxxxxxxxxxxxxxxxxx>
Subject: [Condor-users] Dagman exits, restart and hangs....

I'm having a problem with dagman on an all Windows XP pool. Basicallywhat happens, occasionally, is that a dagman job exits beforecompleting all nodes. It is then is restarted and it completes theremaining nodes, but then hangs waiting, I think, for some "phantom"node to complete. There are three problems:

1 - dagman appears to exit for no reason, with no errors in any logsthat I can find2 - after recovering, dagman hangs after all the nodes have beensubmitted and completed

3 - the delay in dagman recovering is nearly 1 hour

I've attached logs, examples dags and subs that will hopefully beenough for someone to understand what is going on. To fill in more ofthe details.....

The main dag is called "master.dag" it consists of anywhere from 1 toa couple of thousand of independent nodes that are in themselvesdags. A couple of lines out of master.dag look like:


JOB eali01nondemo.0 eali01nondemo.0/testcase.dag.condor.sub
JOB eana01non3rdord.0 eana01non3rdord.0/testcase.dag.condor.sub
JOB eana02nontracea.0 eana02nontracea.0/testcase.dag.condor.sub
JOB eana03nonssrayt.0 eana03nonssrayt.0/testcase.dag.condor.sub
    ...and so on...

The attached example has 160 nodes like this. Eachtestcase.dag.condor.sub is another dag, see testcase.dag for anexample, that has a single node with a PRE and a POST script. Thesedags, including the node, pre and post scripts, only take about aminute to complete. The .sub files for the dag (seetestcase.dag.condor.sub) are edited to send their dagman logs to acommon log, orats.dagman.log which is also included. The output fromdagman, master.dag.dagman.out, is included and you will note that therun starts at 08:03:50 and at 08:10:09 node eenv01nonstackd.0 issubmitted, but output stops until 09:04:22 when a new STARTING UPheader appears and dagman attempt to recover. Upon completingrecovery, the first node submitted is the same eenv01nonstackd.0 asclusterID 30068.0. Shortly thereafter, the following appears in the log:

09:10:02 ERROR: node eenv01nonstackd.0: job ID in userlog submitevent (30069.0) doesn't match ID reported earlier by submit command(30068.0)! Trusting the userlog for now, but this is scary!

In the common log, orats.dagman.log you can see that the nodeeenv01nonstackd.0 is started twice, once with cluster id 30068.0 andonce with 30069.0. Both of these are after 9:00.


If other logs, or info would be useful, I can supply them.

Thanks in advance for any help that anyone can offer.
Bob Mortensen

Attachment: dagman.zip
Description: Zip archive

Follow-Ups:
- Re: [Condor-users] Dagman exits, restart and hangs....
  - From: R. Kent Wenger

Prev by Date: Re: [Condor-users] Trouble communicating through GCB servers
Next by Date: [Condor-users] Condor-C on Windows problem
Previous by thread: [Condor-users] A simple question: just for understanding Condor
Next by thread: Re: [Condor-users] Dagman exits, restart and hangs....
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

[Condor-users] Dagman exits, restart and hangs....