[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Dagman exits, restart and hangs....

Thanks, I have more info which I will forward directly to you...

On Jan 31, 2007, at 8:40 AM, R. Kent Wenger wrote:

On Tue, 30 Jan 2007, Robert Mortensen wrote:

I'm having a problem with dagman on an all Windows XP pool. Basically
what happens, occasionally, is that a dagman job exits before
completing all nodes. It is then is restarted and it completes the
remaining nodes, but then hangs waiting, I think, for some "phantom"
node to complete. There are three problems:

1 - dagman appears to exit for no reason, with no errors in any logs
that I can find
2 - after recovering, dagman hangs after all the nodes have been
submitted and completed
3 - the delay in dagman recovering is nearly 1 hour

We're looking into this.

One thing that might help would be to also have the master.dag.dagman.log
and master.dag.lib.out files if you still have them.

Also, it would help if you increased the verbosity of the DAGMan output,
and sent the resulting dagman.out file when/if this happens again.

There are two separate verbosity controls (that control different output).
Please do the following:

- Add the setting '-debug 5' on your condor_submit_dag command line.

- Set the configuration macro DAGMAN_DEBUG to D_FULLDEBUG.  You can do
  this in a couple of ways:
- Put 'DAGMAN_DEBUG = D_FULLDEBUG' into an appropriate configuration
- Set _CONDOR_DAGMAN_DEBUG to D_FULLDEBUG in your environment before
      running condor_submit_dag.

- You can address number 3 by setting the DAGMAN_NOT_RESPONDING_TIMEOUT configuration macro to a value shorter than the default (which is 3600

Kent Wenger
Condor Team
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting

The archives can be found at either