[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Dagman exits, restart and hangs....



On Tue, 30 Jan 2007, Robert Mortensen wrote:

> I'm having a problem with dagman on an all Windows XP pool. Basically
> what happens, occasionally, is that a dagman job exits before
> completing all nodes. It is then is restarted and it completes the
> remaining nodes, but then hangs waiting, I think, for some "phantom"
> node to complete. There are three problems:
>
> 1 - dagman appears to exit for no reason, with no errors in any logs
> that I can find
> 2 - after recovering, dagman hangs after all the nodes have been
> submitted and completed
> 3 - the delay in dagman recovering is nearly 1 hour
> ...

We're looking into this.

One thing that might help would be to also have the master.dag.dagman.log
and master.dag.lib.out files if you still have them.

Also, it would help if you increased the verbosity of the DAGMan output,
and sent the resulting dagman.out file when/if this happens again.

There are two separate verbosity controls (that control different output).
Please do the following:

- Add the setting '-debug 5' on your condor_submit_dag command line.

- Set the configuration macro DAGMAN_DEBUG to D_FULLDEBUG.  You can do
  this in a couple of ways:
    - Put 'DAGMAN_DEBUG = D_FULLDEBUG' into an appropriate configuration
      file.
    - Set _CONDOR_DAGMAN_DEBUG to D_FULLDEBUG in your environment before
      running condor_submit_dag.

- You can address number 3 by setting the DAGMAN_NOT_RESPONDING_TIMEOUT
  configuration macro to a value shorter than the default (which is 3600
  seconds).

Kent Wenger
Condor Team