[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor_dagman.exe in idle after submit jobs completed



On Tue, 14 Jun 2011, Ian Chesal wrote:

...
You could try running condor_rm against the condor_dagman job -- this should trigger it to write out a rescue dag if it thinks there's still work to be done. That might shed some light on what part of the DAG the manager thinks hasn't completed.

Actually, I'd recommend doing condor_hold/condor_release of the DAGMan job, rather than condor_rm. If you do condor_rm, and a rescue DAG is
written out, the rescue DAG will not reflect the fact that a bunch of the
node jobs completed (the rescue DAG will only reflect whatever DAGMan has "seen" at the time the rescue DAG is created).

If the section of the dagman.out file in the original posting includes the end of the file, then something weird is going on with DAGMan trying to read the node job log files. Condor_hold/condor_release will actually start a new DAGMan process, and hopefully the new one will be able to read the log files correctly.

Also, what does the log file for the DAGMan job itself (<whatever>.dagman.log) show? That might give some idea why the job is idle.

Kent Wenger
Condor Team