[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] condor_dagman.exe in idle after submit jobs completed
- Date: Tue, 14 Jun 2011 12:08:00 -0500 (CDT)
- From: "R. Kent Wenger" <wenger@xxxxxxxxxxx>
- Subject: Re: [Condor-users] condor_dagman.exe in idle after submit jobs completed
On Tue, 14 Jun 2011, Ian Chesal wrote:
You could try running condor_rm against the condor_dagman job -- this
should trigger it to write out a rescue dag if it thinks there's still
work to be done. That might shed some light on what part of the DAG the
manager thinks hasn't completed.
Actually, I'd recommend doing condor_hold/condor_release of the DAGMan
job, rather than condor_rm. If you do condor_rm, and a rescue DAG is
written out, the rescue DAG will not reflect the fact that a bunch of the
node jobs completed (the rescue DAG will only reflect whatever DAGMan has
"seen" at the time the rescue DAG is created).
If the section of the dagman.out file in the original posting includes the
end of the file, then something weird is going on with DAGMan trying to
read the node job log files. Condor_hold/condor_release will actually
start a new DAGMan process, and hopefully the new one will be able to
read the log files correctly.
Also, what does the log file for the DAGMan job itself
(<whatever>.dagman.log) show? That might give some idea why the job is