[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor_dagman.exe in idle after submit jobs completed

Kent and Ian,

Thank you for your comments. It looks like I am going to have to spend a lot more time investigating this because it is not evident what has happened. Most of the jobs did complete, but something happened to the communication between the jobs and the condor_dagman.exe. I do not know the communication process yet, but I did not see any errors in the dagman log or anything. Basically the dagman went into recovery mode and could never exit this recovery loop.  When it went into recovery mode it generated this file: dprintf_failure.DAGMAN. If I delete the file it would generate it again on the next recovery attempt.

When I released the condor_dagman job, a recovery file was not generated. I then tried to rerun the dag and the following happened:
dprintf_failure.DAGMAN was generated again
condor_dagman job went into idle
no dag jobs were submitted
condor_dagman.exe would not exit without forcing it

I had started a different DAG about 4 days ago (overlapped this DAG that I am having problems with) and it has been running fine, so maybe something happened on my submit machine before I submitted the second DAG.

If I can figure this out or if it happens again I will post my findings to the list.

thanks again for your help,

From: "R. Kent Wenger" <wenger@xxxxxxxxxxx>
To: Condor-Users Mail List <condor-users@xxxxxxxxxxx>
Date: 06/14/2011 11:15 AM
Subject: Re: [Condor-users] condor_dagman.exe in idle after submit jobs completed
Sent by: condor-users-bounces@xxxxxxxxxxx

On Tue, 14 Jun 2011, Ian Chesal wrote:

> ...
> You could try running condor_rm against the condor_dagman job -- this
> should trigger it to write out a rescue dag if it thinks there's still
> work to be done. That might shed some light on what part of the DAG the
> manager thinks hasn't completed.

Actually, I'd recommend doing condor_hold/condor_release of the DAGMan
job, rather than condor_rm.  If you do condor_rm, and a rescue DAG is
written out, the rescue DAG will not reflect the fact that a bunch of the
node jobs completed (the rescue DAG will only reflect whatever DAGMan has
"seen" at the time the rescue DAG is created).

If the section of the dagman.out file in the original posting includes the
end of the file, then something weird is going on with DAGMan trying to
read the node job log files.  Condor_hold/condor_release will actually
start a new DAGMan process, and hopefully the new one will be able to
read the log files correctly.

Also, what does the log file for the DAGMan job itself
(<whatever>.dagman.log) show?  That might give some idea why the job is

Kent Wenger
Condor Team
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting

The archives can be found at: