Subject: Re: [Condor-users] condor_dagman.exe in idle after submit jobs completed
Kent and Ian,
Thank you for your comments. It looks
like I am going to have to spend a lot more time investigating this because
it is not evident what has happened. Most of the jobs did complete, but
something happened to the communication between the jobs and the condor_dagman.exe.
I do not know the communication process yet, but I did not see any errors
in the dagman log or anything. Basically the dagman went into recovery
mode and could never exit this recovery loop. When it went into recovery
mode it generated this file: dprintf_failure.DAGMAN.
If I delete the file it would generate it again on the next recovery attempt.
When I released the condor_dagman job,
a recovery file was not generated. I then tried to rerun the dag and the
was generated again
condor_dagman job went into idle
no dag jobs were submitted
condor_dagman.exe would not exit without
I had started a different DAG about
4 days ago (overlapped this DAG that I am having problems with) and it
has been running fine, so maybe something happened on my submit machine
before I submitted the second DAG.
If I can figure this out or if it happens
again I will post my findings to the list.
thanks again for your help,
"R. Kent Wenger" <wenger@xxxxxxxxxxx>
Condor-Users Mail List <condor-users@xxxxxxxxxxx>
06/14/2011 11:15 AM
Re: [Condor-users] condor_dagman.exe
in idle after submit jobs completed
On Tue, 14 Jun 2011, Ian Chesal wrote:
> You could try running condor_rm against the condor_dagman job -- this
> should trigger it to write out a rescue dag if it thinks there's still
> work to be done. That might shed some light on what part of the DAG
> manager thinks hasn't completed.
Actually, I'd recommend doing condor_hold/condor_release of the DAGMan
job, rather than condor_rm. If you do condor_rm, and a rescue DAG
written out, the rescue DAG will not reflect the fact that a bunch of the
node jobs completed (the rescue DAG will only reflect whatever DAGMan has
"seen" at the time the rescue DAG is created).
If the section of the dagman.out file in the original posting includes
end of the file, then something weird is going on with DAGMan trying to
read the node job log files. Condor_hold/condor_release will actually
start a new DAGMan process, and hopefully the new one will be able to
read the log files correctly.
Also, what does the log file for the DAGMan job itself
(<whatever>.dagman.log) show? That might give some idea why
the job is
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with
You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/condor-users