[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] dagman aborts without creating a rescue dag



On Fri, 5 Aug 2005, Alexander Dietz wrote:

> I was running a DAG on my submitting machine (Red Hat Enterprise Linux
> AS release 3, condor version 6.7.8) whereas all the jobs shall be
> executed on a remote machine (Fedora Core release 3 (Heidelberg), condor
> version 6.7.8). Almost the full DAG completed, but then the dagman
> aborts. Here are the last few lines from the dagman.out-file:
>
> ...
>
> 8/4 22:04:37 ERROR "Assertion ERROR on (job->GetStatus() ==
> Job::STATUS_POSTRUN || recovery)" at line 772 in file dag.C
>
> The user proxies on botch machines were still valid for a long time, and
> then the dagman aborts without creating a rescue dag. Is there possibly
> a bug in the file dag.C or whats going on?

Yes, you hit a known bug in DAGMan.  The fix is in 6.7.10, which should be
coming out within a few days.

Kent Wenger
Condor Team