[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] 'ERROR while bootstrapping' in subdag



On Wed, 12 May 2010, Alexander Dietz wrote:

On Mon, May 10, 2010 at 17:19, R. Kent Wenger <wenger@xxxxxxxxxxx> wrote:
On Mon, 10 May 2010, Alexander Dietz wrote:

does anyone have new information on my reported problem? I need to
finish this DAG soon, so without any reply soon I have to restart the
DAG from scratch (and will not be able to make tests regarding my
reported problem).

I haven't figured out yet exactly what happened.  But here's one thing to
try that's better that starting over from scratch:  if there's a lock file
(t.lock) remove that, and
re-submit the DAG.  (I'm assuming that the DAGMan job is no longer in the
queue.)  That should run the rescue DAG, so you won't be starting from
scratch, but it won't go into recovery mode, so you'll bypass the problems
with events that are goofing things up.

I guess this procedure kind of works. Maybe the DAG continued not
exactly where it was, but at least from the rescue-DAG level.

Yeah, you probably lost some work, but there's not really a good way to avoid that in this case.

Before you do that, if you have space, could you tar up all of the node job
log files and put them someplace I can grab them?  That would help in
figuring out what has gone wrong.

What log files exactly do you mean? Maybe I still can grab them...?

The log files specified in your submit files for each node in the DAG. Or else, if you don't specify any log files, the default log file, which would be <DagFile>.nodes.log, unless you specified it with the DAGMAN_DEFAULT_NODE_LOG config option.

Kent Wenger
Condor Team