[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] 'ERROR while bootstrapping' in subdag



Hi,

On Tue, May 4, 2010 at 17:31, R. Kent Wenger <wenger@xxxxxxxxxxx> wrote:
> On Tue, 4 May 2010, Alexander Dietz wrote:
>
>> I have a problem with a DAG within my Uberdag, and I appreciate any help.
>> I needed to stop the uberdag process, because it seemed to be hang up.
>> When I started the (rescue) Uberdag, all but one of the sub DAGs run
>> fine, but one did not. The dagman.out file shows the following error I
>> never have seen before:
>>
>> ....
>> 05/03 17:16:51     ------------------------------
>> 05/03 17:16:51        Condor Recovery Complete
>> 05/03 17:16:51     ------------------------------
>> 05/03 17:16:51 Disabling log line cache.
>> 05/03 17:16:51 ERROR while bootstrapping
>> 05/03 17:16:51 **** condor_scheduniv_exec.9943595.0 (condor_DAGMAN)
>> pid 2502 EXITING WITH STATUS 1
>> 05/03 17:16:51 Warning: ReadMultipleUserLogs destructor called, but
>> still monitoring 1 log(s)!
>
> Is this from the uberdag or the low-level dag?  It's hard to tell what's
> going on from this snippet of the file -- the best thing would be if you can
> send the dagman.out file for both the uberdag and the subdag that's failing.
>

The bootstrapping error happens in the lower-level subdag. I doubt I
can send the dagman.out file, even gzipped is has a size of almost 10
MB. Any other ides of I could help? Do you need specific parts of this
file?


>> This also happend if I try to restart this DAG by its own. It seems
>> that no rescue DAG has been created, and the DAG tried to recover from
>> the information in the dagman.out file?
>> Anyway, what can I do to recover this DAG? Or do I need to rerun this
>> DAG from scratch. I also tried to find some help via google, but I
>> found noting helpful. Any help is appreciated!
>
> Hmm -- it sounds like DAGMan was killed (not condor_rm'ed) or held while the
> lower-level DAG was running.

It is entirely possible that the submit machine has crashed or so, and
the subdag kept on running. But just a hypothesis, have not
investigated that possible reason.

> (Just as a note, recovery mode means that
> DAGMan is trying to recover the DAG state from the node job log files, not
> the dagman.out file.)

Ah I see. Thanks for the info.



Cheers
 Alex

>
> Kent Wenger
> Condor Team
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>
>