[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Dagman "BAD EVENT" problems on Windows
- Date: Tue, 17 Jan 2012 17:23:41 -0600 (CST)
- From: "R. Kent Wenger" <wenger@xxxxxxxxxxx>
- Subject: Re: [Condor-users] Dagman "BAD EVENT" problems on Windows
On Tue, 17 Jan 2012, Rowe, Thomas wrote:
I'm running Condor Stable on Windows. A couple times I've seen my big
DAGs die with incomprehensible "BAD EVENT" stuff. The dagman.out log
below seems to indicate 5886 exits successfully, but then an unexpected
ULOG_EXECUTING event happens for no clear reason?
There are a bunch of these "bad event" messages scattered throughout
the log alongside "Continuing with DAG in spite of bad event". But then
suddenly "Aborting DAG" happens and everything gets condor_rm'ed. I
can't figure out what the proximate issue to the "Aborting DAG" message
In general DAGMan doesn't like to see any events for a job after the
TERMINATED event. I'll have to look at the code -- it may be that, even
after a bad event, DAGMan completes a "cycle" of reading events, so that
may be why the abort happens some time after the bad events.
At any rate, you should be able to avoid the DAG aborting by setting the
DAGMAN_ALLOW_EVENTS configuration parameter appropriately (see
If you set it to 1, I think that should avoid the DAG aborts in your case.
You can set DAGMAN_ALLOW_EVENTS with a DAG configuration file (see
or by setting the environment variable _CONDOR_DAGMAN_ALLOW_EVENTS in
the shell in which you run condor_submit_dag.
I'm curious what's going on though, because at least some of your bad
events happened on DAG-level NOOP jobs, which seems really weird. Can you
send me a copy of your dag file and your dagman.out file? I'd like to
take a look at them to try to figure out what is going on, rather than
just working around the problem with the DAGMAN_ALLOW_EVENTS setting.