[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Dagman "BAD EVENT" problems on Windows



On Tue, 17 Jan 2012, Rowe, Thomas wrote:

I'm running Condor Stable on Windows. A couple times I've seen my big DAGs die with incomprehensible "BAD EVENT" stuff. The dagman.out log below seems to indicate 5886 exits successfully, but then an unexpected ULOG_EXECUTING event happens for no clear reason?

There are a bunch of these "bad event" messages scattered throughout the log alongside "Continuing with DAG in spite of bad event". But then suddenly "Aborting DAG" happens and everything gets condor_rm'ed. I can't figure out what the proximate issue to the "Aborting DAG" message is.

In general DAGMan doesn't like to see any events for a job after the TERMINATED event. I'll have to look at the code -- it may be that, even after a bad event, DAGMan completes a "cycle" of reading events, so that
may be why the abort happens some time after the bad events.

At any rate, you should be able to avoid the DAG aborting by setting the DAGMAN_ALLOW_EVENTS configuration parameter appropriately (see
http://research.cs.wisc.edu/condor/manual/v7.7/3_3Configuration.html#sec:DAGMan-Config-File-Entries).
If you set it to 1, I think that should avoid the DAG aborts in your case.

You can set DAGMAN_ALLOW_EVENTS with a DAG configuration file (see
http://research.cs.wisc.edu/condor/manual/v7.7/2_10DAGMan_Applications.html#SECTION003106500000000000000)
or by setting the environment variable _CONDOR_DAGMAN_ALLOW_EVENTS in
the shell in which you run condor_submit_dag.

I'm curious what's going on though, because at least some of your bad events happened on DAG-level NOOP jobs, which seems really weird. Can you send me a copy of your dag file and your dagman.out file? I'd like to take a look at them to try to figure out what is going on, rather than just working around the problem with the DAGMAN_ALLOW_EVENTS setting.

Kent Wenger
Condor Team