[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Dagman "BAD EVENT" problems on Windows



I'm running Condor Stable on Windows. A couple times I've seen my big DAGs die with incomprehensible "BAD EVENT" stuff. The dagman.out log below seems to indicate 5886 exits successfully, but then an unexpected ULOG_EXECUTING event happens for no clear reason?

 

There are a bunch of these "bad event" messages scattered throughout the log alongside "Continuing with DAG in spite of bad event". But then suddenly "Aborting DAG" happens and everything gets condor_rm'ed. I can't figure out what the proximate issue to the "Aborting DAG" message is.

 

01/14/12 21:14:25 Event: ULOG_EXECUTE for Condor Node moeadd_scen0.sim-850087584 (5886.0.0)

01/14/12 21:14:25 Number of idle job procs: 1428

01/14/12 21:14:25 Event: ULOG_EXECUTE for Condor Node moeadd_scen0.sim-632752657 (5889.0.0)

01/14/12 21:14:25 Number of idle job procs: 1427

01/14/12 21:14:25 Event: ULOG_ATTRIBUTE_UPDATE for Condor Node reprun_scen0.sim-23890347 (4004.0.0)

01/14/12 21:14:25 Event: ULOG_IMAGE_SIZE for Condor Node moeadd_scen0.sim-850087584 (5886.0.0)

01/14/12 21:14:25 Event: ULOG_SUBMIT for Condor Node alertadd_scen0.sim-118624484 (0.2147483647.1033)

01/14/12 21:14:25 Number of idle job procs: 1428

01/14/12 21:14:25 Event: ULOG_JOB_TERMINATED for Condor Node alertadd_scen0.sim-118624484 (0.2147483647.1033)

01/14/12 21:14:25 Node alertadd_scen0.sim-118624484 job proc (0.2147483647.1033) completed successfully.

01/14/12 21:14:25 Node alertadd_scen0.sim-118624484 job completed

01/14/12 21:14:25 Number of idle job procs: 1427

01/14/12 21:14:25 Event: ULOG_JOB_TERMINATED for Condor Node moeadd_scen0.sim-850087584 (5886.0.0)

01/14/12 21:14:25 Node moeadd_scen0.sim-850087584 job proc (5886.0.0) completed successfully.

01/14/12 21:14:25 Node moeadd_scen0.sim-850087584 job completed

01/14/12 21:14:25 Number of idle job procs: 1427

01/14/12 21:14:25 Event: ULOG_SUBMIT for Condor Node alertadd_scen0.sim-22384084 (0.2147483647.1034)

01/14/12 21:14:25 Number of idle job procs: 1428

01/14/12 21:14:25 Event: ULOG_EXECUTE for Condor Node moeadd_scen0.sim-850087584 (5886.0.0)

01/14/12 21:14:25 BAD EVENT: job (5886.0.0) executing, total end count != 0 (1)

01/14/12 21:14:25 ERROR: aborting DAG because of bad event (BAD EVENT: job (5886.0.0) executing, total end count != 0 (1))

01/14/12 21:14:25 Event: ULOG_EXECUTE for Condor Node moeadd_scen0.sim-632752657 (5889.0.0)

01/14/12 21:14:25 Number of idle job procs: 1428

01/14/12 21:14:25 Event: ULOG_ATTRIBUTE_UPDATE for Condor Node reprun_scen0.sim-23890347 (4004.0.0)

01/14/12 21:14:25 Event: ULOG_IMAGE_SIZE for Condor Node moeadd_scen0.sim-850087584 (5886.0.0)

01/14/12 21:14:25 Event: ULOG_SUBMIT for Condor Node alertadd_scen0.sim-118624484 (0.2147483647.1033)

01/14/12 21:14:25 BAD EVENT: job (0.2147483647.1033) submitted, total end count != 0 (1)

01/14/12 21:14:25 Continuing with DAG in spite of bad event (BAD EVENT: job (0.2147483647.1033) submitted, total end count != 0 (1)) because of allow_events setting

01/14/12 21:14:26 Event: ULOG_JOB_TERMINATED for Condor Node alertadd_scen0.sim-118624484 (0.2147483647.1033)

01/14/12 21:14:26 BAD EVENT: job (0.2147483647.1033) ended, total end count != 1 (2)

01/14/12 21:14:26 Continuing with DAG in spite of bad event (BAD EVENT: job (0.2147483647.1033) ended, total end count != 1 (2)) because of allow_events setting

01/14/12 21:14:26 Event: ULOG_JOB_TERMINATED for Condor Node moeadd_scen0.sim-850087584 (5886.0.0)

01/14/12 21:14:26 BAD EVENT: job (5886.0.0) ended, total end count != 1 (2)

01/14/12 21:14:26 Continuing with DAG in spite of bad event (BAD EVENT: job (5886.0.0) ended, total end count != 1 (2)) because of allow_events setting

01/14/12 21:14:26 Event: ULOG_SUBMIT for Condor Node alertadd_scen0.sim-22384084 (0.2147483647.1034)

01/14/12 21:14:26 BAD EVENT: job (0.2147483647.1034) submitted, submit count != 1 (2)

01/14/12 21:14:26 Continuing with DAG in spite of bad event (BAD EVENT: job (0.2147483647.1034) submitted, submit count != 1 (2)) because of allow_events setting

01/14/12 21:14:26 Event: ULOG_JOB_TERMINATED for Condor Node alertadd_scen0.sim-22384084 (0.2147483647.1034)

01/14/12 21:14:26 Node alertadd_scen0.sim-22384084 job proc (0.2147483647.1034) completed successfully.

01/14/12 21:14:26 Node alertadd_scen0.sim-22384084 job completed

01/14/12 21:14:26 Number of idle job procs: 1427

01/14/12 21:14:26 Event: ULOG_SUBMIT for Condor Node alertadd_scen0.sim-763834499 (0.2147483647.1035)

01/14/12 21:14:26 Number of idle job procs: 1428

01/14/12 21:14:26 Event: ULOG_JOB_TERMINATED for Condor Node alertadd_scen0.sim-763834499 (0.2147483647.1035)

01/14/12 21:14:26 Node alertadd_scen0.sim-763834499 job proc (0.2147483647.1035) completed successfully.

01/14/12 21:14:26 Node alertadd_scen0.sim-763834499 job completed

01/14/12 21:14:26 Number of idle job procs: 1427

01/14/12 21:14:26 Event: ULOG_SUBMIT for Condor Node alertadd_scen0.sim-140979186 (0.2147483647.1036)

01/14/12 21:14:26 Number of idle job procs: 1428

01/14/12 21:14:26 Event: ULOG_JOB_TERMINATED for Condor Node alertadd_scen0.sim-140979186 (0.2147483647.1036)

01/14/12 21:14:26 Node alertadd_scen0.sim-140979186 job proc (0.2147483647.1036) completed successfully.

01/14/12 21:14:26 Node alertadd_scen0.sim-140979186 job completed

01/14/12 21:14:26 Number of idle job procs: 1427

01/14/12 21:14:26 Event: ULOG_ATTRIBUTE_UPDATE for Condor Node reprun_scen0.sim-801761742 (4027.0.0)

01/14/12 21:14:26 Aborting DAG...