[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] DAG error: "BAD EVENT: job (...) executing, total end count != 0 (1)"




Hello,

I'm using a Condor farm to run dags containing a dozen of independent tasks, each task being made of a few processes running sequentially following the parent/child logic. Lately I have encountered errors like the one below:

(...)
02/08/19 00:30:10 Event: ULOG_IMAGE_SIZE for HTCondor Node test_20190208_narnaud_virgo_status (281605.0.0) {02/08/19 00:30:06}
02/08/19 00:30:10 Event: ULOG_JOB_TERMINATED for HTCondor Node test_20190208_narnaud_virgo_status (281605.0.0) {02/08/19 00:30:06}
02/08/19 00:30:10 Number of idle job procs: 0
02/08/19 00:30:10 Node test_20190208_narnaud_virgo_status job proc (281605.0.0) completed successfully.
02/08/19 00:30:10 Node test_20190208_narnaud_virgo_status job completed
02/08/19 00:30:10 Event: ULOG_EXECUTE for HTCondor Node test_20190208_narnaud_virgo_status (281605.0.0) {02/08/19 00:30:07}
02/08/19 00:30:10 BAD EVENT: job (281605.0.0) executing, total end count != 0 (1)
02/08/19 00:30:10 ERROR: aborting DAG because of bad event (BAD EVENT: job (281605.0.0) executing, total end count != 0 (1))
(...)
02/08/19 00:30:10 ProcessLogEvents() returned false
02/08/19 00:30:10 Aborting DAG...
(...)

Condor correctly asseses one job as being successfully completed but it seems that it starts executing it again immediately. Then there is a "BAD EVENT" error and the DAG aborts, killing all the jobs that were running.

So far this problem seems to occur randomly: some dags complete fine while, when the problem occurs, the job that suffers from it is different each time. So are the machine and the slot on which that particular job is running.

In the above example, the dag snippet is fairly simple

(...)
JOB test_20190208_narnaud_virgo_status virgo_status.sub
VARS test_20190208_narnaud_virgo_status initialdir="/data/procdata/web/dqr/test_20190208_narnaud/dag"
RETRY test_20190208_narnaud_virgo_status 1
(...)

and the sub file reads

universe = vanilla
executable = /users/narnaud/Software/RRT/Virgo/VirgoDQR/trunk/scripts/virgo_status.py
arguments = "--event_gps 1233176418.54321 --event_id test_20190208_narnaud --data_stream /virgoData/ffl/raw.ffl --output_dir /data/procdata/web/dqr/test_20190208_narnaud --n_seconds_backward 10 --n_seconds_forward 10"
priority = 10
getenv = True
error = /data/procdata/web/dqr/test_20190208_narnaud/virgo_status/logs/$(cluster)-$(process)-$$(Name).err
output = /data/procdata/web/dqr/test_20190208_narnaud/virgo_status/logs/$(cluster)-$(process)-$$(Name).out
notification = never
+Experiment = "DetChar"
+AccountingGroup= "virgo.prod.o3.detchar.transient.dqr"
queue 1

=> Would you know what could cause this error? And whether this is at my level (user) or at the level of the farm?

=> And, until the problem is fixed, would there be a way to convince the dag to continue instead of aborting? Possibly by modifying the default value of the macro

DAGMAN_ALLOW_EVENTS = 114

? But changing this value to 5 [!?] is said to "break the semantics of the DAG" => I'm not sure this is the right way to proceed.

Thanks in advance for your help,

Nicolas