Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] DAG error: "BAD EVENT: job (...) executing, total end count != 0 (1)"

Date: Tue, 12 Feb 2019 09:42:28 +0100
From: Nicolas Arnaud <narnaud@xxxxxxxxxxxx>
Subject: [HTCondor-users] DAG error: "BAD EVENT: job (...) executing, total end count != 0 (1)"


Hello,

I'm using a Condor farm to run dags containing a dozen of independenttasks, each task being made of a few processes running sequentiallyfollowing the parent/child logic. Lately I have encountered errors likethe one below:

(...)
02/08/19 00:30:10 Event: ULOG_IMAGE_SIZE for HTCondor Node test_20190208_narnaud_virgo_status (281605.0.0) {02/08/19 00:30:06}
02/08/19 00:30:10 Event: ULOG_JOB_TERMINATED for HTCondor Node test_20190208_narnaud_virgo_status (281605.0.0) {02/08/19 00:30:06}
02/08/19 00:30:10 Number of idle job procs: 0
02/08/19 00:30:10 Node test_20190208_narnaud_virgo_status job proc (281605.0.0) completed successfully.
02/08/19 00:30:10 Node test_20190208_narnaud_virgo_status job completed
02/08/19 00:30:10 Event: ULOG_EXECUTE for HTCondor Node test_20190208_narnaud_virgo_status (281605.0.0) {02/08/19 00:30:07}
02/08/19 00:30:10 BAD EVENT: job (281605.0.0) executing, total end count != 0 (1)
02/08/19 00:30:10 ERROR: aborting DAG because of bad event (BAD EVENT: job (281605.0.0) executing, total end count != 0 (1))
(...)
02/08/19 00:30:10 ProcessLogEvents() returned false
02/08/19 00:30:10 Aborting DAG...

(...)

Condor correctly asseses one job as being successfully completed but itseems that it starts executing it again immediately. Then there is a"BAD EVENT" error and the DAG aborts, killing all the jobs that wererunning.

So far this problem seems to occur randomly: some dags complete finewhile, when the problem occurs, the job that suffers from it isdifferent each time. So are the machine and the slot on which thatparticular job is running.


In the above example, the dag snippet is fairly simple

(...)
JOB test_20190208_narnaud_virgo_status virgo_status.sub
VARS test_20190208_narnaud_virgo_status initialdir="/data/procdata/web/dqr/test_20190208_narnaud/dag"
RETRY test_20190208_narnaud_virgo_status 1

(...)


and the sub file reads

universe = vanilla
executable = /users/narnaud/Software/RRT/Virgo/VirgoDQR/trunk/scripts/virgo_status.py
arguments = "--event_gps 1233176418.54321 --event_id test_20190208_narnaud --data_stream /virgoData/ffl/raw.ffl --output_dir /data/procdata/web/dqr/test_20190208_narnaud --n_seconds_backward 10 --n_seconds_forward 10"
priority = 10
getenv = True
error = /data/procdata/web/dqr/test_20190208_narnaud/virgo_status/logs/$(cluster)-$(process)-$$(Name).err
output = /data/procdata/web/dqr/test_20190208_narnaud/virgo_status/logs/$(cluster)-$(process)-$$(Name).out
notification = never
+Experiment = "DetChar"
+AccountingGroup= "virgo.prod.o3.detchar.transient.dqr"

queue 1

=> Would you know what could cause this error? And whether this is at mylevel (user) or at the level of the farm?

=> And, until the problem is fixed, would there be a way to convince thedag to continue instead of aborting? Possibly by modifying the defaultvalue of the macro

DAGMAN_ALLOW_EVENTS = 114

? But changing this value to 5 [!?] is said to "break the semantics ofthe DAG" => I'm not sure this is the right way to proceed.


Thanks in advance for your help,

Nicolas

Follow-Ups:
- Re: [HTCondor-users] DAG error: "BAD EVENT: job (...) executing, total end count != 0 (1)"
  - From: Mark Coatsworth

Prev by Date: Re: [HTCondor-users] transfer_in/output_files only if they exist
Next by Date: Re: [HTCondor-users] DAG error: "BAD EVENT: job (...) executing, total end count != 0 (1)"
Previous by thread: [HTCondor-users] test - pls ignore
Next by thread: Re: [HTCondor-users] DAG error: "BAD EVENT: job (...) executing, total end count != 0 (1)"
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

[HTCondor-users] DAG error: "BAD EVENT: job (...) executing, total end count != 0 (1)"