Re: [HTCondor-users] DAG error: "BAD EVENT: job (...) executing, total end count != 0 (1)"

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

On Tue, Feb 12, 2019 at 3:08 PM Nicolas Arnaud <narnaud@xxxxxxxxxxxx> wrote:

Hi Mark,
> I've been looking into this.

Thanks!

> (...)
> Are you running on Windows or Linux? It seems that all previous
> occurrences of this problem happened on Windows.

I'm running on Linux. Some information:

> condor_version
> $CondorVersion: 8.6.13 Oct 30 2018 BuildID: 453497 $
> $CondorPlatform: x86_64_RedHat7 $
> echo $UNAME
> Linux-x86_64-CL7

> These bugs were never resolved, although it seems like Kent spent some
> time on them and determined the problem was most likely in the
> log-reading code (so at the user level, not the farm). However it's hard
> to tell without seeing what events are actually showing up in the log.
> I'd like to try and reproduce this locally -- could you send your a)
> .nodes.log file, b) .dagman.out file, c) full .dag file? These should
> help me figure out where the bug is happening.

Please find in attachment two sets of these three files:

* those tagged "20190207_narnaud_2" correspond to a "BAD EVENT" case
followed by a dag abort (DAGMAN_ALLOW_EVENTS = 114, the default value)

* those tagged "20190212_narnaud_7" correspond to a "BAD EVENT" case,
mitigated by DAGMAN_ALLOW_EVENTS = 5: the dag goes on until completion.

As the dag file relies on independent sub files, I am also sending you
the template sub file we're using to generate all the individual task
sub files.

> For a short term workaround, you could try adjusting the value of
> DAGMAN_ALLOW_EVENTS to 5 like you suggested. It's true this could affect
> the semantics, but I think the worst case is that DAGMan could get stuck
> in a logical loop. If you're able to keep an eye on its progress and
> manually abort if necessary, I think this should work.

See above: indeed setting DAGMAN_ALLOW_EVENTS = 5 allows the dag to go on.

The point is that since I've noticed this issue I am always running the
"same" dag: the only thing that changes is its tag -- basically driving
the output directory and used for many filenames. In about 40% of the
cases, I get a "BAD EVENT" error but each time it affects a different
task and so happens at different times of the dag processing as the
tasks have very different durations. While in about 60% of the cases,
the dag completes fine w/o any "BAD EVENT".

Let me know if you need more information or if anything is unclear.

Cheers,

Nicolas

> Mark
>
>
> On Tue, Feb 12, 2019 at 2:42 AM Nicolas Arnaud <narnaud@xxxxxxxxxxxx
> <mailto:narnaud@xxxxxxxxxxxx>> wrote:
>
>
> Hello,
>
> I'm using a Condor farm to run dags containing a dozen of independent
> tasks, each task being made of a few processes running sequentially
> following the parent/child logic. Lately I have encountered errors like
> the one below:
>
> > (...)
> > 02/08/19 00:30:10 Event: ULOG_IMAGE_SIZE for HTCondor Node
> test_20190208_narnaud_virgo_status (281605.0.0) {02/08/19 00:30:06}
> > 02/08/19 00:30:10 Event: ULOG_JOB_TERMINATED for HTCondor Node
> test_20190208_narnaud_virgo_status (281605.0.0) {02/08/19 00:30:06}
> > 02/08/19 00:30:10 Number of idle job procs: 0
> > 02/08/19 00:30:10 Node test_20190208_narnaud_virgo_status job
> proc (281605.0.0) completed successfully.
> > 02/08/19 00:30:10 Node test_20190208_narnaud_virgo_status job
> completed
> > 02/08/19 00:30:10 Event: ULOG_EXECUTE for HTCondor Node
> test_20190208_narnaud_virgo_status (281605.0.0) {02/08/19 00:30:07}
> > 02/08/19 00:30:10 BAD EVENT: job (281605.0.0) executing, total
> end count != 0 (1)
> > 02/08/19 00:30:10 ERROR: aborting DAG because of bad event (BAD
> EVENT: job (281605.0.0) executing, total end count != 0 (1))
> > (...)
> > 02/08/19 00:30:10 ProcessLogEvents() returned false
> > 02/08/19 00:30:10 Aborting DAG...
> > (...)
>
> Condor correctly asseses one job as being successfully completed but it
> seems that it starts executing it again immediately. Then there is a
> "BAD EVENT" error and the DAG aborts, killing all the jobs that were
> running.
>
> So far this problem seems to occur randomly: some dags complete fine
> while, when the problem occurs, the job that suffers from it is
> different each time. So are the machine and the slot on which that
> particular job is running.
>
> In the above example, the dag snippet is fairly simple
>
> > (...)
> > JOB test_20190208_narnaud_virgo_status virgo_status.sub
> > VARS test_20190208_narnaud_virgo_status
> initialdir="/data/procdata/web/dqr/test_20190208_narnaud/dag"
> > RETRY test_20190208_narnaud_virgo_status 1
> > (...)
>
> and the sub file reads
>
> > universe = vanilla
> > executable =
> /users/narnaud/Software/RRT/Virgo/VirgoDQR/trunk/scripts/virgo_status.py
> > arguments = "--event_gps 1233176418.54321 --event_id
> test_20190208_narnaud --data_stream /virgoData/ffl/raw.ffl
> --output_dir /data/procdata/web/dqr/test_20190208_narnaud
> --n_seconds_backward 10 --n_seconds_forward 10"
> > priority = 10
> > getenv = True
> > error =
> /data/procdata/web/dqr/test_20190208_narnaud/virgo_status/logs/$(cluster)-$(process)-$$(Name).err
> > output =
> /data/procdata/web/dqr/test_20190208_narnaud/virgo_status/logs/$(cluster)-$(process)-$$(Name).out
> > notification = never
> > +Experiment = "DetChar"
> > +AccountingGroup= "virgo.prod.o3.detchar.transient.dqr"
> > queue 1
>
> => Would you know what could cause this error? And whether this is
> at my
> level (user) or at the level of the farm?
>
> => And, until the problem is fixed, would there be a way to convince
> the
> dag to continue instead of aborting? Possibly by modifying the default
> value of the macro
>
> > DAGMAN_ALLOW_EVENTS = 114
>
> ? But changing this value to 5 [!?] is said to "break the semantics of
> the DAG" => I'm not sure this is the right way to proceed.
>
> Thanks in advance for your help,
>
> Nicolas
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
> <mailto:htcondor-users-request@xxxxxxxxxxx> with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
>
>
>
> --
> Mark Coatsworth
> Systems Programmer
> Center for High Throughput Computing
> Department of Computer Sciences
> University of Wisconsin-Madison
> +1 608 206 4703
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
>

Mailing List Archives

Public Access

Re: [HTCondor-users] DAG error: "BAD EVENT: job (...) executing, total end count != 0 (1)"