[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] DAG error: "BAD EVENT: job (...) executing, total end count != 0 (1)"




Hi Mark,

Thanks for following up on this!

I'm actually just an end user, not a sys admin => I'll forward your message to the admins here who will then contact you directly.

Cheers,

Nicolas

Le 13/02/2019 Ã 21:15, Mark Coatsworth a ÃcritÂ:
Hi Nicolas, thanks for all this information!

I looked through your log files, it turns out this was not the problem I expected. The ULOG_EXECUTE event actually does appear twice in the log -- so it's not an issue with our log-reading code (which was the case with those bugs I mentioned). For some reason it looks like your schedd is actually executing the same job twice.

We're going to need a few more things to help debug this. Could you please send me the following:
* ScheddLog
* ShadowLog
* Your job classad (can retrieve this by running "condor_history -lÂ281392"
* The output on your submit server from running "ps auxww | grep condor"

Also, is there any way I can get a user account to log in to your submit server? We discussed this at our team meeting this morning and everybody thinks the problem is related to your environment. So it might be easier for us to debug if we can get access, rather than keep asking you to send us things over email.

Mark



On Tue, Feb 12, 2019 at 3:08 PM Nicolas Arnaud <narnaud@xxxxxxxxxxxx <mailto:narnaud@xxxxxxxxxxxx>> wrote:


    Hi Mark,
     > I've been looking into this.

    Thanks!

     > (...)
     > Are you running on Windows or Linux? It seems that all previous
     > occurrences of this problem happened on Windows.

    I'm running on Linux. Some information:

     > condor_version
     > $CondorVersion: 8.6.13 Oct 30 2018 BuildID: 453497 $
     > $CondorPlatform: x86_64_RedHat7 $
     > echo $UNAME
     > Linux-x86_64-CL7

     > These bugs were never resolved, although it seems like Kent spent
    some
     > time on them and determined the problem was most likely in the
     > log-reading code (so at the user level, not the farm). However
    it's hard
     > to tell without seeing what events are actually showing up in the
    log.
     > I'd like to try and reproduce this locally -- could you send your a)
     > .nodes.log file, b) .dagman.out file, c) full .dag file? These
    should
     > help me figure out where the bug is happening.

    Please find in attachment two sets of these three files:

     Â Â* those tagged "20190207_narnaud_2" correspond to a "BAD EVENT"
    case
    followed by a dag abort (DAGMAN_ALLOW_EVENTS = 114, the default value)

     Â Â* those tagged "20190212_narnaud_7" correspond to a "BAD EVENT"
    case,
    mitigated by DAGMAN_ALLOW_EVENTS = 5: the dag goes on until completion.

    As the dag file relies on independent sub files, I am also sending you
    the template sub file we're using to generate all the individual task
    sub files.

     > For a short term workaround, you could try adjusting the value of
     > DAGMAN_ALLOW_EVENTS to 5 like you suggested. It's true this could
    affect
     > the semantics, but I think the worst case is that DAGMan could
    get stuck
     > in a logical loop. If you're able to keep an eye on its progress and
     > manually abort if necessary, I think this should work.

    See above: indeed setting DAGMAN_ALLOW_EVENTS = 5 allows the dag to
    go on.

    The point is that since I've noticed this issue I am always running the
    "same" dag: the only thing that changes is its tag -- basically driving
    the output directory and used for many filenames. In about 40% of the
    cases, I get a "BAD EVENT" error but each time it affects a different
    task and so happens at different times of the dag processing as the
    tasks have very different durations. While in about 60% of the cases,
    the dag completes fine w/o any "BAD EVENT".

    Let me know if you need more information or if anything is unclear.

    Cheers,

    Nicolas

     > Mark
     >
     >
     > On Tue, Feb 12, 2019 at 2:42 AM Nicolas Arnaud
    <narnaud@xxxxxxxxxxxx <mailto:narnaud@xxxxxxxxxxxx>
     > <mailto:narnaud@xxxxxxxxxxxx <mailto:narnaud@xxxxxxxxxxxx>>> wrote:
     >
     >
     >Â Â ÂHello,
     >
     >Â Â ÂI'm using a Condor farm to run dags containing a dozen of
    independent
     >Â Â Âtasks, each task being made of a few processes running
    sequentially
     >Â Â Âfollowing the parent/child logic. Lately I have encountered
    errors like
     >Â Â Âthe one below:
     >
     >Â Â Â > (...)
     >Â Â Â > 02/08/19 00:30:10 Event: ULOG_IMAGE_SIZE for HTCondor Node
     >Â Â Âtest_20190208_narnaud_virgo_status (281605.0.0) {02/08/19
    00:30:06}
     >Â Â Â > 02/08/19 00:30:10 Event: ULOG_JOB_TERMINATED for HTCondor Node
     >Â Â Âtest_20190208_narnaud_virgo_status (281605.0.0) {02/08/19
    00:30:06}
     >Â Â Â > 02/08/19 00:30:10 Number of idle job procs: 0
     >Â Â Â > 02/08/19 00:30:10 Node test_20190208_narnaud_virgo_status job
     >Â Â Âproc (281605.0.0) completed successfully.
     >Â Â Â > 02/08/19 00:30:10 Node test_20190208_narnaud_virgo_status job
     >Â Â Âcompleted
     >Â Â Â > 02/08/19 00:30:10 Event: ULOG_EXECUTE for HTCondor Node
     >Â Â Âtest_20190208_narnaud_virgo_status (281605.0.0) {02/08/19
    00:30:07}
     >Â Â Â > 02/08/19 00:30:10 BAD EVENT: job (281605.0.0) executing, total
     >Â Â Âend count != 0 (1)
     >Â Â Â > 02/08/19 00:30:10 ERROR: aborting DAG because of bad event
    (BAD
     >Â Â ÂEVENT: job (281605.0.0) executing, total end count != 0 (1))
     >Â Â Â > (...)
     >Â Â Â > 02/08/19 00:30:10 ProcessLogEvents() returned false
     >Â Â Â > 02/08/19 00:30:10 Aborting DAG...
     >Â Â Â > (...)
     >
     >Â Â ÂCondor correctly asseses one job as being successfully
    completed but it
     >Â Â Âseems that it starts executing it again immediately. Then
    there is a
     >Â Â Â"BAD EVENT" error and the DAG aborts, killing all the jobs
    that were
     >Â Â Ârunning.
     >
     >Â Â ÂSo far this problem seems to occur randomly: some dags
    complete fine
     >Â Â Âwhile, when the problem occurs, the job that suffers from it is
     >Â Â Âdifferent each time. So are the machine and the slot on which
    that
     >Â Â Âparticular job is running.
     >
     >Â Â ÂIn the above example, the dag snippet is fairly simple
     >
     >Â Â Â > (...)
     >Â Â Â > JOB test_20190208_narnaud_virgo_status virgo_status.sub
     >Â Â Â > VARS test_20190208_narnaud_virgo_status
     >Â Â Âinitialdir="/data/procdata/web/dqr/test_20190208_narnaud/dag"
     >Â Â Â > RETRY test_20190208_narnaud_virgo_status 1
     >Â Â Â > (...)
     >
     >Â Â Âand the sub file reads
     >
     >Â Â Â > universe = vanilla
     >Â Â Â > executable =
> Â/users/narnaud/Software/RRT/Virgo/VirgoDQR/trunk/scripts/virgo_status.py
     >Â Â Â > arguments = "--event_gps 1233176418.54321 --event_id
     >Â Â Âtest_20190208_narnaud --data_stream /virgoData/ffl/raw.ffl
     >Â Â Â--output_dir /data/procdata/web/dqr/test_20190208_narnaud
     >Â Â Â--n_seconds_backward 10 --n_seconds_forward 10"
     >Â Â Â > priority = 10
     >Â Â Â > getenv = True
     >Â Â Â > error =
> Â/data/procdata/web/dqr/test_20190208_narnaud/virgo_status/logs/$(cluster)-$(process)-$$(Name).err
     >Â Â Â > output =
> Â/data/procdata/web/dqr/test_20190208_narnaud/virgo_status/logs/$(cluster)-$(process)-$$(Name).out
     >Â Â Â > notification = never
     >Â Â Â > +Experiment = "DetChar"
     >Â Â Â > +AccountingGroup= "virgo.prod.o3.detchar.transient.dqr"
     >Â Â Â > queue 1
     >
     >Â Â Â=> Would you know what could cause this error? And whether
    this is
     >Â Â Âat my
     >Â Â Âlevel (user) or at the level of the farm?
     >
     >Â Â Â=> And, until the problem is fixed, would there be a way to
    convince
     >Â Â Âthe
     >Â Â Âdag to continue instead of aborting? Possibly by modifying
    the default
     >Â Â Âvalue of the macro
     >
     >Â Â Â > DAGMAN_ALLOW_EVENTS = 114
     >
     >Â Â Â? But changing this value to 5 [!?] is said to "break the
    semantics of
     >Â Â Âthe DAG" => I'm not sure this is the right way to proceed.
     >
     >Â Â ÂThanks in advance for your help,
     >
     >Â Â ÂNicolas
     >
     >Â Â Â_______________________________________________
     >Â Â ÂHTCondor-users mailing list
     >Â Â ÂTo unsubscribe, send a message to
    htcondor-users-request@xxxxxxxxxxx
    <mailto:htcondor-users-request@xxxxxxxxxxx>
     >Â Â Â<mailto:htcondor-users-request@xxxxxxxxxxx
    <mailto:htcondor-users-request@xxxxxxxxxxx>> with a
     >Â Â Âsubject: Unsubscribe
     >Â Â ÂYou can also unsubscribe by visiting
     > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
     >
     >Â Â ÂThe archives can be found at:
     > https://lists.cs.wisc.edu/archive/htcondor-users/
     >
     >
     >
     > --
     > Mark Coatsworth
     > Systems Programmer
     > Center for High Throughput Computing
     > Department of Computer Sciences
     > University of Wisconsin-Madison
     > +1 608 206 4703
     >
     > _______________________________________________
     > HTCondor-users mailing list
     > To unsubscribe, send a message to
    htcondor-users-request@xxxxxxxxxxx
    <mailto:htcondor-users-request@xxxxxxxxxxx> with a
     > subject: Unsubscribe
     > You can also unsubscribe by visiting
     > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
     >
     > The archives can be found at:
     > https://lists.cs.wisc.edu/archive/htcondor-users/
     >



--
Mark Coatsworth
Systems Programmer
Center for High Throughput Computing
Department of Computer Sciences
University of Wisconsin-Madison
+1 608 206 4703