[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Fwd: Re: DAG error: "BAD EVENT: job (...) executing, total end count != 0 (1)"




Hi Mark,

Thanks for this great news!

When is your next stable release planned?

I'll test the workaround you're suggesting in the coming days.

Cheers,

Nicolas

Le 19/02/2019 Ã 22:49, Mark Coatsworth a ÃcritÂ:
Hi everyone,

I got some help from Jaime on this, who was finally able to figure out what's happening. There's a bug in condor_shadow which sometimes causes it to report old job IDs in the event log. We've written up a ticket with more details:

https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=6919

Jaime is working on a fix for this which should be available in our next stable release (8.8.2).

In the meantime, there's a workaround you can use that is safer than the one we talked about previously. If you set a value for your EVENT_LOG configuration knob (which is empty by default), that should force the shadow to run the code that refreshes the job ID. The value should be the path to a file where events can get written. More info here:

http://research.cs.wisc.edu/htcondor/manual/v8.8/ConfigurationMacros.html#x33-1870003.5

Please give this a try, and we'll keep you posted on getting this problem fixed.

Mark



On Fri, Feb 15, 2019 at 10:28 AM giuseppe.dibiase <giuseppe.dibiase@xxxxxxxxx <mailto:giuseppe.dibiase@xxxxxxxxx>> wrote:

    Hi Mark,
    unfortunately those logs are gone... so Nicolas re-run his dags...
    and BAD EVENT occurred on job id 285383.0. I sent logs about this job.
    So you can find out about bug in those logs.
    Cheers
    Giuseppe



-- ===============================================
    Giuseppe Di Biase - giuseppe.dibiase@xxxxxxxxx
    <mailto:giuseppe.dibiase@xxxxxxxxx>

    European Gravitational Observatory - EGO
    Via E.Amaldi - 56021 Cascina (Pisa) - IT
    Phone: +39 050 752 577 <tel:+39%20050%20752%20577>
    ===============================================

    -------- Messaggio originale --------
    Da: Mark Coatsworth <coatsworth@xxxxxxxxxxx
    <mailto:coatsworth@xxxxxxxxxxx>>
    Data: 15/02/19 17:03 (GMT+01:00)
    A: Giuseppe Di Biase <giuseppe.dibiase@xxxxxxxxx
    <mailto:giuseppe.dibiase@xxxxxxxxx>>
    Cc: Nicolas Arnaud <narnaud@xxxxxxxxxxxx
    <mailto:narnaud@xxxxxxxxxxxx>>, EGO Helpdesk <service@xxxxxxxxx
    <mailto:service@xxxxxxxxx>>, Franco Carbognani
    <franco.carbognani@xxxxxxxxx <mailto:franco.carbognani@xxxxxxxxx>>,
    HTCondor-Users <htcondor-users@xxxxxxxxxxx
    <mailto:htcondor-users@xxxxxxxxxxx>>
    Oggetto: Re: Fwd: Re: [HTCondor-users] DAG error: "BAD EVENT: job
    (...) executing, total end count != 0 (1)"

    Hi Giuseppe, thanks for sending all this information.

    Unfortunately the ShadowLog and SchedLog files you sent do not have
    any information about Nicolas' job (281392.0.0). This job ran on Feb
    7, whereas the logs you sent only have information from Feb 14, so
    the information we need has already gone stale.

    It's possible you'll find some information in ShadowLog.old and
    SchedLog.old (look for string "281392"), if so, please send those to me.

    If there's no information in there, I'll need you to coordinate.
    Nicolas will need to run the jobs again until they produce the same
    error, then I'll need Giuseppe to send me the log files right away
    while they still contain the information we need.

    Sorry for the inconvenience. There's just no way for us to diagnose
    this problem without the log output.

    Mark


    On Thu, Feb 14, 2019 at 9:17 AM Giuseppe Di Biase
    <giuseppe.dibiase@xxxxxxxxx <mailto:giuseppe.dibiase@xxxxxxxxx>> wrote:

        Hi Mark,

        i prefere to send you logs instead of creating a linux account
        for now.
        So i hope to find out some configuration error by logs.

        In attachment 4 files as requested.


        Thanks

        Giuseppe
         >
         >
         >>
         >>
         >> -------- Message transfÃrà --------
         >> SujetÂ:ÂÂÂÂ Re: [HTCondor-users] DAG error: "BAD EVENT: job
        (...)
         >> executing, total end count != 0 (1)"
         >> DateÂ:ÂÂÂÂ Wed, 13 Feb 2019 20:15:24 +0000
         >> DeÂ:ÂÂÂÂ Mark Coatsworth <coatsworth@xxxxxxxxxxx
        <mailto:coatsworth@xxxxxxxxxxx>>
         >> PourÂ:ÂÂÂÂ Nicolas Arnaud <narnaud@xxxxxxxxxxxx
        <mailto:narnaud@xxxxxxxxxxxx>>
         >> Copie ÃÂ:ÂÂÂÂ HTCondor-Users Mail List
        <htcondor-users@xxxxxxxxxxx <mailto:htcondor-users@xxxxxxxxxxx>>
         >>
         >> Hi Nicolas, thanks for all this information!
         >>
         >> I looked through your log files, it turns out this was not the
         >> problem I expected. The ULOG_EXECUTE event actually does
        appear twice
         >> in the log -- so it's not an issue with our log-reading code
        (which
         >> was the case with those bugs I mentioned). For some reason
        it looks
         >> like your schedd is actually executing the same job twice.
         >>
         >> We're going to need a few more things to help debug this.
        Could you
         >> please send me the following:
         >> * ScheddLog
         >> * ShadowLog
         >> * Your job classad (can retrieve this by running
        "condor_history
         >> -lÂ281392"
         >> * The output on your submit server from running "ps auxww |
        grep condor"
         >>
         >> Also, is there any way I can get a user account to log in to
        your
         >> submit server? We discussed this at our team meeting this
        morning and
         >> everybody thinks the problem is related to your environment.
        So it
         >> might be easier for us to debug if we can get access, rather
        than
         >> keep asking you to send us things over email.
         >>
         >> Mark
         >>
         >>
         >>
         >> On Tue, Feb 12, 2019 at 3:08 PM Nicolas Arnaud
        <narnaud@xxxxxxxxxxxx <mailto:narnaud@xxxxxxxxxxxx>
         >> <mailto:narnaud@xxxxxxxxxxxx <mailto:narnaud@xxxxxxxxxxxx>>>
        wrote:
         >>
         >>
         >> ÂÂÂ Hi Mark,
         >> ÂÂÂÂ > I've been looking into this.
         >>
         >> ÂÂÂ Thanks!
         >>
         >> ÂÂÂÂ > (...)
         >> ÂÂÂÂ > Are you running on Windows or Linux? It seems that
        all previous
         >> ÂÂÂÂ > occurrences of this problem happened on Windows.
         >>
         >> ÂÂÂ I'm running on Linux. Some information:
         >>
         >> ÂÂÂÂ > condor_version
         >> ÂÂÂÂ > $CondorVersion: 8.6.13 Oct 30 2018 BuildID: 453497 $
         >> ÂÂÂÂ > $CondorPlatform: x86_64_RedHat7 $
         >> ÂÂÂÂ > echo $UNAME
         >> ÂÂÂÂ > Linux-x86_64-CL7
         >>
         >> ÂÂÂÂ > These bugs were never resolved, although it seems
        like Kent spent
         >> ÂÂÂ some
         >> ÂÂÂÂ > time on them and determined the problem was most
        likely in the
         >> ÂÂÂÂ > log-reading code (so at the user level, not the
        farm). However
         >> ÂÂÂ it's hard
         >> ÂÂÂÂ > to tell without seeing what events are actually
        showing up in the
         >> ÂÂÂ log.
         >> ÂÂÂÂ > I'd like to try and reproduce this locally -- could
        you send
         >> your a)
         >> ÂÂÂÂ > .nodes.log file, b) .dagman.out file, c) full .dag
        file? These
         >> ÂÂÂ should
         >> ÂÂÂÂ > help me figure out where the bug is happening.
         >>
         >> ÂÂÂ Please find in attachment two sets of these three files:
         >>
         >> ÂÂÂÂ Â Â* those tagged "20190207_narnaud_2" correspond to a
        "BAD EVENT"
         >> ÂÂÂ case
         >> ÂÂÂ followed by a dag abort (DAGMAN_ALLOW_EVENTS = 114, the
        default
         >> value)
         >>
         >> ÂÂÂÂ Â Â* those tagged "20190212_narnaud_7" correspond to a
        "BAD EVENT"
         >> ÂÂÂ case,
         >> ÂÂÂ mitigated by DAGMAN_ALLOW_EVENTS = 5: the dag goes on until
         >> completion.
         >>
         >> ÂÂÂ As the dag file relies on independent sub files, I am also
         >> sending you
         >> ÂÂÂ the template sub file we're using to generate all the
        individual
         >> task
         >> ÂÂÂ sub files.
         >>
         >> ÂÂÂÂ > For a short term workaround, you could try adjusting
        the value of
         >> ÂÂÂÂ > DAGMAN_ALLOW_EVENTS to 5 like you suggested. It's
        true this could
         >> ÂÂÂ affect
         >> ÂÂÂÂ > the semantics, but I think the worst case is that
        DAGMan could
         >> ÂÂÂ get stuck
         >> ÂÂÂÂ > in a logical loop. If you're able to keep an eye on its
         >> progress and
         >> ÂÂÂÂ > manually abort if necessary, I think this should work.
         >>
         >> ÂÂÂ See above: indeed setting DAGMAN_ALLOW_EVENTS = 5 allows
        the dag to
         >> ÂÂÂ go on.
         >>
         >> ÂÂÂ The point is that since I've noticed this issue I am always
         >> running the
         >> ÂÂÂ "same" dag: the only thing that changes is its tag --
        basically
         >> driving
         >> ÂÂÂ the output directory and used for many filenames. In
        about 40% of
         >> the
         >> ÂÂÂ cases, I get a "BAD EVENT" error but each time it affects a
         >> different
         >> ÂÂÂ task and so happens at different times of the dag
        processing as the
         >> ÂÂÂ tasks have very different durations. While in about 60%
        of the
         >> cases,
         >> ÂÂÂ the dag completes fine w/o any "BAD EVENT".
         >>
         >> ÂÂÂ Let me know if you need more information or if anything
        is unclear.
         >>
         >> ÂÂÂ Cheers,
         >>
         >> ÂÂÂ Nicolas
         >>
         >> ÂÂÂÂ > Mark
         >> ÂÂÂÂ >
         >> ÂÂÂÂ >
         >> ÂÂÂÂ > On Tue, Feb 12, 2019 at 2:42 AM Nicolas Arnaud
         >> ÂÂÂ <narnaud@xxxxxxxxxxxx <mailto:narnaud@xxxxxxxxxxxx>
        <mailto:narnaud@xxxxxxxxxxxx <mailto:narnaud@xxxxxxxxxxxx>>
         >> ÂÂÂÂ > <mailto:narnaud@xxxxxxxxxxxx
        <mailto:narnaud@xxxxxxxxxxxx> <mailto:narnaud@xxxxxxxxxxxx
        <mailto:narnaud@xxxxxxxxxxxx>>>>
         >> wrote:
         >> ÂÂÂÂ >
         >> ÂÂÂÂ >
         >> ÂÂÂÂ >Â Â ÂHello,
         >> ÂÂÂÂ >
         >> ÂÂÂÂ >Â Â ÂI'm using a Condor farm to run dags containing a
        dozen of
         >> ÂÂÂ independent
         >> ÂÂÂÂ >Â Â Âtasks, each task being made of a few processes
        running
         >> ÂÂÂ sequentially
         >> ÂÂÂÂ >Â Â Âfollowing the parent/child logic. Lately I have
        encountered
         >> ÂÂÂ errors like
         >> ÂÂÂÂ >Â Â Âthe one below:
         >> ÂÂÂÂ >
         >> ÂÂÂÂ >Â Â Â > (...)
         >> ÂÂÂÂ >Â Â Â > 02/08/19 00:30:10 Event: ULOG_IMAGE_SIZE for
        HTCondor Node
         >> ÂÂÂÂ >Â Â Âtest_20190208_narnaud_virgo_status (281605.0.0)
        {02/08/19
         >> ÂÂÂ 00:30:06}
         >> ÂÂÂÂ >Â Â Â > 02/08/19 00:30:10 Event: ULOG_JOB_TERMINATED for
         >> HTCondor Node
         >> ÂÂÂÂ >Â Â Âtest_20190208_narnaud_virgo_status (281605.0.0)
        {02/08/19
         >> ÂÂÂ 00:30:06}
         >> ÂÂÂÂ >Â Â Â > 02/08/19 00:30:10 Number of idle job procs: 0
         >> ÂÂÂÂ >Â Â Â > 02/08/19 00:30:10 Node
         >> test_20190208_narnaud_virgo_status job
         >> ÂÂÂÂ >Â Â Âproc (281605.0.0) completed successfully.
         >> ÂÂÂÂ >Â Â Â > 02/08/19 00:30:10 Node
         >> test_20190208_narnaud_virgo_status job
         >> ÂÂÂÂ >Â Â Âcompleted
         >> ÂÂÂÂ >Â Â Â > 02/08/19 00:30:10 Event: ULOG_EXECUTE for
        HTCondor Node
         >> ÂÂÂÂ >Â Â Âtest_20190208_narnaud_virgo_status (281605.0.0)
        {02/08/19
         >> ÂÂÂ 00:30:07}
         >> ÂÂÂÂ >Â Â Â > 02/08/19 00:30:10 BAD EVENT: job (281605.0.0)
         >> executing, total
         >> ÂÂÂÂ >Â Â Âend count != 0 (1)
         >> ÂÂÂÂ >Â Â Â > 02/08/19 00:30:10 ERROR: aborting DAG because
        of bad event
         >> ÂÂÂ (BAD
         >> ÂÂÂÂ >Â Â ÂEVENT: job (281605.0.0) executing, total end
        count != 0 (1))
         >> ÂÂÂÂ >Â Â Â > (...)
         >> ÂÂÂÂ >Â Â Â > 02/08/19 00:30:10 ProcessLogEvents() returned
        false
         >> ÂÂÂÂ >Â Â Â > 02/08/19 00:30:10 Aborting DAG...
         >> ÂÂÂÂ >Â Â Â > (...)
         >> ÂÂÂÂ >
         >> ÂÂÂÂ >Â Â ÂCondor correctly asseses one job as being
        successfully
         >> ÂÂÂ completed but it
         >> ÂÂÂÂ >Â Â Âseems that it starts executing it again
        immediately. Then
         >> ÂÂÂ there is a
         >> ÂÂÂÂ >Â Â Â"BAD EVENT" error and the DAG aborts, killing all
        the jobs
         >> ÂÂÂ that were
         >> ÂÂÂÂ >Â Â Ârunning.
         >> ÂÂÂÂ >
         >> ÂÂÂÂ >Â Â ÂSo far this problem seems to occur randomly: some
        dags
         >> ÂÂÂ complete fine
         >> ÂÂÂÂ >Â Â Âwhile, when the problem occurs, the job that
        suffers from
         >> it is
         >> ÂÂÂÂ >Â Â Âdifferent each time. So are the machine and the
        slot on which
         >> ÂÂÂ that
         >> ÂÂÂÂ >Â Â Âparticular job is running.
         >> ÂÂÂÂ >
         >> ÂÂÂÂ >Â Â ÂIn the above example, the dag snippet is fairly
        simple
         >> ÂÂÂÂ >
         >> ÂÂÂÂ >Â Â Â > (...)
         >> ÂÂÂÂ >Â Â Â > JOB test_20190208_narnaud_virgo_status
        virgo_status.sub
         >> ÂÂÂÂ >Â Â Â > VARS test_20190208_narnaud_virgo_status
         >> ÂÂÂÂ >
         Âinitialdir="/data/procdata/web/dqr/test_20190208_narnaud/dag"
         >> ÂÂÂÂ >Â Â Â > RETRY test_20190208_narnaud_virgo_status 1
         >> ÂÂÂÂ >Â Â Â > (...)
         >> ÂÂÂÂ >
         >> ÂÂÂÂ >Â Â Âand the sub file reads
         >> ÂÂÂÂ >
         >> ÂÂÂÂ >Â Â Â > universe = vanilla
         >> ÂÂÂÂ >Â Â Â > executable =
         >> ÂÂÂÂ >
         >>
         >>
         Â/users/narnaud/Software/RRT/Virgo/VirgoDQR/trunk/scripts/virgo_status.py
         >>
         >> ÂÂÂÂ >Â Â Â > arguments = "--event_gps 1233176418.54321
        --event_id
         >> ÂÂÂÂ >Â Â Âtest_20190208_narnaud --data_stream
        /virgoData/ffl/raw.ffl
         >> ÂÂÂÂ >Â Â Â--output_dir
        /data/procdata/web/dqr/test_20190208_narnaud
         >> ÂÂÂÂ >Â Â Â--n_seconds_backward 10 --n_seconds_forward 10"
         >> ÂÂÂÂ >Â Â Â > priority = 10
         >> ÂÂÂÂ >Â Â Â > getenv = True
         >> ÂÂÂÂ >Â Â Â > error =
         >> ÂÂÂÂ >
         >>
         >>
         Â/data/procdata/web/dqr/test_20190208_narnaud/virgo_status/logs/$(cluster)-$(process)-$$(Name).err
         >>
         >> ÂÂÂÂ >Â Â Â > output =
         >> ÂÂÂÂ >
         >>
         >>
         Â/data/procdata/web/dqr/test_20190208_narnaud/virgo_status/logs/$(cluster)-$(process)-$$(Name).out
         >>
         >> ÂÂÂÂ >Â Â Â > notification = never
         >> ÂÂÂÂ >Â Â Â > +Experiment = "DetChar"
         >> ÂÂÂÂ >Â Â Â > +AccountingGroup=
        "virgo.prod.o3.detchar.transient.dqr"
         >> ÂÂÂÂ >Â Â Â > queue 1
         >> ÂÂÂÂ >
         >> ÂÂÂÂ >Â Â Â=> Would you know what could cause this error?
        And whether
         >> ÂÂÂ this is
         >> ÂÂÂÂ >Â Â Âat my
         >> ÂÂÂÂ >Â Â Âlevel (user) or at the level of the farm?
         >> ÂÂÂÂ >
         >> ÂÂÂÂ >Â Â Â=> And, until the problem is fixed, would there
        be a way to
         >> ÂÂÂ convince
         >> ÂÂÂÂ >Â Â Âthe
         >> ÂÂÂÂ >Â Â Âdag to continue instead of aborting? Possibly by
        modifying
         >> ÂÂÂ the default
         >> ÂÂÂÂ >Â Â Âvalue of the macro
         >> ÂÂÂÂ >
         >> ÂÂÂÂ >Â Â Â > DAGMAN_ALLOW_EVENTS = 114
         >> ÂÂÂÂ >
         >> ÂÂÂÂ >Â Â Â? But changing this value to 5 [!?] is said to
        "break the
         >> ÂÂÂ semantics of
         >> ÂÂÂÂ >Â Â Âthe DAG" => I'm not sure this is the right way to
        proceed.
         >> ÂÂÂÂ >
         >> ÂÂÂÂ >Â Â ÂThanks in advance for your help,
         >> ÂÂÂÂ >
         >> ÂÂÂÂ >Â Â ÂNicolas
         >> ÂÂÂÂ >
         >> ÂÂÂÂ >Â Â Â_______________________________________________
         >> ÂÂÂÂ >Â Â ÂHTCondor-users mailing list
         >> ÂÂÂÂ >Â Â ÂTo unsubscribe, send a message to
         >> htcondor-users-request@xxxxxxxxxxx
        <mailto:htcondor-users-request@xxxxxxxxxxx>
         >> ÂÂÂ <mailto:htcondor-users-request@xxxxxxxxxxx
        <mailto:htcondor-users-request@xxxxxxxxxxx>>
         >> ÂÂÂÂ >Â Â Â<mailto:htcondor-users-request@xxxxxxxxxxx
        <mailto:htcondor-users-request@xxxxxxxxxxx>
         >> ÂÂÂ <mailto:htcondor-users-request@xxxxxxxxxxx
        <mailto:htcondor-users-request@xxxxxxxxxxx>>> with a
         >> ÂÂÂÂ >Â Â Âsubject: Unsubscribe
         >> ÂÂÂÂ >Â Â ÂYou can also unsubscribe by visiting
         >> ÂÂÂÂ > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
         >> ÂÂÂÂ >
         >> ÂÂÂÂ >Â Â ÂThe archives can be found at:
         >> ÂÂÂÂ > https://lists.cs.wisc.edu/archive/htcondor-users/
         >> ÂÂÂÂ >
         >> ÂÂÂÂ >
         >> ÂÂÂÂ >
         >> ÂÂÂÂ > --
         >> ÂÂÂÂ > Mark Coatsworth
         >> ÂÂÂÂ > Systems Programmer
         >> ÂÂÂÂ > Center for High Throughput Computing
         >> ÂÂÂÂ > Department of Computer Sciences
         >> ÂÂÂÂ > University of Wisconsin-Madison
         >> ÂÂÂÂ > +1 608 206 4703
         >> ÂÂÂÂ >
         >> ÂÂÂÂ > _______________________________________________
         >> ÂÂÂÂ > HTCondor-users mailing list
         >> ÂÂÂÂ > To unsubscribe, send a message to
         >> htcondor-users-request@xxxxxxxxxxx
        <mailto:htcondor-users-request@xxxxxxxxxxx>
         >> ÂÂÂ <mailto:htcondor-users-request@xxxxxxxxxxx
        <mailto:htcondor-users-request@xxxxxxxxxxx>> with a
         >> ÂÂÂÂ > subject: Unsubscribe
         >> ÂÂÂÂ > You can also unsubscribe by visiting
         >> ÂÂÂÂ > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
         >> ÂÂÂÂ >
         >> ÂÂÂÂ > The archives can be found at:
         >> ÂÂÂÂ > https://lists.cs.wisc.edu/archive/htcondor-users/
         >> ÂÂÂÂ >
         >>
         >>
         >>



-- Mark Coatsworth
    Systems Programmer
    Center for High Throughput Computing
    Department of Computer Sciences
    University of Wisconsin-Madison



--
Mark Coatsworth
Systems Programmer
Center for High Throughput Computing
Department of Computer Sciences
University of Wisconsin-Madison