[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] dagman job won't finish



Hi all,

We took this conversation off-list, so I wanted to follow up here in case the solution is useful to anyone else.

The problem ended up being a simple one. The scripts for this job were accidentally overwriting the .nodes.log file that DAGMan uses to monitor the progress of a DAG's execution.

Gentle reminder to be extra mindful of the .nodes.log file when your pre/post scripts are deleting, archiving or otherwise moving files around. If that file gets accidentally erased or overwritten, the consequences are ugly!

Mark

On Thu, Feb 1, 2018 at 7:18 PM, Rami Vanguri <rami.vanguri@xxxxxxxxx> wrote:
Hi Mark,

I submitted with condor_submit_dag -debug 7, I hope that's equivalent.
Attached are the .dagman.out and .nodes.log files. Thanks!

--Rami

On Thu, Feb 1, 2018 at 5:48 PM, Mark Coatsworth <coatsworth@xxxxxxxxxxx> wrote:
> Hi Rami,
>
> Thanks for sending this. Everything here makes sense, but it's not what I
> was expecting. I certainly didn't think removing the POST script would send
> the DAG into a loop!!
>
> Something strange is happening with your log files; and since DAGMan uses
> these to control the flow of the DAG, it seems to be the culprit here. I've
> tried reproducing this problem with a similarly-structured DAG on my local
> system, but everything works correctly.
>
> I've tried referencing your output messages to our codebase, but there's not
> enough information to find any conclusive. Could you try adding the follow
> setting to your condor config:
>
> DAGMAN_DEBUG = D_ALL
>
> Then try submitting your DAG again, and send me both the resulting
> .dagman.out and .nodes.log files in full? I'm hoping that debugging in max
> verbosity will reveal an error message which explains why log access is
> going crooked.
>
> Mark
>
>
>
> On Thu, Feb 1, 2018 at 2:15 PM, Rami Vanguri <rami.vanguri@xxxxxxxxx> wrote:
>>
>> Sorry for the extra email, but the dagman job keeps running *and* the
>> B node jobs get resubmitted to run again. In other words, not only
>> does the dagman job never end, but it submits the B nodes, waits for
>> them to finish, then resubmits them in a loop.
>>
>> --Rami
>>
>> On Thu, Feb 1, 2018 at 3:04 PM, Rami Vanguri <rami.vanguri@xxxxxxxxx>
>> wrote:
>> > I removed the POST script and the dag job still hangs out in condor.
>> > However, I noticed a different error in the dag.dagman.out:
>> >
>> > 02/01/18 12:01:32 Node C job proc (69659.0.0) completed successfully.
>> > 02/01/18 12:01:32 Node C job completed
>> > 02/01/18 12:01:32 DAG status: 0 (DAG_STATUS_OK)
>> > 02/01/18 12:01:32 Of 4 nodes total:
>> > 02/01/18 12:01:32 Done  ÂPre ÂQueued  Post ÂReady ÂUn-Ready
>> > Failed
>> > 02/01/18 12:01:32Â Â===Â Â Â===Â Â Â ===Â Â Â===Â Â Â===Â Â Â Â ===
>> > ===
>> > 02/01/18 12:01:32Â Â Â4Â Â Â Â0Â Â Â Â 0Â Â Â Â0Â Â Â Â0Â Â Â Â Â 0
>> > 0
>> > 02/01/18 12:01:32 0 job proc(s) currently held
>> > 02/01/18 12:01:32 All jobs Completed!
>> > 02/01/18 12:01:32 Note: 0 total job deferrals because of -MaxJobs limit
>> > (0)
>> > 02/01/18 12:01:32 Note: 0 total job deferrals because of -MaxIdle limit
>> > (1000)
>> > 02/01/18 12:01:32 Note: 0 total job deferrals because of node category
>> > throttles
>> > 02/01/18 12:01:32 Note: 0 total PRE script deferrals because of
>> > -MaxPre limit (20) or DEFER
>> > 02/01/18 12:01:32 Note: 0 total POST script deferrals because of
>> > -MaxPost limit (20) or DEFER
>> > 02/01/18 12:01:32 DAG status: 0 (DAG_STATUS_OK)
>> > 02/01/18 12:01:32 Of 4 nodes total:
>> > 02/01/18 12:01:32 Done  ÂPre ÂQueued  Post ÂReady ÂUn-Ready
>> > Failed
>> > 02/01/18 12:01:32Â Â===Â Â Â===Â Â Â ===Â Â Â===Â Â Â===Â Â Â Â ===
>> > ===
>> > 02/01/18 12:01:32Â Â Â4Â Â Â Â0Â Â Â Â 0Â Â Â Â0Â Â Â Â0Â Â Â Â Â 0
>> > 0
>> > 02/01/18 12:01:32 0 job proc(s) currently held
>> > 02/01/18 12:01:32 Wrote metrics file workflow_localTEST.dag.metrics.
>> > 02/01/18 12:01:32 Metrics not sent because of PEGASUS_METRICS or
>> > CONDOR_DEVELOPERS setting.
>> > 02/01/18 12:01:32 ReadMultipleUserLogs error: Didn't find
>> > LogFileMonitor object for log file
>> >
>> > /home/ramiv/nsides_IN/condor/localTEST/./workflow_localTEST.dag.nodes.log
>> > (2305:20709560)!
>> > 02/01/18 12:01:32 All log monitors:
>> > 02/01/18 12:01:32Â ÂFile ID: 2305:20709395
>> > 02/01/18 12:01:32Â Â ÂMonitor: 0x19620a0
>> > 02/01/18 12:01:32Â Â ÂLog file:
>> >
>> > </home/ramiv/nsides_IN/condor/localTEST/./workflow_localTEST.dag.nodes.log>
>> > 02/01/18 12:01:32Â Â ÂrefCount: 1
>> > 02/01/18 12:01:32Â Â ÂlastLogEvent: (nil)
>> > 02/01/18 12:01:32 DAGMan::Job:8001:ERROR: Unable to unmonitor log file
>> >
>> > </home/ramiv/nsides_IN/condor/localTEST/./workflow_localTEST.dag.nodes.log>
>> > |ReadMultipleUserLogs:9004:Didn't find LogFileMonitor object for log
>> > file
>> > /home/ramiv/nsides_IN/condor/localTEST/./workflow_localTEST.dag.nodes.log
>> > (2305:20709560)!
>> > 02/01/18 12:01:32 ERROR "Fatal log file monitoring error!" at line
>> > 3143 in file
>> > /builddir/build/BUILD/condor-8.6.8/src/condor_dagman/dag.cpp
>> > 02/01/18 12:01:32 ReadMultipleUserLogs error: Didn't find
>> > LogFileMonitor object for log file
>> >
>> > /home/ramiv/nsides_IN/condor/localTEST/./workflow_localTEST.dag.nodes.log
>> > (2305:20709560)!
>> > 02/01/18 12:01:32 All log monitors:
>> > 02/01/18 12:01:32Â ÂFile ID: 2305:20709395
>> > 02/01/18 12:01:32Â Â ÂMonitor: 0x19620a0
>> > 02/01/18 12:01:32Â Â ÂLog file:
>> >
>> > </home/ramiv/nsides_IN/condor/localTEST/./workflow_localTEST.dag.nodes.log>
>> > 02/01/18 12:01:32Â Â ÂrefCount: 1
>> > 02/01/18 12:01:32Â Â ÂlastLogEvent: (nil)
>> > 02/01/18 12:01:32 DAGMan::Job:8001:ERROR: Unable to unmonitor log file
>> >
>> > </home/ramiv/nsides_IN/condor/localTEST/./workflow_localTEST.dag.nodes.log>
>> > |ReadMultipleUserLogs:9004:Didn't find LogFileMonitor object for log
>> > file
>> > /home/ramiv/nsides_IN/condor/localTEST/./workflow_localTEST.dag.nodes.log
>> > (2305:20709560)!
>> > 02/01/18 12:01:32 ERROR "Fatal log file monitoring error!" at line
>> > 3143 in file
>> > /builddir/build/BUILD/condor-8.6.8/src/condor_dagman/dag.cpp
>> >
>> >
>> > Does this make any sense?
>> >
>> > --Rami
>> >
>> > On Wed, Jan 31, 2018 at 4:15 PM, Rami Vanguri <rami.vanguri@xxxxxxxxx>
>> > wrote:
>> >> This is the log file of the POST script:
>> >>
>> >> Copying 220735 bytes
>> >> file:///home/ramiv/nsides_IN/condor/localTEST/results_121_dnn.tgz =>
>> >>
>> >> gsiftp://gftp.t2.ucsd.edu/hadoop/osg/ColumbiaTBI/ramiv/nsides_output_IN/results_121_dnn.tgz
>> >>
>> >> and the file is there. That is the only line in the POST script
>> >> except for removing the tarballs and "exit 0". I'll try removing the
>> >> POST script and get back to you.
>> >>
>> >> --Rami
>> >>
>> >> On Wed, Jan 31, 2018 at 3:32 PM, Mark Coatsworth
>> >> <coatsworth@xxxxxxxxxxx> wrote:
>> >>> Hi Rami,
>> >>>
>> >>> Something is definitely fishy here. For some reason DAGMan still
>> >>> thinks a
>> >>> job is queued, which could explain why it keeps running, but that
>> >>> shouldn't
>> >>> be possible based on the structure of your DAG.
>> >>>
>> >>> Can you confirm that the files your POST script is sending to Hadoop
>> >>> are
>> >>> arriving there properly? Alternately, try removing the POST script and
>> >>> see
>> >>> if the problem goes away.
>> >>>
>> >>> If that doesn't work, could you send me your .dag file, your job
>> >>> submit
>> >>> files, and the post script? It'll be easier for me to debug if I can
>> >>> see
>> >>> them.
>> >>>
>> >>> Mark
>> >>>
>> >>> On Wed, Jan 31, 2018 at 1:49 PM, Rami Vanguri <rami.vanguri@xxxxxxxxx>
>> >>> wrote:
>> >>>>
>> >>>> Ah okay, my POST script only deletes some tarballs. I have been using
>> >>>> a test job and the problem occurs 100% of the time. I've submitted
>> >>>> this job around 100 times so far. Since it hangs I'm reluctant to
>> >>>> start production.
>> >>>>
>> >>>> --Rami
>> >>>>
>> >>>> On Wed, Jan 31, 2018 at 2:38 PM, Todd Tannenbaum
>> >>>> <tannenba@xxxxxxxxxxx>
>> >>>> wrote:
>> >>>> > On 1/31/2018 1:10 PM, Rami Vanguri wrote:
>> >>>> >> Hi Todd,
>> >>>> >>
>> >>>> >> Thanks for the reply!
>> >>>> >>
>> >>>> >> condor_version:
>> >>>> >> $CondorVersion: 8.6.8 Oct 31 2017 $
>> >>>> >> $CondorPlatform: X86_64-CentOS_6.9 $
>> >>>> >>
>> >>>> >> OS:
>> >>>> >> Description: CentOS release 6.9 (Final)
>> >>>> >>
>> >>>> >> So your question about editing/removing files might be the answer,
>> >>>> >> my
>> >>>> >> POST script transfers (to hadoop) and removes the resulting
>> >>>> >> tarballs
>> >>>> >> from the preceding steps. I do this because I will be submitting
>> >>>> >> hundreds of these and don't want to keep the output around in the
>> >>>> >> scratch directory. If that is indeed what's causing the issue,
>> >>>> >> how
>> >>>> >> can I remove files safely?
>> >>>> >>
>> >>>> >
>> >>>> > Having your POST script move job output someplace should be fine.
>> >>>> > I
>> >>>> > asked because I was concerned that your POST script may actually
>> >>>> > move (or
>> >>>> > remove) files that DAGMan itself needs to reference, like your
>> >>>> > .dag.nodes.log and other files created by DAGMan. These files
>> >>>> > created
>> >>>> > by DAGMan should not be moved/removed until DAGMan exits.
>> >>>> >
>> >>>> > Another question:Â Does this problem consistently always occur
>> >>>> > (i.e.
>> >>>> > DAGMan always gets stuck with this workflow), or only occasionally?
>> >>>> > If the
>> >>>> > latter, are we talking 1 in every 5 workflow runs, or 1 in 50,000 ?
>> >>>> >
>> >>>> > Thanks,
>> >>>> > Todd
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> >> --Rami
>> >>>> >>
>> >>>> >> On Wed, Jan 31, 2018 at 2:05 PM, Todd Tannenbaum
>> >>>> >> <tannenba@xxxxxxxxxxx>
>> >>>> >> wrote:
>> >>>> >>> Hi Rami,
>> >>>> >>>
>> >>>> >>> I see what you mean, the below certainly looks strange to me. I
>> >>>> >>> will
>> >>>> >>> ask
>> >>>> >>> one of our DAGMan experts here to take a look and report back to
>> >>>> >>> the
>> >>>> >>> list.
>> >>>> >>>
>> >>>> >>> In the meantime, could you tell us what version of HTCondor you
>> >>>> >>> are
>> >>>> >>> using
>> >>>> >>> (i.e. output of condor_version on your submit machine), and on
>> >>>> >>> what
>> >>>> >>> operating system?
>> >>>> >>>
>> >>>> >>> Were any files in /home/ramiv/nsides_IN/condor/localTEST edited
>> >>>> >>> or
>> >>>> >>> removed
>> >>>> >>> during the test? Is this subdirectory on a shared file system?
>> >>>> >>>
>> >>>> >>> Thanks
>> >>>> >>> Todd
>> >>>> >>>
>> >>>> >>>
>> >>>> >>> On 1/30/2018 5:47 PM, Rami Vanguri wrote:
>> >>>> >>>>
>> >>>> >>>> Hi,
>> >>>> >>>>
>> >>>> >>>> I am running a DAG job that has several components which seem to
>> >>>> >>>> all
>> >>>> >>>> run fine, but then the actual dag job never stops running even
>> >>>> >>>> though
>> >>>> >>>> all of the jobs were successful.
>> >>>> >>>>
>> >>>> >>>> Here is an excerpt from the .dag.nodes.log file:
>> >>>> >>>> 005 (69535.000.000) 01/30 14:47:07 Job terminated.
>> >>>> >>>>Â Â Â Â Â Â(1) Normal termination (return value 0)
>> >>>> >>>>Â Â Â Â Â Â Â Â Â ÂUsr 0 00:11:47, Sys 0 00:02:08Â -Â Run Remote
>> >>>> >>>> Usage
>> >>>> >>>>Â Â Â Â Â Â Â Â Â ÂUsr 0 00:00:00, Sys 0 00:00:00Â -Â Run Local
>> >>>> >>>> Usage
>> >>>> >>>>Â Â Â Â Â Â Â Â Â ÂUsr 0 00:11:47, Sys 0 00:02:08Â -Â Total
>> >>>> >>>> Remote
>> >>>> >>>> Usage
>> >>>> >>>>Â Â Â Â Â Â Â Â Â ÂUsr 0 00:00:00, Sys 0 00:00:00Â -Â Total Local
>> >>>> >>>> Usage
>> >>>> >>>>Â Â Â Â Â Â0Â -Â Run Bytes Sent By Job
>> >>>> >>>>Â Â Â Â Â Â0Â -Â Run Bytes Received By Job
>> >>>> >>>>Â Â Â Â Â Â0Â -Â Total Bytes Sent By Job
>> >>>> >>>>Â Â Â Â Â Â0Â -Â Total Bytes Received By Job
>> >>>> >>>>     ÂPartitionable Resources :  Usage Request Allocated
>> >>>> >>>>       Cpus        Â:        Â1    Â1
>> >>>> >>>>Â Â Â Â Â Â Â Disk (KB)Â Â Â Â Â Â :Â Â250000Â Â Â Â 1Â Â6936266
>> >>>> >>>>Â Â Â Â Â Â Â Memory (MB)Â Â Â Â Â :Â Â Â 583Â Â Â2048Â Â Â 2048
>> >>>> >>>> ...
>> >>>> >>>> 016 (69538.000.000) 01/30 15:29:24 POST Script terminated.
>> >>>> >>>>Â Â Â Â Â Â(1) Normal termination (return value 0)
>> >>>> >>>>Â Â Â ÂDAG Node: C
>> >>>> >>>>
>> >>>> >>>> ..and here is an excerpt from the .dagman.out file:
>> >>>> >>>> 01/30/18 15:29:22 Node C job proc (69538.0.0) completed
>> >>>> >>>> successfully.
>> >>>> >>>> 01/30/18 15:29:22 Node C job completed
>> >>>> >>>> 01/30/18 15:29:22 Running POST script of Node C...
>> >>>> >>>> 01/30/18 15:29:22 Warning: mysin has length 0 (ignore if
>> >>>> >>>> produced by
>> >>>> >>>> DAGMan; see gittrac #4987, #5031)
>> >>>> >>>> 01/30/18 15:29:22 DAG status: 0 (DAG_STATUS_OK)
>> >>>> >>>> 01/30/18 15:29:22 Of 4 nodes total:
>> >>>> >>>> 01/30/18 15:29:22 Done  ÂPre ÂQueued  Post ÂReady
>> >>>> >>>> Un-Ready
>> >>>> >>>> Failed
>> >>>> >>>> 01/30/18 15:29:22Â Â===Â Â Â===Â Â Â ===Â Â Â===Â Â Â===
>> >>>> >>>> ===
>> >>>> >>>> ===
>> >>>> >>>> 01/30/18 15:29:22Â Â Â3Â Â Â Â0Â Â Â Â 0Â Â Â Â1Â Â Â Â0
>> >>>> >>>> 0
>> >>>> >>>> 0
>> >>>> >>>> 01/30/18 15:29:22 0 job proc(s) currently held
>> >>>> >>>> 01/30/18 15:29:24 Initializing user log writer for
>> >>>> >>>>
>> >>>> >>>>
>> >>>> >>>> /home/ramiv/nsides_IN/condor/localTEST/./workflow_localTEST.dag.nodes.log,
>> >>>> >>>> (69538.0.0)
>> >>>> >>>> 01/30/18 15:39:23 601 seconds since last log event
>> >>>> >>>> 01/30/18 15:39:23 Pending DAG nodes:
>> >>>> >>>> 01/30/18 15:39:23Â ÂNode C, HTCondor ID 69538, status
>> >>>> >>>> STATUS_POSTRUN
>> >>>> >>>>
>> >>>> >>>> The dag control file only has 4 jobs structured like this:
>> >>>> >>>> PARENT B0 B1 B2 CHILD C
>> >>>> >>>>
>> >>>> >>>> What could cause my job to be stuck in POSTRUN even though it
>> >>>> >>>> runs
>> >>>> >>>> successfully with the proper exit code?
>> >>>> >>>>
>> >>>> >>>> Thanks for any help.
>> >>>> >>>>
>> >>>> >>>> --Rami
>> >>>> >
>> >>>> >
>> >>>> > --
>> >>>> > Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of
>> >>>> > Wisconsin-Madison
>> >>>> > Center for High Throughput Computing ÂDepartment of Computer
>> >>>> > Sciences
>> >>>> > HTCondor Technical Lead        1210 W. Dayton St. Rm #4257
>> >>>> > Phone: (608) 263-7132Â Â Â Â Â Â Â Â Â Madison, WI 53706-1685
>> >>>> _______________________________________________
>> >>>> HTCondor-users mailing list
>> >>>> To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu
>> >>>> with
>> >>>> a
>> >>>> subject: Unsubscribe
>> >>>> You can also unsubscribe by visiting
>> >>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>> >>>>
>> >>>> The archives can be found at:
>> >>>> https://lists.cs.wisc.edu/archive/htcondor-users/
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Mark Coatsworth
>> >>> Systems Programmer
>> >>> Center for High Throughput Computing
>> >>> Department of Computer Sciences
>> >>> University of Wisconsin-Madison
>> >>> +1 608 206 4703
>> >>>
>> >>> _______________________________________________
>> >>> HTCondor-users mailing list
>> >>> To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu
>> >>> with a
>> >>> subject: Unsubscribe
>> >>> You can also unsubscribe by visiting
>> >>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>> >>>
>> >>> The archives can be found at:
>> >>> https://lists.cs.wisc.edu/archive/htcondor-users/
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with
>> a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
>
>
>
>
> --
> Mark Coatsworth
> Systems Programmer
> Center for High Throughput Computing
> Department of Computer Sciences
> University of Wisconsin-Madison
> +1 608 206 4703
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/



--
Mark Coatsworth
Systems Programmer
Center for High Throughput Computing
Department of Computer Sciences
University of Wisconsin-Madison
+1 608 206 4703