[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] dagman job won't finish



Hi Rami,

Thanks for sending this. Everything here makes sense, but it's not what I was expecting. I certainly didn't think removing the POST script would send the DAG into a loop!!

Something strange is happening with your log files; and since DAGMan uses these to control the flow of the DAG, it seems to be the culprit here. I've tried reproducing this problem with a similarly-structured DAG on my local system, but everything works correctly.

I've tried referencing your output messages to our codebase, but there's not enough information to find any conclusive. Could you try adding the follow setting to your condor config:

DAGMAN_DEBUG = D_ALL

Then try submitting your DAG again, and send me both the resulting .dagman.out and .nodes.log files in full? I'm hoping that debugging in max verbosity will reveal an error message which explains why log access is going crooked.

Mark



On Thu, Feb 1, 2018 at 2:15 PM, Rami Vanguri <rami.vanguri@xxxxxxxxx> wrote:
Sorry for the extra email, but the dagman job keeps running *and* the
B node jobs get resubmitted to run again. In other words, not only
does the dagman job never end, but it submits the B nodes, waits for
them to finish, then resubmits them in a loop.

--Rami

On Thu, Feb 1, 2018 at 3:04 PM, Rami Vanguri <rami.vanguri@xxxxxxxxx> wrote:
> I removed the POST script and the dag job still hangs out in condor.
> However, I noticed a different error in the dag.dagman.out:
>
> 02/01/18 12:01:32 Node C job proc (69659.0.0) completed successfully.
> 02/01/18 12:01:32 Node C job completed
> 02/01/18 12:01:32 DAG status: 0 (DAG_STATUS_OK)
> 02/01/18 12:01:32 Of 4 nodes total:
> 02/01/18 12:01:32 Done  ÂPre ÂQueued  Post ÂReady ÂUn-Ready ÂFailed
> 02/01/18 12:01:32Â Â===Â Â Â===Â Â Â ===Â Â Â===Â Â Â===Â Â Â Â ===Â Â Â ===
> 02/01/18 12:01:32Â Â Â4Â Â Â Â0Â Â Â Â 0Â Â Â Â0Â Â Â Â0Â Â Â Â Â 0Â Â Â Â 0
> 02/01/18 12:01:32 0 job proc(s) currently held
> 02/01/18 12:01:32 All jobs Completed!
> 02/01/18 12:01:32 Note: 0 total job deferrals because of -MaxJobs limit (0)
> 02/01/18 12:01:32 Note: 0 total job deferrals because of -MaxIdle limit (1000)
> 02/01/18 12:01:32 Note: 0 total job deferrals because of node category throttles
> 02/01/18 12:01:32 Note: 0 total PRE script deferrals because of
> -MaxPre limit (20) or DEFER
> 02/01/18 12:01:32 Note: 0 total POST script deferrals because of
> -MaxPost limit (20) or DEFER
> 02/01/18 12:01:32 DAG status: 0 (DAG_STATUS_OK)
> 02/01/18 12:01:32 Of 4 nodes total:
> 02/01/18 12:01:32 Done  ÂPre ÂQueued  Post ÂReady ÂUn-Ready ÂFailed
> 02/01/18 12:01:32Â Â===Â Â Â===Â Â Â ===Â Â Â===Â Â Â===Â Â Â Â ===Â Â Â ===
> 02/01/18 12:01:32Â Â Â4Â Â Â Â0Â Â Â Â 0Â Â Â Â0Â Â Â Â0Â Â Â Â Â 0Â Â Â Â 0
> 02/01/18 12:01:32 0 job proc(s) currently held
> 02/01/18 12:01:32 Wrote metrics file workflow_localTEST.dag.metrics.
> 02/01/18 12:01:32 Metrics not sent because of PEGASUS_METRICS or
> CONDOR_DEVELOPERS setting.
> 02/01/18 12:01:32 ReadMultipleUserLogs error: Didn't find
> LogFileMonitor object for log file
> /home/ramiv/nsides_IN/condor/localTEST/./workflow_localTEST.dag.nodes.log
> (2305:20709560)!
> 02/01/18 12:01:32 All log monitors:
> 02/01/18 12:01:32Â ÂFile ID: 2305:20709395
> 02/01/18 12:01:32Â Â ÂMonitor: 0x19620a0
> 02/01/18 12:01:32Â Â ÂLog file:
> </home/ramiv/nsides_IN/condor/localTEST/./workflow_localTEST.dag.nodes.log>
> 02/01/18 12:01:32Â Â ÂrefCount: 1
> 02/01/18 12:01:32Â Â ÂlastLogEvent: (nil)
> 02/01/18 12:01:32 DAGMan::Job:8001:ERROR: Unable to unmonitor log file
> </home/ramiv/nsides_IN/condor/localTEST/./workflow_localTEST.dag.nodes.log>
> |ReadMultipleUserLogs:9004:Didn't find LogFileMonitor object for log
> file /home/ramiv/nsides_IN/condor/localTEST/./workflow_localTEST.dag.nodes.log
> (2305:20709560)!
> 02/01/18 12:01:32 ERROR "Fatal log file monitoring error!" at line
> 3143 in file /builddir/build/BUILD/condor-8.6.8/src/condor_dagman/dag.cpp
> 02/01/18 12:01:32 ReadMultipleUserLogs error: Didn't find
> LogFileMonitor object for log file
> /home/ramiv/nsides_IN/condor/localTEST/./workflow_localTEST.dag.nodes.log
> (2305:20709560)!
> 02/01/18 12:01:32 All log monitors:
> 02/01/18 12:01:32Â ÂFile ID: 2305:20709395
> 02/01/18 12:01:32Â Â ÂMonitor: 0x19620a0
> 02/01/18 12:01:32Â Â ÂLog file:
> </home/ramiv/nsides_IN/condor/localTEST/./workflow_localTEST.dag.nodes.log>
> 02/01/18 12:01:32Â Â ÂrefCount: 1
> 02/01/18 12:01:32Â Â ÂlastLogEvent: (nil)
> 02/01/18 12:01:32 DAGMan::Job:8001:ERROR: Unable to unmonitor log file
> </home/ramiv/nsides_IN/condor/localTEST/./workflow_localTEST.dag.nodes.log>
> |ReadMultipleUserLogs:9004:Didn't find LogFileMonitor object for log
> file /home/ramiv/nsides_IN/condor/localTEST/./workflow_localTEST.dag.nodes.log
> (2305:20709560)!
> 02/01/18 12:01:32 ERROR "Fatal log file monitoring error!" at line
> 3143 in file /builddir/build/BUILD/condor-8.6.8/src/condor_dagman/dag.cpp
>
>
> Does this make any sense?
>
> --Rami
>
> On Wed, Jan 31, 2018 at 4:15 PM, Rami Vanguri <rami.vanguri@xxxxxxxxx> wrote:
>> This is the log file of the POST script:
>>
>> Copying 220735 bytes
>> file:///home/ramiv/nsides_IN/condor/localTEST/results_121_dnn.tgz =>
>> gsiftp://gftp.t2.ucsd.edu/hadoop/osg/ColumbiaTBI/ramiv/nsides_output_IN/results_121_dnn.tgz
>>
>> and the file is there. That is the only line in the POST script
>> except for removing the tarballs and "exit 0". I'll try removing the
>> POST script and get back to you.
>>
>> --Rami
>>
>> On Wed, Jan 31, 2018 at 3:32 PM, Mark Coatsworth <coatsworth@xxxxxxxxxxx> wrote:
>>> Hi Rami,
>>>
>>> Something is definitely fishy here. For some reason DAGMan still thinks a
>>> job is queued, which could explain why it keeps running, but that shouldn't
>>> be possible based on the structure of your DAG.
>>>
>>> Can you confirm that the files your POST script is sending to Hadoop are
>>> arriving there properly? Alternately, try removing the POST script and see
>>> if the problem goes away.
>>>
>>> If that doesn't work, could you send me your .dag file, your job submit
>>> files, and the post script? It'll be easier for me to debug if I can see
>>> them.
>>>
>>> Mark
>>>
>>> On Wed, Jan 31, 2018 at 1:49 PM, Rami Vanguri <rami.vanguri@xxxxxxxxx>
>>> wrote:
>>>>
>>>> Ah okay, my POST script only deletes some tarballs. I have been using
>>>> a test job and the problem occurs 100% of the time. I've submitted
>>>> this job around 100 times so far. Since it hangs I'm reluctant to
>>>> start production.
>>>>
>>>> --Rami
>>>>
>>>> On Wed, Jan 31, 2018 at 2:38 PM, Todd Tannenbaum <tannenba@xxxxxxxxxxx>
>>>> wrote:
>>>> > On 1/31/2018 1:10 PM, Rami Vanguri wrote:
>>>> >> Hi Todd,
>>>> >>
>>>> >> Thanks for the reply!
>>>> >>
>>>> >> condor_version:
>>>> >> $CondorVersion: 8.6.8 Oct 31 2017 $
>>>> >> $CondorPlatform: X86_64-CentOS_6.9 $
>>>> >>
>>>> >> OS:
>>>> >> Description: CentOS release 6.9 (Final)
>>>> >>
>>>> >> So your question about editing/removing files might be the answer, my
>>>> >> POST script transfers (to hadoop) and removes the resulting tarballs
>>>> >> from the preceding steps. I do this because I will be submitting
>>>> >> hundreds of these and don't want to keep the output around in the
>>>> >> scratch directory. If that is indeed what's causing the issue, how
>>>> >> can I remove files safely?
>>>> >>
>>>> >
>>>> > Having your POST script move job output someplace should be fine. I
>>>> > asked because I was concerned that your POST script may actually move (or
>>>> > remove) files that DAGMan itself needs to reference, like your
>>>> > .dag.nodes.log and other files created by DAGMan. These files created
>>>> > by DAGMan should not be moved/removed until DAGMan exits.
>>>> >
>>>> > Another question:Â Does this problem consistently always occur (i.e.
>>>> > DAGMan always gets stuck with this workflow), or only occasionally? If the
>>>> > latter, are we talking 1 in every 5 workflow runs, or 1 in 50,000 ?
>>>> >
>>>> > Thanks,
>>>> > Todd
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >> --Rami
>>>> >>
>>>> >> On Wed, Jan 31, 2018 at 2:05 PM, Todd Tannenbaum <tannenba@xxxxxxxxxxx>
>>>> >> wrote:
>>>> >>> Hi Rami,
>>>> >>>
>>>> >>> I see what you mean, the below certainly looks strange to me. I will
>>>> >>> ask
>>>> >>> one of our DAGMan experts here to take a look and report back to the
>>>> >>> list.
>>>> >>>
>>>> >>> In the meantime, could you tell us what version of HTCondor you are
>>>> >>> using
>>>> >>> (i.e. output of condor_version on your submit machine), and on what
>>>> >>> operating system?
>>>> >>>
>>>> >>> Were any files in /home/ramiv/nsides_IN/condor/localTEST edited or
>>>> >>> removed
>>>> >>> during the test? Is this subdirectory on a shared file system?
>>>> >>>
>>>> >>> Thanks
>>>> >>> Todd
>>>> >>>
>>>> >>>
>>>> >>> On 1/30/2018 5:47 PM, Rami Vanguri wrote:
>>>> >>>>
>>>> >>>> Hi,
>>>> >>>>
>>>> >>>> I am running a DAG job that has several components which seem to all
>>>> >>>> run fine, but then the actual dag job never stops running even though
>>>> >>>> all of the jobs were successful.
>>>> >>>>
>>>> >>>> Here is an excerpt from the .dag.nodes.log file:
>>>> >>>> 005 (69535.000.000) 01/30 14:47:07 Job terminated.
>>>> >>>>Â Â Â Â Â Â(1) Normal termination (return value 0)
>>>> >>>>Â Â Â Â Â Â Â Â Â ÂUsr 0 00:11:47, Sys 0 00:02:08Â -Â Run Remote Usage
>>>> >>>>Â Â Â Â Â Â Â Â Â ÂUsr 0 00:00:00, Sys 0 00:00:00Â -Â Run Local Usage
>>>> >>>>Â Â Â Â Â Â Â Â Â ÂUsr 0 00:11:47, Sys 0 00:02:08Â -Â Total Remote
>>>> >>>> Usage
>>>> >>>>Â Â Â Â Â Â Â Â Â ÂUsr 0 00:00:00, Sys 0 00:00:00Â -Â Total Local
>>>> >>>> Usage
>>>> >>>>Â Â Â Â Â Â0Â -Â Run Bytes Sent By Job
>>>> >>>>Â Â Â Â Â Â0Â -Â Run Bytes Received By Job
>>>> >>>>Â Â Â Â Â Â0Â -Â Total Bytes Sent By Job
>>>> >>>>Â Â Â Â Â Â0Â -Â Total Bytes Received By Job
>>>> >>>>     ÂPartitionable Resources :  Usage Request Allocated
>>>> >>>>       Cpus        Â:        Â1    Â1
>>>> >>>>Â Â Â Â Â Â Â Disk (KB)Â Â Â Â Â Â :Â Â250000Â Â Â Â 1Â Â6936266
>>>> >>>>Â Â Â Â Â Â Â Memory (MB)Â Â Â Â Â :Â Â Â 583Â Â Â2048Â Â Â 2048
>>>> >>>> ...
>>>> >>>> 016 (69538.000.000) 01/30 15:29:24 POST Script terminated.
>>>> >>>>Â Â Â Â Â Â(1) Normal termination (return value 0)
>>>> >>>>Â Â Â ÂDAG Node: C
>>>> >>>>
>>>> >>>> ..and here is an excerpt from the .dagman.out file:
>>>> >>>> 01/30/18 15:29:22 Node C job proc (69538.0.0) completed successfully.
>>>> >>>> 01/30/18 15:29:22 Node C job completed
>>>> >>>> 01/30/18 15:29:22 Running POST script of Node C...
>>>> >>>> 01/30/18 15:29:22 Warning: mysin has length 0 (ignore if produced by
>>>> >>>> DAGMan; see gittrac #4987, #5031)
>>>> >>>> 01/30/18 15:29:22 DAG status: 0 (DAG_STATUS_OK)
>>>> >>>> 01/30/18 15:29:22 Of 4 nodes total:
>>>> >>>> 01/30/18 15:29:22 Done  ÂPre ÂQueued  Post ÂReady ÂUn-Ready
>>>> >>>> Failed
>>>> >>>> 01/30/18 15:29:22Â Â===Â Â Â===Â Â Â ===Â Â Â===Â Â Â===Â Â Â Â ===
>>>> >>>> ===
>>>> >>>> 01/30/18 15:29:22Â Â Â3Â Â Â Â0Â Â Â Â 0Â Â Â Â1Â Â Â Â0Â Â Â Â Â 0
>>>> >>>> 0
>>>> >>>> 01/30/18 15:29:22 0 job proc(s) currently held
>>>> >>>> 01/30/18 15:29:24 Initializing user log writer for
>>>> >>>>
>>>> >>>> /home/ramiv/nsides_IN/condor/localTEST/./workflow_localTEST.dag.nodes.log,
>>>> >>>> (69538.0.0)
>>>> >>>> 01/30/18 15:39:23 601 seconds since last log event
>>>> >>>> 01/30/18 15:39:23 Pending DAG nodes:
>>>> >>>> 01/30/18 15:39:23Â ÂNode C, HTCondor ID 69538, status STATUS_POSTRUN
>>>> >>>>
>>>> >>>> The dag control file only has 4 jobs structured like this:
>>>> >>>> PARENT B0 B1 B2 CHILD C
>>>> >>>>
>>>> >>>> What could cause my job to be stuck in POSTRUN even though it runs
>>>> >>>> successfully with the proper exit code?
>>>> >>>>
>>>> >>>> Thanks for any help.
>>>> >>>>
>>>> >>>> --Rami
>>>> >
>>>> >
>>>> > --
>>>> > Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
>>>> > Center for High Throughput Computing ÂDepartment of Computer Sciences
>>>> > HTCondor Technical Lead        1210 W. Dayton St. Rm #4257
>>>> > Phone: (608) 263-7132Â Â Â Â Â Â Â Â Â Madison, WI 53706-1685
>>>> _______________________________________________
>>>> HTCondor-users mailing list
>>>> To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with
>>>> a
>>>> subject: Unsubscribe
>>>> You can also unsubscribe by visiting
>>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>>>
>>>> The archives can be found at:
>>>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>>
>>>
>>>
>>>
>>> --
>>> Mark Coatsworth
>>> Systems Programmer
>>> Center for High Throughput Computing
>>> Department of Computer Sciences
>>> University of Wisconsin-Madison
>>> +1 608 206 4703
>>>
>>> _______________________________________________
>>> HTCondor-users mailing list
>>> To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
>>> subject: Unsubscribe
>>> You can also unsubscribe by visiting
>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>>
>>> The archives can be found at:
>>> https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--
Mark Coatsworth
Systems Programmer
Center for High Throughput Computing
Department of Computer Sciences
University of Wisconsin-Madison
+1 608 206 4703