[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] dagman job won't finish



Sorry for the extra email, but the dagman job keeps running *and* the
B node jobs get resubmitted to run again.  In other words, not only
does the dagman job never end, but it submits the B nodes, waits for
them to finish, then resubmits them in a loop.

--Rami

On Thu, Feb 1, 2018 at 3:04 PM, Rami Vanguri <rami.vanguri@xxxxxxxxx> wrote:
> I removed the POST script and the dag job still hangs out in condor.
> However, I noticed a different error in the dag.dagman.out:
>
> 02/01/18 12:01:32 Node C job proc (69659.0.0) completed successfully.
> 02/01/18 12:01:32 Node C job completed
> 02/01/18 12:01:32 DAG status: 0 (DAG_STATUS_OK)
> 02/01/18 12:01:32 Of 4 nodes total:
> 02/01/18 12:01:32  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
> 02/01/18 12:01:32   ===     ===      ===     ===     ===        ===      ===
> 02/01/18 12:01:32     4       0        0       0       0          0        0
> 02/01/18 12:01:32 0 job proc(s) currently held
> 02/01/18 12:01:32 All jobs Completed!
> 02/01/18 12:01:32 Note: 0 total job deferrals because of -MaxJobs limit (0)
> 02/01/18 12:01:32 Note: 0 total job deferrals because of -MaxIdle limit (1000)
> 02/01/18 12:01:32 Note: 0 total job deferrals because of node category throttles
> 02/01/18 12:01:32 Note: 0 total PRE script deferrals because of
> -MaxPre limit (20) or DEFER
> 02/01/18 12:01:32 Note: 0 total POST script deferrals because of
> -MaxPost limit (20) or DEFER
> 02/01/18 12:01:32 DAG status: 0 (DAG_STATUS_OK)
> 02/01/18 12:01:32 Of 4 nodes total:
> 02/01/18 12:01:32  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
> 02/01/18 12:01:32   ===     ===      ===     ===     ===        ===      ===
> 02/01/18 12:01:32     4       0        0       0       0          0        0
> 02/01/18 12:01:32 0 job proc(s) currently held
> 02/01/18 12:01:32 Wrote metrics file workflow_localTEST.dag.metrics.
> 02/01/18 12:01:32 Metrics not sent because of PEGASUS_METRICS or
> CONDOR_DEVELOPERS setting.
> 02/01/18 12:01:32 ReadMultipleUserLogs error: Didn't find
> LogFileMonitor object for log file
> /home/ramiv/nsides_IN/condor/localTEST/./workflow_localTEST.dag.nodes.log
> (2305:20709560)!
> 02/01/18 12:01:32 All log monitors:
> 02/01/18 12:01:32   File ID: 2305:20709395
> 02/01/18 12:01:32     Monitor: 0x19620a0
> 02/01/18 12:01:32     Log file:
> </home/ramiv/nsides_IN/condor/localTEST/./workflow_localTEST.dag.nodes.log>
> 02/01/18 12:01:32     refCount: 1
> 02/01/18 12:01:32     lastLogEvent: (nil)
> 02/01/18 12:01:32 DAGMan::Job:8001:ERROR: Unable to unmonitor log file
> </home/ramiv/nsides_IN/condor/localTEST/./workflow_localTEST.dag.nodes.log>
> |ReadMultipleUserLogs:9004:Didn't find LogFileMonitor object for log
> file /home/ramiv/nsides_IN/condor/localTEST/./workflow_localTEST.dag.nodes.log
> (2305:20709560)!
> 02/01/18 12:01:32 ERROR "Fatal log file monitoring error!" at line
> 3143 in file /builddir/build/BUILD/condor-8.6.8/src/condor_dagman/dag.cpp
> 02/01/18 12:01:32 ReadMultipleUserLogs error: Didn't find
> LogFileMonitor object for log file
> /home/ramiv/nsides_IN/condor/localTEST/./workflow_localTEST.dag.nodes.log
> (2305:20709560)!
> 02/01/18 12:01:32 All log monitors:
> 02/01/18 12:01:32   File ID: 2305:20709395
> 02/01/18 12:01:32     Monitor: 0x19620a0
> 02/01/18 12:01:32     Log file:
> </home/ramiv/nsides_IN/condor/localTEST/./workflow_localTEST.dag.nodes.log>
> 02/01/18 12:01:32     refCount: 1
> 02/01/18 12:01:32     lastLogEvent: (nil)
> 02/01/18 12:01:32 DAGMan::Job:8001:ERROR: Unable to unmonitor log file
> </home/ramiv/nsides_IN/condor/localTEST/./workflow_localTEST.dag.nodes.log>
> |ReadMultipleUserLogs:9004:Didn't find LogFileMonitor object for log
> file /home/ramiv/nsides_IN/condor/localTEST/./workflow_localTEST.dag.nodes.log
> (2305:20709560)!
> 02/01/18 12:01:32 ERROR "Fatal log file monitoring error!" at line
> 3143 in file /builddir/build/BUILD/condor-8.6.8/src/condor_dagman/dag.cpp
>
>
> Does this make any sense?
>
> --Rami
>
> On Wed, Jan 31, 2018 at 4:15 PM, Rami Vanguri <rami.vanguri@xxxxxxxxx> wrote:
>> This is the log file of the POST script:
>>
>> Copying 220735 bytes
>> file:///home/ramiv/nsides_IN/condor/localTEST/results_121_dnn.tgz =>
>> gsiftp://gftp.t2.ucsd.edu/hadoop/osg/ColumbiaTBI/ramiv/nsides_output_IN/results_121_dnn.tgz
>>
>> and the file is there.  That is the only line in the POST script
>> except for removing the tarballs and "exit 0".  I'll try removing the
>> POST script and get back to you.
>>
>> --Rami
>>
>> On Wed, Jan 31, 2018 at 3:32 PM, Mark Coatsworth <coatsworth@xxxxxxxxxxx> wrote:
>>> Hi Rami,
>>>
>>> Something is definitely fishy here. For some reason DAGMan still thinks a
>>> job is queued, which could explain why it keeps running, but that shouldn't
>>> be possible based on the structure of your DAG.
>>>
>>> Can you confirm that the files your POST script is sending to Hadoop are
>>> arriving there properly? Alternately, try removing the POST script and see
>>> if the problem goes away.
>>>
>>> If that doesn't work, could you send me your .dag file, your job submit
>>> files, and the post script? It'll be easier for me to debug if I can see
>>> them.
>>>
>>> Mark
>>>
>>> On Wed, Jan 31, 2018 at 1:49 PM, Rami Vanguri <rami.vanguri@xxxxxxxxx>
>>> wrote:
>>>>
>>>> Ah okay, my POST script only deletes some tarballs. I have been using
>>>> a test job and the problem occurs 100% of the time.  I've submitted
>>>> this job around 100 times so far.  Since it hangs I'm reluctant to
>>>> start production.
>>>>
>>>> --Rami
>>>>
>>>> On Wed, Jan 31, 2018 at 2:38 PM, Todd Tannenbaum <tannenba@xxxxxxxxxxx>
>>>> wrote:
>>>> > On 1/31/2018 1:10 PM, Rami Vanguri wrote:
>>>> >> Hi Todd,
>>>> >>
>>>> >> Thanks for the reply!
>>>> >>
>>>> >> condor_version:
>>>> >> $CondorVersion: 8.6.8 Oct 31 2017 $
>>>> >> $CondorPlatform: X86_64-CentOS_6.9 $
>>>> >>
>>>> >> OS:
>>>> >> Description: CentOS release 6.9 (Final)
>>>> >>
>>>> >> So your question about editing/removing files might be the answer, my
>>>> >> POST script transfers (to hadoop) and removes the resulting tarballs
>>>> >> from the preceding steps.  I do this because I will be submitting
>>>> >> hundreds of these and don't want to keep the output around in the
>>>> >> scratch directory.  If that is indeed what's causing the issue, how
>>>> >> can I remove files safely?
>>>> >>
>>>> >
>>>> > Having your POST script move job output someplace should be fine.  I
>>>> > asked because I was concerned that your POST script may actually move (or
>>>> > remove) files that DAGMan itself needs to reference, like your
>>>> > .dag.nodes.log and other files created by DAGMan.  These files created
>>>> > by DAGMan should not be moved/removed until DAGMan exits.
>>>> >
>>>> > Another question:  Does this problem consistently always occur (i.e.
>>>> > DAGMan always gets stuck with this workflow), or only occasionally?  If the
>>>> > latter, are we talking 1 in every 5 workflow runs, or 1 in 50,000 ?
>>>> >
>>>> > Thanks,
>>>> > Todd
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >> --Rami
>>>> >>
>>>> >> On Wed, Jan 31, 2018 at 2:05 PM, Todd Tannenbaum <tannenba@xxxxxxxxxxx>
>>>> >> wrote:
>>>> >>> Hi Rami,
>>>> >>>
>>>> >>> I see what you mean, the below certainly looks strange to me.  I will
>>>> >>> ask
>>>> >>> one of our DAGMan experts here to take a look and report back to the
>>>> >>> list.
>>>> >>>
>>>> >>> In the meantime, could you tell us what version of HTCondor you are
>>>> >>> using
>>>> >>> (i.e. output of condor_version on your submit machine), and on what
>>>> >>> operating system?
>>>> >>>
>>>> >>> Were any files in /home/ramiv/nsides_IN/condor/localTEST edited or
>>>> >>> removed
>>>> >>> during the test?  Is this subdirectory on a shared file system?
>>>> >>>
>>>> >>> Thanks
>>>> >>> Todd
>>>> >>>
>>>> >>>
>>>> >>> On 1/30/2018 5:47 PM, Rami Vanguri wrote:
>>>> >>>>
>>>> >>>> Hi,
>>>> >>>>
>>>> >>>> I am running a DAG job that has several components which seem to all
>>>> >>>> run fine, but then the actual dag job never stops running even though
>>>> >>>> all of the jobs were successful.
>>>> >>>>
>>>> >>>> Here is an excerpt from the .dag.nodes.log file:
>>>> >>>> 005 (69535.000.000) 01/30 14:47:07 Job terminated.
>>>> >>>>           (1) Normal termination (return value 0)
>>>> >>>>                   Usr 0 00:11:47, Sys 0 00:02:08  -  Run Remote Usage
>>>> >>>>                   Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
>>>> >>>>                   Usr 0 00:11:47, Sys 0 00:02:08  -  Total Remote
>>>> >>>> Usage
>>>> >>>>                   Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local
>>>> >>>> Usage
>>>> >>>>           0  -  Run Bytes Sent By Job
>>>> >>>>           0  -  Run Bytes Received By Job
>>>> >>>>           0  -  Total Bytes Sent By Job
>>>> >>>>           0  -  Total Bytes Received By Job
>>>> >>>>           Partitionable Resources :    Usage  Request Allocated
>>>> >>>>              Cpus                 :                 1         1
>>>> >>>>              Disk (KB)            :   250000        1   6936266
>>>> >>>>              Memory (MB)          :      583     2048      2048
>>>> >>>> ...
>>>> >>>> 016 (69538.000.000) 01/30 15:29:24 POST Script terminated.
>>>> >>>>           (1) Normal termination (return value 0)
>>>> >>>>       DAG Node: C
>>>> >>>>
>>>> >>>> ..and here is an excerpt from the .dagman.out file:
>>>> >>>> 01/30/18 15:29:22 Node C job proc (69538.0.0) completed successfully.
>>>> >>>> 01/30/18 15:29:22 Node C job completed
>>>> >>>> 01/30/18 15:29:22 Running POST script of Node C...
>>>> >>>> 01/30/18 15:29:22 Warning: mysin has length 0 (ignore if produced by
>>>> >>>> DAGMan; see gittrac #4987, #5031)
>>>> >>>> 01/30/18 15:29:22 DAG status: 0 (DAG_STATUS_OK)
>>>> >>>> 01/30/18 15:29:22 Of 4 nodes total:
>>>> >>>> 01/30/18 15:29:22  Done     Pre   Queued    Post   Ready   Un-Ready
>>>> >>>> Failed
>>>> >>>> 01/30/18 15:29:22   ===     ===      ===     ===     ===        ===
>>>> >>>> ===
>>>> >>>> 01/30/18 15:29:22     3       0        0       1       0          0
>>>> >>>> 0
>>>> >>>> 01/30/18 15:29:22 0 job proc(s) currently held
>>>> >>>> 01/30/18 15:29:24 Initializing user log writer for
>>>> >>>>
>>>> >>>> /home/ramiv/nsides_IN/condor/localTEST/./workflow_localTEST.dag.nodes.log,
>>>> >>>> (69538.0.0)
>>>> >>>> 01/30/18 15:39:23 601 seconds since last log event
>>>> >>>> 01/30/18 15:39:23 Pending DAG nodes:
>>>> >>>> 01/30/18 15:39:23   Node C, HTCondor ID 69538, status STATUS_POSTRUN
>>>> >>>>
>>>> >>>> The dag control file only has 4 jobs structured like this:
>>>> >>>> PARENT B0 B1 B2 CHILD C
>>>> >>>>
>>>> >>>> What could cause my job to be stuck in POSTRUN even though it runs
>>>> >>>> successfully with the proper exit code?
>>>> >>>>
>>>> >>>> Thanks for any help.
>>>> >>>>
>>>> >>>> --Rami
>>>> >
>>>> >
>>>> > --
>>>> > Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
>>>> > Center for High Throughput Computing   Department of Computer Sciences
>>>> > HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
>>>> > Phone: (608) 263-7132                  Madison, WI 53706-1685
>>>> _______________________________________________
>>>> HTCondor-users mailing list
>>>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with
>>>> a
>>>> subject: Unsubscribe
>>>> You can also unsubscribe by visiting
>>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>>>
>>>> The archives can be found at:
>>>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>>
>>>
>>>
>>>
>>> --
>>> Mark Coatsworth
>>> Systems Programmer
>>> Center for High Throughput Computing
>>> Department of Computer Sciences
>>> University of Wisconsin-Madison
>>> +1 608 206 4703
>>>
>>> _______________________________________________
>>> HTCondor-users mailing list
>>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>>> subject: Unsubscribe
>>> You can also unsubscribe by visiting
>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>>
>>> The archives can be found at:
>>> https://lists.cs.wisc.edu/archive/htcondor-users/