[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] dagman job won't finish



I removed the POST script and the dag job still hangs out in condor.
However, I noticed a different error in the dag.dagman.out:

02/01/18 12:01:32 Node C job proc (69659.0.0) completed successfully.
02/01/18 12:01:32 Node C job completed
02/01/18 12:01:32 DAG status: 0 (DAG_STATUS_OK)
02/01/18 12:01:32 Of 4 nodes total:
02/01/18 12:01:32  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
02/01/18 12:01:32   ===     ===      ===     ===     ===        ===      ===
02/01/18 12:01:32     4       0        0       0       0          0        0
02/01/18 12:01:32 0 job proc(s) currently held
02/01/18 12:01:32 All jobs Completed!
02/01/18 12:01:32 Note: 0 total job deferrals because of -MaxJobs limit (0)
02/01/18 12:01:32 Note: 0 total job deferrals because of -MaxIdle limit (1000)
02/01/18 12:01:32 Note: 0 total job deferrals because of node category throttles
02/01/18 12:01:32 Note: 0 total PRE script deferrals because of
-MaxPre limit (20) or DEFER
02/01/18 12:01:32 Note: 0 total POST script deferrals because of
-MaxPost limit (20) or DEFER
02/01/18 12:01:32 DAG status: 0 (DAG_STATUS_OK)
02/01/18 12:01:32 Of 4 nodes total:
02/01/18 12:01:32  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
02/01/18 12:01:32   ===     ===      ===     ===     ===        ===      ===
02/01/18 12:01:32     4       0        0       0       0          0        0
02/01/18 12:01:32 0 job proc(s) currently held
02/01/18 12:01:32 Wrote metrics file workflow_localTEST.dag.metrics.
02/01/18 12:01:32 Metrics not sent because of PEGASUS_METRICS or
CONDOR_DEVELOPERS setting.
02/01/18 12:01:32 ReadMultipleUserLogs error: Didn't find
LogFileMonitor object for log file
/home/ramiv/nsides_IN/condor/localTEST/./workflow_localTEST.dag.nodes.log
(2305:20709560)!
02/01/18 12:01:32 All log monitors:
02/01/18 12:01:32   File ID: 2305:20709395
02/01/18 12:01:32     Monitor: 0x19620a0
02/01/18 12:01:32     Log file:
</home/ramiv/nsides_IN/condor/localTEST/./workflow_localTEST.dag.nodes.log>
02/01/18 12:01:32     refCount: 1
02/01/18 12:01:32     lastLogEvent: (nil)
02/01/18 12:01:32 DAGMan::Job:8001:ERROR: Unable to unmonitor log file
</home/ramiv/nsides_IN/condor/localTEST/./workflow_localTEST.dag.nodes.log>
|ReadMultipleUserLogs:9004:Didn't find LogFileMonitor object for log
file /home/ramiv/nsides_IN/condor/localTEST/./workflow_localTEST.dag.nodes.log
(2305:20709560)!
02/01/18 12:01:32 ERROR "Fatal log file monitoring error!" at line
3143 in file /builddir/build/BUILD/condor-8.6.8/src/condor_dagman/dag.cpp
02/01/18 12:01:32 ReadMultipleUserLogs error: Didn't find
LogFileMonitor object for log file
/home/ramiv/nsides_IN/condor/localTEST/./workflow_localTEST.dag.nodes.log
(2305:20709560)!
02/01/18 12:01:32 All log monitors:
02/01/18 12:01:32   File ID: 2305:20709395
02/01/18 12:01:32     Monitor: 0x19620a0
02/01/18 12:01:32     Log file:
</home/ramiv/nsides_IN/condor/localTEST/./workflow_localTEST.dag.nodes.log>
02/01/18 12:01:32     refCount: 1
02/01/18 12:01:32     lastLogEvent: (nil)
02/01/18 12:01:32 DAGMan::Job:8001:ERROR: Unable to unmonitor log file
</home/ramiv/nsides_IN/condor/localTEST/./workflow_localTEST.dag.nodes.log>
|ReadMultipleUserLogs:9004:Didn't find LogFileMonitor object for log
file /home/ramiv/nsides_IN/condor/localTEST/./workflow_localTEST.dag.nodes.log
(2305:20709560)!
02/01/18 12:01:32 ERROR "Fatal log file monitoring error!" at line
3143 in file /builddir/build/BUILD/condor-8.6.8/src/condor_dagman/dag.cpp


Does this make any sense?

--Rami

On Wed, Jan 31, 2018 at 4:15 PM, Rami Vanguri <rami.vanguri@xxxxxxxxx> wrote:
> This is the log file of the POST script:
>
> Copying 220735 bytes
> file:///home/ramiv/nsides_IN/condor/localTEST/results_121_dnn.tgz =>
> gsiftp://gftp.t2.ucsd.edu/hadoop/osg/ColumbiaTBI/ramiv/nsides_output_IN/results_121_dnn.tgz
>
> and the file is there.  That is the only line in the POST script
> except for removing the tarballs and "exit 0".  I'll try removing the
> POST script and get back to you.
>
> --Rami
>
> On Wed, Jan 31, 2018 at 3:32 PM, Mark Coatsworth <coatsworth@xxxxxxxxxxx> wrote:
>> Hi Rami,
>>
>> Something is definitely fishy here. For some reason DAGMan still thinks a
>> job is queued, which could explain why it keeps running, but that shouldn't
>> be possible based on the structure of your DAG.
>>
>> Can you confirm that the files your POST script is sending to Hadoop are
>> arriving there properly? Alternately, try removing the POST script and see
>> if the problem goes away.
>>
>> If that doesn't work, could you send me your .dag file, your job submit
>> files, and the post script? It'll be easier for me to debug if I can see
>> them.
>>
>> Mark
>>
>> On Wed, Jan 31, 2018 at 1:49 PM, Rami Vanguri <rami.vanguri@xxxxxxxxx>
>> wrote:
>>>
>>> Ah okay, my POST script only deletes some tarballs. I have been using
>>> a test job and the problem occurs 100% of the time.  I've submitted
>>> this job around 100 times so far.  Since it hangs I'm reluctant to
>>> start production.
>>>
>>> --Rami
>>>
>>> On Wed, Jan 31, 2018 at 2:38 PM, Todd Tannenbaum <tannenba@xxxxxxxxxxx>
>>> wrote:
>>> > On 1/31/2018 1:10 PM, Rami Vanguri wrote:
>>> >> Hi Todd,
>>> >>
>>> >> Thanks for the reply!
>>> >>
>>> >> condor_version:
>>> >> $CondorVersion: 8.6.8 Oct 31 2017 $
>>> >> $CondorPlatform: X86_64-CentOS_6.9 $
>>> >>
>>> >> OS:
>>> >> Description: CentOS release 6.9 (Final)
>>> >>
>>> >> So your question about editing/removing files might be the answer, my
>>> >> POST script transfers (to hadoop) and removes the resulting tarballs
>>> >> from the preceding steps.  I do this because I will be submitting
>>> >> hundreds of these and don't want to keep the output around in the
>>> >> scratch directory.  If that is indeed what's causing the issue, how
>>> >> can I remove files safely?
>>> >>
>>> >
>>> > Having your POST script move job output someplace should be fine.  I
>>> > asked because I was concerned that your POST script may actually move (or
>>> > remove) files that DAGMan itself needs to reference, like your
>>> > .dag.nodes.log and other files created by DAGMan.  These files created
>>> > by DAGMan should not be moved/removed until DAGMan exits.
>>> >
>>> > Another question:  Does this problem consistently always occur (i.e.
>>> > DAGMan always gets stuck with this workflow), or only occasionally?  If the
>>> > latter, are we talking 1 in every 5 workflow runs, or 1 in 50,000 ?
>>> >
>>> > Thanks,
>>> > Todd
>>> >
>>> >
>>> >
>>> >
>>> >> --Rami
>>> >>
>>> >> On Wed, Jan 31, 2018 at 2:05 PM, Todd Tannenbaum <tannenba@xxxxxxxxxxx>
>>> >> wrote:
>>> >>> Hi Rami,
>>> >>>
>>> >>> I see what you mean, the below certainly looks strange to me.  I will
>>> >>> ask
>>> >>> one of our DAGMan experts here to take a look and report back to the
>>> >>> list.
>>> >>>
>>> >>> In the meantime, could you tell us what version of HTCondor you are
>>> >>> using
>>> >>> (i.e. output of condor_version on your submit machine), and on what
>>> >>> operating system?
>>> >>>
>>> >>> Were any files in /home/ramiv/nsides_IN/condor/localTEST edited or
>>> >>> removed
>>> >>> during the test?  Is this subdirectory on a shared file system?
>>> >>>
>>> >>> Thanks
>>> >>> Todd
>>> >>>
>>> >>>
>>> >>> On 1/30/2018 5:47 PM, Rami Vanguri wrote:
>>> >>>>
>>> >>>> Hi,
>>> >>>>
>>> >>>> I am running a DAG job that has several components which seem to all
>>> >>>> run fine, but then the actual dag job never stops running even though
>>> >>>> all of the jobs were successful.
>>> >>>>
>>> >>>> Here is an excerpt from the .dag.nodes.log file:
>>> >>>> 005 (69535.000.000) 01/30 14:47:07 Job terminated.
>>> >>>>           (1) Normal termination (return value 0)
>>> >>>>                   Usr 0 00:11:47, Sys 0 00:02:08  -  Run Remote Usage
>>> >>>>                   Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
>>> >>>>                   Usr 0 00:11:47, Sys 0 00:02:08  -  Total Remote
>>> >>>> Usage
>>> >>>>                   Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local
>>> >>>> Usage
>>> >>>>           0  -  Run Bytes Sent By Job
>>> >>>>           0  -  Run Bytes Received By Job
>>> >>>>           0  -  Total Bytes Sent By Job
>>> >>>>           0  -  Total Bytes Received By Job
>>> >>>>           Partitionable Resources :    Usage  Request Allocated
>>> >>>>              Cpus                 :                 1         1
>>> >>>>              Disk (KB)            :   250000        1   6936266
>>> >>>>              Memory (MB)          :      583     2048      2048
>>> >>>> ...
>>> >>>> 016 (69538.000.000) 01/30 15:29:24 POST Script terminated.
>>> >>>>           (1) Normal termination (return value 0)
>>> >>>>       DAG Node: C
>>> >>>>
>>> >>>> ..and here is an excerpt from the .dagman.out file:
>>> >>>> 01/30/18 15:29:22 Node C job proc (69538.0.0) completed successfully.
>>> >>>> 01/30/18 15:29:22 Node C job completed
>>> >>>> 01/30/18 15:29:22 Running POST script of Node C...
>>> >>>> 01/30/18 15:29:22 Warning: mysin has length 0 (ignore if produced by
>>> >>>> DAGMan; see gittrac #4987, #5031)
>>> >>>> 01/30/18 15:29:22 DAG status: 0 (DAG_STATUS_OK)
>>> >>>> 01/30/18 15:29:22 Of 4 nodes total:
>>> >>>> 01/30/18 15:29:22  Done     Pre   Queued    Post   Ready   Un-Ready
>>> >>>> Failed
>>> >>>> 01/30/18 15:29:22   ===     ===      ===     ===     ===        ===
>>> >>>> ===
>>> >>>> 01/30/18 15:29:22     3       0        0       1       0          0
>>> >>>> 0
>>> >>>> 01/30/18 15:29:22 0 job proc(s) currently held
>>> >>>> 01/30/18 15:29:24 Initializing user log writer for
>>> >>>>
>>> >>>> /home/ramiv/nsides_IN/condor/localTEST/./workflow_localTEST.dag.nodes.log,
>>> >>>> (69538.0.0)
>>> >>>> 01/30/18 15:39:23 601 seconds since last log event
>>> >>>> 01/30/18 15:39:23 Pending DAG nodes:
>>> >>>> 01/30/18 15:39:23   Node C, HTCondor ID 69538, status STATUS_POSTRUN
>>> >>>>
>>> >>>> The dag control file only has 4 jobs structured like this:
>>> >>>> PARENT B0 B1 B2 CHILD C
>>> >>>>
>>> >>>> What could cause my job to be stuck in POSTRUN even though it runs
>>> >>>> successfully with the proper exit code?
>>> >>>>
>>> >>>> Thanks for any help.
>>> >>>>
>>> >>>> --Rami
>>> >
>>> >
>>> > --
>>> > Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
>>> > Center for High Throughput Computing   Department of Computer Sciences
>>> > HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
>>> > Phone: (608) 263-7132                  Madison, WI 53706-1685
>>> _______________________________________________
>>> HTCondor-users mailing list
>>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with
>>> a
>>> subject: Unsubscribe
>>> You can also unsubscribe by visiting
>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>>
>>> The archives can be found at:
>>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>
>>
>>
>>
>> --
>> Mark Coatsworth
>> Systems Programmer
>> Center for High Throughput Computing
>> Department of Computer Sciences
>> University of Wisconsin-Madison
>> +1 608 206 4703
>>
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/