[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] dagman job won't finish



On 1/31/2018 1:10 PM, Rami Vanguri wrote:
> Hi Todd,
> 
> Thanks for the reply!
> 
> condor_version:
> $CondorVersion: 8.6.8 Oct 31 2017 $
> $CondorPlatform: X86_64-CentOS_6.9 $
> 
> OS:
> Description: CentOS release 6.9 (Final)
> 
> So your question about editing/removing files might be the answer, my
> POST script transfers (to hadoop) and removes the resulting tarballs
> from the preceding steps.  I do this because I will be submitting
> hundreds of these and don't want to keep the output around in the
> scratch directory.  If that is indeed what's causing the issue, how
> can I remove files safely?
> 

Having your POST script move job output someplace should be fine.  I asked because I was concerned that your POST script may actually move (or remove) files that DAGMan itself needs to reference, like your 
.dag.nodes.log and other files created by DAGMan.  These files created by DAGMan should not be moved/removed until DAGMan exits.

Another question:  Does this problem consistently always occur (i.e. DAGMan always gets stuck with this workflow), or only occasionally?  If the latter, are we talking 1 in every 5 workflow runs, or 1 in 50,000 ?

Thanks,
Todd




> --Rami
> 
> On Wed, Jan 31, 2018 at 2:05 PM, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
>> Hi Rami,
>>
>> I see what you mean, the below certainly looks strange to me.  I will ask
>> one of our DAGMan experts here to take a look and report back to the list.
>>
>> In the meantime, could you tell us what version of HTCondor you are using
>> (i.e. output of condor_version on your submit machine), and on what
>> operating system?
>>
>> Were any files in /home/ramiv/nsides_IN/condor/localTEST edited or removed
>> during the test?  Is this subdirectory on a shared file system?
>>
>> Thanks
>> Todd
>>
>>
>> On 1/30/2018 5:47 PM, Rami Vanguri wrote:
>>>
>>> Hi,
>>>
>>> I am running a DAG job that has several components which seem to all
>>> run fine, but then the actual dag job never stops running even though
>>> all of the jobs were successful.
>>>
>>> Here is an excerpt from the .dag.nodes.log file:
>>> 005 (69535.000.000) 01/30 14:47:07 Job terminated.
>>>           (1) Normal termination (return value 0)
>>>                   Usr 0 00:11:47, Sys 0 00:02:08  -  Run Remote Usage
>>>                   Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
>>>                   Usr 0 00:11:47, Sys 0 00:02:08  -  Total Remote Usage
>>>                   Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
>>>           0  -  Run Bytes Sent By Job
>>>           0  -  Run Bytes Received By Job
>>>           0  -  Total Bytes Sent By Job
>>>           0  -  Total Bytes Received By Job
>>>           Partitionable Resources :    Usage  Request Allocated
>>>              Cpus                 :                 1         1
>>>              Disk (KB)            :   250000        1   6936266
>>>              Memory (MB)          :      583     2048      2048
>>> ...
>>> 016 (69538.000.000) 01/30 15:29:24 POST Script terminated.
>>>           (1) Normal termination (return value 0)
>>>       DAG Node: C
>>>
>>> ..and here is an excerpt from the .dagman.out file:
>>> 01/30/18 15:29:22 Node C job proc (69538.0.0) completed successfully.
>>> 01/30/18 15:29:22 Node C job completed
>>> 01/30/18 15:29:22 Running POST script of Node C...
>>> 01/30/18 15:29:22 Warning: mysin has length 0 (ignore if produced by
>>> DAGMan; see gittrac #4987, #5031)
>>> 01/30/18 15:29:22 DAG status: 0 (DAG_STATUS_OK)
>>> 01/30/18 15:29:22 Of 4 nodes total:
>>> 01/30/18 15:29:22  Done     Pre   Queued    Post   Ready   Un-Ready
>>> Failed
>>> 01/30/18 15:29:22   ===     ===      ===     ===     ===        ===
>>> ===
>>> 01/30/18 15:29:22     3       0        0       1       0          0
>>> 0
>>> 01/30/18 15:29:22 0 job proc(s) currently held
>>> 01/30/18 15:29:24 Initializing user log writer for
>>> /home/ramiv/nsides_IN/condor/localTEST/./workflow_localTEST.dag.nodes.log,
>>> (69538.0.0)
>>> 01/30/18 15:39:23 601 seconds since last log event
>>> 01/30/18 15:39:23 Pending DAG nodes:
>>> 01/30/18 15:39:23   Node C, HTCondor ID 69538, status STATUS_POSTRUN
>>>
>>> The dag control file only has 4 jobs structured like this:
>>> PARENT B0 B1 B2 CHILD C
>>>
>>> What could cause my job to be stuck in POSTRUN even though it runs
>>> successfully with the proper exit code?
>>>
>>> Thanks for any help.
>>>
>>> --Rami


-- 
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685