[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] dagman job won't finish



Ah okay, my POST script only deletes some tarballs. I have been using
a test job and the problem occurs 100% of the time.  I've submitted
this job around 100 times so far.  Since it hangs I'm reluctant to
start production.

--Rami

On Wed, Jan 31, 2018 at 2:38 PM, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
> On 1/31/2018 1:10 PM, Rami Vanguri wrote:
>> Hi Todd,
>>
>> Thanks for the reply!
>>
>> condor_version:
>> $CondorVersion: 8.6.8 Oct 31 2017 $
>> $CondorPlatform: X86_64-CentOS_6.9 $
>>
>> OS:
>> Description: CentOS release 6.9 (Final)
>>
>> So your question about editing/removing files might be the answer, my
>> POST script transfers (to hadoop) and removes the resulting tarballs
>> from the preceding steps.  I do this because I will be submitting
>> hundreds of these and don't want to keep the output around in the
>> scratch directory.  If that is indeed what's causing the issue, how
>> can I remove files safely?
>>
>
> Having your POST script move job output someplace should be fine.  I asked because I was concerned that your POST script may actually move (or remove) files that DAGMan itself needs to reference, like your
> .dag.nodes.log and other files created by DAGMan.  These files created by DAGMan should not be moved/removed until DAGMan exits.
>
> Another question:  Does this problem consistently always occur (i.e. DAGMan always gets stuck with this workflow), or only occasionally?  If the latter, are we talking 1 in every 5 workflow runs, or 1 in 50,000 ?
>
> Thanks,
> Todd
>
>
>
>
>> --Rami
>>
>> On Wed, Jan 31, 2018 at 2:05 PM, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
>>> Hi Rami,
>>>
>>> I see what you mean, the below certainly looks strange to me.  I will ask
>>> one of our DAGMan experts here to take a look and report back to the list.
>>>
>>> In the meantime, could you tell us what version of HTCondor you are using
>>> (i.e. output of condor_version on your submit machine), and on what
>>> operating system?
>>>
>>> Were any files in /home/ramiv/nsides_IN/condor/localTEST edited or removed
>>> during the test?  Is this subdirectory on a shared file system?
>>>
>>> Thanks
>>> Todd
>>>
>>>
>>> On 1/30/2018 5:47 PM, Rami Vanguri wrote:
>>>>
>>>> Hi,
>>>>
>>>> I am running a DAG job that has several components which seem to all
>>>> run fine, but then the actual dag job never stops running even though
>>>> all of the jobs were successful.
>>>>
>>>> Here is an excerpt from the .dag.nodes.log file:
>>>> 005 (69535.000.000) 01/30 14:47:07 Job terminated.
>>>>           (1) Normal termination (return value 0)
>>>>                   Usr 0 00:11:47, Sys 0 00:02:08  -  Run Remote Usage
>>>>                   Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
>>>>                   Usr 0 00:11:47, Sys 0 00:02:08  -  Total Remote Usage
>>>>                   Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
>>>>           0  -  Run Bytes Sent By Job
>>>>           0  -  Run Bytes Received By Job
>>>>           0  -  Total Bytes Sent By Job
>>>>           0  -  Total Bytes Received By Job
>>>>           Partitionable Resources :    Usage  Request Allocated
>>>>              Cpus                 :                 1         1
>>>>              Disk (KB)            :   250000        1   6936266
>>>>              Memory (MB)          :      583     2048      2048
>>>> ...
>>>> 016 (69538.000.000) 01/30 15:29:24 POST Script terminated.
>>>>           (1) Normal termination (return value 0)
>>>>       DAG Node: C
>>>>
>>>> ..and here is an excerpt from the .dagman.out file:
>>>> 01/30/18 15:29:22 Node C job proc (69538.0.0) completed successfully.
>>>> 01/30/18 15:29:22 Node C job completed
>>>> 01/30/18 15:29:22 Running POST script of Node C...
>>>> 01/30/18 15:29:22 Warning: mysin has length 0 (ignore if produced by
>>>> DAGMan; see gittrac #4987, #5031)
>>>> 01/30/18 15:29:22 DAG status: 0 (DAG_STATUS_OK)
>>>> 01/30/18 15:29:22 Of 4 nodes total:
>>>> 01/30/18 15:29:22  Done     Pre   Queued    Post   Ready   Un-Ready
>>>> Failed
>>>> 01/30/18 15:29:22   ===     ===      ===     ===     ===        ===
>>>> ===
>>>> 01/30/18 15:29:22     3       0        0       1       0          0
>>>> 0
>>>> 01/30/18 15:29:22 0 job proc(s) currently held
>>>> 01/30/18 15:29:24 Initializing user log writer for
>>>> /home/ramiv/nsides_IN/condor/localTEST/./workflow_localTEST.dag.nodes.log,
>>>> (69538.0.0)
>>>> 01/30/18 15:39:23 601 seconds since last log event
>>>> 01/30/18 15:39:23 Pending DAG nodes:
>>>> 01/30/18 15:39:23   Node C, HTCondor ID 69538, status STATUS_POSTRUN
>>>>
>>>> The dag control file only has 4 jobs structured like this:
>>>> PARENT B0 B1 B2 CHILD C
>>>>
>>>> What could cause my job to be stuck in POSTRUN even though it runs
>>>> successfully with the proper exit code?
>>>>
>>>> Thanks for any help.
>>>>
>>>> --Rami
>
>
> --
> Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
> Center for High Throughput Computing   Department of Computer Sciences
> HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
> Phone: (608) 263-7132                  Madison, WI 53706-1685