Ah okay, my POST script only deletes some tarballs. I have been using
a test job and the problem occurs 100% of the time. I've submitted
this job around 100 times so far. Since it hangs I'm reluctant to
start production.
--Rami
On Wed, Jan 31, 2018 at 2:38 PM, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
> On 1/31/2018 1:10 PM, Rami Vanguri wrote:
>> Hi Todd,
>>
>> Thanks for the reply!
>>
>> condor_version:
>> $CondorVersion: 8.6.8 Oct 31 2017 $
>> $CondorPlatform: X86_64-CentOS_6.9 $
>>
>> OS:
>> Description: CentOS release 6.9 (Final)
>>
>> So your question about editing/removing files might be the answer, my
>> POST script transfers (to hadoop) and removes the resulting tarballs
>> from the preceding steps. I do this because I will be submitting
>> hundreds of these and don't want to keep the output around in the
>> scratch directory. If that is indeed what's causing the issue, how
>> can I remove files safely?
>>
>
> Having your POST script move job output someplace should be fine. I asked because I was concerned that your POST script may actually move (or remove) files that DAGMan itself needs to reference, like your
> .dag.nodes.log and other files created by DAGMan. These files created by DAGMan should not be moved/removed until DAGMan exits.
>
> Another question: Does this problem consistently always occur (i.e. DAGMan always gets stuck with this workflow), or only occasionally? If the latter, are we talking 1 in every 5 workflow runs, or 1 in 50,000 ?
>
> Thanks,
> Todd
>
>
>
>
>> --Rami
>>
>> On Wed, Jan 31, 2018 at 2:05 PM, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
>>> Hi Rami,
>>>
>>> I see what you mean, the below certainly looks strange to me. I will ask
>>> one of our DAGMan experts here to take a look and report back to the list.
>>>
>>> In the meantime, could you tell us what version of HTCondor you are using
>>> (i.e. output of condor_version on your submit machine), and on what
>>> operating system?
>>>
>>> Were any files in /home/ramiv/nsides_IN/condor/localTEST edited or removed
>>> during the test? Is this subdirectory on a shared file system?
>>>
>>> Thanks
>>> Todd
>>>
>>>
>>> On 1/30/2018 5:47 PM, Rami Vanguri wrote:
>>>>
>>>> Hi,
>>>>
>>>> I am running a DAG job that has several components which seem to all
>>>> run fine, but then the actual dag job never stops running even though
>>>> all of the jobs were successful.
>>>>
>>>> Here is an excerpt from the .dag.nodes.log file:
>>>> 005 (69535.000.000) 01/30 14:47:07 Job terminated.
>>>>Â Â Â Â Â Â(1) Normal termination (return value 0)
>>>>Â Â Â Â Â Â Â Â Â ÂUsr 0 00:11:47, Sys 0 00:02:08Â -Â Run Remote Usage
>>>>Â Â Â Â Â Â Â Â Â ÂUsr 0 00:00:00, Sys 0 00:00:00Â -Â Run Local Usage
>>>>Â Â Â Â Â Â Â Â Â ÂUsr 0 00:11:47, Sys 0 00:02:08Â -Â Total Remote Usage
>>>>Â Â Â Â Â Â Â Â Â ÂUsr 0 00:00:00, Sys 0 00:00:00Â -Â Total Local Usage
>>>>Â Â Â Â Â Â0Â -Â Run Bytes Sent By Job
>>>>Â Â Â Â Â Â0Â -Â Run Bytes Received By Job
>>>>Â Â Â Â Â Â0Â -Â Total Bytes Sent By Job
>>>>Â Â Â Â Â Â0Â -Â Total Bytes Received By Job
>>>>     ÂPartitionable Resources :  Usage Request Allocated
>>>>       Cpus        Â:        Â1    Â1
>>>>Â Â Â Â Â Â Â Disk (KB)Â Â Â Â Â Â :Â Â250000Â Â Â Â 1Â Â6936266
>>>>Â Â Â Â Â Â Â Memory (MB)Â Â Â Â Â :Â Â Â 583Â Â Â2048Â Â Â 2048
>>>> ...
>>>> 016 (69538.000.000) 01/30 15:29:24 POST Script terminated.
>>>>Â Â Â Â Â Â(1) Normal termination (return value 0)
>>>>Â Â Â ÂDAG Node: C
>>>>
>>>> ..and here is an excerpt from the .dagman.out file:
>>>> 01/30/18 15:29:22 Node C job proc (69538.0.0) completed successfully.
>>>> 01/30/18 15:29:22 Node C job completed
>>>> 01/30/18 15:29:22 Running POST script of Node C...
>>>> 01/30/18 15:29:22 Warning: mysin has length 0 (ignore if produced by
>>>> DAGMan; see gittrac #4987, #5031)
>>>> 01/30/18 15:29:22 DAG status: 0 (DAG_STATUS_OK)
>>>> 01/30/18 15:29:22 Of 4 nodes total:
>>>> 01/30/18 15:29:22 Done  ÂPre ÂQueued  Post ÂReady ÂUn-Ready
>>>> Failed
>>>> 01/30/18 15:29:22Â Â===Â Â Â===Â Â Â ===Â Â Â===Â Â Â===Â Â Â Â ===
>>>> ===
>>>> 01/30/18 15:29:22Â Â Â3Â Â Â Â0Â Â Â Â 0Â Â Â Â1Â Â Â Â0Â Â Â Â Â 0
>>>> 0
>>>> 01/30/18 15:29:22 0 job proc(s) currently held
>>>> 01/30/18 15:29:24 Initializing user log writer for
>>>> /home/ramiv/nsides_IN/condor/localTEST/./workflow_ localTEST.dag.nodes.log,
>>>> (69538.0.0)
>>>> 01/30/18 15:39:23 601 seconds since last log event
>>>> 01/30/18 15:39:23 Pending DAG nodes:
>>>> 01/30/18 15:39:23Â ÂNode C, HTCondor ID 69538, status STATUS_POSTRUN
>>>>
>>>> The dag control file only has 4 jobs structured like this:
>>>> PARENT B0 B1 B2 CHILD C
>>>>
>>>> What could cause my job to be stuck in POSTRUN even though it runs
>>>> successfully with the proper exit code?
>>>>
>>>> Thanks for any help.
>>>>
>>>> --Rami
>
>
> --
> Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
> Center for High Throughput Computing ÂDepartment of Computer Sciences
> HTCondor Technical Lead        1210 W. Dayton St. Rm #4257
> Phone: (608) 263-7132Â Â Â Â Â Â Â Â Â Madison, WI 53706-1685
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor- users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/