[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] dagman job won't finish




Hi Rami,

I see what you mean, the below certainly looks strange to me. I will ask one of our DAGMan experts here to take a look and report back to the list.

In the meantime, could you tell us what version of HTCondor you are using (i.e. output of condor_version on your submit machine), and on what operating system?

Were any files in /home/ramiv/nsides_IN/condor/localTEST edited or removed during the test? Is this subdirectory on a shared file system?

Thanks
Todd

On 1/30/2018 5:47 PM, Rami Vanguri wrote:
Hi,

I am running a DAG job that has several components which seem to all
run fine, but then the actual dag job never stops running even though
all of the jobs were successful.

Here is an excerpt from the .dag.nodes.log file:
005 (69535.000.000) 01/30 14:47:07 Job terminated.
         (1) Normal termination (return value 0)
                 Usr 0 00:11:47, Sys 0 00:02:08  -  Run Remote Usage
                 Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                 Usr 0 00:11:47, Sys 0 00:02:08  -  Total Remote Usage
                 Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
         0  -  Run Bytes Sent By Job
         0  -  Run Bytes Received By Job
         0  -  Total Bytes Sent By Job
         0  -  Total Bytes Received By Job
         Partitionable Resources :    Usage  Request Allocated
            Cpus                 :                 1         1
            Disk (KB)            :   250000        1   6936266
            Memory (MB)          :      583     2048      2048
...
016 (69538.000.000) 01/30 15:29:24 POST Script terminated.
         (1) Normal termination (return value 0)
     DAG Node: C

..and here is an excerpt from the .dagman.out file:
01/30/18 15:29:22 Node C job proc (69538.0.0) completed successfully.
01/30/18 15:29:22 Node C job completed
01/30/18 15:29:22 Running POST script of Node C...
01/30/18 15:29:22 Warning: mysin has length 0 (ignore if produced by
DAGMan; see gittrac #4987, #5031)
01/30/18 15:29:22 DAG status: 0 (DAG_STATUS_OK)
01/30/18 15:29:22 Of 4 nodes total:
01/30/18 15:29:22  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
01/30/18 15:29:22   ===     ===      ===     ===     ===        ===      ===
01/30/18 15:29:22     3       0        0       1       0          0        0
01/30/18 15:29:22 0 job proc(s) currently held
01/30/18 15:29:24 Initializing user log writer for
/home/ramiv/nsides_IN/condor/localTEST/./workflow_localTEST.dag.nodes.log,
(69538.0.0)
01/30/18 15:39:23 601 seconds since last log event
01/30/18 15:39:23 Pending DAG nodes:
01/30/18 15:39:23   Node C, HTCondor ID 69538, status STATUS_POSTRUN

The dag control file only has 4 jobs structured like this:
PARENT B0 B1 B2 CHILD C

What could cause my job to be stuck in POSTRUN even though it runs
successfully with the proper exit code?

Thanks for any help.

--Rami