[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] DAGMan Hangs Near End



Dear Nathan,

This worked. The job in question was B_chr21. I am attaching a tarball of the requested log files. The Condor version is old (7.0.5), and I do not have control over its administration.

Thank you and please let me know what you find.
Oren

On 9/28/2012 5:02 PM, Nathan Panike wrote:
Oren:

DAGMan thinks a job was submitted, but never saw it terminate.  So it
missed the event in the log. Here is what to do in this case, to
complete the job:

1. Figure out which node is still pending.

2. condor_rm the DAG.

3. Edit the rescue dagfile to mark the pending job as "DONE"

4. Resubmit the DAG with condor_submit_dag.

5. Also, we need to figure out why DAGMan never recognized the node was done
itself. To this end, could you send the .dagman.out file to me, along
with the userlog files?

Nathan Panike

On Fri, Sep 28, 2012 at 01:27:10PM -0500, Oren Livne wrote:
Dear All,

I have a DAGMan pipeline that starts fine, but never completes,
because the last few jobs are queued but never run. A down-scaled
version of it works, so I doubt that it's a programming error. There
are many available nodes; why won't those jobs run? How can I
analyze the individual job within the DAGMan that says "Queued"?

Thank you so much,
Oren

-- Submitter: ibicluster.uchicago.cc : <172.16.0.149:42470> :
ibicluster.uchicago.cc
  ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
  904.0   livne           9/28 13:09   0+00:15:40 R  0 7.3
condor_dagman -f -

1 jobs; 0 idle, 1 running, 0 held
===================================================================================

                      Total Owner Claimed Unclaimed Matched
Preempting Backfill

         X86_64/LINUX   728   108       0       620 0          0        0

                Total   728   108       0       620 0          0        0

===================================================================================
9/28 13:23:33 Event: ULOG_EXECUTE for Condor Node D_chr10 (1009.0)
9/28 13:23:33 Number of idle job procs: 1
9/28 13:23:43 Event: ULOG_JOB_TERMINATED for Condor Node D_chr10 (1009.0)
9/28 13:23:43 Node D_chr10 job proc (1009.0) completed successfully.
9/28 13:23:43 Node D_chr10 job completed
9/28 13:23:43 Number of idle job procs: 1
9/28 13:23:43 Of 107 nodes total:
9/28 13:23:43  Done     Pre   Queued    Post   Ready Un-Ready   Failed
9/28 13:23:43   ===     ===      ===     ===     === ===      ===
9/28 13:23:43   104       0        1       0 0          2        0
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/


--
A person is just about as big as the things that make him angry.

Attachment: pipeline.tgz
Description: Binary data