[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] dagman.out vs condor_q



This is my first week of using DAGMan.  I am trying to understand the output.  foo.dagman.out contains the nice summary:

2/17 16:49:14 Number of idle job procs: 6
12/17 16:49:14 Of 40 nodes total:
12/17 16:49:14  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
12/17 16:49:14   ===     ===      ===     ===     ===        ===      ===
12/17 16:49:14    28       0       12       0       0          0        0

However condor_q seems to paint a different picture:

[ijstokes@abitibi dag6]$ condor_q -constraint DAGManJobId==`cat .lastjobid` | grep -c " R "
7
[ijstokes@abitibi dag6]$ condor_q -constraint DAGManJobId==`cat .lastjobid` | grep -c " I "
4
[ijstokes@abitibi dag6]$ condor_q -constraint DAGManJobId==`cat .lastjobid` | grep -c " H "
3

>From the first, I'm told 6 are Idle (condor_q indicates 4, but this could be asynchrony in the updates), but then how do I distinguish between jobs in the RUNNING state and jobs in the HELD state?  The nice summary doesn't (directly) seem to distinguish between RUNNING, HELD, and QUEUED, which seems odd.  The condor_q output shows that 3 are HELD, which in a Condor-G world effectively means they've failed and need to be retried.

Thanks in advance for help understanding this.

Ian
-- 
Ian Stokes-Rees                            W: http://sbgrid.org
ijstokes@xxxxxxxxxxxxxxxxxxx               T: +1 617 432-5608 x75
SBGrid, Harvard Medical School             F: +1 617 432-5600