[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor 6.8.n: job running delays: RUN TIMES stay at Zero



On Wed, Dec 09, 2009 at 04:24:26PM +1300, Kevin.Buckley@xxxxxxxxxxxxx wrote:
> 
> Rough calculations on the completed jobs suggest that only around 100
> jobs are ever run concurrently at any one time, whereas we know that
> there are many more free machines and hence queues than that.

The key number here is the rate at which jobs are finishing.  For many
of the same reasons that the 6.8.n series can't handle too many jobs in
the job queue, it also can't handle more than a dozen or so jobs (IIRC
from when we had this same problem) finishing per minute.  This number
is actually dependent on how many jobs are in the queue, the more jobs
there are the longer it takes to remove one from the queue when it
completes.

> One "condor_q" showed that some 1400 jobs listed as being in a Running
> state, on queues displaying a Claimed status, but display a RUN TIME
> of 00:00:00 for an inordinately long time.

This is a sign that the scheduler is too busy to get shadows started
quickly enough.

> Is this likely to be a 6.8.n series issue - I am still waiting
> for an oppourtunity to upgrade VUW's grid to something less than
> two-years old - or have I missed something really basic within the
> Condor /config file/operational methodology/?

As one other poster mentioned, the other solution is to upgrade only
the central manager to a newer Condor version.  It should have no
problem interoperating with compute nodes that are running 6.8.n.

-- 
Daniel K. Forrest		Space Science and
dan.forrest@xxxxxxxxxxxxx	Engineering Center
(608) 890 - 0558		University of Wisconsin, Madison