[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor 6.8.n: job running delays: RUN TIMES stay at Zero



Hi again

I guess this follows on from my posting back in October, to which
Daniel Forrest at WISC suggested:

> Back to your original question, this is entirely a scalability
> issue.  Prior to the 6.9.3 release the schedd simply couldn't handle
> more than a few thousand jobs in the job queue without a severe
> degradation in performance.  I believe your previous message stated
> you had around 17,500 jobs in the queue - this simply won't work
> with Condor 6.8.
>
> The easiest temporary solution is to only have a few thousand jobs in
> the queue.

The user previously queuing the 20,000 or so jobs has "backed off" a
little but still isn't seeing as many jobs going through the system
as we'd expect.

Rough calculations on the completed jobs suggest that only around 100
jobs are ever run concurrently at any one time, whereas we know that
there are many more free machines and hence queues than that.

One "condor_q" showed that some 1400 jobs listed as being in a Running
state, on queues displaying a Claimed status, but display a RUN TIME
of 00:00:00 for an inordinately long time.

There are also some 115 jobs Running with non-zero run times
(which undermines the rough calculation of 100 slightly but
 you get the idea)

I thought that there might be a config setting that limits the
number of concurrent running jobs but the closest I could come
to the apparentrestriction around 100 was the default setting of

#START_SCHEDULER_UNIVERSE = TotalSchedulerJobsRunning < 100

We bumped that up to 200 in the server config file and restarted
the system but have not seen any difference in concurrent execution.

One other thing, whilst I have full access to the GNU/Linux master for
this grid, I am slightly hobbled as regards editing the winxp compute
node config files, although the two seem to be in sync, other than
a few ".exe" extensions and differing path names here and there.

Is this likely to be a 6.8.n series issue - I am still waiting
for an oppourtunity to upgrade VUW's grid to something less than
two-years old - or have I missed something really basic within the
Condor /config file/operational methodology/?

Kevin

-- 
Kevin M. Buckley                                  Room:  CO327
School of Engineering and                         Phone: +64 4 463 5971
 Computer Science
Victoria University of Wellington
New Zealand