[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Jobs not running even though servers available



Check what does condor_q -ana tell you about the waiting jobs,
also look at NegotiatorLog and MatchLog to see what they are doing.

Steve Timm


On Thu, 10 Nov 2011, Rob Stevenson wrote:

Hi All,
A quick summary: I've hit a seemingly arbitrary limit in my condor grid, only a maximum of 76 jobs run at any given time, even though there are suitable idle servers available. I think it might be because my master is under powered and regularly hitting 100% CPU but this isn't based on anything more than a hunch, yet. More details below..


I've just added around 150 cores to our condor grid, now at a total of 190 cores.

In testing (throwing ~4000x30 minute jobs at it) I'm noticing it seems to cap at exactly 76 running jobs.

I'm pretty sure this is not a requirements issue because (as recently as yesterday) I've run exactly the same set of jobs and they have run fine on servers which are now Idle. (This was when there were only 40 cores).

My best guess so far is a resource issue on the master (which is also the scheduler and everything else, really) which I'm now regularly seeing at 100% CPU. Though I don't really understand why this would cause the problem or why it's always exactly 76 jobs running even though they are all (slightly) different sizes.

Does this hunch sound believable? I intend to investigate further, but thought it might be good to run it by the experts to see if it sounds like a good starting point.

I know my master is under powered (1 virtual core with 1.5GB RAM) so I fully intend to give this a boost anyway - just wondering if this will likely cure the issue (in which case I'll expedite this upgrade) or if there is probably a different issue too?

Thanks for any ideas!

Rob Stevenson
Systems Administrator, Support Services

E: r.stevenson@xxxxxxxxxxxxxxxxx<mailto:r.stevenson@xxxxxxxxxxxxxxxxx>
T: +44 (0)1491 822270

________________________________
[HR Wallingford Logo]

HR Wallingford
Howbery Park, Wallingford, Oxfordshire OX10 8BA, United Kingdom
T: +44 (0) 1491 835381     F: +44 (0)1491 832233
www.hrwallingford.com


________________________________


________________________________

HR Wallingford uses faxes and emails for confidential and legally privileged business communications. They do not of themselves create legal commitments. Disclosure to parties other than addressees requires our specific consent. We are not liable for unauthorised disclosures nor reliance upon them.
If you have received this message in error please advise us immediately and destroy all copies of it.

HR Wallingford Limited
Howbery Park, Wallingford, Oxfordshire, OX10 8BA, United Kingdom
Registered in England No. 02562099

________________________________


--
------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525
timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Group Leader.
Lead of FermiCloud project.