[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Jobs not running even though servers available



Thanks for the quick response. Not sure how to interpret the logs (below) - but the condor_q -ana output is something that I found strange before (should have mentioned that in my first email, really!)

Any thoughts?


Condor_q -ana says:
9020.000:  Request has not yet been considered by the matchmaker.




The last negotiator cycle:
11/10/11 20:13:24 Getting state information from the accountant
11/10/11 20:14:20 ---------- Started Negotiation Cycle ----------
11/10/11 20:14:20 Phase 1:  Obtaining ads from collector ...
11/10/11 20:14:21   Getting all public ads ...
11/10/11 20:14:21   Sorting 325 ads ...
11/10/11 20:14:21   Getting startd private ads ...
11/10/11 20:14:21 Got ads: 325 public and 188 private
11/10/11 20:14:21 Public ads include 1 submitter, 188 startd
11/10/11 20:14:21 Phase 2:  Performing accounting ...
11/10/11 20:14:21 Phase 3:  Sorting submitter ads by priority ...
11/10/11 20:14:21 Phase 4.1:  Negotiating with schedds ...
11/10/11 20:14:21   Negotiating with user@xxxxxxx at <192.xxx.xxx.xxx:2568>
11/10/11 20:14:21 0 seconds so far
11/10/11 20:14:22     Got NO_MORE_JOBS;  done negotiating
11/10/11 20:14:22  negotiateWithGroup resources used scheddAds length 0
11/10/11 20:14:22 ---------- Finished Negotiation Cycle ----------




Match log (This is a sample - the last entry in the log was a few hours ago (11/10/11 16:48:45)
I checked the jobs that are listed as rejected, and they have now completed successfully.



11/10/11 16:43:39       Rejected 8493.0 user@xxxxxxx <192.xxx.xxx.xxx:2568>: no match found
11/10/11 16:44:40       Rejected 8494.0 user@xxxxxxx <192.xxx.xxx.xxx:2568>: no match found
11/10/11 16:45:41       Rejected 8494.0 user@xxxxxxx <192.xxx.xxx.xxx:2568>: no match found
11/10/11 16:46:41       Rejected 8494.0 user@xxxxxxx <192.xxx.xxx.xxx:2568>: no match found
11/10/11 16:47:42       Matched 8496.0  user@xxxxxxx <192.xxx.xxx.xxx:2568> preempting none <192.xxx.xxx.yyy:1169> slot1@xxxxxxxxxxxxxxxxxxxxx
11/10/11 16:47:42       Matched 8497.0 user@xxxxxxx 192.xxx.xxx.xxx:2568> preempting none <192.xxx.xxx.yyy:1169> slot2@xxxxxxxxxxxxxxxxxxxxx
11/10/11 16:47:42       Matched 8498.0 user@xxxxxxx <192.xxx.xxx.xxx:2568> preempting none <192.xxx.xxx.yyy:1169> slot3@xxxxxxxxxxxxxxxxxxxxx
11/10/11 16:47:43       Matched 8499.0 user@xxxxxxx <192.xxx.xxx.xxx:2568> preempting none <192.xxx.xxx.yyy:1169> slot4@xxxxxxxxxxxxxxxxxxxxx


-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Steven Timm
Sent: 10 November 2011 20:00
To: Condor-Users Mail List
Subject: Re: [Condor-users] Jobs not running even though servers available

Check what does condor_q -ana tell you about the waiting jobs, also look at NegotiatorLog and MatchLog to see what they are doing.

Steve Timm


On Thu, 10 Nov 2011, Rob Stevenson wrote:

> Hi All,
> A quick summary: I've hit a seemingly arbitrary limit in my condor grid, only a maximum of 76 jobs run at any given time, even though there are suitable idle servers available. I think it might be because my master is under powered and regularly hitting 100% CPU but this isn't based on anything more than a hunch, yet. More details below..
>
>
> I've just added around 150 cores to our condor grid, now at a total of 190 cores.
>
> In testing (throwing ~4000x30 minute jobs at it) I'm noticing it seems to cap at exactly 76 running jobs.
>
> I'm pretty sure this is not a requirements issue because (as recently as yesterday) I've run exactly the same set of jobs and they have run fine on servers which are now Idle. (This was when there were only 40 cores).
>
> My best guess so far is a resource issue on the master (which is also the scheduler and everything else, really) which I'm now regularly seeing at 100% CPU. Though I don't really understand why this would cause the problem or why it's always exactly 76 jobs running even though they are all (slightly) different sizes.
>
> Does this hunch sound believable? I intend to investigate further, but thought it might be good to run it by the experts to see if it sounds like a good starting point.
>
> I know my master is under powered (1 virtual core with 1.5GB RAM) so I fully intend to give this a boost anyway - just wondering if this will likely cure the issue (in which case I'll expedite this upgrade) or if there is probably a different issue too?
>
> Thanks for any ideas!
>
> Rob Stevenson
> Systems Administrator, Support Services
>
> E: r.stevenson@xxxxxxxxxxxxxxxxx<mailto:r.stevenson@xxxxxxxxxxxxxxxxx>
> T: +44 (0)1491 822270
>
> ________________________________
> [HR Wallingford Logo]
>
> HR Wallingford
> Howbery Park, Wallingford, Oxfordshire OX10 8BA, United Kingdom
> T: +44 (0) 1491 835381     F: +44 (0)1491 832233
> www.hrwallingford.com
>
>
> ________________________________
>
>
> ________________________________
>
> HR Wallingford uses faxes and emails for confidential and legally privileged business communications. They do not of themselves create legal commitments. Disclosure to parties other than addressees requires our specific consent. We are not liable for unauthorised disclosures nor reliance upon them.
> If you have received this message in error please advise us immediately and destroy all copies of it.
>
> HR Wallingford Limited
> Howbery Park, Wallingford, Oxfordshire, OX10 8BA, United Kingdom
> Registered in England No. 02562099
>
> ________________________________
>

--
------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525
timm@xxxxxxxx  http://home.fnal.gov/~timm/ Fermilab Computing Division, Scientific Computing Facilities, Grid Facilities Department, FermiGrid Services Group, Group Leader.
Lead of FermiCloud project.
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/


This message has been scanned for viruses by MailControl - www.mailcontrol.com



Click https://www.mailcontrol.com/sr/xxtAoGoKu+rTndxI!oX7UruhAJZzARH0su0kcL4kC4vOwLI!InxnAkdGmKzglC3ebzKHog9EbRN6aunJTqVygQ== to report this email as spam.

________________________________

HR Wallingford uses faxes and emails for confidential and legally privileged business communications. They do not of themselves create legal commitments. Disclosure to parties other than addressees requires our specific consent. We are not liable for unauthorised disclosures nor reliance upon them.
If you have received this message in error please advise us immediately and destroy all copies of it.

HR Wallingford Limited
Howbery Park, Wallingford, Oxfordshire, OX10 8BA, United Kingdom
Registered in England No. 02562099

________________________________