[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] jobs stop running when lots of people submit jobs




Hrm. I guess you have to add D_FULLDEBUG to your NEGOTIATOR_DEBUG setting in order to see the message about "Over submitter resource limit".

--Dan

Ben Clifford wrote:

On Tue, 20 May 2008, Dan Bradley wrote:

The problem you describe sounds like a problem that was fixed in 7.0.0. Here's the entry in the 7.0.0 version history:

[...]

assigned to anybody. The message in the /condor_ negotiator/ log in this case was this:

Over submitter resource limit (0) ... only consider startd ranks

I don't see anything in NegotiatorLog that looks like that.

A negotiation cycle from the log is pasted at the end of this message.

The first job mentioned, 2255, shows this in better-analyze:

root@workshop2:/sw/condor-6.8.4/local/log# condor_q -better-analyze 2255


-- Submitter: workshop2.ci.uchicago.edu : <127.0.1.1:34935> : workshop2.ci.uchicago.edu
---
2255.000:  Run analysis summary.  Of 2 machines,
     0 are rejected by your job's requirements
     0 reject your job because of their own requirements
     0 match but are serving users with a better priority in the pool
     2 match but reject the job for unknown reasons
     0 match but will not currently preempt their existing job
     0 are available to run your job
       No successful match recorded.
       Last failed match: Tue May 20 08:55:42 2008
       Reason for last match failure: no match found

That timestamp in the 'Last failed match' line is a few minutes ago.



5/20 08:54:01 ---------- Started Negotiation Cycle ----------
5/20 08:54:01 Phase 1:  Obtaining ads from collector ...
5/20 08:54:01   Getting all public ads ...
5/20 08:54:01   Sorting 20 ads ...
5/20 08:54:01   Getting startd private ads ...
5/20 08:54:01 Got ads: 20 public and 2 private
5/20 08:54:01 Public ads include 15 submitter, 2 startd
5/20 08:54:01 Phase 2:  Performing accounting ...
5/20 08:54:01 Phase 3:  Sorting submitter ads by priority ...
5/20 08:54:01 Phase 4.1:  Negotiating with schedds ...
5/20 08:54:01 Negotiating with train07@xxxxxxxxxxxxxxxxxxxxxxxxx at <127.0.1.1:34935>
5/20 08:54:01 0 seconds so far
5/20 08:54:01     Request 02255.00000:
5/20 08:54:01 Rejected 2255.0 train07@xxxxxxxxxxxxxxxxxxxxxxxxx <127.0.1.1:34935>: no match found
5/20 08:54:01     Got NO_MORE_JOBS;  done negotiating
5/20 08:54:01 Negotiating with train08@xxxxxxxxxxxxxxxxxxxxxxxxx at <127.0.1.1:34935>
5/20 08:54:01 0 seconds so far
5/20 08:54:01     Request 02128.00000:
5/20 08:54:01 Rejected 2128.0 train08@xxxxxxxxxxxxxxxxxxxxxxxxx <127.0.1.1:34935>: no match found
5/20 08:54:01     Got NO_MORE_JOBS;  done negotiating
5/20 08:54:01 Negotiating with train15@xxxxxxxxxxxxxxxxxxxxxxxxx at <127.0.1.1:34935>
5/20 08:54:01 0 seconds so far
5/20 08:54:01     Request 02134.00000:
5/20 08:54:01 Rejected 2134.0 train15@xxxxxxxxxxxxxxxxxxxxxxxxx <127.0.1.1:34935>: no match found
5/20 08:54:01     Got NO_MORE_JOBS;  done negotiating
5/20 08:54:01 Negotiating with train19@xxxxxxxxxxxxxxxxxxxxxxxxx at <127.0.1.1:34935>
5/20 08:54:01 0 seconds so far
5/20 08:54:01     Request 02149.00000:
5/20 08:54:01 Rejected 2149.0 train19@xxxxxxxxxxxxxxxxxxxxxxxxx <127.0.1.1:34935>: no match found
5/20 08:54:01     Got NO_MORE_JOBS;  done negotiating
5/20 08:54:01 Negotiating with train21@xxxxxxxxxxxxxxxxxxxxxxxxx at <127.0.1.1:34935>
5/20 08:54:01 0 seconds so far
5/20 08:54:01     Request 02145.00000:
5/20 08:54:01 Rejected 2145.0 train21@xxxxxxxxxxxxxxxxxxxxxxxxx <127.0.1.1:34935>: no match found
5/20 08:54:01     Got NO_MORE_JOBS;  done negotiating
5/20 08:54:01 Negotiating with train39@xxxxxxxxxxxxxxxxxxxxxxxxx at <127.0.1.1:34935>
5/20 08:54:01 0 seconds so far
5/20 08:54:01     Request 02169.00000:
5/20 08:54:01 Rejected 2169.0 train39@xxxxxxxxxxxxxxxxxxxxxxxxx <127.0.1.1:34935>: no match found
5/20 08:54:01     Got NO_MORE_JOBS;  done negotiating
5/20 08:54:01 ---------- Finished Negotiation Cycle ----------