[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] jobs stop running when lots of people submit jobs



On Tue, 20 May 2008, Dan Bradley wrote:

> The problem you describe sounds like a problem that was fixed in 7.0.0.  
> Here's the entry in the 7.0.0 version history:

[...]

> assigned to anybody. The message in the /condor_ negotiator/ log in this 
> case was this:
> 
> Over submitter resource limit (0) ... only consider startd ranks

I don't see anything in NegotiatorLog that looks like that.

A negotiation cycle from the log is pasted at the end of this message.

The first job mentioned, 2255, shows this in better-analyze:

root@workshop2:/sw/condor-6.8.4/local/log# condor_q -better-analyze 2255


-- Submitter: workshop2.ci.uchicago.edu : <127.0.1.1:34935> : 
workshop2.ci.uchicago.edu
---
2255.000:  Run analysis summary.  Of 2 machines,
      0 are rejected by your job's requirements
      0 reject your job because of their own requirements
      0 match but are serving users with a better priority in the pool
      2 match but reject the job for unknown reasons
      0 match but will not currently preempt their existing job
      0 are available to run your job
        No successful match recorded.
        Last failed match: Tue May 20 08:55:42 2008
        Reason for last match failure: no match found

That timestamp in the 'Last failed match' line is a few minutes ago.



5/20 08:54:01 ---------- Started Negotiation Cycle ----------
5/20 08:54:01 Phase 1:  Obtaining ads from collector ...
5/20 08:54:01   Getting all public ads ...
5/20 08:54:01   Sorting 20 ads ...
5/20 08:54:01   Getting startd private ads ...
5/20 08:54:01 Got ads: 20 public and 2 private
5/20 08:54:01 Public ads include 15 submitter, 2 startd
5/20 08:54:01 Phase 2:  Performing accounting ...
5/20 08:54:01 Phase 3:  Sorting submitter ads by priority ...
5/20 08:54:01 Phase 4.1:  Negotiating with schedds ...
5/20 08:54:01   Negotiating with train07@xxxxxxxxxxxxxxxxxxxxxxxxx at 
<127.0.1.1:34935>
5/20 08:54:01 0 seconds so far
5/20 08:54:01     Request 02255.00000:
5/20 08:54:01       Rejected 2255.0 train07@xxxxxxxxxxxxxxxxxxxxxxxxx 
<127.0.1.1:34935>: no match found
5/20 08:54:01     Got NO_MORE_JOBS;  done negotiating
5/20 08:54:01   Negotiating with train08@xxxxxxxxxxxxxxxxxxxxxxxxx at 
<127.0.1.1:34935>
5/20 08:54:01 0 seconds so far
5/20 08:54:01     Request 02128.00000:
5/20 08:54:01       Rejected 2128.0 train08@xxxxxxxxxxxxxxxxxxxxxxxxx 
<127.0.1.1:34935>: no match found
5/20 08:54:01     Got NO_MORE_JOBS;  done negotiating
5/20 08:54:01   Negotiating with train15@xxxxxxxxxxxxxxxxxxxxxxxxx at 
<127.0.1.1:34935>
5/20 08:54:01 0 seconds so far
5/20 08:54:01     Request 02134.00000:
5/20 08:54:01       Rejected 2134.0 train15@xxxxxxxxxxxxxxxxxxxxxxxxx 
<127.0.1.1:34935>: no match found
5/20 08:54:01     Got NO_MORE_JOBS;  done negotiating
5/20 08:54:01   Negotiating with train19@xxxxxxxxxxxxxxxxxxxxxxxxx at 
<127.0.1.1:34935>
5/20 08:54:01 0 seconds so far
5/20 08:54:01     Request 02149.00000:
5/20 08:54:01       Rejected 2149.0 train19@xxxxxxxxxxxxxxxxxxxxxxxxx 
<127.0.1.1:34935>: no match found
5/20 08:54:01     Got NO_MORE_JOBS;  done negotiating
5/20 08:54:01   Negotiating with train21@xxxxxxxxxxxxxxxxxxxxxxxxx at 
<127.0.1.1:34935>
5/20 08:54:01 0 seconds so far
5/20 08:54:01     Request 02145.00000:
5/20 08:54:01       Rejected 2145.0 train21@xxxxxxxxxxxxxxxxxxxxxxxxx 
<127.0.1.1:34935>: no match found
5/20 08:54:01     Got NO_MORE_JOBS;  done negotiating
5/20 08:54:01   Negotiating with train39@xxxxxxxxxxxxxxxxxxxxxxxxx at 
<127.0.1.1:34935>
5/20 08:54:01 0 seconds so far
5/20 08:54:01     Request 02169.00000:
5/20 08:54:01       Rejected 2169.0 train39@xxxxxxxxxxxxxxxxxxxxxxxxx 
<127.0.1.1:34935>: no match found
5/20 08:54:01     Got NO_MORE_JOBS;  done negotiating
5/20 08:54:01 ---------- Finished Negotiation Cycle ----------

--