[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor_negotiator/condor_collector scheduling problem



Hi Erik,

Upon further examination, I don't think that my condor_negotiator isn't checking to see if all the jobs that are currently idle can be matched. Rather, it seems to stop on the first "no match found". Below is a snippet of two iterations of the negotiation cycle from NegotiatorLog.

I'm not sure why the Negotiator "gets" a NO_MORE_JOBS message, since there are 6 more jobs that it is not considering. Is there a flag I need to pass to the Negotiator to force it consider all idle jobs in every iteration of the negotiation cycle instead of just stopping at the first?

I've been deleting the spool directory everytime I try this. Jobs 1-4 require MY_RESOURCE_1, and Jobs 5-8 require MY_RESOURCE_2. I also I changed the submit files so the dummy programs sleep for 90 seconds instead of 600. I should also mention that despite the fact that my negotiation cycles are set to run more frequently, I get exactly the same problem when I use the default timings.

In case the inline snippet isn't helpful enough, I've uploaded all the log files to:

http://www.static.net/~armenb/condor-negotiator-problem/

q1.txt and s1.txt are dumps of condor_q -l and condor_status -l when all jobs are running, and q2.txt and s2.txt are dumps of the same programs when Condor runs the first MY_RESOURCE_2-needing program concurrently with the last MY_RESOURCE_1-needing program.

Please let me know if you have any questions.  Thanks!

 - Armen

5/5 15:56:05 ---------- Started Negotiation Cycle ----------
5/5 15:56:05 Phase 1:  Obtaining ads from collector ...
5/5 15:56:05   Getting all public ads ...
5/5 15:56:05   Sorting 8 ads ...
5/5 15:56:05   Getting startd private ads ...
5/5 15:56:05 Got ads: 8 public and 4 private
5/5 15:56:05 Public ads include 1 submitter, 4 startd
5/5 15:56:05 Phase 2:  Performing accounting ...
5/5 15:56:05 Phase 3:  Sorting submitter ads by priority ...
5/5 15:56:05 Phase 4.1:  Negotiating with schedds ...
5/5 15:56:05 Negotiating with armenb@xxxxxxxxxxxxxxxxxxxxxxxxx at <155.34.66.121:50431>
5/5 15:56:05 0 seconds so far
5/5 15:56:05     Request 00001.00000:
5/5 15:56:05 Matched 1.0 armenb@xxxxxxxxxxxxxxxxxxxxxxxxx <155.34.66.121:50431> preempting none <155.34.66.121:50432> vm1@xxxxxxxxxxxxxxxxxxxxxxxxx
5/5 15:56:05       Successfully matched with vm1@xxxxxxxxxxxxxxxxxxxxxxxxx
5/5 15:56:05     Got NO_MORE_JOBS;  done negotiating
5/5 15:56:05 ---------- Finished Negotiation Cycle ----------
5/5 15:56:25 ---------- Started Negotiation Cycle ----------
5/5 15:56:25 Phase 1:  Obtaining ads from collector ...
5/5 15:56:25   Getting all public ads ...
5/5 15:56:25   Sorting 8 ads ...
5/5 15:56:25   Getting startd private ads ...
5/5 15:56:25 Got ads: 8 public and 4 private
5/5 15:56:25 Public ads include 1 submitter, 4 startd
5/5 15:56:25 Phase 2:  Performing accounting ...
5/5 15:56:25 Phase 3:  Sorting submitter ads by priority ...
5/5 15:56:25 Phase 4.1:  Negotiating with schedds ...
5/5 15:56:25 Negotiating with armenb@xxxxxxxxxxxxxxxxxxxxxxxxx at <155.34.66.121:50431>
5/5 15:56:25 0 seconds so far
5/5 15:56:25     Request 00002.00000:
5/5 15:56:25 Rejected 2.0 armenb@xxxxxxxxxxxxxxxxxxxxxxxxx <155.34.66.121:50431>: no match found
5/5 15:56:25     Got NO_MORE_JOBS;  done negotiating
5/5 15:56:25 ---------- Finished Negotiation Cycle ----------

Erik Paulson wrote:

On Thu, May 04, 2006 at 02:34:11PM -0400, Armen Babikyan wrote:
Hi Condor Team,

A few weeks ago I described a problem I was having with Condor not scheduling jobs on available resources. I've recreated the problem in a simpler way, without the need for a DAG. It seems like condor_negotiator and/or condor_collector are somehow misbehaving and not matching jobs when there are resources and jobs that match.


It'd be more useful to see the output of condor_status -l and condor_q -l
when the situation you're describing is happening, along with
the NegotiatorLog, and possibly the ScheddLog.

-Erik
_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users



--
Armen Babikyan
MIT Lincoln Laboratory
armenb@xxxxxxxxxx . 781-981-1796