[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Unnecessary flocking



Hi all,

We have two condor pools in our institute. Pool A consists of machines dedicated to calculations only; Pool B consists of staff's desktops. All machines configured with partiotionable slots.

After we configured flocking from Pool A to Pool B I see some strange (at least not desired) behavior of condor.

Pool A has no tasks running. User submits 40 jobs. The pool has 10 8-core computers. Slots are dynamical. Each job requests 2 cores, so all these jobs exactly fit into the pool. However only 10 first jobs start running on Pool A, then a portion of jobs immediately flock into Pool B, and in a minute rest jobs start also on Pool A.

The question is why jobs start to flock when the current pool still have plenty of available resources?

I would expect that flocking starts only after all local resources are exhausted (especially if there are dedicated!)

Thanks a lot for any help how to prevent such an early flocking!

Alexey

=====
Negotiator log for Pool A  is below

12/13/12 09:30:54 ---------- Started Negotiation Cycle ----------
12/13/12 09:30:54 Phase 1:  Obtaining ads from collector ...
12/13/12 09:30:54   Getting Scheduler, Submitter and Machine ads ...
12/13/12 09:30:54 Trying to query collector [...]
12/13/12 09:30:54   Sorting 13 ads ...
12/13/12 09:30:54   Getting startd private ads ...
12/13/12 09:30:54 Got ads: 13 public and 10 private
12/13/12 09:30:54 Public ads include 1 submitter, 10 startd
12/13/12 09:30:54 Phase 2:  Performing accounting ...
12/13/12 09:30:54 Phase 3:  Sorting submitter ads by priority ...
12/13/12 09:30:54 Phase 4.1:  Negotiating with schedds ...
12/13/12 09:30:54     numSlots = 10
12/13/12 09:30:54     slotWeightTotal = 80.000000
12/13/12 09:30:54     pieLeft = 80.000
12/13/12 09:30:54     NormalFactor = 1.000000
12/13/12 09:30:54     MaxPrioValue = 1.782045
12/13/12 09:30:54     NumSubmitterAds = 1
12/13/12 09:30:54   Negotiating with [...]
12/13/12 09:30:54   Calculating submitter limit with the following parameters
12/13/12 09:30:54     SubmitterPrio       = 1.782045
12/13/12 09:30:54     SubmitterPrioFactor = 1.000000
12/13/12 09:30:54     submitterShare      = 1.000000
12/13/12 09:30:54     submitterAbsShare   = 1.000000
12/13/12 09:30:54     submitterLimit    = 80.000000
12/13/12 09:30:54     submitterUsage    = 0.000000
12/13/12 09:30:54 Socket to [...] already in cache, reusing

12/13/12 09:30:54     Got JOB_INFO command; getting classad/eom
12/13/12 09:30:54     Request 00895.00000:
12/13/12 09:30:54 matchmakingAlgorithm: limit 80.000000 used 0.000000 pieLeft 80.000000
12/13/12 09:30:54 Start of sorting MatchList (len=10)
12/13/12 09:30:54 Finished sorting MatchList
 12/13/12 09:30:54       Matched 895.0 [...] preempting none slot1@v160.[...]
12/13/12 09:30:54       Successfully matched with slot1@v160.[...]
12/13/12 09:30:54     Sending SEND_JOB_INFO/eom

[ 8 more jobs matched here]

12/13/12 09:30:54     Got JOB_INFO command; getting classad/eom
12/13/12 09:30:54     Request 00895.00009:
12/13/12 09:30:54 matchmakingAlgorithm: limit 80.000000 used 72.000000 pieLeft 8.000000
12/13/12 09:30:54 Attempting to use cached MatchList: Succeeded. [...]
12/13/12 09:30:54       Matched 895.9 [...] preempting none slot1@v163.[...]
12/13/12 09:30:54       Notifying the accountant
12/13/12 09:30:54       Successfully matched with slot1@v163.[...]
12/13/12 09:30:54     Over submitter resource limit (80.000000, used 80.000000) ... only consider startd ranks
12/13/12 09:30:54     Sending SEND_JOB_INFO/eom

[so condor believes that we don't have any resources left...]

12/13/12 09:30:54     Request 00895.00010:
12/13/12 09:30:54 matchmakingAlgorithm: limit 80.000000 used 80.000000 pieLeft 0.000000
12/13/12 09:30:54       Rejected 895.10 [...]: no match found
12/13/12 09:30:54     Sending SEND_JOB_INFO/eom
12/13/12 09:30:54     Getting reply from schedd ...
12/13/12 09:30:54     Got NO_MORE_JOBS;  done negotiating
12/13/12 09:30:54   This submitter hit its submitterLimit.
12/13/12 09:30:54  resources used scheddUsed= 80.000000
12/13/12 09:30:54  negotiateWithGroup resources used scheddAds length 1
12/13/12 09:30:54 ---------- Finished Negotiation Cycle ----------

[at this time 39 jobs flock into a different pool]

12/13/12 09:31:54 ---------- Started Negotiation Cycle ----------
12/13/12 09:31:54 Phase 1:  Obtaining ads from collector ...
12/13/12 09:31:54   Getting Scheduler, Submitter and Machine ads ...
12/13/12 09:31:55   Sorting 23 ads ...
12/13/12 09:31:55   Getting startd private ads ...
12/13/12 09:31:55 Got ads: 23 public and 20 private
12/13/12 09:31:55 Public ads include 1 submitter, 20 startd
12/13/12 09:31:55 Phase 2:  Performing accounting ...
12/13/12 09:31:55 Phase 3:  Sorting submitter ads by priority ...
12/13/12 09:31:55 Phase 4.1:  Negotiating with schedds ...
12/13/12 09:31:55     numSlots = 20
12/13/12 09:31:55     slotWeightTotal = 80.000000
12/13/12 09:31:55     pieLeft = 60.000
12/13/12 09:31:55     NormalFactor = 1.000000
12/13/12 09:31:55     MaxPrioValue = 1.820312
12/13/12 09:31:55     NumSubmitterAds = 1
12/13/12 09:31:55   Negotiating with [...]
12/13/12 09:31:55 0 seconds so far
12/13/12 09:31:55   Calculating submitter limit with the following parameters
12/13/12 09:31:55     SubmitterPrio       = 1.820312
12/13/12 09:31:55     SubmitterPrioFactor = 1.000000
12/13/12 09:31:55     submitterShare      = 1.000000
12/13/12 09:31:55     submitterAbsShare   = 1.000000
12/13/12 09:31:55     submitterLimit    = 60.000000
12/13/12 09:31:55     submitterUsage    = 20.000000
12/13/12 09:31:55 Socket to [...] already in cache, reusing

[Well, I don't understand condor! A minute ago there were no resources left but now we have!]
[Is it because of partitionable slot?]

12/13/12 09:31:55     Got JOB_INFO command; getting classad/eom
12/13/12 09:31:55     Request 00895.00039:
12/13/12 09:31:55 matchmakingAlgorithm: limit 60.000000 used 0.000000 pieLeft 60.000000
12/13/12 09:31:55 Start of sorting MatchList (len=10)
12/13/12 09:31:55 Finished sorting MatchList
12/13/12 09:31:55       Matched 895.39 [...] preempting none slot1@v160.[...]
12/13/12 09:31:55       Notifying the accountant
12/13/12 09:31:55       Successfully matched with slot1@v160.[...]
12/13/12 09:31:55     Sending SEND_JOB_INFO/eom
12/13/12 09:31:55     Got NO_MORE_JOBS;  done negotiating
12/13/12 09:31:55   Submitter [...] got all it wants; removing it.
12/13/12 09:31:55  resources used by [...] are 26.000000
12/13/12 09:31:55  resources used scheddUsed= 26.000000
12/13/12 09:31:55  negotiateWithGroup resources used scheddAds length 0
12/13/12 09:31:55 ---------- Finished Negotiation Cycle ----------