[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Execute last DAGMan job as soon as possible



Here is another suspicious part from the NegotiatorLog. It looks that the post process group always gets zero slots allocated, even when there are idle group jobs to start and idle slots to pick them up. In phase 4 the group user does get matches (I guess because of the group priority) but by that time it has to fight with the rest of the submitters for the available slots.

---
12/06/16 13:31:34 ---------- Started Negotiation Cycle ----------
12/06/16 13:31:34 Phase 1:Â Obtaining ads from collector ...
12/06/16 13:31:34ÂÂ Getting startd private ads ...
12/06/16 13:31:34ÂÂ Getting Scheduler, Submitter and Machine ads ...
12/06/16 13:31:34ÂÂ Sorting 484 ads ...
12/06/16 13:31:34 Got ads: 484 public and 391 private
12/06/16 13:31:34 Public ads include 26 submitter, 391 startd
12/06/16 13:31:34 Phase 2:Â Performing accounting ...
12/06/16 13:31:34 group quotas: assigning 26 submitters to accounting groups
12/06/16 13:31:34 group quotas: assigning group quotas from 391 available weighted slots
12/06/16 13:31:34 group quotas: allocation round 1
12/06/16 13:31:34 group quotas: groups= 2Â requesting= 1Â served= 1Â unserved= 0Â slots= 391Â requested= 1566Â allocated= 1566Â surplus= 998825Â maxdelta= 1e+06
12/06/16 13:31:34 group quotas: entering RR iteration n= 1e+06
12/06/16 13:31:34 Group group_jobendGrp - skipping, zero slots allocated
12/06/16 13:31:34 Group <none> - BEGIN NEGOTIATION
12/06/16 13:31:34 subtree_usage at group_jobendGrp is 0
12/06/16 13:31:34 subtree_usage at <none> is 374
12/06/16 13:31:34 Phase 3:Â Sorting submitter ads by priority ...
12/06/16 13:31:34 Phase 4.1:Â Negotiating with schedds ...
12/06/16 13:31:34ÂÂ Negotiating with group_jobendGrp@xxxxxxxxxxxxxxxxxxxxxxxxxx at <192.168.0.41:51386?addrs=192.168.0.41-51386>
12/06/16 13:31:34 0 seconds so far for this submitter
12/06/16 13:31:34 0 seconds so far for this schedd
12/06/16 13:31:34ÂÂÂÂ Got NO_MORE_JOBS;Â schedd has no more requests
12/06/16 13:31:34ÂÂÂÂ Request 47968356.00000:Â (request count 1 of 3)
12/06/16 13:31:34ÂÂÂÂÂÂ Matched 47968356.0 group_jobendGrp@xxxxxxxxxxxxxxxxxxxxxxxxxx <192.168.0.41:51386?addrs=192.168.0.41-51386> preempting none <192.168.14.96:52841?addrs=192.168.14.96-52841> slot1@xxxxxxxxxxxxxx
12/06/16 13:31:34ÂÂÂÂÂÂ Successfully matched with slot1@xxxxxxxxxxxxxx
12/06/16 13:31:34ÂÂÂÂ Request 47968356.00000:Â (request count 2 of 3)
12/06/16 13:31:34ÂÂÂÂÂÂ Matched 47968356.0 group_jobendGrp@xxxxxxxxxxxxxxxxxxxxxxxxxx <192.168.0.41:51386?addrs=192.168.0.41-51386> preempting none <192.168.14.88:35103?addrs=192.168.14.88-35103> slot1@xxxxxxxxxxxxxx
12/06/16 13:31:34ÂÂÂÂÂÂ Successfully matched with slot1@xxxxxxxxxxxxxx
12/06/16 13:31:34ÂÂÂÂ Request 47968356.00000:Â (request count 3 of 3)
12/06/16 13:31:34ÂÂÂÂÂÂ Matched 47968356.0 group_jobendGrp@xxxxxxxxxxxxxxxxxxxxxxxxxx <192.168.0.41:51386?addrs=192.168.0.41-51386> preempting none <192.168.14.55:46344?addrs=192.168.14.55-46344> slot2@xxxxxxxxxxxxxx
12/06/16 13:31:34ÂÂÂÂÂÂ Successfully matched with slot2@xxxxxxxxxxxxxx
12/06/16 13:31:34ÂÂÂÂ Got NO_MORE_JOBS;Â schedd has no more requests
12/06/16 13:31:34ÂÂ Negotiating with someuser@xxxxxxxxxxxxxxxxxxxxxxxxxx at <192.168.0.41:51386?addrs=192.168.0.41-51386>
12/06/16 13:31:34 0 seconds so far for this submitter
12/06/16 13:31:34 0 seconds so far for this schedd
12/06/16 13:31:34ÂÂÂÂ Got NO_MORE_JOBS;Â schedd has no more requests
12/06/16 13:31:34ÂÂÂÂ Request 47961012.00000:Â (request count 1 of 1)
12/06/16 13:31:34ÂÂÂÂÂÂ Matched 47961012.0 someuser@xxxxxxxxxxxxxxxxxxxxxxxxxx <192.168.0.41:51386?addrs=192.168.0.41-51386> preempting none <192.168.20.87:41946?addrs=192.168.20.87-41946> slot1@xxxxxxxxxxxxxxxxxxxxxxxxxx
12/06/16 13:31:34ÂÂÂÂÂÂ Successfully matched with slot1@xxxxxxxxxxxxxx
---

Cheers,
Szabolcs

On Sat, Dec 3, 2016 at 2:22 PM, Szabolcs HorvÃtth <szabolcs.horvatth@xxxxxxxxx> wrote:

Hi Todd,

> In other words, I do not understand why you enabled autoregroup, surplus, etc. Just makes things unnecessarily complicated. With the above, group_jobendGrp jobs should get first crack at slots until that group has 1000000 cpu cores (unless you also edited SLOT_WEIGHT).

I was trying the attributes suggested by Michael in an earlier reply. Using groups was a great way to prioritize a type of job regardless of user and machine priority, but the slow matchmaking was an issue that we tried to solve.

> Are there always idle jobs waiting in the queue submitting to group_jobendGrp? If not, what will happen is at times when there are no idle group_jobendGrp jobs idle, your regular non-jobendGrp jobs will claim the slots, and you will need to wait for the claim on those slots to be relinquished (unless you setup preemption of your non-jobendGrp jobs, which has its own drawbacks).

There are always idle jobs in the group. They accumulate for a while (the delay seams to correlate with the number of submitting users) and it takes more than ten minutes to match. And in the meantime lots of ungrouped jobs are matched and started.

Cheers,
Szabolcs