[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Execute last DAGMan job as soon as possible



Checking the negotiator logs after these modifications I see the following:
--
12/02/16 17:49:31 ---------- Started Negotiation Cycle ----------
12/02/16 17:49:31 Phase 1:Â Obtaining ads from collector ...
12/02/16 17:49:31ÂÂ Getting startd private ads ...
12/02/16 17:49:31ÂÂ Getting Scheduler, Submitter and Machine ads ...
12/02/16 17:49:31ÂÂ Sorting 516 ads ...
12/02/16 17:49:31 Got ads: 516 public and 416 private
12/02/16 17:49:31 Public ads include 36 submitter, 416 startd
12/02/16 17:49:31 Phase 2:Â Performing accounting ...
12/02/16 17:49:31 group quotas: assigning 36 submitters to accounting groups
12/02/16 17:49:31 group quotas: autoregroup mode: appended 1 submitters to group <none> negotiation
12/02/16 17:49:31 group quotas: assigning group quotas from 416 available weighted slots
12/02/16 17:49:31 group quotas: allocation round 1
12/02/16 17:49:31 group quotas: autoregroup mode: allocating 416 to group <none>
12/02/16 17:49:31 group quotas: groups= 2Â requesting= 1Â served= 1Â unserved= 0Â slots= 416Â requested= 416Â allocated= 416Â surplus= 999663Â maxdelta= 158
12/02/16 17:49:31 group quotas: autoregroup mode: forcing group <none> to negotiate last
12/02/16 17:49:31 group quotas: entering RR iteration n= 158
12/02/16 17:49:31 Group group_jobendGrp - skipping, zero slots allocated
12/02/16 17:49:31 Group <none> - BEGIN NEGOTIATION
12/02/16 17:49:31 subtree_usage at group_jobendGrp is 0
12/02/16 17:49:31 subtree_usage at <none> is 258
12/02/16 17:49:31 group quotas: autoregroup mode: negotiating with autoregroup for <none>
12/02/16 17:49:31 Phase 3:Â Sorting submitter ads by priority ...
12/02/16 17:49:31 Phase 4.1:Â Negotiating with schedds ...
--
And I have these settings:
GROUP_NAMES = group_jobendGrp
GROUP_PRIO_FACTOR_group_jobendGrp = 1.0
GROUP_QUOTA_group_jobendGrp = 1000000.0
GROUP_AUTOREGROUP = True
GROUP_ACCEPT_SURPLUS = True
NEGOTIATOR_ALLOW_QUOTA_OVERSUBSCRIPTION = True

It seams that 90% of the time all slots are given to the none group by default, even though group_jobendGrp should have 1000 times more priority than the rest of the users.

Cheers,
Szabolcs

On Fri, Dec 2, 2016 at 4:35 PM, Szabolcs HorvÃtth <szabolcs.horvatth@xxxxxxxxx> wrote:
Hi Michael,

I tried setting NEGOTIATOR_ALLOW_QUOTA_OVERSUBSCRIPTION = True (its probably the longest Condor attr I ever set! :)) and set the group quota to a huge number, but it did not really
affect the speed of matching empty slots to high priority post process jobs. I still suspect that there are some claims and timeouts that delay the matchmaking.

Cheers,
Szabolcs

On Thu, Dec 1, 2016 at 7:27 PM, Michael Pelletier <Michael.V.Pelletier@raytheon.com> wrote:

While pondering this question, I found what looks like the information you need on page 334 of the 8.4.9 manual â in effect you want a âstrict priorityâ policy for the post-processing DAG nodes:

Â

One possible group quota policy is strict priority. For example, a site prefers physics users to match as many

slots as they can, and only when all the physics jobs are running, and idle slots remain, are chemistry jobs allowed

to run. The default "starvation group order" can be used to implement this. By setting configuration variable

NEGOTIATOR_ALLOW_QUOTA_OVERSUBSCRIPTION to True, and setting the physics quota to a number so

large that it cannot ever be met, such as one million, the physics group will always be the "most starving" group, will

always negotiate first, and will always be unable to meet the quota. Only when all the physics jobs are running will

the chemistry jobs then run.

Â

Your post-job is equivalent to âphysicsâ and everything else is equivalent to âchemistry,â I think.

Â

ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ -Michael Pelletier.

Â

From: HTCondor-users [mailto:htcondor-users-bounces@cs.wisc.edu] On Behalf Of Szabolcs HorvÃtth
Sent: Thursday, December 01, 2016 12:07 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Execute last DAGMan job as soon as possible

Â

It turned out that we modified the default prio factor to 10 (before the condor default switched to 1000) so I changed all users priority factor to 1000 and set the urgent group's priority to 1. It did help in shortening the process of the jobs grabbing free slots, but it still takes between 10-15 minutes to do so. Whats interesting is that after these ten minutes lots of slots are allocated to the group, so there is obviously something affected by the group priority. The might be some unintentional claim / timeout setting behind all this but I don't know what to look for.

My main gripe is that why do the jobs wait for minutes, when the jobs' machine rank is the highest in the pool, the group priority factor is the lowest, the job priority is also high, PRIORITY_HALFLIFE = 1 so the amount of resources used should not matter, and there *are* free slots that get matched to other users.

Cheers,

Szabolcs


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxx.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/