[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Execute last DAGMan job as soon as possible

Hi Michael,

I tried setting NEGOTIATOR_ALLOW_QUOTA_OVERSUBSCRIPTION = True (its probably the longest Condor attr I ever set! :)) and set the group quota to a huge number, but it did not really
affect the speed of matching empty slots to high priority post process jobs. I still suspect that there are some claims and timeouts that delay the matchmaking.


On Thu, Dec 1, 2016 at 7:27 PM, Michael Pelletier <Michael.V.Pelletier@xxxxxxxxxxxx> wrote:

While pondering this question, I found what looks like the information you need on page 334 of the 8.4.9 manual â in effect you want a âstrict priorityâ policy for the post-processing DAG nodes:


One possible group quota policy is strict priority. For example, a site prefers physics users to match as many

slots as they can, and only when all the physics jobs are running, and idle slots remain, are chemistry jobs allowed

to run. The default "starvation group order" can be used to implement this. By setting configuration variable

NEGOTIATOR_ALLOW_QUOTA_OVERSUBSCRIPTION to True, and setting the physics quota to a number so

large that it cannot ever be met, such as one million, the physics group will always be the "most starving" group, will

always negotiate first, and will always be unable to meet the quota. Only when all the physics jobs are running will

the chemistry jobs then run.


Your post-job is equivalent to âphysicsâ and everything else is equivalent to âchemistry,â I think.


ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ -Michael Pelletier.


From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Szabolcs HorvÃtth
Sent: Thursday, December 01, 2016 12:07 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Execute last DAGMan job as soon as possible


It turned out that we modified the default prio factor to 10 (before the condor default switched to 1000) so I changed all users priority factor to 1000 and set the urgent group's priority to 1. It did help in shortening the process of the jobs grabbing free slots, but it still takes between 10-15 minutes to do so. Whats interesting is that after these ten minutes lots of slots are allocated to the group, so there is obviously something affected by the group priority. The might be some unintentional claim / timeout setting behind all this but I don't know what to look for.

My main gripe is that why do the jobs wait for minutes, when the jobs' machine rank is the highest in the pool, the group priority factor is the lowest, the job priority is also high, PRIORITY_HALFLIFE = 1 so the amount of resources used should not matter, and there *are* free slots that get matched to other users.



HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting

The archives can be found at: