[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Execute last DAGMan job as soon as possible



On 12/2/2016 10:57 AM, Szabolcs Horvátth wrote:
--
And I have these settings:
GROUP_NAMES = group_jobendGrp
GROUP_PRIO_FACTOR_group_jobendGrp = 1.0
GROUP_QUOTA_group_jobendGrp = 1000000.0
GROUP_AUTOREGROUP = True
GROUP_ACCEPT_SURPLUS = True
NEGOTIATOR_ALLOW_QUOTA_OVERSUBSCRIPTION = True


Seems to me that if you just want jobs submitted with
  accounting_group = group_jobendGrp
to be offered resources ahead of any other jobs, and for fairshare across all users that submit into group_jobendGrp, you just need the following in your negotiator condor_config:

GROUP_NAMES = group_jobendGrp
GROUP_QUOTA_group_jobendGrp = 1000000.0
NEGOTIATOR_ALLOW_QUOTA_OVERSUBSCRIPTION = True

In other words, I do not understand why you enabled autoregroup, surplus, etc. Just makes things unnecessarily complicated. With the above, group_jobendGrp jobs should get first crack at slots until that group has 1000000 cpu cores (unless you also edited SLOT_WEIGHT).

It seams that 90% of the time all slots are given to the none group by
default, even though group_jobendGrp should have 1000 times more
priority than the rest of the users.


Are there always idle jobs waiting in the queue submitting to group_jobendGrp? If not, what will happen is at times when there are no idle group_jobendGrp jobs idle, your regular non-jobendGrp jobs will claim the slots, and you will need to wait for the claim on those slots to be relinquished (unless you setup preemption of your non-jobendGrp jobs, which has its own drawbacks).

regards,
Todd

Cheers,
Szabolcs

On Fri, Dec 2, 2016 at 4:35 PM, Szabolcs Horvátth
<szabolcs.horvatth@xxxxxxxxx <mailto:szabolcs.horvatth@xxxxxxxxx>> wrote:

    Hi Michael,

    I tried setting NEGOTIATOR_ALLOW_QUOTA_OVERSUBSCRIPTION = True (its
    probably the longest Condor attr I ever set! :)) and set the group
    quota to a huge number, but it did not really
    affect the speed of matching empty slots to high priority post
    process jobs. I still suspect that there are some claims and
    timeouts that delay the matchmaking.

    Cheers,
    Szabolcs

    On Thu, Dec 1, 2016 at 7:27 PM, Michael Pelletier
    <Michael.V.Pelletier@xxxxxxxxxxxx
    <mailto:Michael.V.Pelletier@xxxxxxxxxxxx>> wrote:

        While pondering this question, I found what looks like the
        information you need on page 334 of the 8.4.9 manual – in effect
        you want a “strict priority” policy for the post-processing DAG
        nodes:____

        __ __

        One possible group quota policy is strict priority. For example,
        a site prefers physics users to match as many____

        slots as they can, and only when all the physics jobs are
        running, and idle slots remain, are chemistry jobs allowed____

        to run. The default "starvation group order" can be used to
        implement this. By setting configuration variable____

        NEGOTIATOR_ALLOW_QUOTA_OVERSUBSCRIPTION to True, and setting the
        physics quota to a number so____

        large that it cannot ever be met, such as one million, the
        physics group will always be the "most starving" group, will____

        always negotiate first, and will always be unable to meet the
        quota. Only when all the physics jobs are running will____

        the chemistry jobs then run.____

        __ __

        Your post-job is equivalent to “physics” and everything else is
        equivalent to “chemistry,” I think.____

        __ __

                        -Michael Pelletier.____

        __ __

        *From:*HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx
        <mailto:htcondor-users-bounces@xxxxxxxxxxx>] *On Behalf Of
        *Szabolcs Horvátth
        *Sent:* Thursday, December 01, 2016 12:07 PM
        *To:* HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx
        <mailto:htcondor-users@xxxxxxxxxxx>>
        *Subject:* Re: [HTCondor-users] Execute last DAGMan job as soon
        as possible____

        __ __

        It turned out that we modified the default prio factor to 10
        (before the condor default switched to 1000) so I changed all
        users priority factor to 1000 and set the urgent group's
        priority to 1. It did help in shortening the process of the jobs
        grabbing free slots, but it still takes between 10-15 minutes to
        do so. Whats interesting is that after these ten minutes lots of
        slots are allocated to the group, so there is obviously
        something affected by the group priority. The might be some
        unintentional claim / timeout setting behind all this but I
        don't know what to look for.

        My main gripe is that why do the jobs wait for minutes, when the
        jobs' machine rank is the highest in the pool, the group
        priority factor is the lowest, the job priority is also high,
        PRIORITY_HALFLIFE = 1 so the amount of resources used should not
        matter, and there *are* free slots that get matched to other
        users.____

        Cheers,____

        Szabolcs____


        _______________________________________________
        HTCondor-users mailing list
        To unsubscribe, send a message to
        htcondor-users-request@xxxxxxxxxxx
        <mailto:htcondor-users-request@xxxxxxxxxxx> with a
        subject: Unsubscribe
        You can also unsubscribe by visiting
        https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
        <https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users>

        The archives can be found at:
        https://lists.cs.wisc.edu/archive/htcondor-users/
        <https://lists.cs.wisc.edu/archive/htcondor-users/>





_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685