[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Groups with weighted slots in 7.6.9



Hi,

Sure: we aren't using auto-regroup for any groups but some groups in our
tree have accept_surplus set, however the prod.mp group in question
doesn't have it set and neither do it's parent or any of it's direct
siblings.

I haven't replicated the environment entirely in 7.8, but running a
simple test in 7.8.3 does not trigger the problem (things behave as
expected).

-Will


On 09/11/2012 05:04 PM, Erik Erlandson wrote:
> Hi William,
> 
> I agree that it looks like #2958, and that fix went in at 7.6.7. 
> 
> Can you describe any configuration related to GROUP_AUTOREGROUP[_*]
> and/or GROUP_ACCEPT_SURPLUS[_*]?
> 
> 
> On Tue, 2012-09-11 at 16:37 -0400, William Strecker-Kellogg wrote:
>> Hi all,
>>
>> There is an interesting problem I'm having, related to the use of group
>> quotas and weighted slots (similar to ticket #2958). While experimenting
>> I came across something that looks like a bug similar to what was
>> supposed to be addressed #2958.
>>
>> The setup involves a large number of machines with three 8-core slots
>> each (about 2000 cores total). When using group quotas I see the
>> following behavior:
>>
>> First, I submit 20 jobs matching only those slots (no other contention,
>> plenty of free slots) each with "request_cpus = 8" and belonging to an
>> AccountingGroup with a quota of >2000. I see the following (grep for
>> "group_atlas.prod.mp" in the attached logs for the full story), the
>> first two jobs match, then the rest are rejected with "group quota
>> exceeded" warnings. It appears that the groupQuota it sees is 20 (the
>> number of idle jobs), and after the first match it uses 8, the second
>> and 16 are used, then the next fails because "pieLeft" is 4.0. It is as
>> if the weights are being applied only after it matches and are not
>> counted for in it's match-making algorithm limit (pieLeft is 20.0 at the
>> start, should be 160.0?)
>>
>> It is reproducible with numbers other than 20 jobs and 8-cores; with <N>
>> k-core jobs in a queue up to floor(N/k) jobs will match before exceeding
>> the quota.
>>
>> The workaround I found is to set "SlotWeight=1" on the 8-core slots,
>> which makes things work great except for the accounting (which doesn't
>> matter for what we are doing right now).
>>
>> We may be going to 7.8 soon so it may not be an issue if it is fixed
>> then, but in case it isn't I figured I'd report on my findings anyway.
>>
>> Thanks,
>> Will Strecker-Kellogg
>> RACF/BNL
>>