[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Groups with weighted slots in 7.6.9

Hi all,

There is an interesting problem I'm having, related to the use of group
quotas and weighted slots (similar to ticket #2958). While experimenting
I came across something that looks like a bug similar to what was
supposed to be addressed #2958.

The setup involves a large number of machines with three 8-core slots
each (about 2000 cores total). When using group quotas I see the
following behavior:

First, I submit 20 jobs matching only those slots (no other contention,
plenty of free slots) each with "request_cpus = 8" and belonging to an
AccountingGroup with a quota of >2000. I see the following (grep for
"group_atlas.prod.mp" in the attached logs for the full story), the
first two jobs match, then the rest are rejected with "group quota
exceeded" warnings. It appears that the groupQuota it sees is 20 (the
number of idle jobs), and after the first match it uses 8, the second
and 16 are used, then the next fails because "pieLeft" is 4.0. It is as
if the weights are being applied only after it matches and are not
counted for in it's match-making algorithm limit (pieLeft is 20.0 at the
start, should be 160.0?)

It is reproducible with numbers other than 20 jobs and 8-cores; with <N>
k-core jobs in a queue up to floor(N/k) jobs will match before exceeding
the quota.

The workaround I found is to set "SlotWeight=1" on the 8-core slots,
which makes things work great except for the accounting (which doesn't
matter for what we are doing right now).

We may be going to 7.8 soon so it may not be an issue if it is fixed
then, but in case it isn't I figured I'd report on my findings anyway.

Will Strecker-Kellogg

Attachment: group_log.gz
Description: GNU Zip compressed data