[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Subgroups and GROUP_SORT_EXPR



Hi all,

I am currently trying to pinpoint some imbalance in the way HTCondor assigns resources. In our grid cluster, a certain group seems to be often preferred for resources even when other groups are below their share.

In our case, groups are deeply nested but users are almost unique per group. For example, we have groups `A.Prod.MC`, `A.Prod.SC`, `A.User.MC`, ... and there are only 1-2 users in that groups (e.g. `aprod1` and `aprod2`). `GROUP_ACCEPT_SURPLUS` is enabled since only the top-level share is actually guaranteed, and the subgroups are fine-tuning that is partially out of our control.

We basically use the default `GROUP_SORT_EXPR`, i.e. the groups are ranked by "relative satisfied quota" of `GroupResourcesInUse/GroupQuota`. Our assumption was that this should even out things to converge towards the share, but this seems to be wrong â and I'd like to understand why.

Knobs, knobs, knobs question time first!
- Do subgroups get sorted by `GROUP_SORT_EXPR` based on their parent groups in some way? Can we somehow use parent group information for sorting subgroups?
- Do user priorities, `PRIORITY_HALFLIFE`, ... take effect only *inside* or also *across* groups? Do all users get ranked by priority if they end up in the `<none>` group via autoregroupâing or surplus?

------------

Now, for the thing we are observing... and does my theory make sense there?

Simplified, we have one user group with a lot of internal structure and they are also the largest group overall. So, for example we have `A` at 30% share, `A.Prod` at 50% relative / 15% total share, and finally `A.Prod.MC`/`A.Prod.SC` at 50% relative / 7.5% total share each; for the runner up, we have `B` at 20 % share, and the flat subgroups `B.Prod` at 100% relative / 20% total share, and `B.Prod.MC` at 100% relative / 20% total share.

Now, say `B` is using half their resources while `A` is using all their recourses for `A.Prod.MC` â that means `GROUP_SORT_EXPR` ranks group `B` and all its subgroups as 0.5, `A` as 1, `A.Prod` as 2 and `A.Prod.MC` as 4. So, `B` has the lowest rank and should get resources first as we expect.

What happens if `A` now adds jobs for the unused subgroup `A.Prod.SC`?

As far as I can tell, group sorting happens regardless of subgrouping. So the new requests with `A.Prod.SC`  have no resources at all, so `GROUP_SORT_EXPR` ranks it as 0 â much better than group `B` at 0.5! Even though both the parent groups `A` and `A.Prod` are worse than `B` and all its subgroups, `A.Prod.SC` can "jump the queue" and gets resource first.

In reverse, when `A` does not use all their resources, then the top-level group has an extremely good rank simply because their share is the largest. In effect, it looks like in most situations group `A` gets resources first unless things are *extremely* unbalanced.

Is that assumption correct? Is there any obvious way to restrict subgroup rankings when their parent groups are already getting their share?

Cheers,
Max

Attachment: smime.p7s
Description: S/MIME cryptographic signature