[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Quota changes for one group without(?) effect



Hi Thomas,

we had a similar problem because FOO was deploying very short-lived 1-core slots. This may be especially noticeable if you did not have BAZ submitting jobs before.

Since FOO's requirements can be satisfied with both 1- and 8-core jobs, a 1-core job will inherit an 8-core slot every now and then.
If the 1-core jobs of FOO are short-lived, they rapidly acquire and release those 1-core slots.
At some point, a freed 1-core slot is given to BAZ, which then claims it for a regular duration.
Since Condor cannot give the remaining resources to 8-core jobs, it hands them to FOO and BAZ again.
Over time, the FOO jobs die so fast they are gradually replaced with BAZ jobs.
As a result, 8-core slots rapidly degrade to long-lived 1-core slots for BAZ.

In our case, FOO did blow through the 1-core slots at 100Hz, but once BAZ acquired one it kept it for hours. So even though FOO *got* considerably more resources, it did not *keep* it.

We ended up adding a resource limit for the different types of jobs. Basically we can limit the number of 1-core jobs (regardless of user) in the cluster with a pseudo-resource. That works a lot better than draining entire nodes. It does not fix the problem, but protects BAR and does not penalise BAZ.
If FOO really is sending short-lived jobs, it is probably due to empty pilots. Just ask them to switch to ACT.

Cheers,
Max

> Am 23.11.2017 um 10:54 schrieb Thomas Hartmann <thomas.hartmann@xxxxxxx>:
> 
> Hi all,
> 
> we have currently an issue with our dynamic group quotas (for grid jobs
> via ARC CEs), which I do not understand.
> From our three main users, user BAZ got/occupied suddenly about half of
> our slots while having a significant nominal smaller share than users
> FOO and BAR.
> To curb BAZ's enthusiam, we changed for testing the dynamic quotas on
> the negotiator to
>  GROUP_QUOTA_DYNAMIC_group_FOO = 0.933
>  GROUP_QUOTA_DYNAMIC_group_BAR = 0.938
>  GROUP_QUOTA_DYNAMIC_group_BAZ = 0.01
>  ...
>  GROUP_QUOTA_DYNAMIC_group_OTHER = 0.01
> 
> without much of an effect.
> 
> One difference between the users is, that BAZ runs only single slot
> jobs, FOO somewhat 1:1 single and 8slot jobs (so slot rate 1:8) and BAR
> only 8slot jobs - so one idea might be, that the single- vs. multi-core
> use pattern somehow distorts the share distribution (but than it worked
> fine before for the ratio between FOO and BAR)
> 
> As far as I see, the ifthenelse nesting of the X509 DNs into the group
> names should be OK [1] (at least the brackets match). Also BAZ DN should
> match and if not should be covered by the OTHERS group as well.
> 
> So, I wonder, why BAZ still gets ~50% of the slots? Maybe somebody has
> an idea for me?
> 
> Cheers and thanks,
>  Thomas
> 
> 
> [1]
> DESYAcctGroup = ifThenElse(x509UserProxyVOName =?= "FOO","group_FOO", \
>                ifThenElse(x509UserProxyVOName =?= "BAR","group_BAR", \
>                ifThenElse(x509UserProxyVOName =?= "BAZ","group_BAZ", \
>                ifThenElse(x509UserProxyVOName =?= "MINOR1",
> "group_MINOR1", \
> ..., \
> "group_OTHER" ))))))))))))
> 
> [2]
> DESYAcctSubGroup = ifThenElse(regexp("desyplt",Owner), "desyplt", \
>                   ifThenElse(regexp("desyprd",Owner), "desyprd", \
>                   ifThenElse(regexp("desysgm",Owner), "desysgm", \
>                   ifThenElse(regexp("desyusr",Owner), "desyusr", \
>                   ifThenElse(regexp("FOO",Owner) && RequestCpus > 1,
> "FOO_multicore", \
>                   ifThenElse(regexp("BAR",Owner) && RequestCpus > 1,
> "BAR_multicore", \
>                   ifThenElse(regexp("BAZ",Owner) && RequestCpus > 1,
> "BAZ_multicore", \
>                                                       "other" )))))))
> 
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/


Karlsruhe Institute of Technology (KIT)
SCC/SDM

Dr. Max Fischer
ALICE Collaboration Representative at GridKa

Hermann-von-Helmholtz-Platz 1
GebÃude 449
76344 Eggenstein-Leopoldshafen, Germany

Phone: +49 721 608-28328
E-mail: max.fischer@xxxxxxx
Web: www.scc.kit.edu

Registered office:
KaiserstraÃe 12, 76131 Karlsruhe, Germany

KIT â The Research University in the Helmholtz Association

Attachment: smime.p7s
Description: S/MIME cryptographic signature