[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] [HTCcondor-users] Possibly a bug in subgroup working without surplus and autoregroup



Hi Vikrant,

I think Todd's right about the partitionable slots being the cause, but I wanted to add a workaround in case you're stuck on v8.6. Enabling CONSUMPTION_POLICY on any Startds/execution hosts with partitionable slots will change jobs' match costs to reflect the amount of cores (for default weight) they actually used, which will allow jobs to match in the above situations.

There's also another, less likely, scenario that we ran into which causes the same issue. If any of the Request<Resource> attributes, for resources used in the SLOT_WEIGHT (just CPUs by default), of your jobs are expressions that reference the target Startd's attributes, then you need to define SCHEDD_SLOT_WEIGHT on your Schedd hosts as an _expression_ which only references job attributes and evaluates to the maximum possible match cost/weight of the job. Otherwise the Schedd will default the job's weight to 1 and the Negotiator won't allocate enough resources to the job's accounting group for it to match. For example:

JobAd:
MinCpus = 16
MaxCpus = 32
RequestCpus =Â(Cpus > MaxCpus) ? MaxCpus : ((Cpus < MinCpus) ? MinCpus : Cpus)

Schedd config file macro:
SCHEDD_SLOT_WEIGHT = MaxCpus ?: RequestCpus

Best,
Collin

On Thu, Aug 1, 2019 at 10:32 AM Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:

Hi Vikrant,

Are you able to reproduce the below using current releases of HTCondor ?
 Also, are you using static or partitionable slots ?

Several patches went into HTCondor since v8.6.x of HTCondor (which is no
longer officially supported [1]) that look like they might mitigate your
below issue, esp if you are using partitionable slots. For instance:

 Âhttps://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=6750
 Âhttps://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=6714

If I get some time next week and you do not get to it first, I will try
your config below on v8.8 and see if I can reproduce the problem.

Hope the above helps,
Todd

[1] From https://tinyurl.com/y34m5ymx : "After beginning a new stable
series, the HTCondor Project will continue to support the previous
stable series for six months."Â Since HTCondor v8.8.x first appeared in
Jan 2019, folks should plan to upgrade v8.6 to v8.8 sooner rather than
later....

On 8/1/2019 3:27 AM, Vikrant Aggarwal wrote:
> Hello Experts,
>
> I am exploring the usage of accounting groups and sub accounting groups.
> I saw weird behavior while using subgroups if I don't specify
> GROUP_ACCEPT_SURPLUS orÂGROUP_AUTOREGROUP then jobs submitted with
> subgroup never run. If i submit the job with parent group "cdp" it runs
> without any issue. Is't expected behavior? I tried to use false value of
> GROUP_ACCEPT_SURPLUS but no luck. If this is an expected behavior this
> means we can't use subgroups without over-commitment?
>
> I added this in my configuration file:
>
> GROUP_NAMES = cdp, cdp.cdp1, cdp.cdp2, cdp.cdp3
> GROUP_QUOTA_DYNAMIC_cdp = .5
> GROUP_QUOTA_DYNAMIC_cdp.cdp1 = .3
> GROUP_QUOTA_DYNAMIC_cdp.cdp2 = .3
> GROUP_QUOTA_DYNAMIC_cdp.cdp3 = .3
>
> After reconfig submitted job with following line in submit file.
>
> Accounting_group = cdp.cdp2
>
> Submitted jobs never ran. Negotiator were not able to do the match making.
>
> 08/01/19 04:14:29 ---------- Started Negotiation Cycle ----------
> 08/01/19 04:14:29 Phase 1: ÂObtaining ads from collector ...
> 08/01/19 04:14:29 Not considering preemption, therefore constraining
> idle machines with ifThenElse(State == "Claimed","Name State Activity
> StartdIpAddr AccountingGroup Owner RemoteUser Requirements SlotWeight
> ConcurrencyLimits","")
> 08/01/19 04:14:29 Â Getting startd private ads ...
> 08/01/19 04:14:29 Â Getting Scheduler, Submitter and Machine ads ...
> 08/01/19 04:14:29 Â Sorting 12 ads ...
> 08/01/19 04:14:29 Got ads: 12 public and 6 private
> 08/01/19 04:14:29 Public ads include 1 submitter, 6 startd
> 08/01/19 04:14:29 Phase 2: ÂPerforming accounting ...
> 08/01/19 04:14:29 group quotas: assigning 1 submitters to accounting groups
> 08/01/19 04:14:29 group quotas: assigning group quotas from 18 available
> weighted slots
> 08/01/19 04:14:29 group quotas: allocation round 1
> 08/01/19 04:14:29 group quotas: groups= 5 Ârequesting= 1 Âserved= 1
>Â Âunserved= 0 Âslots= 18 Ârequested= 1 Âallocated= 1 Âsurplus= 25.1
>Â Âmaxdelta= 9
> 08/01/19 04:14:29 group quotas: entering RR iteration n= 9
> 08/01/19 04:14:29 Group cdp - skipping, zero slots allocated
> 08/01/19 04:14:29 Group cdp.cdp1 - skipping, zero slots allocated
> 08/01/19 04:14:29 Group cdp.cdp1 - skipping, zero slots allocated
> 08/01/19 04:14:29 Group cdp.cdp2 - BEGIN NEGOTIATION
> 08/01/19 04:14:29 Phase 3: ÂSorting submitter ads by priority ...
> 08/01/19 04:14:29 Phase 4.1: ÂNegotiating with schedds ...
> 08/01/19 04:14:29 Â Negotiating with cdp.cdp2.vaggarwal@xxxxxxxx
> <mailto:cdp.cdp2.vaggarwal@xxxxxxxx> at
> <xx.xx.xx.57:9618?addrs=xx.xx.xx.57-9618&noUDP&sock=9516_13b9_3>
> 08/01/19 04:14:29 0 seconds so far for this submitter
> 08/01/19 04:14:29 0 seconds so far for this schedd
> 08/01/19 04:14:29 Â Â Got NO_MORE_JOBS; Âschedd has no more requests
> 08/01/19 04:14:29 Â Â Request 00149.00000: autocluster 34 (request count
> 1 of 1)
> 08/01/19 04:14:29 Â Â Â Rejected 149.0 cdp.cdp2.vaggarwal@xxxxxxxx
> <mailto:cdp.cdp2.vaggarwal@xxxxxxxx>
> <xx.xx.xx.57:9618?addrs=xx.xx.xx.57-9618&noUDP&sock=9516_13b9_3>:
> submitter limit exceeded
> 08/01/19 04:14:29 Â Â Got NO_MORE_JOBS; Âschedd has no more requests
> 08/01/19 04:14:29 ÂnegotiateWithGroup resources used scheddAds length 1
> 08/01/19 04:14:29 Group cdp.cdp3 - skipping, zero slots allocated
> 08/01/19 04:14:29 Group <none> - skipping, zero slots allocated
> 08/01/19 04:14:29 Round 1 totals: allocated= 1 Âusage= 0
> 08/01/19 04:14:29 Round 1 totals: allocated= 1 Âusage= 0
> 08/01/19 04:14:29 group quotas: allocation round 2
> 08/01/19 04:14:29 group quotas: allocation round 2
> 08/01/19 04:14:29 group quotas: groups= 5 Ârequesting= 0 Âserved= 0
>Â Âunserved= 0 Âslots= 18 Ârequested= 0 Âallocated= 0 Âsurplus= 26.1
>Â Âmaxdelta= 9
> 08/01/19 04:14:29 group quotas: entering RR iteration n= 9
> 08/01/19 04:14:29 Group cdp - skipping, zero slots allocated
> 08/01/19 04:14:29 Group cdp - skipping, zero slots allocated
> 08/01/19 04:14:29 Group cdp.cdp1 - skipping, zero slots allocated
> 08/01/19 04:14:29 Group cdp.cdp2 - skipping, zero slots allocated
> 08/01/19 04:14:29 Group cdp.cdp3 - skipping, zero slots allocated
> 08/01/19 04:14:29 Group <none> - skipping, zero slots allocated
> 08/01/19 04:14:29 Round 2 totals: allocated= 0 Âusage= 0
> 08/01/19 04:14:29 ---------- Finished Negotiation Cycle ----------
>
>
> Working conf:
>
> GROUP_NAMES = cdp, cdp.cdp1, cdp.cdp2, cdp.cdp3
> GROUP_QUOTA_DYNAMIC_cdp = .5
> GROUP_QUOTA_DYNAMIC_cdp.cdp1 = .3
> GROUP_QUOTA_DYNAMIC_cdp.cdp2 = .3
> GROUP_QUOTA_DYNAMIC_cdp.cdp3 = .3
> GROUP_ACCEPT_SURPLUS_cdp.cdp1 = true
> GROUP_ACCEPT_SURPLUS_cdp.cdp2 = true
> GROUP_ACCEPT_SURPLUS_cdp.cdp3 = true
>
>
> # condor_version
> $CondorVersion: 8.6.13 Oct 30 2018 BuildID: 453497 $
> $CondorPlatform: x86_64_RedHat6 $
>
> Regards,
> Vikrant
>
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
>


--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing ÂDepartment of Computer Sciences
HTCondor Technical Lead        1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132Â Â Â Â Â Â Â Â Â Madison, WI 53706-1685

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


--
Collin Mehring | PE-JoSE - Software Engineer