[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] [HTCcondor-users] Possibly a bug in subgroup working without surplus and autoregroup



Hi Vikrant,

Are you able to reproduce the below using current releases of HTCondor ? 
  Also, are you using static or partitionable slots ?

Several patches went into HTCondor since v8.6.x of HTCondor (which is no 
longer officially supported [1]) that look like they might mitigate your 
below issue, esp if you are using partitionable slots.  For instance:

   https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=6750
   https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=6714

If I get some time next week and you do not get to it first, I will try 
your config below on v8.8 and see if I can reproduce the problem.

Hope the above helps,
Todd

[1] From https://tinyurl.com/y34m5ymx : "After beginning a new stable 
series, the HTCondor Project will continue to support the previous 
stable series for six months."  Since HTCondor v8.8.x first appeared in 
Jan 2019,  folks should plan to upgrade v8.6 to v8.8 sooner rather than 
later....

On 8/1/2019 3:27 AM, Vikrant Aggarwal wrote:
> Hello Experts,
> 
> I am exploring the usage of accounting groups and sub accounting groups. 
> I saw weird behavior while using subgroups if I don't specify 
> GROUP_ACCEPT_SURPLUS orÂGROUP_AUTOREGROUP then jobs submitted with 
> subgroup never run. If i submit the job with parent group "cdp" it runs 
> without any issue. Is't expected behavior? I tried to use false value of 
> GROUP_ACCEPT_SURPLUS but no luck. If this is an expected behavior this 
> means we can't use subgroups without over-commitment?
> 
> I added this in my configuration file:
> 
> GROUP_NAMES = cdp, cdp.cdp1, cdp.cdp2, cdp.cdp3
> GROUP_QUOTA_DYNAMIC_cdp = .5
> GROUP_QUOTA_DYNAMIC_cdp.cdp1 = .3
> GROUP_QUOTA_DYNAMIC_cdp.cdp2 = .3
> GROUP_QUOTA_DYNAMIC_cdp.cdp3 = .3
> 
> After reconfig submitted job with following line in submit file.
> 
> Accounting_group = cdp.cdp2
> 
> Submitted jobs never ran. Negotiator were not able to do the match making.
> 
> 08/01/19 04:14:29 ---------- Started Negotiation Cycle ----------
> 08/01/19 04:14:29 Phase 1: ÂObtaining ads from collector ...
> 08/01/19 04:14:29 Not considering preemption, therefore constraining 
> idle machines with ifThenElse(State == "Claimed","Name State Activity 
> StartdIpAddr AccountingGroup Owner RemoteUser Requirements SlotWeight 
> ConcurrencyLimits","")
> 08/01/19 04:14:29 Â Getting startd private ads ...
> 08/01/19 04:14:29 Â Getting Scheduler, Submitter and Machine ads ...
> 08/01/19 04:14:29 Â Sorting 12 ads ...
> 08/01/19 04:14:29 Got ads: 12 public and 6 private
> 08/01/19 04:14:29 Public ads include 1 submitter, 6 startd
> 08/01/19 04:14:29 Phase 2: ÂPerforming accounting ...
> 08/01/19 04:14:29 group quotas: assigning 1 submitters to accounting groups
> 08/01/19 04:14:29 group quotas: assigning group quotas from 18 available 
> weighted slots
> 08/01/19 04:14:29 group quotas: allocation round 1
> 08/01/19 04:14:29 group quotas: groups= 5 Ârequesting= 1 Âserved= 1 
>  Âunserved= 0 Âslots= 18 Ârequested= 1 Âallocated= 1 Âsurplus= 25.1 
>  Âmaxdelta= 9
> 08/01/19 04:14:29 group quotas: entering RR iteration n= 9
> 08/01/19 04:14:29 Group cdp - skipping, zero slots allocated
> 08/01/19 04:14:29 Group cdp.cdp1 - skipping, zero slots allocated
> 08/01/19 04:14:29 Group cdp.cdp1 - skipping, zero slots allocated
> 08/01/19 04:14:29 Group cdp.cdp2 - BEGIN NEGOTIATION
> 08/01/19 04:14:29 Phase 3: ÂSorting submitter ads by priority ...
> 08/01/19 04:14:29 Phase 4.1: ÂNegotiating with schedds ...
> 08/01/19 04:14:29 Â Negotiating with cdp.cdp2.vaggarwal@xxxxxxxx 
> <mailto:cdp.cdp2.vaggarwal@xxxxxxxx> at 
> <xx.xx.xx.57:9618?addrs=xx.xx.xx.57-9618&noUDP&sock=9516_13b9_3>
> 08/01/19 04:14:29 0 seconds so far for this submitter
> 08/01/19 04:14:29 0 seconds so far for this schedd
> 08/01/19 04:14:29 Â Â Got NO_MORE_JOBS; Âschedd has no more requests
> 08/01/19 04:14:29 Â Â Request 00149.00000: autocluster 34 (request count 
> 1 of 1)
> 08/01/19 04:14:29 Â Â Â Rejected 149.0 cdp.cdp2.vaggarwal@xxxxxxxx 
> <mailto:cdp.cdp2.vaggarwal@xxxxxxxx> 
> <xx.xx.xx.57:9618?addrs=xx.xx.xx.57-9618&noUDP&sock=9516_13b9_3>: 
> submitter limit exceeded
> 08/01/19 04:14:29 Â Â Got NO_MORE_JOBS; Âschedd has no more requests
> 08/01/19 04:14:29 ÂnegotiateWithGroup resources used scheddAds length 1
> 08/01/19 04:14:29 Group cdp.cdp3 - skipping, zero slots allocated
> 08/01/19 04:14:29 Group <none> - skipping, zero slots allocated
> 08/01/19 04:14:29 Round 1 totals: allocated= 1 Âusage= 0
> 08/01/19 04:14:29 Round 1 totals: allocated= 1 Âusage= 0
> 08/01/19 04:14:29 group quotas: allocation round 2
> 08/01/19 04:14:29 group quotas: allocation round 2
> 08/01/19 04:14:29 group quotas: groups= 5 Ârequesting= 0 Âserved= 0 
>  Âunserved= 0 Âslots= 18 Ârequested= 0 Âallocated= 0 Âsurplus= 26.1 
>  Âmaxdelta= 9
> 08/01/19 04:14:29 group quotas: entering RR iteration n= 9
> 08/01/19 04:14:29 Group cdp - skipping, zero slots allocated
> 08/01/19 04:14:29 Group cdp - skipping, zero slots allocated
> 08/01/19 04:14:29 Group cdp.cdp1 - skipping, zero slots allocated
> 08/01/19 04:14:29 Group cdp.cdp2 - skipping, zero slots allocated
> 08/01/19 04:14:29 Group cdp.cdp3 - skipping, zero slots allocated
> 08/01/19 04:14:29 Group <none> - skipping, zero slots allocated
> 08/01/19 04:14:29 Round 2 totals: allocated= 0 Âusage= 0
> 08/01/19 04:14:29 ---------- Finished Negotiation Cycle ----------
> 
> 
> Working conf:
> 
> GROUP_NAMES = cdp, cdp.cdp1, cdp.cdp2, cdp.cdp3
> GROUP_QUOTA_DYNAMIC_cdp = .5
> GROUP_QUOTA_DYNAMIC_cdp.cdp1 = .3
> GROUP_QUOTA_DYNAMIC_cdp.cdp2 = .3
> GROUP_QUOTA_DYNAMIC_cdp.cdp3 = .3
> GROUP_ACCEPT_SURPLUS_cdp.cdp1 = true
> GROUP_ACCEPT_SURPLUS_cdp.cdp2 = true
> GROUP_ACCEPT_SURPLUS_cdp.cdp3 = true
> 
> 
> # condor_version
> $CondorVersion: 8.6.13 Oct 30 2018 BuildID: 453497 $
> $CondorPlatform: x86_64_RedHat6 $
> 
> Regards,
> Vikrant
> 
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
> 


-- 
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685