[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Effective Priority lower but job stays idle



Hi Sophie,

See a bunch of comments inline below....


On 7/21/2021 6:12 AM, FERRY Sophie wrote:

Hello,

 

I have these (among others):

 

[1]

$ condor_userprio -priority

Last Priority Update:  7/21 12:39

                                               Effective     Real   Priority

User Name                                       Priority   Priority  Factor

--------------------------------------------- ------------ -------- ---------

group_atlas.mcore.atlpilot001@xxxxxxxxxxxxxxx   1065005.62    10.65 100000.00

group_cms.mcore.cmspilot003@xxxxxxxxxxxxxxx     3259168.25    32.59 100000.00

group_alice.sgmalice@xxxxxxxxxxxxxxx          657536192.00  6575.36 100000.00

--------------------------------------------- ------------ -------- ---------

 

[2]

$ condor_userprio -usage

Group                                   Res   Total Usage       Usage             Last

  User Name                            In Use (wghted-hrs)    Start Time       Usage Time

-------------------------------------- ------ ------------ ---------------- ----------------

group_cms                                  40   5201044.50  4/17/2020 10:55  7/21/2021 12:46

  mcore.cmspilot003@xxxxxxxxxxxxxxx        40   4117231.75  9/17/2020 14:12  7/21/2021 12:46

group_atlas                              1685  20845206.00  4/17/2020 10:55  7/21/2021 12:46

  sgmatlas@xxxxxxxxxxxxxxx                  1       712.06  4/17/2020 10:55  7/21/2021 12:39

  mcore.atlpilot001@xxxxxxxxxxxxxxx         8    406800.62  9/17/2020 19:30  7/21/2021 12:46

  atlpilot001@xxxxxxxxxxxxxxx             260   2988551.00  6/30/2020 09:56  7/21/2021 12:46

  mcore.prdatl008@xxxxxxxxxxxxxxx         701   3553249.75  9/17/2020 14:50  7/21/2021 12:46

  prdatl008@xxxxxxxxxxxxxxx               716  11126445.00  6/30/2020 09:35  7/21/2021 12:46

group_alice                              6762  45051064.00  4/17/2020 10:55  7/21/2021 12:46

Number of users: 11                      8492  67240840.00                   7/20/2021 12:46

 

[3]

$ condor_userprio –quotas

Group                                  Effective  Config     Use    Subtree  Requested

Name                                     Quota     Quota   Surplus   Quota   Resources

-------------------------------------- --------- --------- ------- --------- ----------

group_alice                              1552.07      0.18 Regroup   1552.07       6796

group_atlas                              3657.77      0.42 Regroup   3657.77       3266

group_cms                                2220.79      0.32 Regroup   2741.38       1640

 

from [2] I get:

alice=>  45051064/67240840=0.67

cms=>   5201044/67240840=0.08

atlas=>20845206/67240840=0.31


Your calculation above is using Total Usage in the numerator, which I believe is total since April 2020 (the Usage Start Time) .... if at any point in the past 1.5 years any of those groups had demand below their allocation, or the quota allocations changed, that could impact your numbers.  Perhaps in the numerator you want to use the "Weighted In Use" values instead?  Also be aware you can reset the Total Usage back to zero (and reset Usage Start Time) via condor_userprio -resetall.

 

 

My problem is that from [1] mcore.atlas is first served, then cms, them alice.

BUT

1)      alice uses only single core, and is always getting slots to run, even though its quota is much much over quota (0.67  instead of 0.18)

2)      mcore atlas always gets in, and core.cms NEVER ( I had to reserve one workernode to let them have job running)

analyse says there are slots suitable but busy, ( and I’ve seen some lines in NegotiatorLog saying it is over quota, which is not the case, but I cn’t find those lines anymore)

 

Anyone might know

1)      how to use defrag to leave space for 8cores ?

2)      how come cms never enters … ?

 


Some thoughts on your two questions above:

Re how to defrag to leave space for 8 cores:

Please read this section in the Manual for the majority of what you want to know, including how to setup and activate the condor_defrag daemon:

  https://htcondor.readthedocs.io/en/latest/admin-manual/policy-configuration.html?#defragmenting-dynamic-slots

Also it sounds like your job mix never contains jobs that need more than 8 cores.  If so,  there is no need to decrease utilization waiting for a large server to fully drain, since you only need space for 8 core jobs.  Thus I suggest the following adding this config knob to tell the the condor_defrag daemon to stop draining a server as soon as 8 cores are available:

    DEFRAG_CANCEL_REQUIREMENTS = Cpus >= 8


Since you have a mix of groups where some groups have single core jobs, I also suggest for your pool the configuration outlined in this HOWTO recipe:

   https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToMatchMulticoreAfterDrain

The recipe in the above link tells HTCondor to only allow multicore jobs to run on servers that have recently been recently defragged - only if no multicore jobs match to the slot after a few minutes are single core jobs allowed back.  This way a single core job, perhaps from a group that is behind quota, will not "sneak" onto a freshly created 8core slot.  The idea is since you took the utilization hit to drain to make room for an 8core job, make certain you run an 8core job!

Re how come CMS never gets allocated jobs:

This is hard to say from the information provided in your email....  hopefully things will improve after you have a healthy ongoing supply of 8core slots.  Another thought: if your supply of jobs from these groups is pretty constant, perhaps setting config knob GROUP_AUTOREGROUP back to its default setting of False would let you better monitor that groups are close to your configured quotas (without the additional complexity of backfilling).

Hope the above helps,
Todd
 


 

Thanks for any help

SF.

 

 

[ANALYSE]

]#  condor_q -better-analyse 5914078.0

 

 

-- Schedd: node16.datagrid.cea.fr : <192.54.206.43:28348>

The Requirements _expression_ for job 5914078.000 is

 

    ((NumJobStarts == 0) && ((IfThenElse(RequestCpus isnt undefined,(RequestCpus == 8 || RequestCpus == 1),true)))) && (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") &&

    (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && (TARGET.Cpus >= RequestCpus) && (TARGET.HasFileTransfer)

 

Job 5914078.000 defines the following attributes:

 

    DiskUsage = 150

    NumJobStarts = 0

    RequestCpus = 8

    RequestDisk = DiskUsage

    RequestMemory = 24000

 

The Requirements _expression_ for job 5914078.000 reduces to these conditions:

 

         Slots

Step    Matched  Condition

-----  --------  ---------

[3]        8030  TARGET.Arch == "X86_64"

[5]        8030  TARGET.OpSys == "LINUX"

[7]        8030  TARGET.Disk >= RequestDisk

[9]         289  TARGET.Memory >= RequestMemory

[11]         93  TARGET.Cpus >= RequestCpus

 

No successful match recorded.

Last failed match: Wed Jul 21 13:04:22 2021

 

Reason for last match failure: no match found

 

5914078.000:  Run analysis summary ignoring user priority.  Of 212 machines,

      1 are rejected by your job's requirements

     41 reject your job because of their own requirements

      0 match and are already running your jobs

      0 match but are serving other users

    170 are able to run your job

 

 

 

 

---------------------

        Sophie Ferry |

          CEA Saclay |

91190 Gif-sur-Yvette |

  DRF/IRFU/DEDIP/LIS |

           GRIF-IRFU |

       Bat 141 p023B |

+33(0)1 69 08 76 45 |

---------------------

 


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


-- 
Todd Tannenbaum <tannenba@xxxxxxxxxxx>  University of Wisconsin-Madison
Center for High Throughput Computing    Department of Computer Sciences
Calendar: https://tinyurl.com/yd55mtgd  1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                   Madison, WI 53706-1685