Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Troubleshooting job eviction by machine RANK

Date: Thu, 04 Feb 2016 15:46:13 -0600
From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Troubleshooting job eviction by machine RANK

On 2/3/2016 6:28 PM, Graham Allan wrote:


Now I have seen allusions to issues between partitionable slots and
preemption but not exactly what they are - I had an impression it was
something to do with evicted jobs leaving the slots fragmented, rather
than preemption just not happening.

Graham


Hi Graham,

Perhaps it is something as simple as you put your startd rank expressioninto your condor_config file(s) but failed to do a condor_reconfig? Thestartd Rank expression should appear in the slot classads; you can checkby doing something like

  condor_status -af:r name rank

As for the issue with startd rank and partitionable slots: yourimpression above is correct, the only issue is with fragmentation. Yourpartitionable slot ("slot1@machine") always represents the unclaimedresources; when a job is matched to that machine, a "dynamic" slot iscreated (slot1_1@machine, slot1_2@machine, etc) that typically containsjust enough CPU and Memory to handle the matched job (although there areconfig knobs available to the admin to round-up the resources). Thesedynamic slots will honor your startd Rank expression. So for instance, if

  Rank = CondorGroup =?= "nova"

then a nova job will preempt a non-nova job running on a dynamic slot,but ONLY IF the nova job "fits" in the dynamic slot. So imagine youhave an infinite number of non-nova jobs that all have request_cpus=1and then you submit a nova job with request_cpus=4. The nova job couldstarve forever because even though the nova job will take over anydynamic slot running a non-nova job, there may not ever be any dynamicslots in the pool with 4 cpus allocated and thus the nova job will notmatch any slots.

There are two solutions to this problem. One solution is use knobs likeMODIFY_REQUEST_EXPR_REQUEST(CPUS|DISK|MEMORY) in your condor_config toalways round up the size of allocated cpus and memory to somethingusable by nova jobs, i.e. don't allow non-nova jobs to fragment yourmachines into slots so small that nova jobs won't match.

A second solution exists if you have HTCondor v8.4 running on yourschedd, startds, and central manager. With HTCondor v8.4, you can say

  ALLOW_PSLOT_PREEMPTION = True
  PREEMPTION_REQUIREMENTS = True

on your central manager. This will avoid the fragmentation issue aboveas it relates to startd rank, as HTCondor will now preempt multipledynamic slots as required in order to fit the higher ranked job. E.g.HTCondor will preempt four 1-core non-nova jobs from a machine in orderto fit your 4-core nova job as preferred by your startd RANK.


Hope the above helps,
Todd

Follow-Ups:
- Re: [HTCondor-users] Troubleshooting job eviction by machine RANK
  - From: Graham Allan

References:
- [HTCondor-users] Troubleshooting job eviction by machine RANK
  - From: Graham Allan
- Re: [HTCondor-users] Troubleshooting job eviction by machine RANK
  - From: Todd Tannenbaum
- Re: [HTCondor-users] Troubleshooting job eviction by machine RANK
  - From: Graham Allan

Prev by Date: Re: [HTCondor-users] HTCondor/cgroups: limiting CPUs/pinning processes to CPUs with hyperthreaded CPUs
Next by Date: [HTCondor-users] HTCondor 8.4.4 Released
Previous by thread: Re: [HTCondor-users] Troubleshooting job eviction by machine RANK
Next by thread: Re: [HTCondor-users] Troubleshooting job eviction by machine RANK
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] Troubleshooting job eviction by machine RANK