[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] CPU Affinity in condor v8.9.1



Greg,

Thanks for the response. Here is the issue with "NUM_CPUS".

I have attached the partitionable slot configuration that Tom put
together for the cluster. I haven't touched this since he moved-on. You
can see at the top he put:

num_cpus = 2 * $(DETECTED_CPUS)

I have no clue as to why this was done, but I suspect it has to do with
the partitionable slot configurations in the rest if this file. Which
looks to partition the cluster into two partitions, one seems to be
dedicated to the 'online_cbc_gstlal_inspiral' analysis and the other for
all other jobs.

Thus I don't know if I should be changing this setting. Which is one
reason I looked into the cgroups and other cpu affinity settings.

Tom also set the RAM in this file as well, which is a reason I am
investigating cgroups for memory-limiting condor as well as cpu-limiting
condor.

Sincerely,
Shawn

On 10/17/19 3:39 PM, Greg Thain wrote:
> On 10/17/19 11:39 AM, Shawn A Kwang wrote:
>> In Condor (v8.9.1) how do I assign CPU affinity to jobs on the compute
>> nodes with 24 cores? Let's say I want to limit condor to using 20 cores:
>> 0-19, for users jobs. It should be noted: the cluster is using
>> partitionable slots.
>>
>> Bigger picture: I wish to limit condors resources because the compute
>> nodes run alongside the ceph-osd daemons which I want to 'reserve' a
>> certain amount of RAM and CPU.
> 
> 
> Shawn:
> 
> What I would do on this machine is set
> 
> 
> NUM_CPUS = 20
> 
> in the htcondor config.
> 
> This will tell htcondor that it only has 20 cores to work with (but not
> which physical ones), and condor will only dole out 20 cores worth of
> work. With cgroups, if there is contention for all the cores on the
> system, the sum of the condor jobs shouldn't exceed 20 cores worth, but
> the kernel is free to pick which physical cores to use, leaving the rest
> for ceph or other system daemons.
> 
> 
> -greg
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/


-- 
Associate Scientist
Center for Gravitation, Cosmology, and Astrophysics
University of Wisconsin-Milwaukee
office: +1 414 229 4960
kwangs@xxxxxxx
#
# First tell the startd we have double the amount of cpu, memory that we really have,
# and advertise some additional information into the slots (such as the amount
# of Cpus, Memory leftover in the pslot.
#
num_cpus = 2 * $(DETECTED_CPUS)
memory = 2 * (floor($(DETECTED_MEMORY)/1024) * 1024 - $(RESERVED_MEMORY:0))
startd_attrs = $(startd_attrs) RealtimeSlot preempt want_vacate Realtime_Resources_Inuse
startd_slot_attrs = $(startd_slot_attrs) Cpus TotalSlotCpus
# Decrease startd polling internal so regular jobs are killed quickly when
# realtime jobs arrive.
polling_interval = 2

#
# Set up a pslot for the realtime jobs, adding a START requirement to prohibit accepting
# regular jobs. Give these slot a custom name of "realtimeX@foo", and a custom attribute of
# RealtimeSlot=True.
# We purposefully use the "==" operator in the Start expression here instead of
# the "=?=" operator when testing RealtimeJob so that the realtime1 pslot stays
# in unclaimed state instead of owner state.
#
slot_type_1_partitionable = true
slot_type_1 = cpus=50% memory=50% gpus=0% disk=50% swap=0%
num_slots_type_1 = 1
slot_type_1_RealtimeSlot = True
slot_type_1_name_prefix = realtime
slot_type_1_start = ( $(START) ) && online_cbc_gstlal_inspiral

#
# Set up a pslot for regular jobs.  Set the START expression on this slot
# to disallow realtime jobs, and only start regular jobs no realtime jobs are running.
# Preempt all regular slots if a claim occurs on the realtime slot.
# Disable vacate time on these slots, so that jobs are immediately killed
# upon preemption (we want the resources freed up asap for the realtime jobs).
#
slot_type_2_partitionable = true
slot_type_2 = cpus=50% memory=50% gpus=50% disk=50% swap=0%
num_slots_type_2 = 1
slot_type_1_RealtimeSlot = False
Realtime_Resources_Inuse = ( realtime1_Cpus < realtime1_TotalSlotCpus )
slot_type_2_start = ( $(START) ) && !online_cbc_gstlal_inspiral && !$(Realtime_Resources_Inuse)
slot_type_2_preempt = $(Realtime_Resources_Inuse)
slot_type_2_want_vacate = False

Attachment: signature.asc
Description: OpenPGP digital signature