[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor configuratoin for Multi-CPU machines



dear all,
 
I have just put together a small cluster of machines that are dual core dual cpu machines all running WinXP(X64) and wanted to share and get feedback on the configuration I have for them. I have tried to set them up so that the machines can service jobs that require 1, 2 or 4 CPUs. This should allow jobs that require 2 CPUs to run, or half the machine's resources to co-exist alongside two 1 CPU jobs. There is also provision for jobs that require 4 CPUs (or 2 CPUs) to start when there is 1 CPU free and to prevent further jobs from being placed on the machine. This was to prevent jobs that require 4 CPUs from being blocked as the scheduler fills the machines with 1CPU jobs and prevents a 4 CPU job from claiming the whole machine because it is never completely free. So here it is, followed by a few questions for the more experienced :
 
(These are just the bits I've changed because of the multi-CPU nature of the computers.)
--------------------------------------------
NUMBER_OF_CLAIMED_CPUS = \
( \
   (1*(VM1_State =?= "Claimed")) + \
   (1*(VM2_State =?= "Claimed")) + \
   (1*(VM3_State =?= "Claimed")) + \
   (1*(VM4_State =?= "Claimed")) + \
   (2*(VM5_State =?= "Claimed")) + \
   (2*(VM6_State =?= "Claimed")) + \
   (4*(VM7_State =?= "Claimed")) \
)
 
MAINTAIN_CLAIM = \
( \
   ((VirtualMachineID == 1)&&(VM1_State =?= "Claimed")) || \
   ((VirtualMachineID == 2)&&(VM2_State =?= "Claimed")) || \
   ((VirtualMachineID == 3)&&(VM3_State =?= "Claimed")) || \
   ((VirtualMachineID == 4)&&(VM4_State =?= "Claimed")) || \
   ((VirtualMachineID == 5)&&(VM5_State =?= "Claimed")) || \
   ((VirtualMachineID == 6)&&(VM6_State =?= "Claimed")) || \
   ((VirtualMachineID == 7)&&(VM7_State =?= "Claimed")) \
)
 
# To claim a multi-cpu vm you must specify CPUS in the job description
# this macro returns a match for the 1 CPU machines if the job does not define the CPUs
# this macro also prevents jobs that don't specify their cpu requirement don't claim the
# 4 CPU VM and restrict the machine to just one job
JOB_CPUS_MATCHES_VM_CPUS = \
( \
  ((CPUS =?= TARGET.CPUS) == TRUE) \
  || ((CPUS == 1) && (TARGET.CPUS =?= UNDEFINED)) \
)
 
# the start _expression_ is evaluate by each VM
# 4 is the total number of CPUs on each machine
START = \
( \
    (4 > $(NUMBER_OF_CLAIMED_CPUS)) \
    && $(JOB_CPUS_MATCHES_VM_CPUS) \
) || $(MAINTAIN_CLAIM)
 
# These are dedicated machines
IsOwner = False
 
STARTD_VM_EXPRS = State, Activity, ImageSize, EnteredCurrentActivity
# the machine has 4 cpus and 2Gig RAM so there values are 3 times as much
# because we advertise the machine in three different ways
MEMORY = 6144
NUM_CPUS = 12

#the VMs are defined in this order so VMs 1-4 have 1 CPU, VM5-6 have 2 CPUS and vm7 has 4cpus
VIRTUAL_MACHINE_TYPE_1 = cpus=1, ram=512
VIRTUAL_MACHINE_TYPE_2 = cpus=2, ram=1024
VIRTUAL_MACHINE_TYPE_3 = cpus=4, ram=2048
NUM_VIRTUAL_MACHINES_TYPE_1 = 4
NUM_VIRTUAL_MACHINES_TYPE_2 = 2
NUM_VIRTUAL_MACHINES_TYPE_3 = 1
# These are dedicated machines
VIRTUAL_MACHINES_CONNECTED_TO_KEYBOARD = 0
VIRTUAL_MACHINES_CONNECTED_TO_CONSOLE = 0
--------------------------------------------
 
The above configuration has just gone into use on our cluster, and is working reasonably well. My only concern is with the MAINTAIN_CLAIM macro which appears to be necessary. This is because when a machine accepts a job that goes above the 4 CPUs then the start _expression_ becomes false. This then caused a job to be dropped because of this, something I could only prevent with the MAINTAIN_CLAIM macro. If anyone can enlighten me as to why this is that would be great.
 
I have also looked at getting the different VMs to run jobs at different priorities, particularly the 4 cpu vm which implies that all jobs on this VM require the whole machine. my desire would be for those jobs to be run at a lower priority allowing any jobs on the other VMs to finish more rapidly and put the 4 CPU vm in full control of the machine as it so wants, Suspending though is unhelpful as there are still CPU cycles that can be used whilst it waits for the whole machine to be freed.
 
So what do you all think? Are there any mistakes that I have not spotted that will cause problems? or is there a better way of doing this type of thing.
 
I hope this helps someone,
 
Peter
 
Ps. Thank you condor team, this software is very helpful.
 

Dr Peter Myerscough-Jackopson  -  Engineer
MULTIPLE ACCESS COMMUNICATIONS LIMITED
Delta House, The University of Southampton Science Park, Southampton, SO16 7NS,
United Kingdom.
Tel: +44 (0)23 8076 7808 Fax: +44 (0)23 8076 0602
Web: http://www.macltd.com/  Email: peter.myerscough-jackopson@xxxxxxxxxx