[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] RequestCpus > 1 and Dynamic (Partitionable) Slots



On 01/15/2011 08:52 PM, Erik Aronesty wrote:
I've confirmed on many tests that jobs with RequestCpus > 1 don't seem
to be compatible with dynamic slots.

Is this a condor version issue?  I'm running 7.4.4 on x86_64

Our system has many, many jobs that consume between 1-8 cpus and many
SMP machines with 4 and 32 cores.

(I can use condor_qedit and get a job to run on a dynamic slot just by
switching its Cpus to 1.   It will not run otherwise ... even if Start=TRUE)

The message from analyse is "2 reject your job because of their own
requirements" ... (or however many slots are partitionable).

It would be nice to be able to take a job id and a node, and then ask
for an explanation of why it's not running on that node.

If I run a bunch of jobs with 1 cpu... the dynamic slot works as
advertised...  forking off new slots and reclaiming them later... quite
nicely.   I even like to leave some lots this way - since they are so
much better about resource utilization... in every other respect.

I've noticed one other thread posting about this, but have never seen a
final solution.

https://www-auth.cs.wisc.edu/lists/condor-users/2009-June/msg00065.shtml

Has anyone gotten dynamic slots to work with RequestCpus > 1... where it
actually decrements the number of cpus from those remaining?

 > condor -version
$CondorVersion: 7.2.4 Apr 11 2010 $
$CondorPlatform: X86_64-LINUX_DEBIAN_UNKNOWN $

 >condor_status ea-morpheus -l | grep Cpu
CpuIsBusy = false
Cpus = 1
CpuBusyTime = 0
CpuBusy = ( ( LoadAvg - CondorLoadAvg ) >= 0.500000 )
TotalCpus = 4

Machine doing nothing:

 >cat /srv/condor/ea-morpheus
DAEMON_LIST = MASTER, STARTD
NUM_SLOTS=1
SLOT_TYPE_1=Cpu=4,auto
SLOT_TYPE_1_PARTITIONABLE=TRUE
NUM_SLOTS_TYPE_1=1
START=TRUE

JOB not running:

 > condor_q 6490.0 -l | grep Req
AutoClusterAttrs =
"JobUniverse,LastCheckpointPlatform,NumCkpts,RequestCpus,RequestDisk,RequestMemory,FileSystemDomain,DiskUsage,ImageSize,Requirements,NiceUser,ConcurrencyLimits"
RequestDisk = DiskUsage
RequestMemory = 500
RequestCpus = 2
Requirements = ( Memory >= 500 ) && ( TARGET.Arch == "X86_64" ) && (
TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= DiskUsage ) && ( (
RequestMemory * 1024 ) >= ImageSize ) && ( TARGET.FileSystemDomain ==
MY.FileSystemDomain )

- Erik

Seems to work fine here...

$ rpm -q condor
condor-7.4.2-1.fc13.x86_64

$ condor_version
$CondorVersion: 7.4.2 May 20 2010 BuildID: Fedora-7.4.2-1.fc13 $
$CondorPlatform: X86_64-LINUX_F13 $

$ tail -n5 ~condor/condor_config.local
NUM_SLOTS=1
SLOT_TYPE_1=Cpu=4,auto
SLOT_TYPE_1_PARTITIONABLE=TRUE
NUM_SLOTS_TYPE_1=1
START=TRUE

08:29:45am$ condor_status
Name OpSys Arch State Activity LoadAv Mem ActvtyTime slot1@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.320 3760 0+00:00:04 Total Owner Claimed Unclaimed Matched Preempting Backfill X86_64/LINUX 1 0 0 1 0 0 0 Total 1 0 0 1 0 0 0

08:29:47am$ echo 'cmd=/bin/sleep\nargs=1d\nrequestcpus=2\nqueue 2' | condor_submit
Submitting job(s)..
2 job(s) submitted to cluster 10377.

08:30:27am$ condor_q
-- Submitter: eeyore.local : <192.168.1.100:37322> : eeyore.local
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 10377.0 matt 1/16 08:30 0+00:00:02 R 0 0.0 sleep 1d 10377.1 matt 1/16 08:30 0+00:00:00 I 0 0.0 sleep 1d
2 jobs; 1 idle, 1 running, 0 held

08:30:30am$ condor_status
Name OpSys Arch State Activity LoadAv Mem ActvtyTime slot1@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.140 3759 0+00:00:56 slot1_1@xxxxxxxxxx LINUX X86_64 Claimed Busy 0.000 1 0+00:00:04 Total Owner Claimed Unclaimed Matched Preempting Backfill X86_64/LINUX 2 0 1 1 0 0 0 Total 2 0 1 1 0 0 0

08:30:34am$ condor_status -format "%s\t" Name -format "%d\n" Cpus
slot1@xxxxxxxxxxxx	0
slot1_1@xxxxxxxxxxxx	2
slot1_2@xxxxxxxxxxxx	2

08:31:00am$ recho You read to the end
dne eht ot daer uoY