[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Trying to overcommit CPUs



On 12/04/2013 04:39 PM, Brian Candler wrote:
[Using htcondor 8.0.4 from Debian Wheezy package, running under ubuntu
12.04]

The types of jobs I run tend to wait on I/O quite a lot, and therefore
the CPUs are idle part of the time. So I'd like to allow more concurrent
jobs than there are CPUs (or threads), in proportion to the CPUs available.

I am using dynamic slots. If I set "cpus=150%", like this:

COUNT_HYPERTHREAD_CPUS = True
SLOT_TYPE_1 = cpus=150%, ram=90%, swap=100%, disk=100%
SLOT_TYPE_1_PARTITIONABLE = True
NUM_SLOTS_TYPE_1 = 1

then startd repeatedly crashes. MasterLog shows:

12/04/13 21:10:32 Master restart (GRACEFUL) is watching
/usr/sbin/condor_master (mtime:1382218101)
12/04/13 21:10:33 Started DaemonCore process "/usr/sbin/condor_startd",
pid and pgroup = 2924
12/04/13 21:10:33 DefaultReaper unexpectedly called on pid 2924, status
1024.
12/04/13 21:10:33 The STARTD (pid 2924) exited with status 4
12/04/13 21:10:33 Sending obituary for "/usr/sbin/condor_startd"
12/04/13 21:10:33 restarting /usr/sbin/condor_startd in 10 seconds
12/04/13 21:10:43 Started DaemonCore process "/usr/sbin/condor_startd",
pid and pgroup = 2951
12/04/13 21:10:43 DefaultReaper unexpectedly called on pid 2951, status
1024.
12/04/13 21:10:43 The STARTD (pid 2951) exited with status 4
12/04/13 21:10:43 Sending obituary for "/usr/sbin/condor_startd"
12/04/13 21:10:43 restarting /usr/sbin/condor_startd in 11 seconds
^C

And StartLog shows:

12/04/13 21:10:33 ERROR: Can't allocate 1st slot of type 1
         Requesting: slot type 1: Cpus: 48, Memory: 58083, Swap:
100.00%, Disk: 100.00%
         Available:  Slot #1: Cpus: 32, Memory: 64537, Swap: 100.00%,
Disk: 100.00%
12/04/13 21:10:33 ERROR "Ran out of system resources" at line 122 in
file /slots/04/dir_478/userdir/src/condor_startd.V6/slot_builder.cpp

So then I tried setting
request_cpus = 0.66
in the job submit file. But this doesn't work; each slot uses 1 CPU.
Indeed, even "request_cpus = 0" gives this.

After some more digging, setting
NUM_CPUS = $(DETECTED_CPUS)*1.5
does seem to work.

Is there a better way, or something I've overlooked?

Thanks,

Brian.

You're on the right track.

It would be nice if you could take fractional CPUs. It'd also be nice if the SLOT_TYPE definition let you overcommit and still use % notation.

Best,


matt