[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Trying to overcommit CPUs



[Using htcondor 8.0.4 from Debian Wheezy package, running under ubuntu 12.04]

The types of jobs I run tend to wait on I/O quite a lot, and therefore the CPUs are idle part of the time. So I'd like to allow more concurrent jobs than there are CPUs (or threads), in proportion to the CPUs available.

I am using dynamic slots. If I set "cpus=150%", like this:

COUNT_HYPERTHREAD_CPUS = True
SLOT_TYPE_1 = cpus=150%, ram=90%, swap=100%, disk=100%
SLOT_TYPE_1_PARTITIONABLE = True
NUM_SLOTS_TYPE_1 = 1

then startd repeatedly crashes. MasterLog shows:

12/04/13 21:10:32 Master restart (GRACEFUL) is watching /usr/sbin/condor_master (mtime:1382218101) 12/04/13 21:10:33 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 2924 12/04/13 21:10:33 DefaultReaper unexpectedly called on pid 2924, status 1024.
12/04/13 21:10:33 The STARTD (pid 2924) exited with status 4
12/04/13 21:10:33 Sending obituary for "/usr/sbin/condor_startd"
12/04/13 21:10:33 restarting /usr/sbin/condor_startd in 10 seconds
12/04/13 21:10:43 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 2951 12/04/13 21:10:43 DefaultReaper unexpectedly called on pid 2951, status 1024.
12/04/13 21:10:43 The STARTD (pid 2951) exited with status 4
12/04/13 21:10:43 Sending obituary for "/usr/sbin/condor_startd"
12/04/13 21:10:43 restarting /usr/sbin/condor_startd in 11 seconds
^C

And StartLog shows:

12/04/13 21:10:33 ERROR: Can't allocate 1st slot of type 1
Requesting: slot type 1: Cpus: 48, Memory: 58083, Swap: 100.00%, Disk: 100.00% Available: Slot #1: Cpus: 32, Memory: 64537, Swap: 100.00%, Disk: 100.00% 12/04/13 21:10:33 ERROR "Ran out of system resources" at line 122 in file /slots/04/dir_478/userdir/src/condor_startd.V6/slot_builder.cpp

So then I tried setting
request_cpus = 0.66
in the job submit file. But this doesn't work; each slot uses 1 CPU. Indeed, even "request_cpus = 0" gives this.

After some more digging, setting
NUM_CPUS = $(DETECTED_CPUS)*1.5
does seem to work.

Is there a better way, or something I've overlooked?

Thanks,

Brian.