[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Problem setting up more slots than cpus



Hi Even,

from condor manual:

NUM_SLOTS
    An integer value representing the number of slots reported when the SMP
    machine is being evenly divided, and the slot type settings described above
    are not being used. The default is one slot for each CPU. This setting can
    be used to reserve some CPUs on an SMP which would not be reported to the
    Condor pool. This value cannot be used to make Condor advertise more slots
    than there are CPUs on the machine. To do that, use NUM_CPUS

In the last two sentences are written exactly for what you already found out. 

Lukas Slebodnik

On Tue, Mar 08, 2011 at 05:44:58PM -0600, Niessen-Derry, Evan wrote:
> 
> While attempting to setup a machine to report more slots than it has
> cpus, I can't seem to get it to report any slots on the machine at all.
> 
> `condor_version` gives me:
> <----
> $CondorVersion: 7.4.4 Oct 13 2010 BuildID: 279383 $
> $CondorPlatform: I386-LINUX_RHEL3 $
> - ---->
> 
> With D_FULLDEBUG on for this machine, I get the following output from
> the SchedLog file and the StartLog file:
> <----
> 03/08 17:22:58 Getting monitoring info for pid 6410
> 03/08 17:22:58 DaemonCore: in SendAliveToParent()
> 03/08 17:23:16 condor_read(): timeout reading 5 bytes from
> <172.16.75.110:33505>.
> 03/08 17:23:16 IO: Failed to read packet header
> 03/08 17:23:16 Failed to read ClassAd size.
> 03/08 17:23:16 SECMAN: no classad from server, failing
> 03/08 17:23:16 ERROR: SECMAN:2004:Failed to create security session to
> <172.16.75.110:33505> with TCP.
> |SECMAN:2007:Failed to end classad message.
> 03/08 17:23:16 DaemonCore: startCommand() to <172.16.75.110:33505>
> failed. SendAliveToParent() failed.
> 03/08 17:23:16 Failed to send alive to <172.16.75.110:33505>, will try
> again...
> - ---->
> 
> What's odd is that the CollectorLog on $(CONDOR_HOST) says that it gets
> an INVALIDATE_STARTD_ADS for the two slots I'm trying to set up.
> 
> The local config for this machine is so:
> <----
> # Where are the binaries
> RELEASE_DIR = /opt/condor-7.4.4
> 
> # How do we send mail?
> MAIL = /bin/mail
> 
> # What devices do we care about? (none, but I'm not sure if it works if
> we don't define this)
> CONSOLE_DEVICES = mouse, console
> 
> # What daemons should we start?
> DAEMON_LIST = MASTER, STARTD, SCHEDD
> 
> # Where is our execute directory?
> LOCAL_DIR = /opt/condor-7.4.4/local
> 
> # TODO Define the default user to act as
> 
> # Define more cpus
> NUM_CPUS = 2
> 
> # Define more slots
> #NUM_SLOTS = 2
> 
> # Define types of slots
> SLOT_TYPE_1 = cpus=1, ram=%50, swap=1/2, disk=1/2
> SLOT_TYPE_2 = cpus=1, ram=%50, swap=1/2, disk=1/2
> 
> NUM_SLOTS_TYPE_1 = 1
> NUM_SLOTS_TYPE_2 = 1
> 
> # Debugging
> ALL_DEBUG = D_FULLDEBUG
> - ---->
> 
> This problem doesn't happen when I'm not trying to lie to Condor about
> how many cpus I have. Is condor trying to teach me to stop lying, or am
> I missing something?
> 
> Thank you,
> Evan Niessen-Derry