[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] setting up dedicated pool for parallel universe



Hi,

I am a newbie for htcondor set-up and administration though I was a user in the past.Â
We are trying to set up a pool of 2 machines each with dual CPUs with 20 cores each. I made an i5 6 core machine as the master, and the other 2 HPC workstations as the nodes.All are running Ubuntu 16.04. The condor installation is from the default Ubuntu repositories installed using apt-get.

We have to run MPI jobs. So, I made the masterÂalso as the DedicatedScheduler. When I submit this simple script (below) the job stays Idle forever.
universe = parallel
executable = /bin/sleep
arguments = 3000
machine_count = 4
log = log
should_transfer_files = IF_NEEDED
when_to_transfer_output = ON_EXIT
queue

I have configured the file following the manual. Below are the config files on master and node01. configuration files of node02 are identical to node01

I am not getting where the problem is and forums and google could not tell me how to get it done so far.Â
Any help is highly appreciated as we are stuck here for a couple of weeks.

master
/etc/condor/condor_config (package manager file UNCHANGED)
-------------------------------------Begin file ----------------------------------
RELEASE_DIR = /usr
LOCAL_DIR = /var
LOCAL_CONFIG_FILE = /etc/condor/condor_config.local
#LOCAL_CONFIG_FILE = $(RELEASE_DIR)/etc/$(HOSTNAME).local
REQUIRE_LOCAL_CONFIG_FILE = false
LOCAL_CONFIG_DIR = /etc/condor/config.d
#LOCAL_CONFIG_DIR_EXCLUDE_REGEXP = ^((\..*)|(.*~)|(#.*)|(.*\.rpmsave)|(.*\.rpmnew))$
use SECURITY : HOST_BASED
#ALLOW_WRITE = *.cs.wisc.edu
##Â FLOCK_FROM defines the machines that grant access to your pool via flocking. (i.e. these machines can join your pool).
#FLOCK_FROM =
##Â FLOCK_TO defines the central managers that your schedd will advertise itself to (i.e. these pools will give matches to your schedd).
#SEC_PASSWORD_FILE = $(LOCAL_DIR)/lib/condor/pool_password
##Â Pathnames
RUNÂ Â Â= $(LOCAL_DIR)/run/condor
LOGÂ Â Â= $(LOCAL_DIR)/log/condor
LOCKÂ Â = $(LOCAL_DIR)/lock/condor
SPOOLÂ Â= $(LOCAL_DIR)/spool/condor
EXECUTE = $(LOCAL_DIR)/lib/condor/execute
BINÂ Â Â= $(RELEASE_DIR)/bin
LIBÂ Â Â= $(RELEASE_DIR)/lib/condor
INCLUDE = $(RELEASE_DIR)/include/condor
SBINÂ Â = $(RELEASE_DIR)/sbin
LIBEXEC = $(RELEASE_DIR)/lib/condor/libexec
SHAREÂ Â= $(RELEASE_DIR)/share/condor
PROCD_ADDRESS = $(RUN)/procd_pipe
CONDOR_HOST = $(FULL_HOSTNAME)
DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD
----------------------------------------------- end of file ----------------------------------------------------

master
File: /etc/condor/config.d/00debconf Â(Edited for configuration)
-------------------------------------Begin file ----------------------------------ÂÂ
DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD
CONDOR_ADMIN = root@localhost
RESERVED_MEMORY =
FILESYSTEM_DOMAIN =
UID_DOMAIN = XXXXXXXXXX
CONDOR_HOST = 10.10.48.81
ALLOW_WRITE = 10.10.48.81, 10.10.48.90, 10.10.48.86
ALLOW_NEGOTIATOR = 10.10.48.81
-----------------------------------end of file ----------------------------------------------

node01
File: /etc/condor/condor_confic (Package manager's copy; Identical to that of master node)ÂÂ
/etc/condor/config.d/00debconf (Edited for configuration)
-------------------------------------Begin file ----------------------------------ÂÂ
DAEMON_LIST = STARTD, SCHEDD, MASTER
CONDOR_ADMIN = root@localhost
RESERVED_MEMORY =
FILESYSTEM_DOMAIN =
UID_DOMAIN = XXXXXXXXX
CONDOR_HOST = 10.10.48.81
ALLOW_WRITE = 10.10.48.90, 10.10.48.81, 10.10.48.86
ALLOW_NEGOTIATOR = 10.10.48.81
# Added: by system admin: For Dedicated scheduler for parallel universe
DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxx"
STARTD_ATTRS = $(STARTD_ATTRS), DedicatedScheduler

# 2) Always run jobs, but prefer dedicated ones
STARTÂ Â Â Â Â Â= True
SUSPENDÂ Â Â Â Â= False
CONTINUEÂ Â Â Â = True
PREEMPTÂ Â Â Â Â= False
KILLÂ Â Â Â Â Â = False
WANT_SUSPENDÂ Â = False
WANT_VACATEÂ Â Â= False
RANKÂ Â Â Â Â Â = Scheduler =?= $(DedicatedScheduler)

MPI_CONDOR_RSH_PATH = $(LIBEXEC)
CONDOR_SSHD = /usr/sbin/sshd
CONDOR_SSH_KEYGEN = /usr/bin/ssh-keygen
--------------------end of file ------------------------------------------------------------Â
Â

Best regards,

Ram