[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Problem with condor_master in compute nodes



hi
i have been trying to get condor working on rocks clusters. Front end condor works and one of the compute nodes work, but rest of them have a problem. There are total 8 compute nodes.
----------------
Condor_config.local looks like
#
#  Condor local configuration file for frontend node.
#
COLLECTOR_NAME = Collector at protos
CONDOR_ADMIN = condor@xxxxxxxxxxxxxxxxxx
CONDOR_DEVELOPERS = NONE
CONDOR_DEVELOPERS_COLLECTOR = NONE
CONDOR_HOST = protos.cs.bgsu.edu
CONDOR_IDS = 407.407
DAEMON_LIST = MASTER, SCHEDD, COLLECTOR, NEGOTIATOR
EMAIL_DOMAIN = $(FULL_HOSTNAME)
FILESYSTEM_DOMAIN = cs.bgsu.edu
HOSTALLOW_WRITE = protos.cs.bgsu.edu, *.local
JAVA = /usr/java/jdk1.5.0_07/bin/java
LOCAL_DIR = /home/condor
LOCK = /tmp/condor-lock.$(HOSTNAME)
MAIL = /bin/mail
NEGOTIATOR_INTERVAL = 120
NETWORK_INTERFACE = 129.1.64.210
RELEASE_DIR = /opt/condor
UID_DOMAIN = cs.bgsu.edu
-----------------------------------
and that on client nodes looks like

CONDOR_ADMIN = condor@xxxxxxxxxxxxxxxxxx
CONDOR_DEVELOPERS = NONE
CONDOR_DEVELOPERS_COLLECTOR = NONE
CONDOR_HOST = protos.cs.bgsu.edu
CONDOR_IDS = 407.407
DAEMON_LIST = MASTER, STARTD
EMAIL_DOMAIN = $(FULL_HOSTNAME)
FILESYSTEM_DOMAIN = cs.bgsu.edu
HOSTALLOW_WRITE = protos.cs.bgsu.edu, *.local
JAVA = /usr/java/jdk1.5.0_07/bin/java
LOCAL_DIR = /home/condor
LOCK = /tmp/condor-lock.$(HOSTNAME)
MAIL = /bin/mail
NEGOTIATOR_INTERVAL = 120
NETWORK_INTERFACE = 10.255.255.254
RELEASE_DIR = /opt/condor
UID_DOMAIN = cs.bgsu.edu
# First set JAVA_MAXHEAP_ARGUMENT to null, to disable the default of max RAM
JAVA_MAXHEAP_ARGUMENT =
# Now set the argument with the Sun-specific maximum allowable value
JAVA_EXTRA_ARGUMENTS = -Xmx1906m
-----------------------------------
I was able to make the processes schedd, startd and master run on one of the nodes
But when i try to do the same on the others, there is a problem,
I get the following message from condor MasterLog
11/14 11:18:05 Using config source: /opt/condor/etc/condor_config
11/14 11:18:05 Using local config sources:
11/14 11:18:05    /opt/condor/etc/condor_config.local
11/14 11:18:05 Failed to bind to command ReliSock
11/14 11:18:05 (Make sure your IP address is correct in /etc/hosts.)
11/14 11:18:05 ERROR "BindAnyCommandPort() failed" at line 8386 in file daemon_core.C
IP address is correct in /etc/hosts file
Also trying to do condor_q on these nodes i get
-------------------
 Failed to fetch ads from: <10.255.255.254:45932> : compute-0-1.local
CEDAR:6001:Failed to connect to
---------------------
 and condor_status give
----
[condor@compute-0-1 log]$ condor_status
CEDAR:6001:Failed to connect to <129.1.64.210:9618>
Error: Couldn't contact the condor_collector on protos.cs.bgsu.edu.
---
on the head node
ps - aux | grep condor shows
condor    2978  0.0  0.2   7536  2380 ?        Ss   Nov10   2:37 /opt/condor/sbin/condor_master
condor    2995  0.0  0.2   7508  3036 ?        Ss   Nov10   0:13 condor_collector -f
condor    3096  0.0  0.3   8964  4036 ?        Ss   Nov10   0:01 condor_schedd -f
condor    3097  0.0  0.3   7412  3080 ?        Ss   Nov10   0:12 condor_negotiator -f
I can see that collector is running on the head node.

I just could not figure out what and where i am missing something.
Please help.

-----------------------------------
Samir Khanal
CS Grad Student
Hayes 226
Bowling Green State University
Bowling Green, OH 43402
skhanal@xxxxxxxx