[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] CEDAR:6001:Failed to connect to ...



Hi,
I have been using condor for half a year, deployed on virtual machines.
But l  did few ‘yum update’  in recent weeks.
For last few days I try to re-doeply condor master and it does not want to work.
The error I see is:

———
[root@oswrk117 ~]# condor_q
-- Failed to fetch ads from: <10.60.0.12:10574> : oswrk117.lns.mit.edu
CEDAR:6001:Failed to connect to <10.60.0.12:10574>
——

Perhaps you could advise me how can I fix my condor?
Below are gory details of my current system
Thanks
Jan


THE DETAILS:

I’m now using this version of condor:
[root@oswrk117 ~]# rpm -qa | grep -i condor
condor-8.3.2-288596.x86_64


Those are the condor processes which run:
[root@oswrk117 ~]# ps -ef |grep condor
condor    3047     1  0 13:23 ?        00:00:00 /usr/sbin/condor_master -pidfile /var/run/condor/condor.pid
root      3048  3047  0 13:23 ?        00:00:01 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 496
root      3431  2733  0 13:29 pts/0    00:00:00 grep condor


Those are the processes I wanted to run on this VM:
[root@oswrk117 ~]# condor_config_val -v DAEMON_LIST
DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD
 # at: /etc/condor/condor_config.local, line 51


The OS on VM is :
[root@oswrk117 ~]# uname -a
Linux oswrk117.lns.mit.edu 2.6.32-504.3.3.el6.x86_64 #1 SMP Tue Dec 16 14:29:22 CST 2014 x86_64 x86_64 x86_64 GNU/Linux

This VM has 2 network interfaces:
a) local
[root@oswrk117 ~]# ifconfig eth0
eth0      Link encap:Ethernet  HWaddr FA:16:3E:D6:F7:45  
          inet addr:10.60.0.12  Bcast:10.60.0.255  Mask:255.255.255.0
          inet6 addr: fe80::f816:3eff:fed6:f745/64 Scope:Link
b) public

The VM reports :
[root@oswrk117 ~]# hostname -f

I have set up Condor-master to use the local IP for worker comunication by setting this 2 variables:
# below use IP of the this node
TCP_FORWARDING_HOST = 198.125.163.117
PRIVATE_NETWORK_INTERFACE = 198.125.163.117

The fire-wall on VM is deactivated:
[root@oswrk117 ~]# service iptables status
iptables: Firewall is not running.

Also, there is no port blocking on the OpenStack controller owning this VM:


Below is full dump of my condor config file.
——————
[root@oswrk117 ~]# cat /etc/condor/condor_config.local
# modified by Jan Balewski, MIT
CONDOR_HOST = $(FULL_HOSTNAME)
COLLECTOR_NAME = "VM condor master on $(FULL_HOSTNAME)"
###############################################################################
# Pool settings
###############################################################################
# EC2 workers don't have shared filesystems or authentication
UID_DOMAIN = lns.mit.edu
TRUST_UID_DOMAIN = $(UID_DOMAIN)
FILESYSTEM_DOMAIN = $(UID_DOMAIN)
USE_NFS = False
USE_AFS = False
USE_CKPT_SERVER = False
# The same for all machines with the same condor user
CONDOR_IDS = 496.492
###############################################################################
#  trick to force condor to use public IP
###############################################################################
# to check what IP condor uses execute:
#    condor_status -format "%s, " Name -format "%s\n" MyAddress
# to check what public IP VM uses execute:
# see more details in this post:
# below use IP of the this node
TCP_FORWARDING_HOST = 198.125.163.117
PRIVATE_NETWORK_INTERFACE = 198.125.163.117
###############################################################################
# Security settings
###############################################################################
# Allow local host and the central manager to manage the node
ALLOW_ADMINISTRATOR = $(FULL_HOSTNAME), $(CONDOR_HOST)
# master needs this two particular versions
ALLOW_READ = *.lns.mit.edu,10.60.0.*
ALLOW_WRITE = *.lns.mit.edu,10.60.0.*
###############################################################################
# CPU usage settings
###############################################################################
# Don't count a hyperthreaded CPU as multiple CPUs
COUNT_HYPERTHREAD_CPUS = False
# Leave this commented out. If your instance has more than one CPU (i.e. if
# you use a large instance or something) then condor will advertise one
# slot for each CPU.
# for master reduce # of jobs to N-1
NUM_CPUS = 4
###############################################################################
# Daemon settings
###############################################################################
# Full list on the host node
DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD
# Don't run java
JAVA = 
###############################################################################
# Classads
###############################################################################
# Run everything, all the time
START = True
SUSPEND = False
CONTINUE = True
PREEMPT = False
WANT_VACATE = False
WANT_SUSPEND = True
SUSPEND_VANILLA = False
WANT_SUSPEND_VANILLA = True
KILL = False
STARTD_EXPRS = START
###############################################################################
# Network settings
###############################################################################
# Use random numbers here so the workers don't all hit the collector at 
# the same time. If there are many workers the collector can get overwhelmed.
UPDATE_INTERVAL = $RANDOM_INTEGER(230, 370)
MASTER_UPDATE_INTERVAL = $RANDOM_INTEGER(230, 370)
# Port range for Jan's VM-condor cluster at LNS 
LOWPORT=9600
HIGHPORT=10600
[root@oswrk117 ~]#