[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor 6.6.10 on FC4 Configuration Woes



Hi,

I'm trying to configure Condor 6.6.10 on Fedora Core 4. I've read through the user manual (including trouble shooting), FAQ, searched through mailing lists and googled.. A description of my problem follows.

The Central Manager is:

uname -a
Linux adac.anu.edu.au 2.6.11-1.1369_FC4smp #1 SMP Thu Jun 2 23:08:39 EDT 2005 i686 i686 i386 GNU/Linux



The first time I do a condor_status on the central manager I get the following:


7/19 15:15:34 (Sent 0 ads in response to query)
7/19 15:15:36 DC_AUTHENTICATE: attempt to open invalid session adac:3856:1121748123:13, failing.



I cannot reproduce this error after the first time. Every other time condor_status comes up as blank. And CollectorLog displays:


7/19 15:39:04 Got QUERY_STARTD_ADS
7/19 15:39:04 (Sent 0 ads in response to query)


I have a firewall configured on the Central manager, incoming ports are open are 9614, 9618 and an arbitrary port range (65000-65255). As described in "Kewley J, Using Condor effectively in the presence of Personal Firewalls, Oct 7 2004,


My condor_config file looks as follows:
CONDOR_HOST = adac.anu.edu.au
LOCAL_DIR               = $(TILDE)
UID_DOMAIN              = adac.anu.edu.au
FILESYSTEM_DOMAIN       = $(FULL_HOSTNAME)

HOSTALLOW_READ = 150.203.*.*, adac.anu.edu.au, 192.168.1.101, 192.168.1.102, 192.168.1.103, 192.168.1.104, 192.168.1.105, 192.168.1.106, 192.168.1.107, 192.168.1.108

HOSTALLOW_WRITE = 150.203.*.*, adac.anu.edu.au, 192.168.1.101, 192.168.1.102, 192.168.1.103, 192.168.1.104, 192.168.1.105, 192.168.1.106, 192.168.1.107, 192.168.1.108

Condor-config.local:

HAS_FIREWALL = TRUE
STARTD_EXPRS = HAS_FIREWALL
COLLECTOR_NAME =
FILESYSTEM_DOMAIN = anu.edu.au
SUSPEND =
LOCK = /tmp/condor-lock.$(HOSTNAME)0.203045580144138
JAVA_MAXHEAP_ARGUMENT =
CONDOR_ADMIN = root@localhost
START =
MAIL = /bin/mail
RELEASE_DIR = /condor
DAEMON_LIST = MASTER,COLLECTOR,NEGOTIATOR,SCHEDD
COLLECTOR = $(SBIN)/condor_collector
PREEMPT =
UID_DOMAIN = anu.edu.au
NEGOTIATOR = $(SBIN)/condor_negotiator
JAVA = /usr/bin/java
VACATE =
CONDOR_HOST = adac.anu.edu.au
CONDOR_IDS = 503.503
LOCAL_DIR = /condor-local


My Execute clients are also running FC4.
uname -a
Linux node1 2.6.11-1.1369_FC4 #1 Thu Jun 2 22:55:56 EDT 2005 i686 i686 i386 GNU/Linux


 When I do a condor_status I get the following:

[root@node2 ~]# condor_status
CEDAR:6001:Failed to connect to <150.203.48.125:9618>
Error: Couldn't contact the condor_collector on adac.anu.edu.au.

Extra Info: the condor_collector is a process that runs on the central
manager of your Condor pool and collects the status of all the machines and
jobs in the Condor pool. The condor_collector might not be running, it might
be refusing to communicate with you, there might be a network problem, or
there may be some other problem. Check with your system administrator to fix
this problem.


If you are the system administrator, check that the condor_collector is
running on adac.anu.edu.au, check the HOSTALLOW configuration in your
condor_config, and check the MasterLog and CollectorLog files in your log
directory for possible clues as to why the condor_collector is not
responding. Also see the Troubleshooting section of the manual.
[root@node2 ~]#

StartLog looks as follows:
7/19 16:35:16 ******************************************************
7/19 16:35:16 ** condor_startd (CONDOR_STARTD) STARTING UP
7/19 16:35:16 ** /condor/sbin/condor_startd
7/19 16:35:16 ** $CondorVersion: 6.6.10 Jun 13 2005 $
7/19 16:35:16 ** $CondorPlatform: I386-LINUX_RH9 $
7/19 16:35:16 ** PID = 2930
7/19 16:35:16 ******************************************************
7/19 16:35:16 Using config file: /condor/etc/condor_config
7/19 16:35:16 Using local config files: /condor-local/condor_config.local
7/19 16:35:16 DaemonCore: Command Socket at <127.0.0.1:33534>
7/19 16:35:16 WARNING: Condor is running on the loopback address (127.0.0.1)
7/19 16:35:16 of this machine, and is not visible to other hosts!
7/19 16:35:16 This may be due to a misconfigured /etc/hosts file.
7/19 16:35:16 Please make sure your hostname is not listed on the
7/19 16:35:16 same line as localhost in /etc/hosts.
7/19 16:35:16 Error computing physical memory with calc_phys_mem().
MEMORY parameter not defined in config file.
Try setting MEMORY to the number of megabytes of RAM.
7/19 16:35:16 ERROR "Can't compute physical memory." at line 60 in file ResAttributes.C


And MasterLog repeats:

7/19 16:35:26 Can't send UPDATE_MASTER_AD to collector adac.anu.edu.au <150.203.48.125:9618>: Failed to send UDP update command to collector
7/19 16:35:26 The STARTD (pid 2930) exited with status 4
7/19 16:35:26 restarting /condor/sbin/condor_startd in 2057 seconds
7/19 16:35:26 Can't connect to <150.203.48.125:9618>:0, errno = 22
7/19 16:35:26 Will keep trying for 10 seconds...
7/19 16:35:36 Connect failed for 10 seconds; returning FALSE
7/19 16:35:36 ERROR:
SECMAN:2003:TCP connection to <150.203.48.125:9618> failed


I can ping and telnet to 9618 on the central manager from all client nodes.


Condor_config on the clients looks as follows: CONDOR_HOST = adac.anu.edu.au LOCAL_DIR = $(TILDE) UID_DOMAIN = adac.anu.edu.au FILESYSTEM_DOMAIN = $(FULL_HOSTNAME) HOSTALLOW_READ = *.anu.edu.au HOSTALLOW_WRITE = *.anu.edu.au

and Condor-config.local:
COLLECTOR_NAME = adac.anu.edu.au
FILESYSTEM_DOMAIN = $(FULL_HOSTNAME)
SUSPEND =
LOCK = /tmp/condor-lock.$(HOSTNAME)0.0746330738338692
JAVA_MAXHEAP_ARGUMENT =
CONDOR_ADMIN = root@xxxxxxxxxxxxxxxxxxxxx localhost
START =
MAIL = /bin/mail
RELEASE_DIR = /condor
DAEMON_LIST = MASTER,STARTD
COLLECTOR = $(SBIN)/condor_collector
PREEMPT =
UID_DOMAIN = localdomain localhost
NEGOTIATOR = $(SBIN)/condor_negotiator
JAVA = /usr/bin/java
VACATE =
CONDOR_HOST = adac.anu.edu.au
CONDOR_IDS = 503.503
LOCAL_DIR = /condor-local

I am running condor as root. The ip addresses of my clients are included in HOSTALLOW on the central manager. I am logging all rejected packets on the firewall and nothing seems to be rejected. This could be a network issue, but then it seems unlikely since I can telnet and ping the central manager fine.

I suspect I have probably misconfigured something, but don't have the adequate experience with condor to be able to identify the problem. Any help would be greatly appreciated.

Cheers

Tom
--
Tom Kobialka
HPC Programmer (Data Grids)
APAC National Facility
Australian National University, Canberra, ACT 2600
Tom.Kobialka@xxxxxxxxxx