[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor 6.6.10 on FC4 Configuration Woes



Hi Greg

 Thanks for the response.

Greg.Hitchen@xxxxxxxx wrote:

Hi Tom

I'm pretty sure the 6.6 production series has a problem with memory
on FC. I had this problem with FC3 and uncommented the:

MEMORY = whatever
RESERVED_SWAP = 5

 I did the same on FC4 and it works fine.

lines in the config file. I'm currently running with the 6.7 development
series and these problems don't occur.

Not sure about the rest of the problems but the reference to condor
running on the loopback address (127.0.0.1) can't be good! Was that
extract from the startlog file on the CM or an execute client?

 Solved by editing /etc/hosts.

Cheers

Tom

Cheers

Greg

P.S. Searching the mail-list archives has sometimes been useful for me.


-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Tom Kobialka
Sent: Tuesday, 19 July 2005 2:06 PM
To: condor-users@xxxxxxxxxxx
Subject: [Condor-users] Condor 6.6.10 on FC4 Configuration Woes



Hi,

I'm trying to configure Condor 6.6.10 on Fedora Core 4. I've read through the user manual (including trouble shooting), FAQ, searched through mailing lists and googled.. A description of my problem follows.

The Central Manager is:

uname -a
Linux adac.anu.edu.au 2.6.11-1.1369_FC4smp #1 SMP Thu Jun 2 23:08:39 EDT 2005 i686 i686 i386 GNU/Linux



The first time I do a condor_status on the central manager I get the following:


7/19 15:15:34 (Sent 0 ads in response to query)
7/19 15:15:36 DC_AUTHENTICATE: attempt to open invalid session adac:3856:1121748123:13, failing.



I cannot reproduce this error after the first time. Every other time condor_status comes up as blank. And CollectorLog displays:


7/19 15:39:04 Got QUERY_STARTD_ADS
7/19 15:39:04 (Sent 0 ads in response to query)


I have a firewall configured on the Central manager, incoming ports are open are 9614, 9618 and an arbitrary port range (65000-65255). As described in "Kewley J, Using Condor effectively in the presence of Personal Firewalls, Oct 7 2004,


My condor_config file looks as follows:
CONDOR_HOST = adac.anu.edu.au
LOCAL_DIR               = $(TILDE)
UID_DOMAIN              = adac.anu.edu.au
FILESYSTEM_DOMAIN       = $(FULL_HOSTNAME)

HOSTALLOW_READ = 150.203.*.*, adac.anu.edu.au, 192.168.1.101, 192.168.1.102, 192.168.1.103, 192.168.1.104, 192.168.1.105, 192.168.1.106, 192.168.1.107, 192.168.1.108

HOSTALLOW_WRITE = 150.203.*.*, adac.anu.edu.au, 192.168.1.101, 192.168.1.102, 192.168.1.103, 192.168.1.104, 192.168.1.105, 192.168.1.106, 192.168.1.107, 192.168.1.108

Condor-config.local:

HAS_FIREWALL = TRUE
STARTD_EXPRS = HAS_FIREWALL
COLLECTOR_NAME =
FILESYSTEM_DOMAIN = anu.edu.au
SUSPEND =
LOCK = /tmp/condor-lock.$(HOSTNAME)0.203045580144138
JAVA_MAXHEAP_ARGUMENT =
CONDOR_ADMIN = root@localhost
START =
MAIL = /bin/mail
RELEASE_DIR = /condor
DAEMON_LIST = MASTER,COLLECTOR,NEGOTIATOR,SCHEDD
COLLECTOR = $(SBIN)/condor_collector
PREEMPT =
UID_DOMAIN = anu.edu.au
NEGOTIATOR = $(SBIN)/condor_negotiator
JAVA = /usr/bin/java
VACATE =
CONDOR_HOST = adac.anu.edu.au
CONDOR_IDS = 503.503
LOCAL_DIR = /condor-local


My Execute clients are also running FC4.
uname -a
Linux node1 2.6.11-1.1369_FC4 #1 Thu Jun 2 22:55:56 EDT 2005 i686 i686 i386 GNU/Linux


 When I do a condor_status I get the following:

[root@node2 ~]# condor_status
CEDAR:6001:Failed to connect to <150.203.48.125:9618>
Error: Couldn't contact the condor_collector on adac.anu.edu.au.

Extra Info: the condor_collector is a process that runs on the central manager of your Condor pool and collects the status of all the machines and jobs in the Condor pool. The condor_collector might not be running, it might
be refusing to communicate with you, there might be a network problem, or there may be some other problem. Check with your system administrator to fix
this problem.


If you are the system administrator, check that the condor_collector is running on adac.anu.edu.au, check the HOSTALLOW configuration in your condor_config, and check the MasterLog and CollectorLog files in your log directory for possible clues as to why the condor_collector is not responding. Also see the Troubleshooting section of the manual. [root@node2 ~]#

StartLog looks as follows:
7/19 16:35:16 ******************************************************
7/19 16:35:16 ** condor_startd (CONDOR_STARTD) STARTING UP
7/19 16:35:16 ** /condor/sbin/condor_startd
7/19 16:35:16 ** $CondorVersion: 6.6.10 Jun 13 2005 $
7/19 16:35:16 ** $CondorPlatform: I386-LINUX_RH9 $
7/19 16:35:16 ** PID = 2930
7/19 16:35:16 ******************************************************
7/19 16:35:16 Using config file: /condor/etc/condor_config
7/19 16:35:16 Using local config files: /condor-local/condor_config.local 7/19 16:35:16 DaemonCore: Command Socket at <127.0.0.1:33534> 7/19 16:35:16 WARNING: Condor is running on the loopback address (127.0.0.1)
7/19 16:35:16 of this machine, and is not visible to other hosts!
7/19 16:35:16 This may be due to a misconfigured /etc/hosts file.
7/19 16:35:16 Please make sure your hostname is not listed on the
7/19 16:35:16 same line as localhost in /etc/hosts.
7/19 16:35:16 Error computing physical memory with calc_phys_mem().
MEMORY parameter not defined in config file.
Try setting MEMORY to the number of megabytes of RAM. 7/19 16:35:16 ERROR "Can't compute physical memory." at line 60 in file ResAttributes.C


And MasterLog repeats:

7/19 16:35:26 Can't send UPDATE_MASTER_AD to collector adac.anu.edu.au <150.203.48.125:9618>: Failed to send UDP update command to collector 7/19 16:35:26 The STARTD (pid 2930) exited with status 4 7/19 16:35:26 restarting /condor/sbin/condor_startd in 2057 seconds 7/19 16:35:26 Can't connect to <150.203.48.125:9618>:0, errno = 22 7/19 16:35:26 Will keep trying for 10 seconds... 7/19 16:35:36 Connect failed for 10 seconds; returning FALSE 7/19 16:35:36 ERROR: SECMAN:2003:TCP connection to <150.203.48.125:9618> failed

I can ping and telnet to 9618 on the central manager from all client nodes.


Condor_config on the clients looks as follows: CONDOR_HOST = adac.anu.edu.au LOCAL_DIR = $(TILDE) UID_DOMAIN = adac.anu.edu.au FILESYSTEM_DOMAIN = $(FULL_HOSTNAME) HOSTALLOW_READ = *.anu.edu.au HOSTALLOW_WRITE = *.anu.edu.au

and Condor-config.local:
COLLECTOR_NAME = adac.anu.edu.au
FILESYSTEM_DOMAIN = $(FULL_HOSTNAME)
SUSPEND =
LOCK = /tmp/condor-lock.$(HOSTNAME)0.0746330738338692
JAVA_MAXHEAP_ARGUMENT =
CONDOR_ADMIN = root@xxxxxxxxxxxxxxxxxxxxx localhost
START =
MAIL = /bin/mail
RELEASE_DIR = /condor
DAEMON_LIST = MASTER,STARTD
COLLECTOR = $(SBIN)/condor_collector
PREEMPT =
UID_DOMAIN = localdomain localhost
NEGOTIATOR = $(SBIN)/condor_negotiator
JAVA = /usr/bin/java
VACATE =
CONDOR_HOST = adac.anu.edu.au
CONDOR_IDS = 503.503
LOCAL_DIR = /condor-local

I am running condor as root. The ip addresses of my clients are included in HOSTALLOW on the central manager. I am logging all rejected packets on the firewall and nothing seems to be rejected. This could be a network issue, but then it seems unlikely since I can telnet and ping the central manager fine.

I suspect I have probably misconfigured something, but don't have the adequate experience with condor to be able to identify the problem. Any help would be greatly appreciated.

Cheers

Tom
--
Tom Kobialka
HPC Programmer (Data Grids)
APAC National Facility
Australian National University, Canberra, ACT 2600 Tom.Kobialka@xxxxxxxxxx


_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx https://lists.cs.wisc.edu/mailman/listinfo/condor-users





-- Tom Kobialka HPC Programmer (Data Grids) APAC National Facility Australian National University, Canberra, ACT 2600 Tom.Kobialka@xxxxxxxxxx