[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] Condor 6.6.10 on FC4 Configuration Woes



Hi Tom

I'm pretty sure the 6.6 production series has a problem with memory
on FC. I had this problem with FC3 and uncommented the:

MEMORY = whatever
RESERVED_SWAP = 5

lines in the config file. I'm currently running with the 6.7 development
series and these problems don't occur.

Not sure about the rest of the problems but the reference to condor
running on the loopback address (127.0.0.1) can't be good! Was that
extract from the startlog file on the CM or an execute client?

Cheers

Greg

P.S. Searching the mail-list archives has sometimes been useful for me.

> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx 
> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Tom Kobialka
> Sent: Tuesday, 19 July 2005 2:06 PM
> To: condor-users@xxxxxxxxxxx
> Subject: [Condor-users] Condor 6.6.10 on FC4 Configuration Woes
> 
> 
> Hi,
> 
>   I'm trying to configure Condor 6.6.10 on Fedora Core 4. I've read 
> through the user manual (including trouble shooting), FAQ, searched 
> through mailing lists and googled.. A description of my 
> problem follows.
> 
> The Central Manager is:
> 
> uname -a
> Linux adac.anu.edu.au 2.6.11-1.1369_FC4smp #1 SMP Thu Jun 2 
> 23:08:39 EDT 
> 2005 i686 i686 i386 GNU/Linux
> 
> 
>   The first time I do a condor_status on the central manager 
> I get the 
> following:
> 
> 7/19 15:15:34 (Sent 0 ads in response to query)
> 7/19 15:15:36 DC_AUTHENTICATE: attempt to open invalid session 
> adac:3856:1121748123:13, failing.
> 
> 
>   I cannot reproduce this error after the first time. Every 
> other time 
> condor_status comes up as blank. And CollectorLog displays:
> 
> 7/19 15:39:04 Got QUERY_STARTD_ADS
> 7/19 15:39:04 (Sent 0 ads in response to query)
> 
> 
>   I have a firewall configured on the Central manager, incoming ports 
> are open are 9614, 9618 and an arbitrary port range (65000-65255). As 
> described in "Kewley J, Using Condor effectively in the presence of 
> Personal Firewalls, Oct 7 2004,
> 
> My condor_config file looks as follows:
> CONDOR_HOST = adac.anu.edu.au
> LOCAL_DIR               = $(TILDE)
> UID_DOMAIN              = adac.anu.edu.au
> FILESYSTEM_DOMAIN       = $(FULL_HOSTNAME)
> 
> HOSTALLOW_READ = 150.203.*.*, adac.anu.edu.au, 192.168.1.101, 
> 192.168.1.102, 192.168.1.103, 192.168.1.104, 192.168.1.105, 
> 192.168.1.106, 192.168.1.107, 192.168.1.108
> 
> HOSTALLOW_WRITE = 150.203.*.*, adac.anu.edu.au, 192.168.1.101, 
> 192.168.1.102, 192.168.1.103, 192.168.1.104, 192.168.1.105, 
> 192.168.1.106, 192.168.1.107, 192.168.1.108
> 
> Condor-config.local:
> 
> HAS_FIREWALL = TRUE
> STARTD_EXPRS = HAS_FIREWALL
> COLLECTOR_NAME =
> FILESYSTEM_DOMAIN = anu.edu.au
> SUSPEND =
> LOCK = /tmp/condor-lock.$(HOSTNAME)0.203045580144138
> JAVA_MAXHEAP_ARGUMENT =
> CONDOR_ADMIN = root@localhost
> START =
> MAIL = /bin/mail
> RELEASE_DIR = /condor
> DAEMON_LIST = MASTER,COLLECTOR,NEGOTIATOR,SCHEDD
> COLLECTOR = $(SBIN)/condor_collector
> PREEMPT =
> UID_DOMAIN = anu.edu.au
> NEGOTIATOR = $(SBIN)/condor_negotiator
> JAVA = /usr/bin/java
> VACATE =
> CONDOR_HOST = adac.anu.edu.au
> CONDOR_IDS = 503.503
> LOCAL_DIR = /condor-local
> 
> 
>   My Execute clients are also running FC4.
> uname -a
> Linux node1 2.6.11-1.1369_FC4 #1 Thu Jun 2 22:55:56 EDT 2005 
> i686 i686 
> i386 GNU/Linux
> 
>   When I do a condor_status I get the following:
> 
> [root@node2 ~]# condor_status
> CEDAR:6001:Failed to connect to <150.203.48.125:9618>
> Error: Couldn't contact the condor_collector on adac.anu.edu.au.
> 
> Extra Info: the condor_collector is a process that runs on 
> the central manager of your Condor pool and collects the 
> status of all the machines and jobs in the Condor pool. The 
> condor_collector might not be running, it 
> might
> be refusing to communicate with you, there might be a network 
> problem, or there may be some other problem. Check with your 
> system administrator to 
> fix
> this problem.
> 
> If you are the system administrator, check that the 
> condor_collector is running on adac.anu.edu.au, check the 
> HOSTALLOW configuration in your condor_config, and check the 
> MasterLog and CollectorLog files in your log directory for 
> possible clues as to why the condor_collector is not 
> responding. Also see the Troubleshooting section of the 
> manual. [root@node2 ~]#
> 
> StartLog looks as follows:
> 7/19 16:35:16 ******************************************************
> 7/19 16:35:16 ** condor_startd (CONDOR_STARTD) STARTING UP
> 7/19 16:35:16 ** /condor/sbin/condor_startd
> 7/19 16:35:16 ** $CondorVersion: 6.6.10 Jun 13 2005 $
> 7/19 16:35:16 ** $CondorPlatform: I386-LINUX_RH9 $
> 7/19 16:35:16 ** PID = 2930
> 7/19 16:35:16 ******************************************************
> 7/19 16:35:16 Using config file: /condor/etc/condor_config
> 7/19 16:35:16 Using local config files: 
> /condor-local/condor_config.local 7/19 16:35:16 DaemonCore: 
> Command Socket at <127.0.0.1:33534> 7/19 16:35:16 WARNING: 
> Condor is running on the loopback address (127.0.0.1)
> 7/19 16:35:16          of this machine, and is not visible to 
> other hosts!
> 7/19 16:35:16          This may be due to a misconfigured 
> /etc/hosts file.
> 7/19 16:35:16          Please make sure your hostname is not 
> listed on the
> 7/19 16:35:16          same line as localhost in /etc/hosts.
> 7/19 16:35:16 Error computing physical memory with calc_phys_mem().
>                  MEMORY parameter not defined in config file.
>                  Try setting MEMORY to the number of 
> megabytes of RAM. 7/19 16:35:16 ERROR "Can't compute physical 
> memory." at line 60 in file 
> ResAttributes.C
> 
> And MasterLog repeats:
> 
> 7/19 16:35:26 Can't send UPDATE_MASTER_AD to collector 
> adac.anu.edu.au 
> <150.203.48.125:9618>: Failed to send UDP update command to 
> collector 7/19 16:35:26 The STARTD (pid 2930) exited with 
> status 4 7/19 16:35:26 restarting /condor/sbin/condor_startd 
> in 2057 seconds 7/19 16:35:26 Can't connect to 
> <150.203.48.125:9618>:0, errno = 22 7/19 16:35:26 Will keep 
> trying for 10 seconds... 7/19 16:35:36 Connect failed for 10 
> seconds; returning FALSE 7/19 16:35:36 ERROR: SECMAN:2003:TCP 
> connection to <150.203.48.125:9618> failed
> 
>   I can ping and telnet to 9618 on the central manager from 
> all client 
> nodes.
> 
> 
>   Condor_config on the clients looks as follows:
> CONDOR_HOST     = adac.anu.edu.au
> LOCAL_DIR               = $(TILDE)
> UID_DOMAIN              = adac.anu.edu.au
> FILESYSTEM_DOMAIN       = $(FULL_HOSTNAME)
> HOSTALLOW_READ = *.anu.edu.au
> HOSTALLOW_WRITE = *.anu.edu.au
> 
> and Condor-config.local:
> COLLECTOR_NAME = adac.anu.edu.au
> FILESYSTEM_DOMAIN = $(FULL_HOSTNAME)
> SUSPEND =
> LOCK = /tmp/condor-lock.$(HOSTNAME)0.0746330738338692
> JAVA_MAXHEAP_ARGUMENT =
> CONDOR_ADMIN = root@xxxxxxxxxxxxxxxxxxxxx localhost
> START =
> MAIL = /bin/mail
> RELEASE_DIR = /condor
> DAEMON_LIST = MASTER,STARTD
> COLLECTOR = $(SBIN)/condor_collector
> PREEMPT =
> UID_DOMAIN = localdomain localhost
> NEGOTIATOR = $(SBIN)/condor_negotiator
> JAVA = /usr/bin/java
> VACATE =
> CONDOR_HOST = adac.anu.edu.au
> CONDOR_IDS = 503.503
> LOCAL_DIR = /condor-local
> 
>   I am running condor as root. The ip addresses of my clients are 
> included in HOSTALLOW on the central manager. I am logging 
> all rejected 
> packets on the firewall and nothing seems to be rejected. 
> This could be 
> a network issue, but then it seems unlikely since I can 
> telnet and ping 
> the central manager fine.
> 
>   I suspect I have probably misconfigured something, but 
> don't have the 
> adequate experience with condor to be able to identify the 
> problem. Any 
> help would be greatly appreciated.
> 
> Cheers
> 
> Tom
> -- 
> Tom Kobialka
> HPC Programmer (Data Grids)
> APAC National Facility
> Australian National University, Canberra, ACT 2600 
> Tom.Kobialka@xxxxxxxxxx
> 
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx 
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>