[Condor-users] Condor,Fedora core installation problem

I am a linux newbie and trying to install condor for my academic project.

I have been stuck with this problem for quite a while now and after trying to find out the cause for it,I have given up.

I installed condor 6.6.10 on fedora core 4 which is running on vmware workstation 5.0 on my laptop.

I have two copies(central manager and working nodes) of fedora core 4 running on windows(host) operating system and I installed condor on both.

I can ping and ssh both the central manager and working node from each other and they seem to be communicating well.

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

These are the steps I followed for installing condor on master node(192.168.60.128) ----

cd /usr/local/condor 6.6.10

./condor_configure --install --type=manager --owner=condor

Then I set the condor_config environment variable to /usr/local/condor 6.6.10/etc/condor_config

I made the following changes to condor_config.local file

START ,PREEMPT,SUSPEND,VACATE variables are set to true

NETWORK_INTERFACE = 192.168.60.128

Also i made changes to condor_config file

Then i start the master daemon and do ps aux | egrep condor_.I can see the required daemons running.

I check the log files for error and everything is fine,there are no errors

./condor_status shows me the central manager as available.

Now while installing condor on the worker node on the other fedora core 4 guest os ( 192.168.60.129 ),I follow the same steps ,

except while configuring the worker node,i use the command

./condor_configure --install --type=submit,execute --central-manger=192.168.60.128 --owner=condor

and remaining are the same steps.The startd,schedd and master daemons start properly.

But when I do ./condor_status.I get the following error message

CEDAR:6001:Failed to connect to <192.168.60.128:9618>
Error: Couldn't contact the condor_collector on 192.168.60.128

Extra Info: the condor_collector is a process that runs on the central
manager of your Condor pool and collects the status of all the machines
and jobs in the Condor pool. The condor_collector might not be running, it
might be refusing to communicate with you, there might be a network problem,
or there may be some other problem. Check with your system administrator to
fix this problem.
If you are the system administrator, check that the condor_collector is
running on 192.168.60.128, check the HOSTALLOW configuration in your
condor_config, and check the MasterLog and CollectorLog files in your
log directory for possible clues as to why the condor_collector is not
responding. Also see the Troubleshooting section of the manual.

Here are the log files on the working node are shown below

MASTER LOG

2/11 10:22:37 ******************************************************
2/11 10:22:37 ** condor_master (CONDOR_MASTER) STARTING UP
2/11 10:22:37 ** /usr/local/condor- 6.6.10 /sbin/condor_master
2/11 10:22:37 ** $CondorVersion: 6.6.10 Jun 13 2005 $
2/11 10:22:37 ** $CondorPlatform: I386-LINUX_RH9 $
2/11 10:22:37 ** PID = 2738
2/11 10:22:37 ******************************************************
2/11 10:22:37 Using config file:
/usr/local/condor-6.6.10/etc/condor_config
2/11 10:22:37 Using local config files:
/usr/local/condor-6.6.10/local.slave/condor_config.local
2/11 10:22:37 DaemonCore: Command Socket at < 192.168.60.129:32770>
2/11 10:22:37 Started DaemonCore process
"/usr/local/condor-6.6.10/sbin/condor_schedd", pid and pgroup = 2739
2/11 10:22:37 Started DaemonCore process
"/usr/local/condor- 6.6.10/sbin/condor_startd", pid and pgroup = 2740
2/11 10:22:43 Can't connect to < 192.168.60.128:9618>:0, errno = 113
2/11 10:22:43 Will keep trying for 10 seconds...
2/11 10:23:01 Connect failed for 10 seconds; returning FALSE
2/11 10:23:01 ERROR:
SECMAN:2003:TCP connection to < 192.168.60.128:9618> failed

2/11 10:23:01 Can't send UPDATE_MASTER_AD to collector
< 192.168.60.128:9618>:Failed to send UDP update command to collector
2/11 10:28:01 Can't connect to < 192.168.60.128:9618>:0, errno = 113
2/11 10:28:01 Will keep trying for 10 seconds...
2/11 10:28:13 Connect failed for 10 seconds; returning FALSE
2/11 10:28:13 ERROR:
SECMAN:2003:TCP connection to < 192.168.60.128:9618 > failed

Any advice on this would really be helpful

Mailing List Archives

Public Access

[Condor-users] Condor,Fedora core installation problem