[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor,Fedora core installation problem



Hi,
I am a linux newbie and trying to install condor for my academic project.
I have been stuck with this problem for quite a while now and after trying to find out the cause for it,I have given up.
I installed condor 6.6.10 on fedora core 4 which is running on vmware workstation 5.0 on my laptop.
I have two copies(central manager and working nodes) of fedora core 4 running on windows(host) operating system and I installed condor on both.
I can ping and ssh both the central manager and working node from each other and they seem to be communicating well.
 
These are the steps I followed for installing condor  on master node(192.168.60.128) ----
 cd /usr/local/condor 6.6.10
./condor_configure --install --type=manager  --owner=condor
 
Then I set the condor_config environment variable to /usr/local/condor 6.6.10/etc/condor_config
I made the following changes to  condor_config.local file
START ,PREEMPT,SUSPEND,VACATE variables are set to true
NETWORK_INTERFACE = 192.168.60.128
 
Also i made changes to condor_config file
MEMORY = 512
RESERVED_MEMORY =40
RESERVED_SWAP=5
 
Then i start the master daemon and do ps aux | egrep condor_.I can see the required daemons running.
I check the log files for error and everything is fine,there are no errors
./condor_status shows  me the central manager as available.
 
Now while installing condor on the worker node on the other fedora core 4  guest os ( 192.168.60.129 ),I follow the same steps ,
except while configuring the worker node,i use the command
 
./condor_configure --install --type=submit,execute --central-manger=192.168.60.128  --owner=condor
 
and remaining are the same steps.The startd,schedd and master daemons start properly.
But when I do ./condor_status.I get the following error message
 
 
CEDAR:6001:Failed to connect to <192.168.60.128:9618>
Error: Couldn't contact the condor_collector on 192.168.60.128

Extra Info: the condor_collector is a process that runs on the central
manager of your Condor pool and collects the status of all the machines
and jobs in the Condor pool. The condor_collector might not be running, it
might be refusing to communicate with you, there might be a network problem,
or there may be some other problem. Check with your system administrator to
fix this problem.
If you are the system administrator, check that the condor_collector is
running on 192.168.60.128, check the HOSTALLOW configuration in your
condor_config, and check the MasterLog and CollectorLog files in your
log directory for possible clues as to why the condor_collector is not
responding. Also see the Troubleshooting section of the manual.
 
 
Here are the log files on the working node are shown below
 
MASTER LOG

2/11 10:22:37 ******************************************************
2/11 10:22:37 ** condor_master (CONDOR_MASTER) STARTING UP
2/11 10:22:37 ** /usr/local/condor- 6.6.10 /sbin/condor_master
2/11 10:22:37 ** $CondorVersion: 6.6.10 Jun 13 2005 $
2/11 10:22:37 ** $CondorPlatform: I386-LINUX_RH9 $
2/11 10:22:37 ** PID = 2738
2/11 10:22:37 ******************************************************
2/11 10:22:37 Using config file:
/usr/local/condor-6.6.10/etc/condor_config
2/11 10:22:37 Using local config files:
/usr/local/condor-6.6.10/local.slave/condor_config.local
2/11 10:22:37 DaemonCore: Command Socket at < 192.168.60.129:32770>
2/11 10:22:37 Started DaemonCore process
"/usr/local/condor-6.6.10/sbin/condor_schedd", pid and pgroup = 2739
2/11 10:22:37 Started DaemonCore process
"/usr/local/condor- 6.6.10/sbin/condor_startd", pid and pgroup = 2740
2/11 10:22:43 Can't connect to < 192.168.60.128:9618>:0, errno = 113
2/11 10:22:43 Will keep trying for 10 seconds...
2/11 10:23:01 Connect failed for 10 seconds; returning FALSE
2/11 10:23:01 ERROR:
SECMAN:2003:TCP connection to < 192.168.60.128:9618> failed

2/11 10:23:01 Can't send UPDATE_MASTER_AD to collector
< 192.168.60.128:9618>:Failed to send UDP update command to collector
2/11 10:28:01 Can't connect to < 192.168.60.128:9618>:0, errno = 113
2/11 10:28:01 Will keep trying for 10 seconds...
2/11 10:28:13 Connect failed for 10 seconds; returning FALSE
2/11 10:28:13 ERROR:
SECMAN:2003:TCP connection to < 192.168.60.128:9618 > failed
 
Any advice on this would really be helpful
 
Thanks
Prashanth