[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Fedora 3 collector problem



Check file :
/etc/condor/condor_config
for :
COLLECTOR_HOST = $(CONDOR_HOST)
DAEMON_LIST = MASTER, STARTD, SCHEDD, COLLECTOR, NEGOTIATOR
and
Check file:
/opt/condor-6.6.9/local.phy-condor/condor_config.local
for :
COLLECTOR_NAME = Collector at <hostname of your master here>


Hope this helps

On Tue, 31 May 2005 09:21:18 -0500, Joshua Juen <jj9867@xxxxxxxxx> wrote:

I have set up condor as master on a Fedora 3 system. The installation
seems to be working except that the master cannot find the collector.

The condor_status works from the client machines but none of the
machines can submit jobs. The submitting machine's jobs will just sit
in the queue.

Error sending update to the collector : Failed to connect to collector
appears in the master log, the negotiator log and the start log.

The port that the collector should be on is open and I can telnet into it.
(I am assuming that the clients can also) but the master can't seem to find it.


I think that the problem is probably a simple configuration error but
I can not seem to track it down.

Any help would be greatly appreciated,
Thanks
Josh


MasterLog

5/31 08:24:19 ******************************************************
5/31 08:24:19 ** condor_master (CONDOR_MASTER) STARTING UP
5/31 08:24:19 ** /opt/condor-6.6.9/sbin/condor_master
5/31 08:24:19 ** $CondorVersion: 6.6.9 Mar 10 2005 $
5/31 08:24:19 ** $CondorPlatform: I386-LINUX_RH9 $
5/31 08:24:19 ** PID = 2354
5/31 08:24:19 ******************************************************
5/31 08:24:19 Using config file: /etc/condor/condor_config
5/31 08:24:19 Using local config files:
/opt/condor-6.6.9/local.phy-condor/condor_config.local
5/31 08:24:19 Attempting to lock
/tmp/condor-lock.phy-condor0.606384916537539/InstanceLock.
5/31 08:24:19 Obtained lock on
/tmp/condor-lock.phy-condor0.606384916537539/InstanceLock.
5/31 08:24:19 DaemonCore: Command Socket at <xxx.xxx.xxx.50:32769>
5/31 08:24:19 SEC_DEFAULT_SESSION_DURATION is undefined, using default
value of 3600
5/31 08:24:19 MASTER_TIMEOUT_MULTIPLIER is undefined, using default value of 0
5/31 08:24:19 MASTER_TIMEOUT_MULTIPLIER is undefined, using default value of 0
5/31 08:24:19 Will use UDP to update collector
5/31 08:24:19 Started DaemonCore process
"/opt/condor-6.6.9/sbin/condor_collector", pid and pgroup = 2355
5/31 08:24:19 MASTER_TIMEOUT_MULTIPLIER is undefined, using default value of 0
5/31 08:24:19 Started DaemonCore process
"/opt/condor-6.6.9/sbin/condor_negotiator", pid and pgroup = 2356
5/31 08:24:19 Started DaemonCore process
"/opt/condor-6.6.9/sbin/condor_startd", pid and pgroup = 2357
5/31 08:24:19 Started DaemonCore process
"/opt/condor-6.6.9/sbin/condor_schedd", pid and pgroup = 2358
5/31 08:24:21 DaemonCore: Command received via UDP from host
<xxx.xxx.xxx.50:32773>
5/31 08:24:21 DaemonCore: received command 60008 (DC_CHILDALIVE),
calling handler (HandleChildAliveCommand)
5/31 08:24:21 DaemonCore: Command received via UDP from host
<xxx.xxx.xxx.50:32773>
5/31 08:24:21 DaemonCore: received command 60008 (DC_CHILDALIVE),
calling handler (HandleChildAliveCommand)
5/31 08:24:22 DaemonCore: Command received via UDP from host
<xxx.xxx.xxx.50:32773>
5/31 08:24:22 DaemonCore: received command 60008 (DC_CHILDALIVE),
calling handler (HandleChildAliveCommand)
5/31 08:24:24 enter Daemons::CheckForNewExecutable
5/31 08:24:24 Time stamp of running
/opt/condor-6.6.9/sbin/condor_master: 1110456335
5/31 08:24:24 GetTimeStamp returned: 1110456335
5/31 08:24:24 Time stamp of running
/opt/condor-6.6.9/sbin/condor_collector: 1110456335
5/31 08:24:24 GetTimeStamp returned: 1110456335
5/31 08:24:24 Time stamp of running
/opt/condor-6.6.9/sbin/condor_negotiator: 1110456334
5/31 08:24:24 GetTimeStamp returned: 1110456334
5/31 08:24:24 Time stamp of running
/opt/condor-6.6.9/sbin/condor_startd: 1110456334
5/31 08:24:24 GetTimeStamp returned: 1110456334
5/31 08:24:24 Time stamp of running
/opt/condor-6.6.9/sbin/condor_schedd: 1110456334
5/31 08:24:24 GetTimeStamp returned: 1110456334
5/31 08:24:24 exit Daemons::CheckForNewExecutable
5/31 08:24:24 enter Daemons::UpdateCollector
5/31 08:24:24 Attempting to send update via UDP to collector
5/31 08:24:24 Can't send UPDATE_MASTER_AD to collector : Failed to
connect to collector
5/31 08:24:33 DaemonCore: Command received via UDP from host
<xxx.xxx.xxx.50:32773>
5/31 08:24:33 DaemonCore: received command 60008 (DC_CHILDALIVE),
calling handler (HandleChildAliveCommand)
5/31 08:29:24 enter Daemons::UpdateCollector
5/31 08:29:24 Attempting to send update via UDP to collector
5/31 08:29:24 Can't send UPDATE_MASTER_AD to collector : Failed to
connect to collector
5/31 08:29:24 enter Daemons::CheckForNewExecutable
5/31 08:29:24 Time stamp of running
/opt/condor-6.6.9/sbin/condor_master: 1110456335
5/31 08:29:24 GetTimeStamp returned: 1110456335
5/31 08:29:24 Time stamp of running
/opt/condor-6.6.9/sbin/condor_collector: 1110456335
5/31 08:29:24 GetTimeStamp returned: 1110456335
5/31 08:29:24 Time stamp of running
/opt/condor-6.6.9/sbin/condor_negotiator: 1110456334
5/31 08:29:24 GetTimeStamp returned: 1110456334
5/31 08:29:24 Time stamp of running
/opt/condor-6.6.9/sbin/condor_startd: 1110456334
5/31 08:29:24 GetTimeStamp returned: 1110456334
5/31 08:29:24 Time stamp of running
/opt/condor-6.6.9/sbin/condor_schedd: 1110456334
5/31 08:29:24 GetTimeStamp returned: 1110456334
5/31 08:29:24 exit Daemons::CheckForNewExecutable
5/31 08:34:24 enter Daemons::CheckForNewExecutable
5/31 08:34:24 Time stamp of running
/opt/condor-6.6.9/sbin/condor_master: 1110456335
5/31 08:34:24 GetTimeStamp returned: 1110456335
5/31 08:34:24 Time stamp of running
/opt/condor-6.6.9/sbin/condor_collector: 1110456335
5/31 08:34:24 GetTimeStamp returned: 1110456335
5/31 08:34:24 Time stamp of running
/opt/condor-6.6.9/sbin/condor_negotiator: 1110456334
5/31 08:34:24 GetTimeStamp returned: 1110456334
5/31 08:34:24 Time stamp of running
/opt/condor-6.6.9/sbin/condor_startd: 1110456334
5/31 08:34:24 GetTimeStamp returned: 1110456334
5/31 08:34:24 Time stamp of running
/opt/condor-6.6.9/sbin/condor_schedd: 1110456334
5/31 08:34:24 GetTimeStamp returned: 1110456334
5/31 08:34:24 exit Daemons::CheckForNewExecutable
5/31 08:34:24 enter Daemons::UpdateCollector
5/31 08:34:24 Attempting to send update via UDP to collector
5/31 08:34:24 Can't send UPDATE_MASTER_AD to collector : Failed to
connect to collector
5/31 08:35:07 DaemonCore: Command received via TCP from host
<xxx.xxx.xxx.50:32777>
5/31 08:35:07 DaemonCore: received command 453 (RESTART), calling
handler (admin_command_handler)
5/31 08:35:07 Got admin command (453) and allowing it.
5/31 08:35:07 NumberOfChildren() returning 4
5/31 08:35:07 MASTER_TIMEOUT_MULTIPLIER is undefined, using default value of 0
5/31 08:35:07 Sent SIGTERM to COLLECTOR (pid 2355)
5/31 08:35:07 MASTER_TIMEOUT_MULTIPLIER is undefined, using default value of 0
5/31 08:35:07 Sent SIGTERM to NEGOTIATOR (pid 2356)
5/31 08:35:07 MASTER_TIMEOUT_MULTIPLIER is undefined, using default value of 0
5/31 08:35:07 Sent SIGTERM to STARTD (pid 2357)
5/31 08:35:07 MASTER_TIMEOUT_MULTIPLIER is undefined, using default value of 0
5/31 08:35:07 Sent SIGTERM to SCHEDD (pid 2358)
5/31 08:35:07 DaemonCore: No more children processes to reap.
5/31 08:35:07 The COLLECTOR (pid 2355) exited with status 0
5/31 08:35:07 ProcAPI::buildFamily failed: parent 2355 not found on system.
5/31 08:35:07 ProcAPI: pid 2355 does not exist.
5/31 08:35:07 NumberOfChildren() returning 3
5/31 08:35:07 The NEGOTIATOR (pid 2356) exited with status 0
5/31 08:35:07 ProcAPI::buildFamily failed: parent 2356 not found on system.
5/31 08:35:07 ProcAPI: pid 2356 does not exist.
5/31 08:35:07 NumberOfChildren() returning 2
5/31 08:35:07 DaemonCore: No more children processes to reap.
5/31 08:35:07 The STARTD (pid 2357) exited with status 0
5/31 08:35:07 ProcAPI::buildFamily failed: parent 2357 not found on system.
5/31 08:35:07 ProcAPI: pid 2357 does not exist.
5/31 08:35:07 NumberOfChildren() returning 1
5/31 08:35:07 The SCHEDD (pid 2358) exited with status 0
5/31 08:35:07 ProcAPI: pid 2418 does not exist.
5/31 08:35:07 ProcAPI::buildFamily failed: parent 2358 not found on system.
5/31 08:35:07 ProcAPI: pid 2358 does not exist.
5/31 08:35:07 NumberOfChildren() returning 0
5/31 08:35:07 All daemons are gone. Restarting.
5/31 08:35:07 Restarting master right away.
5/31 08:35:07 Doing exec( "/opt/condor-6.6.9/sbin/condor_master" )
5/31 08:35:07 getExecPath: readlink("/proc/self/exe") failed: errno 13
(Permission denied)


5/31 08:35:07 PASSWD_CACHE_REFRESH is undefined, using default value of 300

StartLog error:

5/31 09:05:37 Attempting to send update via UDP to collector
5/31 09:05:37 Error sending update to the collector : Failed to
connect to collector
5/31 09:05:37 Error sending update to collector(s)

Negotiator Sample:

5/31 09:05:07 ---------- Started Negotiation Cycle ----------
5/31 09:05:07 Phase 1:  Obtaining ads from collector ...
5/31 09:05:07   Getting all public ads ...
5/31 09:05:07 NEGOTIATOR_TIMEOUT_MULTIPLIER is undefined, using
default value of 0
5/31 09:05:07 Couldn't fetch ads: can't find collector
5/31 09:05:07 Aborting negotiation cycle

_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users



--