[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Fedora 3 collector problem



Those files checked out ok. Still not sure what is happening.

Also if I try to run a condor_status from fedora (my master) it says
that it cannot connect to the collectore, but condor_status does work
from my clients

Thanks
Josh

On 5/31/05, Jose D. Zamora <jzamora@xxxxxxxxxxxx> wrote:
> Check file :
> /etc/condor/condor_config
> for :
> COLLECTOR_HOST  = $(CONDOR_HOST)
> DAEMON_LIST                     = MASTER, STARTD, SCHEDD, COLLECTOR,
> NEGOTIATOR
> and
> Check file:
> /opt/condor-6.6.9/local.phy-condor/condor_config.local
> for :
> COLLECTOR_NAME = Collector at <hostname of your master here>
> 
> Hope this helps
> 
> On Tue, 31 May 2005 09:21:18 -0500, Joshua Juen <jj9867@xxxxxxxxx> wrote:
> 
> > I have set up condor as master on a Fedora 3 system. The installation
> > seems to be working except that the master cannot find the collector.
> >
> > The condor_status works from the client machines but none of the
> > machines can submit jobs. The submitting machine's jobs will just sit
> > in the queue.
> >
> > Error sending update to the collector : Failed to connect to collector
> > appears in the master log, the negotiator log and the start log.
> >
> > The port that the collector should be on is open and I can telnet into
> > it.
> > (I am assuming that the clients can also) but the master can't seem to
> > find it.
> >
> > I think that the problem is probably a simple configuration error but
> > I can not seem to track it down.
> >
> > Any help would be greatly appreciated,
> > Thanks
> > Josh
> >
> >
> > MasterLog
> >
> > 5/31 08:24:19 ******************************************************
> > 5/31 08:24:19 ** condor_master (CONDOR_MASTER) STARTING UP
> > 5/31 08:24:19 ** /opt/condor-6.6.9/sbin/condor_master
> > 5/31 08:24:19 ** $CondorVersion: 6.6.9 Mar 10 2005 $
> > 5/31 08:24:19 ** $CondorPlatform: I386-LINUX_RH9 $
> > 5/31 08:24:19 ** PID = 2354
> > 5/31 08:24:19 ******************************************************
> > 5/31 08:24:19 Using config file: /etc/condor/condor_config
> > 5/31 08:24:19 Using local config files:
> > /opt/condor-6.6.9/local.phy-condor/condor_config.local
> > 5/31 08:24:19 Attempting to lock
> > /tmp/condor-lock.phy-condor0.606384916537539/InstanceLock.
> > 5/31 08:24:19 Obtained lock on
> > /tmp/condor-lock.phy-condor0.606384916537539/InstanceLock.
> > 5/31 08:24:19 DaemonCore: Command Socket at <xxx.xxx.xxx.50:32769>
> > 5/31 08:24:19 SEC_DEFAULT_SESSION_DURATION is undefined, using default
> > value of 3600
> > 5/31 08:24:19 MASTER_TIMEOUT_MULTIPLIER is undefined, using default
> > value of 0
> > 5/31 08:24:19 MASTER_TIMEOUT_MULTIPLIER is undefined, using default
> > value of 0
> > 5/31 08:24:19 Will use UDP to update collector
> > 5/31 08:24:19 Started DaemonCore process
> > "/opt/condor-6.6.9/sbin/condor_collector", pid and pgroup = 2355
> > 5/31 08:24:19 MASTER_TIMEOUT_MULTIPLIER is undefined, using default
> > value of 0
> > 5/31 08:24:19 Started DaemonCore process
> > "/opt/condor-6.6.9/sbin/condor_negotiator", pid and pgroup = 2356
> > 5/31 08:24:19 Started DaemonCore process
> > "/opt/condor-6.6.9/sbin/condor_startd", pid and pgroup = 2357
> > 5/31 08:24:19 Started DaemonCore process
> > "/opt/condor-6.6.9/sbin/condor_schedd", pid and pgroup = 2358
> > 5/31 08:24:21 DaemonCore: Command received via UDP from host
> > <xxx.xxx.xxx.50:32773>
> > 5/31 08:24:21 DaemonCore: received command 60008 (DC_CHILDALIVE),
> > calling handler (HandleChildAliveCommand)
> > 5/31 08:24:21 DaemonCore: Command received via UDP from host
> > <xxx.xxx.xxx.50:32773>
> > 5/31 08:24:21 DaemonCore: received command 60008 (DC_CHILDALIVE),
> > calling handler (HandleChildAliveCommand)
> > 5/31 08:24:22 DaemonCore: Command received via UDP from host
> > <xxx.xxx.xxx.50:32773>
> > 5/31 08:24:22 DaemonCore: received command 60008 (DC_CHILDALIVE),
> > calling handler (HandleChildAliveCommand)
> > 5/31 08:24:24 enter Daemons::CheckForNewExecutable
> > 5/31 08:24:24 Time stamp of running
> > /opt/condor-6.6.9/sbin/condor_master: 1110456335
> > 5/31 08:24:24 GetTimeStamp returned: 1110456335
> > 5/31 08:24:24 Time stamp of running
> > /opt/condor-6.6.9/sbin/condor_collector: 1110456335
> > 5/31 08:24:24 GetTimeStamp returned: 1110456335
> > 5/31 08:24:24 Time stamp of running
> > /opt/condor-6.6.9/sbin/condor_negotiator: 1110456334
> > 5/31 08:24:24 GetTimeStamp returned: 1110456334
> > 5/31 08:24:24 Time stamp of running
> > /opt/condor-6.6.9/sbin/condor_startd: 1110456334
> > 5/31 08:24:24 GetTimeStamp returned: 1110456334
> > 5/31 08:24:24 Time stamp of running
> > /opt/condor-6.6.9/sbin/condor_schedd: 1110456334
> > 5/31 08:24:24 GetTimeStamp returned: 1110456334
> > 5/31 08:24:24 exit Daemons::CheckForNewExecutable
> > 5/31 08:24:24 enter Daemons::UpdateCollector
> > 5/31 08:24:24 Attempting to send update via UDP to collector
> > 5/31 08:24:24 Can't send UPDATE_MASTER_AD to collector : Failed to
> > connect to collector
> > 5/31 08:24:33 DaemonCore: Command received via UDP from host
> > <xxx.xxx.xxx.50:32773>
> > 5/31 08:24:33 DaemonCore: received command 60008 (DC_CHILDALIVE),
> > calling handler (HandleChildAliveCommand)
> > 5/31 08:29:24 enter Daemons::UpdateCollector
> > 5/31 08:29:24 Attempting to send update via UDP to collector
> > 5/31 08:29:24 Can't send UPDATE_MASTER_AD to collector : Failed to
> > connect to collector
> > 5/31 08:29:24 enter Daemons::CheckForNewExecutable
> > 5/31 08:29:24 Time stamp of running
> > /opt/condor-6.6.9/sbin/condor_master: 1110456335
> > 5/31 08:29:24 GetTimeStamp returned: 1110456335
> > 5/31 08:29:24 Time stamp of running
> > /opt/condor-6.6.9/sbin/condor_collector: 1110456335
> > 5/31 08:29:24 GetTimeStamp returned: 1110456335
> > 5/31 08:29:24 Time stamp of running
> > /opt/condor-6.6.9/sbin/condor_negotiator: 1110456334
> > 5/31 08:29:24 GetTimeStamp returned: 1110456334
> > 5/31 08:29:24 Time stamp of running
> > /opt/condor-6.6.9/sbin/condor_startd: 1110456334
> > 5/31 08:29:24 GetTimeStamp returned: 1110456334
> > 5/31 08:29:24 Time stamp of running
> > /opt/condor-6.6.9/sbin/condor_schedd: 1110456334
> > 5/31 08:29:24 GetTimeStamp returned: 1110456334
> > 5/31 08:29:24 exit Daemons::CheckForNewExecutable
> > 5/31 08:34:24 enter Daemons::CheckForNewExecutable
> > 5/31 08:34:24 Time stamp of running
> > /opt/condor-6.6.9/sbin/condor_master: 1110456335
> > 5/31 08:34:24 GetTimeStamp returned: 1110456335
> > 5/31 08:34:24 Time stamp of running
> > /opt/condor-6.6.9/sbin/condor_collector: 1110456335
> > 5/31 08:34:24 GetTimeStamp returned: 1110456335
> > 5/31 08:34:24 Time stamp of running
> > /opt/condor-6.6.9/sbin/condor_negotiator: 1110456334
> > 5/31 08:34:24 GetTimeStamp returned: 1110456334
> > 5/31 08:34:24 Time stamp of running
> > /opt/condor-6.6.9/sbin/condor_startd: 1110456334
> > 5/31 08:34:24 GetTimeStamp returned: 1110456334
> > 5/31 08:34:24 Time stamp of running
> > /opt/condor-6.6.9/sbin/condor_schedd: 1110456334
> > 5/31 08:34:24 GetTimeStamp returned: 1110456334
> > 5/31 08:34:24 exit Daemons::CheckForNewExecutable
> > 5/31 08:34:24 enter Daemons::UpdateCollector
> > 5/31 08:34:24 Attempting to send update via UDP to collector
> > 5/31 08:34:24 Can't send UPDATE_MASTER_AD to collector : Failed to
> > connect to collector
> > 5/31 08:35:07 DaemonCore: Command received via TCP from host
> > <xxx.xxx.xxx.50:32777>
> > 5/31 08:35:07 DaemonCore: received command 453 (RESTART), calling
> > handler (admin_command_handler)
> > 5/31 08:35:07 Got admin command (453) and allowing it.
> > 5/31 08:35:07 NumberOfChildren() returning 4
> > 5/31 08:35:07 MASTER_TIMEOUT_MULTIPLIER is undefined, using default
> > value of 0
> > 5/31 08:35:07 Sent SIGTERM to COLLECTOR (pid 2355)
> > 5/31 08:35:07 MASTER_TIMEOUT_MULTIPLIER is undefined, using default
> > value of 0
> > 5/31 08:35:07 Sent SIGTERM to NEGOTIATOR (pid 2356)
> > 5/31 08:35:07 MASTER_TIMEOUT_MULTIPLIER is undefined, using default
> > value of 0
> > 5/31 08:35:07 Sent SIGTERM to STARTD (pid 2357)
> > 5/31 08:35:07 MASTER_TIMEOUT_MULTIPLIER is undefined, using default
> > value of 0
> > 5/31 08:35:07 Sent SIGTERM to SCHEDD (pid 2358)
> > 5/31 08:35:07 DaemonCore: No more children processes to reap.
> > 5/31 08:35:07 The COLLECTOR (pid 2355) exited with status 0
> > 5/31 08:35:07 ProcAPI::buildFamily failed: parent 2355 not found on
> > system.
> > 5/31 08:35:07 ProcAPI: pid 2355 does not exist.
> > 5/31 08:35:07 NumberOfChildren() returning 3
> > 5/31 08:35:07 The NEGOTIATOR (pid 2356) exited with status 0
> > 5/31 08:35:07 ProcAPI::buildFamily failed: parent 2356 not found on
> > system.
> > 5/31 08:35:07 ProcAPI: pid 2356 does not exist.
> > 5/31 08:35:07 NumberOfChildren() returning 2
> > 5/31 08:35:07 DaemonCore: No more children processes to reap.
> > 5/31 08:35:07 The STARTD (pid 2357) exited with status 0
> > 5/31 08:35:07 ProcAPI::buildFamily failed: parent 2357 not found on
> > system.
> > 5/31 08:35:07 ProcAPI: pid 2357 does not exist.
> > 5/31 08:35:07 NumberOfChildren() returning 1
> > 5/31 08:35:07 The SCHEDD (pid 2358) exited with status 0
> > 5/31 08:35:07 ProcAPI: pid 2418 does not exist.
> > 5/31 08:35:07 ProcAPI::buildFamily failed: parent 2358 not found on
> > system.
> > 5/31 08:35:07 ProcAPI: pid 2358 does not exist.
> > 5/31 08:35:07 NumberOfChildren() returning 0
> > 5/31 08:35:07 All daemons are gone.  Restarting.
> > 5/31 08:35:07 Restarting master right away.
> > 5/31 08:35:07 Doing exec( "/opt/condor-6.6.9/sbin/condor_master" )
> > 5/31 08:35:07 getExecPath: readlink("/proc/self/exe") failed: errno 13
> > (Permission denied)
> >
> > 5/31 08:35:07 PASSWD_CACHE_REFRESH is undefined, using default value of
> > 300
> >
> > StartLog error:
> >
> > 5/31 09:05:37 Attempting to send update via UDP to collector
> > 5/31 09:05:37 Error sending update to the collector : Failed to
> > connect to collector
> > 5/31 09:05:37 Error sending update to collector(s)
> >
> > Negotiator Sample:
> >
> > 5/31 09:05:07 ---------- Started Negotiation Cycle ----------
> > 5/31 09:05:07 Phase 1:  Obtaining ads from collector ...
> > 5/31 09:05:07   Getting all public ads ...
> > 5/31 09:05:07 NEGOTIATOR_TIMEOUT_MULTIPLIER is undefined, using
> > default value of 0
> > 5/31 09:05:07 Couldn't fetch ads: can't find collector
> > 5/31 09:05:07 Aborting negotiation cycle
> >
> > _______________________________________________
> > Condor-users mailing list
> > Condor-users@xxxxxxxxxxx
> > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> 
> 
> --
> 
>