[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] CEDAR:6001: Failed to connect to ...



Hello,

I am running a condor pool with 3 servers with Windows Server 2003 and
Condor 6.8.5.

Apparantly every once in a while the condor master seems to "fail", I
mean by this that the condor submitting nodes and condor processing
nodes don't see the available resources anymore... if I then take a look
at the condor master (which is btw on a seperate server) and I do a
condor_status I get the message that the condor_collector is not
running. 

"CEDAR:6001"Failed to connect to <xxx.x.x.xx:9618>, Error: Couldn't
contact the condor_collector on xxx.xxxx.xx.xx.

I know I can fix this by rebooting this machine but since it is also the
floating license server for 5 different softwares, I can't reboot it
without stopping all processing.  

Can you tell me if there is another way of resetting the condor master
machine, without a reboot?  
I already tried condor_restart and also stopping and restarting the
condor_master service (via the MS management console Services).  Also
condor_off doesn't work because the condor_collector is not running at
this point (I think).

The MasterLog file contains the following messages:

3/30 03:04:51 C:\condor/bin/condor_master.exe was modified, restarting
C:\condor/bin/condor_master.exe.
3/30 03:04:51 Sent signal 15 to COLLECTOR (pid 1228)
3/30 03:04:51 Sent signal 15 to NEGOTIATOR (pid 8068)
3/30 03:04:51 Sent signal 15 to SCHEDD (pid 7324)
3/30 03:04:52 DaemonCore: Command received via UDP from host >
<10.40.10.150:3638>
3/30 03:04:52 DaemonCore: received command 60011 (DC_NOP), calling
handler (handle_nop())
3/30 03:04:52 The NEGOTIATOR (pid 8068) exited with status 0
3/30 03:04:52 DaemonCore: Command received via UDP from host
<10.40.10.150:3639>
3/30 03:04:52 DaemonCore: received command 60011 (DC_NOP), calling
handler (handle_nop())
3/30 03:04:52 The COLLECTOR (pid 1228) exited with status 0
3/30 03:04:52 DaemonCore: Command received via UDP from host
<10.40.10.150:3640>
3/30 03:04:52 DaemonCore: received command 60011 (DC_NOP), calling
handler (handle_nop())
3/30 03:04:52 The SCHEDD (pid 7324) exited with status 0
3/30 03:04:52 All daemons are gone.  Restarting.
3/30 03:04:52 Restarting master in 120 seconds.
3/30 03:06:52 Running as NT Service = 1
3/30 03:06:52 Doing exec( "C:\WINDOWS\system32\cmd.exe /Q /C net stop
Condor & net start Condor" )
3/30 03:06:53 ******************************************************
3/30 03:06:53 ** Condor (CONDOR_MASTER) STARTING UP
3/30 03:06:53 ** C:\condor\bin\condor_master.exe
3/30 03:06:53 ** $CondorVersion: 6.8.5 May 17 2007 $
3/30 03:06:53 ** $CondorPlatform: INTEL-WINNT50 $
3/30 03:06:53 ** PID = 7700
3/30 03:06:53 ** Log last touched 3/30 03:06:52
3/30 03:06:53 ******************************************************
3/30 03:06:53 Using config source: C:\condor\condor_config
3/30 03:06:53 Using local config sources: 
3/30 03:06:53    C:\condor/condor_config.local
3/30 03:06:53 DaemonCore: Command Socket at < xxx.x.x.xx:xxxx >
3/30 03:16:58 WinFirewall: get_CurrentProfile failed: 0x800706d9
3/30 03:16:58 Started DaemonCore process
"C:\condor/bin/condor_schedd.exe", pid and pgroup = 7912
3/30 03:17:04 attempt to connect to < xxx.x.x.xx:9618> failed: connect
errno = 10061 connection refused.
3/30 03:17:04 ERROR: SECMAN:2003:TCP connection to < xxx.x.x.xx:xxxx >
failed

3/30 03:17:04 Failed to start non-blocking update to < xxx.x.x.xx:xxxx
>.
3/30 03:22:04 attempt to connect to < xxx.x.x.xx:9618> failed: connect
errno = 10061 connection refused.
3/30 03:22:04 ERROR: SECMAN:2003:TCP connection to < xxx.x.x.xx:xxxx >
failed

Could anyone tell me what might be the reason of this problem and how to
fix it. Thank you very much in advance!

Thomas