I am trying to set
up Condor with the central manager running on a linux box with the rest of the
pool on windows machines. I seem to have the central manager up and
running correctly but when I install 6.8.2 for Windows on submit/execute
machines (eg 147.66.10.206) the condor_schedd process repeatedly dies and core
dumps and is then restarted by the manager.
condor_status shows
machines:
Name
OpSys Arch
State Activity LoadAv Mem
ActvtyTime
vm1@ERM-43880
LINUX INTEL
Owner Idle
0.000 247 0+00:29:05
vm2@ERM-43880
LINUX INTEL
Owner Idle
0.000 247 0+00:29:06
SCI-46433.AAD
WINNT51 INTEL Unclaimed
Idle 0.080 767
0+00:49:57
Total Owner Claimed Unclaimed Matched Preempting Backfill
INTEL/LINUX
2 2
0
0
0
0 0
INTEL/WINNT51
1 0
0
1
0
0 0
Total
3 2
0
1
0
0 0
Have FULL_DEBUG
on
MasterLog
---------------
2/2 15:44:54
******************************************************
2/2 15:44:54 ** Condor
(CONDOR_MASTER) STARTING UP
2/2 15:44:54 **
C:\condor\bin\condor_master.exe
2/2 15:44:54 ** $CondorVersion: 6.8.2 Oct 12
2006 $
2/2 15:44:54 ** $CondorPlatform: INTEL-WINNT50 $
2/2 15:44:54 **
PID = 440
2/2 15:44:54 ** Log last touched 2/2 15:43:45
2/2 15:44:54
******************************************************
2/2 15:44:54 Using
config source: C:\condor\condor_config
2/2 15:44:54 Using local config
sources:
2/2 15:44:54 C:\condor/condor_config.local
2/2
15:44:54 DaemonCore: Command Socket at <147.66.10.206:1034>
2/2
15:45:14 Collector port not defined, will use default: 9618
2/2 15:45:15
Started DaemonCore process "C:\condor/bin/condor_collector.exe", pid and pgroup
= 948
2/2 15:45:15 Started DaemonCore process
"C:\condor/bin/condor_negotiator.exe", pid and pgroup = 972
2/2 15:45:15
Started DaemonCore process "C:\condor/bin/condor_schedd.exe", pid and pgroup =
1004
2/2 15:45:16 Started DaemonCore process
"C:\condor/bin/condor_startd.exe", pid and pgroup = 1036
2/2 15:45:20
DaemonCore: Command received via UDP from host <147.66.10.206:1083>
2/2
15:45:20 DaemonCore: received command 60011 (DC_NOP), calling handler
(handle_nop())
2/2 15:45:20 The SCHEDD (pid 1004) died due to exception
ACCESS_VIOLATION
2/2 15:45:20 Sending obituary for
"C:\condor/bin/condor_schedd.exe"
2/2 15:45:23 restarting
C:\condor/bin/condor_schedd.exe in 10 seconds
2/2 15:45:33 Started DaemonCore
process "C:\condor/bin/condor_schedd.exe", pid and pgroup = 2108
2/2 15:45:34
DaemonCore: Command received via UDP from host <147.66.10.206:1099>
2/2
15:45:34 DaemonCore: received command 60011 (DC_NOP), calling handler
(handle_nop())
2/2 15:45:34 The SCHEDD (pid 2108) died due to exception
ACCESS_VIOLATION
2/2 15:45:34 Sending obituary for
"C:\condor/bin/condor_schedd.exe"
2/2 15:45:37 restarting
C:\condor/bin/condor_schedd.exe in 11 seconds
2/2 15:45:48 Started DaemonCore
process "C:\condor/bin/condor_schedd.exe", pid and pgroup = 2240
2/2 15:45:50
DaemonCore: Command received via UDP from host <147.66.10.206:1109>
2/2
15:45:50 DaemonCore: received command 60011 (DC_NOP), calling handler
(handle_nop())
2/2 15:45:50 The SCHEDD (pid 2240) died due to exception
ACCESS_VIOLATION
....etc....etc
ScheddLog
----------------
2/2 16:04:25
(pid:3812) ******************************************************
2/2
16:04:25 (pid:3812) ** condor_schedd.exe (CONDOR_SCHEDD) STARTING UP
2/2
16:04:25 (pid:3812) ** C:\condor\bin\condor_schedd.exe
2/2 16:04:25
(pid:3812) ** $CondorVersion: 6.8.2 Oct 12 2006 $
2/2 16:04:25 (pid:3812) **
$CondorPlatform: INTEL-WINNT50 $
2/2 16:04:25 (pid:3812) ** PID = 3812
2/2
16:04:25 (pid:3812) ** Log last touched 2/2 15:55:44
2/2 16:04:25 (pid:3812)
******************************************************
2/2 16:04:25
(pid:3812) Using config source: C:\condor\condor_config
2/2 16:04:25
(pid:3812) Using local config sources:
2/2 16:04:25
(pid:3812) C:\condor/condor_config.local
2/2 16:04:25
(pid:3812) DaemonCore: Command Socket at <147.66.10.206:1328>
2/2
16:04:25 (pid:3812) History file rotation is enabled.
2/2 16:04:25
(pid:3812) Maximum history file size is: 20971520 bytes
2/2
16:04:25 (pid:3812) Number of rotated history files is: 2
2/2
16:04:25 (pid:3812) my_popen: CreateProcess failed
2/2 16:04:25 (pid:3812)
Failed to execute C:\condor/bin/condor_shadow.pvm.exe, ignoring
2/2 16:04:25
(pid:3812) my_popen: CreateProcess failed
2/2 16:04:25 (pid:3812) Failed to
execute C:\condor/bin/condor_shadow.std.exe, ignoring
2/2 16:21:39 (pid:1004)
******************************************************
2/2 16:21:39
(pid:1004) ** condor_schedd.exe (CONDOR_SCHEDD) STARTING UP
2/2 16:21:39
(pid:1004) ** C:\condor\bin\condor_schedd.exe
2/2 16:21:39 (pid:1004) **
$CondorVersion: 6.8.2 Oct 12 2006 $
2/2 16:21:39 (pid:1004) **
$CondorPlatform: INTEL-WINNT50 $
2/2 16:21:39 (pid:1004) ** PID = 1004
2/2
16:21:39 (pid:1004) ** Log last touched 2/2 16:04:25
2/2 16:21:39 (pid:1004)
******************************************************
2/2 16:21:39
(pid:1004) Using config source: C:\condor\condor_config
2/2 16:21:39
(pid:1004) Using local config sources:
2/2 16:21:39
(pid:1004) C:\condor/condor_config.local
2/2 16:21:39
(pid:1004) DaemonCore: Command Socket at <147.66.10.206:1405>
2/2
16:21:39 (pid:1004) History file rotation is enabled.
2/2 16:21:39
(pid:1004) Maximum history file size is: 20971520 bytes
2/2
16:21:39 (pid:1004) Number of rotated history files is: 2
2/2
16:21:39 (pid:1004) my_popen: CreateProcess failed
2/2 16:21:39 (pid:1004)
Failed to execute C:\condor/bin/condor_shadow.pvm.exe, ignoring
2/2 16:21:39
(pid:1004) my_popen: CreateProcess failed
2/2 16:21:39 (pid:1004) Failed to
execute C:\condor/bin/condor_shadow.std.exe, ignoring
Core
Dump
----------------
//=====================================================
Exception
code: C0000005 ACCESS_VIOLATION
Fault address: 004B00AD 01:000AF0AD
C:\condor\bin\condor_schedd.exe
Registers:
EAX:33037A69
EBX:00D54A38
ECX:00D54A38
EDX:00BA0F10
ESI:0012F91C
EDI:00000400
CS:EIP:001B:004B00AD
SS:ESP:0023:0012F8C4
EBP:0012F900
DS:0023 ES:0023 FS:003B
GS:0000
Flags:00010206
Call
stack:
Address Frame
004B00AD 0012F8C8
ProcAPI::grabOffsets+1B
004AF400 0012F900
ProcAPI::getProcInfoRaw+FD
004AF21A 0012F984
ProcAPI::getProcInfo+1F
0048697E 0012F9A4
SelfMonitorData::CollectData+40
00495E43 0012F9E4
TimerManager::Timeout+154
0047DB9C 0012FE30
DaemonCore::Driver+212
00486127 0012FF68
dc_main+AF9
00486236 0012FF80 main+CE
Anyone have any
ideas as I cannot find a solution and this is starting to wear me
down?