[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Windows condor_quill access violation exception errors



Hi All
 
We have recently added the Quill database system/setup to our pools of Condor machines.
 
Quick summary:
 
5 pools each with ~ 600 windows PCs (mainly XP) running Condor version 7.2.4
 
5 central managers, 1 Condorview server and 1 Quill database server, each a VM on ESX servers
and running x86_64 bit SLES10 with Condor version 7.2.3
 
The setup and install went mostly OK and we have been going for only a week with ~ 16 windows
submit nodes all running with condor_quill. We have so far had 3 of these machines start to have
access violation errors with condor_quill. We get emails with the header:
 
[Condor] Problem PI-SCHAP2-SL.nexus.csiro.au: condor_quill.exe died (-1073741819)
 
See below for excerpts from MasterLog, QuillLog and the core file core.WIN32.QUILL
Deleting and recopying just the condor_quill.exe file made no difference. In each case an uninstall
and reinstall seemed to fix things up. Has anyone else come across this? I'm worried that it just
seems to have randomly started happening for no apparent reason.
 
On top of this we are also having some where condor stops altogether giving emails:
 
[Condor] Problem ELEMENT-KB.arrc.csiro.au: condor_quill.exe exited (44)
 
With MasterLog showing quilld and schedd exiting failures with condor_mail and
condor_schedd.exe not a valid windows executable! :
 
MasterLog for exit code 44 and condor stopping, no daemons running at all.
 
10/29 04:56:08 DaemonCore: pid 3144 exited with status 44, invoking reaper 1 <Daemons::DefaultReaper()>
10/29 04:56:08 The QUILL (pid 3144) exited with status 44
10/29 04:56:08 restarting C:\PROGRA~1\condor/bin/condor_quill.exe in 3600 seconds
10/29 04:56:08 DaemonCore: return from reaper for pid 3144
10/29 04:56:14 Received UDP command 60008 (DC_CHILDALIVE) from  <130.116.144.59:9738>, access level DAEMON
10/29 04:56:14 Calling HandleReq <HandleChildAliveCommand> (0)
10/29 04:56:14 Return from HandleReq <HandleChildAliveCommand> (handler: 0.000s, sec: 0.016s)
10/29 05:15:35 Received UDP command 60008 (DC_CHILDALIVE) from  <130.116.144.59:9743>, access level DAEMON
10/29 05:15:35 Calling HandleReq <HandleChildAliveCommand> (0)
10/29 05:15:35 Return from HandleReq <HandleChildAliveCommand> (handler: 0.000s, sec: 0.000s)
10/29 05:15:44 Received UDP command 60008 (DC_CHILDALIVE) from  <130.116.144.59:9340>, access level DAEMON
10/29 05:15:44 Calling HandleReq <HandleChildAliveCommand> (0)
10/29 05:15:44 Return from HandleReq <HandleChildAliveCommand> (handler: 0.000s, sec: 0.000s)
10/29 05:17:09 Received UDP command 60011 (DC_NOP) from  <130.116.144.59:9221>, access level READ
10/29 05:17:09 Calling HandleReq <handle_nop()> (0)
10/29 05:17:09 Return from HandleReq <handle_nop()> (handler: 0.000s, sec: 0.016s)
10/29 05:17:09 DaemonCore: pid 3632 exited with status 44, invoking reaper 1 <Daemons::DefaultReaper()>
10/29 05:17:09 The SCHEDD (pid 3632) exited with status 44
10/29 05:17:09 cannot send softkill since WINDOWS_SOFTKILL is undefined
10/29 05:17:09 Sending obituary for "C:\PROGRA~1\condor/bin/condor_schedd.exe"
10/29 05:17:09 my_popen: CreateProcess failed
10/29 05:17:09 Failed to access email program "C:\PROGRA~1\condor/bin/condor_mail.exe"
10/29 05:17:09 restarting C:\PROGRA~1\condor/bin/condor_schedd.exe in 10 seconds
10/29 05:17:09 DaemonCore: return from reaper for pid 3632
10/29 05:17:19 ERROR: C:\PROGRA~1\condor/bin/condor_schedd.exe is not a valid Windows executable
10/29 05:17:19 ERROR: Create_Process failed trying to start C:\PROGRA~1\condor/bin/condor_schedd.exe
10/29 05:17:19 restarting C:\PROGRA~1\condor/bin/condor_schedd.exe in 11 seconds
10/29 05:17:30 ERROR: C:\PROGRA~1\condor/bin/condor_schedd.exe is not a valid Windows executable
10/29 05:17:30 ERROR: Create_Process failed trying to start C:\PROGRA~1\condor/bin/condor_schedd.exe
10/29 05:17:30 restarting C:\PROGRA~1\condor/bin/condor_schedd.exe in 13 seconds
10/29 05:17:43 ERROR: C:\PROGRA~1\condor/bin/condor_schedd.exe is not a valid Windows executable
10/29 05:17:43 ERROR: Create_Process failed trying to start C:\PROGRA~1\condor/bin/condor_schedd.exe
10/29 05:17:43 restarting C:\PROGRA~1\condor/bin/condor_schedd.exe in 17 seconds
10/29 05:18:00 ERROR: C:\PROGRA~1\condor/bin/condor_schedd.exe is not a valid Windows executable
10/29 05:18:00 ERROR: Create_Process failed trying to start C:\PROGRA~1\condor/bin/condor_schedd.exe
10/29 05:18:00 restarting C:\PROGRA~1\condor/bin/condor_schedd.exe in 25 seconds
10/29 05:18:25 ERROR: C:\PROGRA~1\condor/bin/condor_schedd.exe is not a valid Windows executable
10/29 05:18:25 ERROR: Create_Process failed trying to start C:\PROGRA~1\condor/bin/condor_schedd.exe
10/29 05:18:25 restarting C:\PROGRA~1\condor/bin/condor_schedd.exe in 41 seconds
10/29 05:19:06 ERROR: C:\PROGRA~1\condor/bin/condor_schedd.exe is not a valid Windows executable
10/29 05:19:06 ERROR: Create_Process failed trying to start C:\PROGRA~1\condor/bin/condor_schedd.exe
10/29 05:19:06 restarting C:\PROGRA~1\condor/bin/condor_schedd.exe in 73 seconds
10/29 05:20:19 ERROR: C:\PROGRA~1\condor/bin/condor_schedd.exe is not a valid Windows executable
10/29 05:20:19 ERROR: Create_Process failed trying to start C:\PROGRA~1\condor/bin/condor_schedd.exe
10/29 05:20:19 restarting C:\PROGRA~1\condor/bin/condor_schedd.exe in 137 seconds
10/29 09:53:30 UnsetEnv(NET_REMAP_ENABLE): SetEnvironmentVariable failed, errno=203
 
Thanks for any info/help.
 
Cheers
 
Greg
 
 
Logs for access violation problems- exit code -1073741819
 
MasterLog
 
10/27 09:17:00 DaemonCore: pid 3588 exited with status -1073741819, invoking reaper 1 <Daemons::DefaultReaper()>
10/27 09:17:00 The QUILL (pid 3588) died due to exception ACCESS_VIOLATION
10/27 09:17:00 Sending obituary for "C:\PROGRA~1\condor/bin/condor_quill.exe"
10/27 09:17:03 restarting C:\PROGRA~1\condor/bin/condor_quill.exe in 10 seconds
10/27 09:17:03 DaemonCore: return from reaper for pid 3588
10/27 09:17:13 Started DaemonCore process "C:\PROGRA~1\condor/bin/condor_quill.exe", pid and pgroup = 4052
10/27 09:17:13 Received UDP command 60008 (DC_CHILDALIVE) from  <130.116.67.243:9263>, access level DAEMON
10/27 09:17:13 Calling HandleReq <HandleChildAliveCommand> (0)
10/27 09:17:13 Return from HandleReq <HandleChildAliveCommand> (handler: 0.000s, sec: 0.015s)
10/27 09:19:03 Received UDP command 60011 (DC_NOP) from  <130.116.67.243:9836>, access level READ
10/27 09:19:03 Calling HandleReq <handle_nop()> (0)
10/27 09:19:03 Return from HandleReq <handle_nop()> (handler: 0.000s, sec: 0.000s)
10/27 09:19:03 DaemonCore: pid 4052 exited with status -1073741819, invoking reaper 1 <Daemons::DefaultReaper()>
10/27 09:19:03 The QUILL (pid 4052) died due to exception ACCESS_VIOLATION
10/27 09:19:03 Sending obituary for "C:\PROGRA~1\condor/bin/condor_quill.exe"
10/27 09:19:05 restarting C:\PROGRA~1\condor/bin/condor_quill.exe in 11 seconds
10/27 09:19:05 DaemonCore: return from reaper for pid 4052
QuillLog
 
10/27 09:15:11 ******************************************************
10/27 09:15:11 ** condor_quill.exe (CONDOR_QUILL) STARTING UP
10/27 09:15:11 ** C:\PROGRA~1\condor\bin\condor_quill.exe
10/27 09:15:11 ** SubsystemInfo: name=QUILL type=DAEMON(10) class=DAEMON(1)
10/27 09:15:11 ** Configuration: subsystem:QUILL local:<NONE> class:DAEMON
10/27 09:15:11 ** $CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
10/27 09:15:11 ** $CondorPlatform: INTEL-WINNT50 $
10/27 09:15:11 ** PID = 3588
10/27 09:15:11 ** Log last touched 10/27 09:01:21
10/27 09:15:11 ******************************************************
10/27 09:15:11 Using config source: c:\PROGRA~1\condor\condor_config
10/27 09:15:11 Using local config sources:
10/27 09:15:11    C:\PROGRA~1\condor/condor_config.local
10/27 09:15:11 DaemonCore: Command Socket at <130.116.67.243:9494>
10/27 09:15:11 main_init() called
10/27 09:15:11 configuring tt options from config file
10/27 09:15:11 Using Polling Period = 10
10/27 09:15:11 Using logs 10/27 09:15:11 C:\PROGRA~1\condor/log/sql.log 10/27 09:15:11
10/27 09:15:11 Using Job Queue File C:\PROGRA~1\condor/spool/job_queue.log
10/27 09:15:11 Using Database Type = Postgres
10/27 09:15:11 Using Database IpAddress = condorquill.csiro.au:5432
10/27 09:15:11 Using Database Name = quilldatabase
10/27 09:15:11 Using Database User = quillwriter
10/27 09:15:12 ******** Start of Polling Job Queue Log ********
10/27 09:15:12 JOB QUEUE POLLING RESULT: COMPRESSED
10/27 09:15:12 ********* End of Polling Job Queue Log *********
10/27 09:15:12 ******** Start of Polling Event Log ********
10/27 09:17:13 ******************************************************
10/27 09:17:13 ** condor_quill.exe (CONDOR_QUILL) STARTING UP
10/27 09:17:13 ** C:\PROGRA~1\condor\bin\condor_quill.exe
10/27 09:17:13 ** SubsystemInfo: name=QUILL type=DAEMON(10) class=DAEMON(1)
10/27 09:17:13 ** Configuration: subsystem:QUILL local:<NONE> class:DAEMON
10/27 09:17:13 ** $CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
10/27 09:17:13 ** $CondorPlatform: INTEL-WINNT50 $
10/27 09:17:13 ** PID = 4052
10/27 09:17:13 ** Log last touched 10/27 09:15:12
10/27 09:17:13 ******************************************************
10/27 09:17:13 Using config source: c:\PROGRA~1\condor\condor_config
10/27 09:17:13 Using local config sources:
10/27 09:17:13    C:\PROGRA~1\condor/condor_config.local
10/27 09:17:13 DaemonCore: Command Socket at <130.116.67.243:9459>
10/27 09:17:13 main_init() called
10/27 09:17:13 configuring tt options from config file
10/27 09:17:13 Using Polling Period = 10
10/27 09:17:13 Using logs 10/27 09:17:13 C:\PROGRA~1\condor/log/sql.log 10/27 09:17:13
10/27 09:17:13 Using Job Queue File C:\PROGRA~1\condor/spool/job_queue.log
10/27 09:17:13 Using Database Type = Postgres
10/27 09:17:13 Using Database IpAddress = condorquill.csiro.au:5432
10/27 09:17:13 Using Database Name = quilldatabase
10/27 09:17:13 Using Database User = quillwriter
10/27 09:17:13 ******** Start of Polling Job Queue Log ********
10/27 09:17:13 JOB QUEUE POLLING RESULT: COMPRESSED
10/27 09:17:14 ********* End of Polling Job Queue Log *********
10/27 09:17:14 ******** Start of Polling Event Log ********
10/27 09:19:17 ******************************************************
10/27 09:19:17 ** condor_quill.exe (CONDOR_QUILL) STARTING UP
10/27 09:19:17 ** C:\PROGRA~1\condor\bin\condor_quill.exe
10/27 09:19:17 ** SubsystemInfo: name=QUILL type=DAEMON(10) class=DAEMON(1)
10/27 09:19:17 ** Configuration: subsystem:QUILL local:<NONE> class:DAEMON
10/27 09:19:17 ** $CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
10/27 09:19:17 ** $CondorPlatform: INTEL-WINNT50 $
10/27 09:19:17 ** PID = 2900
10/27 09:19:17 ** Log last touched 10/27 09:17:14
10/27 09:19:17 ******************************************************
10/27 09:19:17 Using config source: c:\PROGRA~1\condor\condor_config
10/27 09:19:17 Using local config sources:
10/27 09:19:17    C:\PROGRA~1\condor/condor_config.local
10/27 09:19:17 DaemonCore: Command Socket at <130.116.67.243:9496>
10/27 09:19:17 main_init() called
10/27 09:19:17 configuring tt options from config file
10/27 09:19:17 Using Polling Period = 10
10/27 09:19:17 Using logs 10/27 09:19:17 C:\PROGRA~1\condor/log/sql.log 10/27 09:19:17
10/27 09:19:17 Using Job Queue File C:\PROGRA~1\condor/spool/job_queue.log
10/27 09:19:17 Using Database Type = Postgres
10/27 09:19:17 Using Database IpAddress = condorquill.csiro.au:5432
10/27 09:19:17 Using Database Name = quilldatabase
10/27 09:19:17 Using Database User = quillwriter
10/27 09:19:17 ******** Start of Polling Job Queue Log ********
10/27 09:19:17 JOB QUEUE POLLING RESULT: COMPRESSED
10/27 09:19:17 ********* End of Polling Job Queue Log *********
10/27 09:19:17 ******** Start of Polling Event Log ********
 
core.WIN.QUILL32
 
//=====================================================
Exception code: C0000005 ACCESS_VIOLATION
Fault address:  00401895 01:00000895 C:\PROGRA~1\condor\bin\condor_quill.exe
 
Registers:
EAX:00000000
EBX:00C8BFFF
ECX:0012F6AC
EDX:00000000
ESI:00000000
EDI:0012F6D8
CS:EIP:001B:00401895
SS:ESP:0023:0012F5DC  EBP:0012F66C
DS:0023  ES:0023  FS:003B  GS:0000
Flags:00010256
 
Call stack:
Address   Frame
00401895  0012F66C  condor_ttdb_buildts (c:\condor\execute\dir_5692\userdir\src\condor_tt\condor_ttdb.cpp:64)
0040C620  0012F85C  TTManager::insertScheddAd (c:\condor\execute\dir_5692\userdir\src\condor_tt\ttmanager.cpp:1579)
0040F6F7  0012F910  TTManager::event_maintain (c:\condor\execute\dir_5692\userdir\src\condor_tt\ttmanager.cpp:599)
0040FDD5  0012F9B4  TTManager::maintain (c:\condor\execute\dir_5692\userdir\src\condor_tt\ttmanager.cpp:250)
0041018A  0012F9C0  TTManager::pollingTime (c:\condor\execute\dir_5692\userdir\src\condor_tt\ttmanager.cpp:199)
004222AF  0012FA64  TimerManager::Timeout (c:\condor\execute\dir_5692\userdir\src\condor_daemon_core.v6\timer_manager.cpp:493)
0041F38B  0012FEEC  DaemonCore::Driver (c:\condor\execute\dir_5692\userdir\src\condor_daemon_core.v6\daemon_core.cpp:2622)
00414C62  0012FF60  dc_main (c:\condor\execute\dir_5692\userdir\src\condor_daemon_core.v6\daemon_core_main.cpp:2106)
00414D62  0012FF78  main (c:\condor\execute\dir_5692\userdir\src\condor_daemon_core.v6\daemon_core_main.cpp:2169)
00487810  0012FFC0  __tmainCRTStartup (f:\dd\vctools\crt_bld\self_x86\crt\src\crt0.c:266)
7C817077  0012FFF0  RegisterWaitForInputIdle+49
 
 
 
------------------------------------------------------------------------------------------------------
Greg Hitchen                                                                         greg.hitchen@xxxxxxxx
CSIRO Exploration and Mining                                         phone: +61 8 6436 8663
Australian Resources Research Centre (ARRC)             fax:       +61 8 6436 8555
Postal address:                                                                     mob:          0407 952 748
PO Box 1130, Bentley WA 6102, Australia
Street Address:
26 Dick Perry Avenue, Kensington WA 6151
-------------------------------------------------------------------------------------------------------