[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Quill++ assistance



Perhaps not much help Michael but we've had similar problems with 7.2.4 on windows
(see first attached email). It behaved somewhat better for 7.4.1 (see second attached email)
and at least ran, even though restarting condor_quill every 1hr 25mins, but a number of other
problems/issues with the 7.4 series has not allowed us to upgrade to that version yet.
 
Cheers
 
Greg
 


From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Michael O'Donnell
Sent: Thursday, 12 August 2010 3:56 AM
To: Condor-Users Mail List
Subject: Re: [Condor-users] Quill++ assistance


I have these specified already and I do not see any issues. The quilllog file show SQL statements and success at populating the tables.

However, I am finding a file on all machine other than the central manager that has an access violation error. I am not sure if the condor_quill.exe daemon is supposed to run continuously, but I do not see it running on any machines other than the central manager.

The file that is showing up in the log directory on each machine is called core.QUILL.WIN32. Its contents are (Does this mean anything to anyone else):

//=====================================================
PID: 3248
Exception code: C0000005 ACCESS_VIOLATION
Fault address:  004025FE 01:000015FE C:\Condor\bin\condor_quill.exe

Registers:
EAX:00000000
EBX:00D04EB4
ECX:0012F714
EDX:00000000
ESI:00000000
EDI:0012F740
CS:EIP:001B:004025FE
SS:ESP:0023:0012F644  EBP:0012F6D4
DS:0023  ES:0023  FS:003B  GS:0000
Flags:00010246

Call stack:
Address   Frame
004025FE  0012F6D4  condor_ttdb_buildts (c:\condor\execute\dir_2116\userdir\src\condor_tt\condor_ttdb.cpp:64)
00415C35  0012F858  TTManager::insertScheddAd (c:\condor\execute\dir_2116\userdir\src\condor_tt\ttmanager.cpp:1579)
00B54898  00D0AEE8  0000:00000000
654E6369  6C627550  

Thank you,
mike




From: Steven Timm <timm@xxxxxxxx>
To: Condor-Users Mail List <condor-users@xxxxxxxxxxx>
Date: 08/11/2010 10:12 AM
Subject: Re: [Condor-users] Quill++ assistance
Sent by: condor-users-bounces@xxxxxxxxxxx






set QUILL_DEBUG to include D_SECURITY, maybe even D_FULLDEBUG
and look at what the logs are telling you.. it should say a better
error message that says what is going on.

Steve

On Wed, 11 Aug 2010, Michael O'Donnell wrote:

> These settings are:
> SEC_DEFAULT_AUTHENTICATION = REQUIRED
> SEC_DEFAULT_AUTHENTICATION_METHODS = NTSSPI, SSL, PASSWORD
>
>
> thanks
> mike
>
>
>
>
>
> From:
> Steven Timm <timm@xxxxxxxx>
> To:
> Condor-Users Mail List <condor-users@xxxxxxxxxxx>
> Date:
> 08/11/2010 09:13 AM
> Subject:
> Re: [Condor-users] Quill++ assistance
> Sent by:
> condor-users-bounces@xxxxxxxxxxx
>
>
>
>
> What are your SEC_DEFAULT_AUTHENTICATION and
> SEC_DEFAULT_AUTHENTICATION_METHODS set to?
> This error is saying that the various quilld's on the worker
> nodes can't contact the master.  Bad security configuration of
> some sort is to blame.. windows gurus can help more.
>
>
> Steve
>
> On Wed, 11 Aug 2010, Michael O'Donnell wrote:
>
>> I have been trying to set up Quill for our pool so we can track HTC use.
> I
>> have followed the Condor manual for configuration of both the
>> configuration files as well as PostGres. Quill will work for several
> hours
>> but then most of the machines are dropped from the pool according to
>> Quill. For example, If I enable Quill everything seems to work for at
>> least several hours. But usually by the next morning Quill is not
> tracking
>> any of the machines and all machines are dropped from the pool (as seen
>> via condor_status). The Condor daemons are still running on each machine
>> however.
>>
>> This seems to be related to the password/security based on the errors I
> am
>> receiving below, but the database tables are populated, all the sql log
>> files have information and everything looks ok.
>>
>> I have a homogeneous pool with Windows OS working nodes and our central
>> manager is running on Windows 2008 server. Postgres is also running on
>> this same server. Our bandwidth is 1Gbs and our pool is small (50
> machines
>> right now).
>>
>> Can anyone help me understand what I may be doing wrong or what the
>> problem might be related to.
>>
>> Thank you for the help,
>> Mike
>>
>>
>> I am getting an error that the condor_quill.exe(exit 4) has exited via
>> email to the administrator:
>>
>> *** Last 20 line(s) of file C:/Condor/log/QuillLog:
>> SessionDuration = "86400"
>> NewSession = "YES"
>> RemoteVersion = "$CondorVersion: 7.4.0 Oct 31 2009 BuildID: 193173 $"
>> ServerCommandSock = "<IP:4555>"
>> Command = 60010
>> AuthCommand = 60008
>> 08/10 20:00:41 condor_write(fd=1704
>> <IP:1046>,,size=514,timeout=20,flags=0)
>> 08/10 20:00:47 condor_read(fd=1704 <IP:1046>,,size=5,timeout=20,flags=0)
>> 08/10 20:01:03 condor_read(): fd=1704
>> 08/10 20:01:24 condor_read(): select returned 0
>> 08/10 20:01:48 condor_read(): timeout reading 5 bytes from
>> <159.189.162.50:1046>.
>> 08/10 20:01:49 IO: Failed to read packet header
>> 08/10 20:01:50 Stream::get(int) failed to read padding
>> 08/10 20:01:51 Failed to read ClassAd size.
>> 08/10 20:01:52 SECMAN: no classad from server, failing
>> 08/10 20:01:53 CLOSE <IP:4610> fd=1704
>> 08/10 20:01:54 SECMAN: unable to create security session to
>> <159.189.162.50:1046> via TCP, failing.
>> 08/10 20:01:55 ERROR: SECMAN:2004:Failed to create security session to
>> <159.189.162.50:1046> with TCP.|SECMAN:2007:Failed to end classad
> message.
>> 08/10 20:01:56 DaemonCore: startCommand() to <159.189.162.50:1046>
> failed.
>> SendAliveToParent() failed.
>> 08/10 20:02:17 ERROR "FAILED TO SEND INITIAL KEEP ALIVE TO OUR PARENT
>> <IP:1046>" at line 9310 in file
>> ..\src\condor_daemon_core.V6\daemon_core.cpp
>> *** End of file QuillLog
>>
>>
>
>

--
------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525
timm@xxxxxxxx  
http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Assistant Group Leader.
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/


--- Begin Message ---
Hi All
 
We have recently added the Quill database system/setup to our pools of Condor machines.
 
Quick summary:
 
5 pools each with ~ 600 windows PCs (mainly XP) running Condor version 7.2.4
 
5 central managers, 1 Condorview server and 1 Quill database server, each a VM on ESX servers
and running x86_64 bit SLES10 with Condor version 7.2.3
 
The setup and install went mostly OK and we have been going for only a week with ~ 16 windows
submit nodes all running with condor_quill. We have so far had 3 of these machines start to have
access violation errors with condor_quill. We get emails with the header:
 
[Condor] Problem PI-SCHAP2-SL.nexus.csiro.au: condor_quill.exe died (-1073741819)
 
See below for excerpts from MasterLog, QuillLog and the core file core.WIN32.QUILL
Deleting and recopying just the condor_quill.exe file made no difference. In each case an uninstall
and reinstall seemed to fix things up. Has anyone else come across this? I'm worried that it just
seems to have randomly started happening for no apparent reason.
 
On top of this we are also having some where condor stops altogether giving emails:
 
[Condor] Problem ELEMENT-KB.arrc.csiro.au: condor_quill.exe exited (44)
 
With MasterLog showing quilld and schedd exiting failures with condor_mail and
condor_schedd.exe not a valid windows executable! :
 
MasterLog for exit code 44 and condor stopping, no daemons running at all.
 
10/29 04:56:08 DaemonCore: pid 3144 exited with status 44, invoking reaper 1 <Daemons::DefaultReaper()>
10/29 04:56:08 The QUILL (pid 3144) exited with status 44
10/29 04:56:08 restarting C:\PROGRA~1\condor/bin/condor_quill.exe in 3600 seconds
10/29 04:56:08 DaemonCore: return from reaper for pid 3144
10/29 04:56:14 Received UDP command 60008 (DC_CHILDALIVE) from  <130.116.144.59:9738>, access level DAEMON
10/29 04:56:14 Calling HandleReq <HandleChildAliveCommand> (0)
10/29 04:56:14 Return from HandleReq <HandleChildAliveCommand> (handler: 0.000s, sec: 0.016s)
10/29 05:15:35 Received UDP command 60008 (DC_CHILDALIVE) from  <130.116.144.59:9743>, access level DAEMON
10/29 05:15:35 Calling HandleReq <HandleChildAliveCommand> (0)
10/29 05:15:35 Return from HandleReq <HandleChildAliveCommand> (handler: 0.000s, sec: 0.000s)
10/29 05:15:44 Received UDP command 60008 (DC_CHILDALIVE) from  <130.116.144.59:9340>, access level DAEMON
10/29 05:15:44 Calling HandleReq <HandleChildAliveCommand> (0)
10/29 05:15:44 Return from HandleReq <HandleChildAliveCommand> (handler: 0.000s, sec: 0.000s)
10/29 05:17:09 Received UDP command 60011 (DC_NOP) from  <130.116.144.59:9221>, access level READ
10/29 05:17:09 Calling HandleReq <handle_nop()> (0)
10/29 05:17:09 Return from HandleReq <handle_nop()> (handler: 0.000s, sec: 0.016s)
10/29 05:17:09 DaemonCore: pid 3632 exited with status 44, invoking reaper 1 <Daemons::DefaultReaper()>
10/29 05:17:09 The SCHEDD (pid 3632) exited with status 44
10/29 05:17:09 cannot send softkill since WINDOWS_SOFTKILL is undefined
10/29 05:17:09 Sending obituary for "C:\PROGRA~1\condor/bin/condor_schedd.exe"
10/29 05:17:09 my_popen: CreateProcess failed
10/29 05:17:09 Failed to access email program "C:\PROGRA~1\condor/bin/condor_mail.exe"
10/29 05:17:09 restarting C:\PROGRA~1\condor/bin/condor_schedd.exe in 10 seconds
10/29 05:17:09 DaemonCore: return from reaper for pid 3632
10/29 05:17:19 ERROR: C:\PROGRA~1\condor/bin/condor_schedd.exe is not a valid Windows executable
10/29 05:17:19 ERROR: Create_Process failed trying to start C:\PROGRA~1\condor/bin/condor_schedd.exe
10/29 05:17:19 restarting C:\PROGRA~1\condor/bin/condor_schedd.exe in 11 seconds
10/29 05:17:30 ERROR: C:\PROGRA~1\condor/bin/condor_schedd.exe is not a valid Windows executable
10/29 05:17:30 ERROR: Create_Process failed trying to start C:\PROGRA~1\condor/bin/condor_schedd.exe
10/29 05:17:30 restarting C:\PROGRA~1\condor/bin/condor_schedd.exe in 13 seconds
10/29 05:17:43 ERROR: C:\PROGRA~1\condor/bin/condor_schedd.exe is not a valid Windows executable
10/29 05:17:43 ERROR: Create_Process failed trying to start C:\PROGRA~1\condor/bin/condor_schedd.exe
10/29 05:17:43 restarting C:\PROGRA~1\condor/bin/condor_schedd.exe in 17 seconds
10/29 05:18:00 ERROR: C:\PROGRA~1\condor/bin/condor_schedd.exe is not a valid Windows executable
10/29 05:18:00 ERROR: Create_Process failed trying to start C:\PROGRA~1\condor/bin/condor_schedd.exe
10/29 05:18:00 restarting C:\PROGRA~1\condor/bin/condor_schedd.exe in 25 seconds
10/29 05:18:25 ERROR: C:\PROGRA~1\condor/bin/condor_schedd.exe is not a valid Windows executable
10/29 05:18:25 ERROR: Create_Process failed trying to start C:\PROGRA~1\condor/bin/condor_schedd.exe
10/29 05:18:25 restarting C:\PROGRA~1\condor/bin/condor_schedd.exe in 41 seconds
10/29 05:19:06 ERROR: C:\PROGRA~1\condor/bin/condor_schedd.exe is not a valid Windows executable
10/29 05:19:06 ERROR: Create_Process failed trying to start C:\PROGRA~1\condor/bin/condor_schedd.exe
10/29 05:19:06 restarting C:\PROGRA~1\condor/bin/condor_schedd.exe in 73 seconds
10/29 05:20:19 ERROR: C:\PROGRA~1\condor/bin/condor_schedd.exe is not a valid Windows executable
10/29 05:20:19 ERROR: Create_Process failed trying to start C:\PROGRA~1\condor/bin/condor_schedd.exe
10/29 05:20:19 restarting C:\PROGRA~1\condor/bin/condor_schedd.exe in 137 seconds
10/29 09:53:30 UnsetEnv(NET_REMAP_ENABLE): SetEnvironmentVariable failed, errno=203
 
Thanks for any info/help.
 
Cheers
 
Greg
 
 
Logs for access violation problems- exit code -1073741819
 
MasterLog
 
10/27 09:17:00 DaemonCore: pid 3588 exited with status -1073741819, invoking reaper 1 <Daemons::DefaultReaper()>
10/27 09:17:00 The QUILL (pid 3588) died due to exception ACCESS_VIOLATION
10/27 09:17:00 Sending obituary for "C:\PROGRA~1\condor/bin/condor_quill.exe"
10/27 09:17:03 restarting C:\PROGRA~1\condor/bin/condor_quill.exe in 10 seconds
10/27 09:17:03 DaemonCore: return from reaper for pid 3588
10/27 09:17:13 Started DaemonCore process "C:\PROGRA~1\condor/bin/condor_quill.exe", pid and pgroup = 4052
10/27 09:17:13 Received UDP command 60008 (DC_CHILDALIVE) from  <130.116.67.243:9263>, access level DAEMON
10/27 09:17:13 Calling HandleReq <HandleChildAliveCommand> (0)
10/27 09:17:13 Return from HandleReq <HandleChildAliveCommand> (handler: 0.000s, sec: 0.015s)
10/27 09:19:03 Received UDP command 60011 (DC_NOP) from  <130.116.67.243:9836>, access level READ
10/27 09:19:03 Calling HandleReq <handle_nop()> (0)
10/27 09:19:03 Return from HandleReq <handle_nop()> (handler: 0.000s, sec: 0.000s)
10/27 09:19:03 DaemonCore: pid 4052 exited with status -1073741819, invoking reaper 1 <Daemons::DefaultReaper()>
10/27 09:19:03 The QUILL (pid 4052) died due to exception ACCESS_VIOLATION
10/27 09:19:03 Sending obituary for "C:\PROGRA~1\condor/bin/condor_quill.exe"
10/27 09:19:05 restarting C:\PROGRA~1\condor/bin/condor_quill.exe in 11 seconds
10/27 09:19:05 DaemonCore: return from reaper for pid 4052
QuillLog
 
10/27 09:15:11 ******************************************************
10/27 09:15:11 ** condor_quill.exe (CONDOR_QUILL) STARTING UP
10/27 09:15:11 ** C:\PROGRA~1\condor\bin\condor_quill.exe
10/27 09:15:11 ** SubsystemInfo: name=QUILL type=DAEMON(10) class=DAEMON(1)
10/27 09:15:11 ** Configuration: subsystem:QUILL local:<NONE> class:DAEMON
10/27 09:15:11 ** $CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
10/27 09:15:11 ** $CondorPlatform: INTEL-WINNT50 $
10/27 09:15:11 ** PID = 3588
10/27 09:15:11 ** Log last touched 10/27 09:01:21
10/27 09:15:11 ******************************************************
10/27 09:15:11 Using config source: c:\PROGRA~1\condor\condor_config
10/27 09:15:11 Using local config sources:
10/27 09:15:11    C:\PROGRA~1\condor/condor_config.local
10/27 09:15:11 DaemonCore: Command Socket at <130.116.67.243:9494>
10/27 09:15:11 main_init() called
10/27 09:15:11 configuring tt options from config file
10/27 09:15:11 Using Polling Period = 10
10/27 09:15:11 Using logs 10/27 09:15:11 C:\PROGRA~1\condor/log/sql.log 10/27 09:15:11
10/27 09:15:11 Using Job Queue File C:\PROGRA~1\condor/spool/job_queue.log
10/27 09:15:11 Using Database Type = Postgres
10/27 09:15:11 Using Database IpAddress = condorquill.csiro.au:5432
10/27 09:15:11 Using Database Name = quilldatabase
10/27 09:15:11 Using Database User = quillwriter
10/27 09:15:12 ******** Start of Polling Job Queue Log ********
10/27 09:15:12 JOB QUEUE POLLING RESULT: COMPRESSED
10/27 09:15:12 ********* End of Polling Job Queue Log *********
10/27 09:15:12 ******** Start of Polling Event Log ********
10/27 09:17:13 ******************************************************
10/27 09:17:13 ** condor_quill.exe (CONDOR_QUILL) STARTING UP
10/27 09:17:13 ** C:\PROGRA~1\condor\bin\condor_quill.exe
10/27 09:17:13 ** SubsystemInfo: name=QUILL type=DAEMON(10) class=DAEMON(1)
10/27 09:17:13 ** Configuration: subsystem:QUILL local:<NONE> class:DAEMON
10/27 09:17:13 ** $CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
10/27 09:17:13 ** $CondorPlatform: INTEL-WINNT50 $
10/27 09:17:13 ** PID = 4052
10/27 09:17:13 ** Log last touched 10/27 09:15:12
10/27 09:17:13 ******************************************************
10/27 09:17:13 Using config source: c:\PROGRA~1\condor\condor_config
10/27 09:17:13 Using local config sources:
10/27 09:17:13    C:\PROGRA~1\condor/condor_config.local
10/27 09:17:13 DaemonCore: Command Socket at <130.116.67.243:9459>
10/27 09:17:13 main_init() called
10/27 09:17:13 configuring tt options from config file
10/27 09:17:13 Using Polling Period = 10
10/27 09:17:13 Using logs 10/27 09:17:13 C:\PROGRA~1\condor/log/sql.log 10/27 09:17:13
10/27 09:17:13 Using Job Queue File C:\PROGRA~1\condor/spool/job_queue.log
10/27 09:17:13 Using Database Type = Postgres
10/27 09:17:13 Using Database IpAddress = condorquill.csiro.au:5432
10/27 09:17:13 Using Database Name = quilldatabase
10/27 09:17:13 Using Database User = quillwriter
10/27 09:17:13 ******** Start of Polling Job Queue Log ********
10/27 09:17:13 JOB QUEUE POLLING RESULT: COMPRESSED
10/27 09:17:14 ********* End of Polling Job Queue Log *********
10/27 09:17:14 ******** Start of Polling Event Log ********
10/27 09:19:17 ******************************************************
10/27 09:19:17 ** condor_quill.exe (CONDOR_QUILL) STARTING UP
10/27 09:19:17 ** C:\PROGRA~1\condor\bin\condor_quill.exe
10/27 09:19:17 ** SubsystemInfo: name=QUILL type=DAEMON(10) class=DAEMON(1)
10/27 09:19:17 ** Configuration: subsystem:QUILL local:<NONE> class:DAEMON
10/27 09:19:17 ** $CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
10/27 09:19:17 ** $CondorPlatform: INTEL-WINNT50 $
10/27 09:19:17 ** PID = 2900
10/27 09:19:17 ** Log last touched 10/27 09:17:14
10/27 09:19:17 ******************************************************
10/27 09:19:17 Using config source: c:\PROGRA~1\condor\condor_config
10/27 09:19:17 Using local config sources:
10/27 09:19:17    C:\PROGRA~1\condor/condor_config.local
10/27 09:19:17 DaemonCore: Command Socket at <130.116.67.243:9496>
10/27 09:19:17 main_init() called
10/27 09:19:17 configuring tt options from config file
10/27 09:19:17 Using Polling Period = 10
10/27 09:19:17 Using logs 10/27 09:19:17 C:\PROGRA~1\condor/log/sql.log 10/27 09:19:17
10/27 09:19:17 Using Job Queue File C:\PROGRA~1\condor/spool/job_queue.log
10/27 09:19:17 Using Database Type = Postgres
10/27 09:19:17 Using Database IpAddress = condorquill.csiro.au:5432
10/27 09:19:17 Using Database Name = quilldatabase
10/27 09:19:17 Using Database User = quillwriter
10/27 09:19:17 ******** Start of Polling Job Queue Log ********
10/27 09:19:17 JOB QUEUE POLLING RESULT: COMPRESSED
10/27 09:19:17 ********* End of Polling Job Queue Log *********
10/27 09:19:17 ******** Start of Polling Event Log ********
 
core.WIN.QUILL32
 
//=====================================================
Exception code: C0000005 ACCESS_VIOLATION
Fault address:  00401895 01:00000895 C:\PROGRA~1\condor\bin\condor_quill.exe
 
Registers:
EAX:00000000
EBX:00C8BFFF
ECX:0012F6AC
EDX:00000000
ESI:00000000
EDI:0012F6D8
CS:EIP:001B:00401895
SS:ESP:0023:0012F5DC  EBP:0012F66C
DS:0023  ES:0023  FS:003B  GS:0000
Flags:00010256
 
Call stack:
Address   Frame
00401895  0012F66C  condor_ttdb_buildts (c:\condor\execute\dir_5692\userdir\src\condor_tt\condor_ttdb.cpp:64)
0040C620  0012F85C  TTManager::insertScheddAd (c:\condor\execute\dir_5692\userdir\src\condor_tt\ttmanager.cpp:1579)
0040F6F7  0012F910  TTManager::event_maintain (c:\condor\execute\dir_5692\userdir\src\condor_tt\ttmanager.cpp:599)
0040FDD5  0012F9B4  TTManager::maintain (c:\condor\execute\dir_5692\userdir\src\condor_tt\ttmanager.cpp:250)
0041018A  0012F9C0  TTManager::pollingTime (c:\condor\execute\dir_5692\userdir\src\condor_tt\ttmanager.cpp:199)
004222AF  0012FA64  TimerManager::Timeout (c:\condor\execute\dir_5692\userdir\src\condor_daemon_core.v6\timer_manager.cpp:493)
0041F38B  0012FEEC  DaemonCore::Driver (c:\condor\execute\dir_5692\userdir\src\condor_daemon_core.v6\daemon_core.cpp:2622)
00414C62  0012FF60  dc_main (c:\condor\execute\dir_5692\userdir\src\condor_daemon_core.v6\daemon_core_main.cpp:2106)
00414D62  0012FF78  main (c:\condor\execute\dir_5692\userdir\src\condor_daemon_core.v6\daemon_core_main.cpp:2169)
00487810  0012FFC0  __tmainCRTStartup (f:\dd\vctools\crt_bld\self_x86\crt\src\crt0.c:266)
7C817077  0012FFF0  RegisterWaitForInputIdle+49
 
 
 
------------------------------------------------------------------------------------------------------
Greg Hitchen                                                                         greg.hitchen@xxxxxxxx
CSIRO Exploration and Mining                                         phone: +61 8 6436 8663
Australian Resources Research Centre (ARRC)             fax:       +61 8 6436 8555
Postal address:                                                                     mob:          0407 952 748
PO Box 1130, Bentley WA 6102, Australia
Street Address:
26 Dick Perry Avenue, Kensington WA 6151
-------------------------------------------------------------------------------------------------------
 
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: 
https://lists.cs.wisc.edu/archive/condor-users/

--- End Message ---
--- Begin Message ---
Having previously reported Quill issues with Windows XP Condor 7.2.4 (see attached email - still waiting for response)
we thought we'd try 7.4.1. Same setup as before, i.e. Quill database is a SLES10 machine running Condor 7.2.3 and latest postgres.
 
This time rather than Access violation errors, core dumps, not valid windows executable type errors we get
in the MasterLog file:
 
03/03 09:37:02 DaemonCore: pid 7488 exited with status 44, invoking reaper 1 <Daemons::DefaultReaper()>
03/03 09:37:02 The QUILL (pid 7488) exited with status 44
03/03 09:37:02 Sending obituary for "C:\PROGRA~1\condor/bin/condor_quill.exe"
03/03 09:37:06 restarting C:\PROGRA~1\condor/bin/condor_quill.exe in 10 seconds
03/03 09:37:06 DaemonCore: return from reaper for pid 7488
03/03 09:37:16 Started DaemonCore process "C:\PROGRA~1\condor/bin/condor_quill.exe", pid and pgroup = 16704
 
This time condor_quill ALWAYS starts up again OK and the machine appears in the postgres db OK. Nothing else appears
in any log files, not even the QuillLog file, even with FULLDEBUG turned on. The interesting? thing is that this condor_quill
exiting with status 44 occurs regularly EXACTLY every 1hour and 25 mins. We have confirmed that this is also true on
another PC.
 
At least this is not a show stopper for us, unlike some other 7.4.1 windows issues, but would still be interested in
whether anyone else has seen this or has any ideas?
 
Thanks.
 
Cheers
 
Greg
--- Begin Message ---
Hi All
 
We have recently added the Quill database system/setup to our pools of Condor machines.
 
Quick summary:
 
5 pools each with ~ 600 windows PCs (mainly XP) running Condor version 7.2.4
 
5 central managers, 1 Condorview server and 1 Quill database server, each a VM on ESX servers
and running x86_64 bit SLES10 with Condor version 7.2.3
 
The setup and install went mostly OK and we have been going for only a week with ~ 16 windows
submit nodes all running with condor_quill. We have so far had 3 of these machines start to have
access violation errors with condor_quill. We get emails with the header:
 
[Condor] Problem PI-SCHAP2-SL.nexus.csiro.au: condor_quill.exe died (-1073741819)
 
See below for excerpts from MasterLog, QuillLog and the core file core.WIN32.QUILL
Deleting and recopying just the condor_quill.exe file made no difference. In each case an uninstall
and reinstall seemed to fix things up. Has anyone else come across this? I'm worried that it just
seems to have randomly started happening for no apparent reason.
 
On top of this we are also having some where condor stops altogether giving emails:
 
[Condor] Problem ELEMENT-KB.arrc.csiro.au: condor_quill.exe exited (44)
 
With MasterLog showing quilld and schedd exiting failures with condor_mail and
condor_schedd.exe not a valid windows executable! :
 
MasterLog for exit code 44 and condor stopping, no daemons running at all.
 
10/29 04:56:08 DaemonCore: pid 3144 exited with status 44, invoking reaper 1 <Daemons::DefaultReaper()>
10/29 04:56:08 The QUILL (pid 3144) exited with status 44
10/29 04:56:08 restarting C:\PROGRA~1\condor/bin/condor_quill.exe in 3600 seconds
10/29 04:56:08 DaemonCore: return from reaper for pid 3144
10/29 04:56:14 Received UDP command 60008 (DC_CHILDALIVE) from  <130.116.144.59:9738>, access level DAEMON
10/29 04:56:14 Calling HandleReq <HandleChildAliveCommand> (0)
10/29 04:56:14 Return from HandleReq <HandleChildAliveCommand> (handler: 0.000s, sec: 0.016s)
10/29 05:15:35 Received UDP command 60008 (DC_CHILDALIVE) from  <130.116.144.59:9743>, access level DAEMON
10/29 05:15:35 Calling HandleReq <HandleChildAliveCommand> (0)
10/29 05:15:35 Return from HandleReq <HandleChildAliveCommand> (handler: 0.000s, sec: 0.000s)
10/29 05:15:44 Received UDP command 60008 (DC_CHILDALIVE) from  <130.116.144.59:9340>, access level DAEMON
10/29 05:15:44 Calling HandleReq <HandleChildAliveCommand> (0)
10/29 05:15:44 Return from HandleReq <HandleChildAliveCommand> (handler: 0.000s, sec: 0.000s)
10/29 05:17:09 Received UDP command 60011 (DC_NOP) from  <130.116.144.59:9221>, access level READ
10/29 05:17:09 Calling HandleReq <handle_nop()> (0)
10/29 05:17:09 Return from HandleReq <handle_nop()> (handler: 0.000s, sec: 0.016s)
10/29 05:17:09 DaemonCore: pid 3632 exited with status 44, invoking reaper 1 <Daemons::DefaultReaper()>
10/29 05:17:09 The SCHEDD (pid 3632) exited with status 44
10/29 05:17:09 cannot send softkill since WINDOWS_SOFTKILL is undefined
10/29 05:17:09 Sending obituary for "C:\PROGRA~1\condor/bin/condor_schedd.exe"
10/29 05:17:09 my_popen: CreateProcess failed
10/29 05:17:09 Failed to access email program "C:\PROGRA~1\condor/bin/condor_mail.exe"
10/29 05:17:09 restarting C:\PROGRA~1\condor/bin/condor_schedd.exe in 10 seconds
10/29 05:17:09 DaemonCore: return from reaper for pid 3632
10/29 05:17:19 ERROR: C:\PROGRA~1\condor/bin/condor_schedd.exe is not a valid Windows executable
10/29 05:17:19 ERROR: Create_Process failed trying to start C:\PROGRA~1\condor/bin/condor_schedd.exe
10/29 05:17:19 restarting C:\PROGRA~1\condor/bin/condor_schedd.exe in 11 seconds
10/29 05:17:30 ERROR: C:\PROGRA~1\condor/bin/condor_schedd.exe is not a valid Windows executable
10/29 05:17:30 ERROR: Create_Process failed trying to start C:\PROGRA~1\condor/bin/condor_schedd.exe
10/29 05:17:30 restarting C:\PROGRA~1\condor/bin/condor_schedd.exe in 13 seconds
10/29 05:17:43 ERROR: C:\PROGRA~1\condor/bin/condor_schedd.exe is not a valid Windows executable
10/29 05:17:43 ERROR: Create_Process failed trying to start C:\PROGRA~1\condor/bin/condor_schedd.exe
10/29 05:17:43 restarting C:\PROGRA~1\condor/bin/condor_schedd.exe in 17 seconds
10/29 05:18:00 ERROR: C:\PROGRA~1\condor/bin/condor_schedd.exe is not a valid Windows executable
10/29 05:18:00 ERROR: Create_Process failed trying to start C:\PROGRA~1\condor/bin/condor_schedd.exe
10/29 05:18:00 restarting C:\PROGRA~1\condor/bin/condor_schedd.exe in 25 seconds
10/29 05:18:25 ERROR: C:\PROGRA~1\condor/bin/condor_schedd.exe is not a valid Windows executable
10/29 05:18:25 ERROR: Create_Process failed trying to start C:\PROGRA~1\condor/bin/condor_schedd.exe
10/29 05:18:25 restarting C:\PROGRA~1\condor/bin/condor_schedd.exe in 41 seconds
10/29 05:19:06 ERROR: C:\PROGRA~1\condor/bin/condor_schedd.exe is not a valid Windows executable
10/29 05:19:06 ERROR: Create_Process failed trying to start C:\PROGRA~1\condor/bin/condor_schedd.exe
10/29 05:19:06 restarting C:\PROGRA~1\condor/bin/condor_schedd.exe in 73 seconds
10/29 05:20:19 ERROR: C:\PROGRA~1\condor/bin/condor_schedd.exe is not a valid Windows executable
10/29 05:20:19 ERROR: Create_Process failed trying to start C:\PROGRA~1\condor/bin/condor_schedd.exe
10/29 05:20:19 restarting C:\PROGRA~1\condor/bin/condor_schedd.exe in 137 seconds
10/29 09:53:30 UnsetEnv(NET_REMAP_ENABLE): SetEnvironmentVariable failed, errno=203
 
Thanks for any info/help.
 
Cheers
 
Greg
 
 
Logs for access violation problems- exit code -1073741819
 
MasterLog
 
10/27 09:17:00 DaemonCore: pid 3588 exited with status -1073741819, invoking reaper 1 <Daemons::DefaultReaper()>
10/27 09:17:00 The QUILL (pid 3588) died due to exception ACCESS_VIOLATION
10/27 09:17:00 Sending obituary for "C:\PROGRA~1\condor/bin/condor_quill.exe"
10/27 09:17:03 restarting C:\PROGRA~1\condor/bin/condor_quill.exe in 10 seconds
10/27 09:17:03 DaemonCore: return from reaper for pid 3588
10/27 09:17:13 Started DaemonCore process "C:\PROGRA~1\condor/bin/condor_quill.exe", pid and pgroup = 4052
10/27 09:17:13 Received UDP command 60008 (DC_CHILDALIVE) from  <130.116.67.243:9263>, access level DAEMON
10/27 09:17:13 Calling HandleReq <HandleChildAliveCommand> (0)
10/27 09:17:13 Return from HandleReq <HandleChildAliveCommand> (handler: 0.000s, sec: 0.015s)
10/27 09:19:03 Received UDP command 60011 (DC_NOP) from  <130.116.67.243:9836>, access level READ
10/27 09:19:03 Calling HandleReq <handle_nop()> (0)
10/27 09:19:03 Return from HandleReq <handle_nop()> (handler: 0.000s, sec: 0.000s)
10/27 09:19:03 DaemonCore: pid 4052 exited with status -1073741819, invoking reaper 1 <Daemons::DefaultReaper()>
10/27 09:19:03 The QUILL (pid 4052) died due to exception ACCESS_VIOLATION
10/27 09:19:03 Sending obituary for "C:\PROGRA~1\condor/bin/condor_quill.exe"
10/27 09:19:05 restarting C:\PROGRA~1\condor/bin/condor_quill.exe in 11 seconds
10/27 09:19:05 DaemonCore: return from reaper for pid 4052
QuillLog
 
10/27 09:15:11 ******************************************************
10/27 09:15:11 ** condor_quill.exe (CONDOR_QUILL) STARTING UP
10/27 09:15:11 ** C:\PROGRA~1\condor\bin\condor_quill.exe
10/27 09:15:11 ** SubsystemInfo: name=QUILL type=DAEMON(10) class=DAEMON(1)
10/27 09:15:11 ** Configuration: subsystem:QUILL local:<NONE> class:DAEMON
10/27 09:15:11 ** $CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
10/27 09:15:11 ** $CondorPlatform: INTEL-WINNT50 $
10/27 09:15:11 ** PID = 3588
10/27 09:15:11 ** Log last touched 10/27 09:01:21
10/27 09:15:11 ******************************************************
10/27 09:15:11 Using config source: c:\PROGRA~1\condor\condor_config
10/27 09:15:11 Using local config sources:
10/27 09:15:11    C:\PROGRA~1\condor/condor_config.local
10/27 09:15:11 DaemonCore: Command Socket at <130.116.67.243:9494>
10/27 09:15:11 main_init() called
10/27 09:15:11 configuring tt options from config file
10/27 09:15:11 Using Polling Period = 10
10/27 09:15:11 Using logs 10/27 09:15:11 C:\PROGRA~1\condor/log/sql.log 10/27 09:15:11
10/27 09:15:11 Using Job Queue File C:\PROGRA~1\condor/spool/job_queue.log
10/27 09:15:11 Using Database Type = Postgres
10/27 09:15:11 Using Database IpAddress = condorquill.csiro.au:5432
10/27 09:15:11 Using Database Name = quilldatabase
10/27 09:15:11 Using Database User = quillwriter
10/27 09:15:12 ******** Start of Polling Job Queue Log ********
10/27 09:15:12 JOB QUEUE POLLING RESULT: COMPRESSED
10/27 09:15:12 ********* End of Polling Job Queue Log *********
10/27 09:15:12 ******** Start of Polling Event Log ********
10/27 09:17:13 ******************************************************
10/27 09:17:13 ** condor_quill.exe (CONDOR_QUILL) STARTING UP
10/27 09:17:13 ** C:\PROGRA~1\condor\bin\condor_quill.exe
10/27 09:17:13 ** SubsystemInfo: name=QUILL type=DAEMON(10) class=DAEMON(1)
10/27 09:17:13 ** Configuration: subsystem:QUILL local:<NONE> class:DAEMON
10/27 09:17:13 ** $CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
10/27 09:17:13 ** $CondorPlatform: INTEL-WINNT50 $
10/27 09:17:13 ** PID = 4052
10/27 09:17:13 ** Log last touched 10/27 09:15:12
10/27 09:17:13 ******************************************************
10/27 09:17:13 Using config source: c:\PROGRA~1\condor\condor_config
10/27 09:17:13 Using local config sources:
10/27 09:17:13    C:\PROGRA~1\condor/condor_config.local
10/27 09:17:13 DaemonCore: Command Socket at <130.116.67.243:9459>
10/27 09:17:13 main_init() called
10/27 09:17:13 configuring tt options from config file
10/27 09:17:13 Using Polling Period = 10
10/27 09:17:13 Using logs 10/27 09:17:13 C:\PROGRA~1\condor/log/sql.log 10/27 09:17:13
10/27 09:17:13 Using Job Queue File C:\PROGRA~1\condor/spool/job_queue.log
10/27 09:17:13 Using Database Type = Postgres
10/27 09:17:13 Using Database IpAddress = condorquill.csiro.au:5432
10/27 09:17:13 Using Database Name = quilldatabase
10/27 09:17:13 Using Database User = quillwriter
10/27 09:17:13 ******** Start of Polling Job Queue Log ********
10/27 09:17:13 JOB QUEUE POLLING RESULT: COMPRESSED
10/27 09:17:14 ********* End of Polling Job Queue Log *********
10/27 09:17:14 ******** Start of Polling Event Log ********
10/27 09:19:17 ******************************************************
10/27 09:19:17 ** condor_quill.exe (CONDOR_QUILL) STARTING UP
10/27 09:19:17 ** C:\PROGRA~1\condor\bin\condor_quill.exe
10/27 09:19:17 ** SubsystemInfo: name=QUILL type=DAEMON(10) class=DAEMON(1)
10/27 09:19:17 ** Configuration: subsystem:QUILL local:<NONE> class:DAEMON
10/27 09:19:17 ** $CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
10/27 09:19:17 ** $CondorPlatform: INTEL-WINNT50 $
10/27 09:19:17 ** PID = 2900
10/27 09:19:17 ** Log last touched 10/27 09:17:14
10/27 09:19:17 ******************************************************
10/27 09:19:17 Using config source: c:\PROGRA~1\condor\condor_config
10/27 09:19:17 Using local config sources:
10/27 09:19:17    C:\PROGRA~1\condor/condor_config.local
10/27 09:19:17 DaemonCore: Command Socket at <130.116.67.243:9496>
10/27 09:19:17 main_init() called
10/27 09:19:17 configuring tt options from config file
10/27 09:19:17 Using Polling Period = 10
10/27 09:19:17 Using logs 10/27 09:19:17 C:\PROGRA~1\condor/log/sql.log 10/27 09:19:17
10/27 09:19:17 Using Job Queue File C:\PROGRA~1\condor/spool/job_queue.log
10/27 09:19:17 Using Database Type = Postgres
10/27 09:19:17 Using Database IpAddress = condorquill.csiro.au:5432
10/27 09:19:17 Using Database Name = quilldatabase
10/27 09:19:17 Using Database User = quillwriter
10/27 09:19:17 ******** Start of Polling Job Queue Log ********
10/27 09:19:17 JOB QUEUE POLLING RESULT: COMPRESSED
10/27 09:19:17 ********* End of Polling Job Queue Log *********
10/27 09:19:17 ******** Start of Polling Event Log ********
 
core.WIN.QUILL32
 
//=====================================================
Exception code: C0000005 ACCESS_VIOLATION
Fault address:  00401895 01:00000895 C:\PROGRA~1\condor\bin\condor_quill.exe
 
Registers:
EAX:00000000
EBX:00C8BFFF
ECX:0012F6AC
EDX:00000000
ESI:00000000
EDI:0012F6D8
CS:EIP:001B:00401895
SS:ESP:0023:0012F5DC  EBP:0012F66C
DS:0023  ES:0023  FS:003B  GS:0000
Flags:00010256
 
Call stack:
Address   Frame
00401895  0012F66C  condor_ttdb_buildts (c:\condor\execute\dir_5692\userdir\src\condor_tt\condor_ttdb.cpp:64)
0040C620  0012F85C  TTManager::insertScheddAd (c:\condor\execute\dir_5692\userdir\src\condor_tt\ttmanager.cpp:1579)
0040F6F7  0012F910  TTManager::event_maintain (c:\condor\execute\dir_5692\userdir\src\condor_tt\ttmanager.cpp:599)
0040FDD5  0012F9B4  TTManager::maintain (c:\condor\execute\dir_5692\userdir\src\condor_tt\ttmanager.cpp:250)
0041018A  0012F9C0  TTManager::pollingTime (c:\condor\execute\dir_5692\userdir\src\condor_tt\ttmanager.cpp:199)
004222AF  0012FA64  TimerManager::Timeout (c:\condor\execute\dir_5692\userdir\src\condor_daemon_core.v6\timer_manager.cpp:493)
0041F38B  0012FEEC  DaemonCore::Driver (c:\condor\execute\dir_5692\userdir\src\condor_daemon_core.v6\daemon_core.cpp:2622)
00414C62  0012FF60  dc_main (c:\condor\execute\dir_5692\userdir\src\condor_daemon_core.v6\daemon_core_main.cpp:2106)
00414D62  0012FF78  main (c:\condor\execute\dir_5692\userdir\src\condor_daemon_core.v6\daemon_core_main.cpp:2169)
00487810  0012FFC0  __tmainCRTStartup (f:\dd\vctools\crt_bld\self_x86\crt\src\crt0.c:266)
7C817077  0012FFF0  RegisterWaitForInputIdle+49
 
 
 
------------------------------------------------------------------------------------------------------
Greg Hitchen                                                                         greg.hitchen@xxxxxxxx
CSIRO Exploration and Mining                                         phone: +61 8 6436 8663
Australian Resources Research Centre (ARRC)             fax:       +61 8 6436 8555
Postal address:                                                                     mob:          0407 952 748
PO Box 1130, Bentley WA 6102, Australia
Street Address:
26 Dick Perry Avenue, Kensington WA 6151
-------------------------------------------------------------------------------------------------------
 
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: 
https://lists.cs.wisc.edu/archive/condor-users/

--- End Message ---

--- End Message ---