[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Quill++ assistance




I have these specified already and I do not see any issues. The quilllog file show SQL statements and success at populating the tables.

However, I am finding a file on all machine other than the central manager that has an access violation error. I am not sure if the condor_quill.exe daemon is supposed to run continuously, but I do not see it running on any machines other than the central manager.

The file that is showing up in the log directory on each machine is called core.QUILL.WIN32. Its contents are (Does this mean anything to anyone else):

//=====================================================
PID: 3248
Exception code: C0000005 ACCESS_VIOLATION
Fault address:  004025FE 01:000015FE C:\Condor\bin\condor_quill.exe

Registers:
EAX:00000000
EBX:00D04EB4
ECX:0012F714
EDX:00000000
ESI:00000000
EDI:0012F740
CS:EIP:001B:004025FE
SS:ESP:0023:0012F644  EBP:0012F6D4
DS:0023  ES:0023  FS:003B  GS:0000
Flags:00010246

Call stack:
Address   Frame
004025FE  0012F6D4  condor_ttdb_buildts (c:\condor\execute\dir_2116\userdir\src\condor_tt\condor_ttdb.cpp:64)
00415C35  0012F858  TTManager::insertScheddAd (c:\condor\execute\dir_2116\userdir\src\condor_tt\ttmanager.cpp:1579)
00B54898  00D0AEE8  0000:00000000
654E6369  6C627550  

Thank you,
mike




From: Steven Timm <timm@xxxxxxxx>
To: Condor-Users Mail List <condor-users@xxxxxxxxxxx>
Date: 08/11/2010 10:12 AM
Subject: Re: [Condor-users] Quill++ assistance
Sent by: condor-users-bounces@xxxxxxxxxxx






set QUILL_DEBUG to include D_SECURITY, maybe even D_FULLDEBUG
and look at what the logs are telling you.. it should say a better
error message that says what is going on.

Steve

On Wed, 11 Aug 2010, Michael O'Donnell wrote:

> These settings are:
> SEC_DEFAULT_AUTHENTICATION = REQUIRED
> SEC_DEFAULT_AUTHENTICATION_METHODS = NTSSPI, SSL, PASSWORD
>
>
> thanks
> mike
>
>
>
>
>
> From:
> Steven Timm <timm@xxxxxxxx>
> To:
> Condor-Users Mail List <condor-users@xxxxxxxxxxx>
> Date:
> 08/11/2010 09:13 AM
> Subject:
> Re: [Condor-users] Quill++ assistance
> Sent by:
> condor-users-bounces@xxxxxxxxxxx
>
>
>
>
> What are your SEC_DEFAULT_AUTHENTICATION and
> SEC_DEFAULT_AUTHENTICATION_METHODS set to?
> This error is saying that the various quilld's on the worker
> nodes can't contact the master.  Bad security configuration of
> some sort is to blame.. windows gurus can help more.
>
>
> Steve
>
> On Wed, 11 Aug 2010, Michael O'Donnell wrote:
>
>> I have been trying to set up Quill for our pool so we can track HTC use.
> I
>> have followed the Condor manual for configuration of both the
>> configuration files as well as PostGres. Quill will work for several
> hours
>> but then most of the machines are dropped from the pool according to
>> Quill. For example, If I enable Quill everything seems to work for at
>> least several hours. But usually by the next morning Quill is not
> tracking
>> any of the machines and all machines are dropped from the pool (as seen
>> via condor_status). The Condor daemons are still running on each machine
>> however.
>>
>> This seems to be related to the password/security based on the errors I
> am
>> receiving below, but the database tables are populated, all the sql log
>> files have information and everything looks ok.
>>
>> I have a homogeneous pool with Windows OS working nodes and our central
>> manager is running on Windows 2008 server. Postgres is also running on
>> this same server. Our bandwidth is 1Gbs and our pool is small (50
> machines
>> right now).
>>
>> Can anyone help me understand what I may be doing wrong or what the
>> problem might be related to.
>>
>> Thank you for the help,
>> Mike
>>
>>
>> I am getting an error that the condor_quill.exe(exit 4) has exited via
>> email to the administrator:
>>
>> *** Last 20 line(s) of file C:/Condor/log/QuillLog:
>> SessionDuration = "86400"
>> NewSession = "YES"
>> RemoteVersion = "$CondorVersion: 7.4.0 Oct 31 2009 BuildID: 193173 $"
>> ServerCommandSock = "<IP:4555>"
>> Command = 60010
>> AuthCommand = 60008
>> 08/10 20:00:41 condor_write(fd=1704
>> <IP:1046>,,size=514,timeout=20,flags=0)
>> 08/10 20:00:47 condor_read(fd=1704 <IP:1046>,,size=5,timeout=20,flags=0)
>> 08/10 20:01:03 condor_read(): fd=1704
>> 08/10 20:01:24 condor_read(): select returned 0
>> 08/10 20:01:48 condor_read(): timeout reading 5 bytes from
>> <159.189.162.50:1046>.
>> 08/10 20:01:49 IO: Failed to read packet header
>> 08/10 20:01:50 Stream::get(int) failed to read padding
>> 08/10 20:01:51 Failed to read ClassAd size.
>> 08/10 20:01:52 SECMAN: no classad from server, failing
>> 08/10 20:01:53 CLOSE <IP:4610> fd=1704
>> 08/10 20:01:54 SECMAN: unable to create security session to
>> <159.189.162.50:1046> via TCP, failing.
>> 08/10 20:01:55 ERROR: SECMAN:2004:Failed to create security session to
>> <159.189.162.50:1046> with TCP.|SECMAN:2007:Failed to end classad
> message.
>> 08/10 20:01:56 DaemonCore: startCommand() to <159.189.162.50:1046>
> failed.
>> SendAliveToParent() failed.
>> 08/10 20:02:17 ERROR "FAILED TO SEND INITIAL KEEP ALIVE TO OUR PARENT
>> <IP:1046>" at line 9310 in file
>> ..\src\condor_daemon_core.V6\daemon_core.cpp
>> *** End of file QuillLog
>>
>>
>
>

--
------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525
timm@xxxxxxxx  
http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Assistant Group Leader.
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/