[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Quill++ assistance




set QUILL_DEBUG to include D_SECURITY, maybe even D_FULLDEBUG
and look at what the logs are telling you.. it should say a better
error message that says what is going on.

Steve

On Wed, 11 Aug 2010, Michael O'Donnell wrote:

These settings are:
SEC_DEFAULT_AUTHENTICATION = REQUIRED
SEC_DEFAULT_AUTHENTICATION_METHODS = NTSSPI, SSL, PASSWORD


thanks
mike





From:
Steven Timm <timm@xxxxxxxx>
To:
Condor-Users Mail List <condor-users@xxxxxxxxxxx>
Date:
08/11/2010 09:13 AM
Subject:
Re: [Condor-users] Quill++ assistance
Sent by:
condor-users-bounces@xxxxxxxxxxx




What are your SEC_DEFAULT_AUTHENTICATION and
SEC_DEFAULT_AUTHENTICATION_METHODS set to?
This error is saying that the various quilld's on the worker
nodes can't contact the master.  Bad security configuration of
some sort is to blame.. windows gurus can help more.


Steve

On Wed, 11 Aug 2010, Michael O'Donnell wrote:

I have been trying to set up Quill for our pool so we can track HTC use.
I
have followed the Condor manual for configuration of both the
configuration files as well as PostGres. Quill will work for several
hours
but then most of the machines are dropped from the pool according to
Quill. For example, If I enable Quill everything seems to work for at
least several hours. But usually by the next morning Quill is not
tracking
any of the machines and all machines are dropped from the pool (as seen
via condor_status). The Condor daemons are still running on each machine
however.

This seems to be related to the password/security based on the errors I
am
receiving below, but the database tables are populated, all the sql log
files have information and everything looks ok.

I have a homogeneous pool with Windows OS working nodes and our central
manager is running on Windows 2008 server. Postgres is also running on
this same server. Our bandwidth is 1Gbs and our pool is small (50
machines
right now).

Can anyone help me understand what I may be doing wrong or what the
problem might be related to.

Thank you for the help,
Mike


I am getting an error that the condor_quill.exe(exit 4) has exited via
email to the administrator:

*** Last 20 line(s) of file C:/Condor/log/QuillLog:
SessionDuration = "86400"
NewSession = "YES"
RemoteVersion = "$CondorVersion: 7.4.0 Oct 31 2009 BuildID: 193173 $"
ServerCommandSock = "<IP:4555>"
Command = 60010
AuthCommand = 60008
08/10 20:00:41 condor_write(fd=1704
<IP:1046>,,size=514,timeout=20,flags=0)
08/10 20:00:47 condor_read(fd=1704 <IP:1046>,,size=5,timeout=20,flags=0)
08/10 20:01:03 condor_read(): fd=1704
08/10 20:01:24 condor_read(): select returned 0
08/10 20:01:48 condor_read(): timeout reading 5 bytes from
<159.189.162.50:1046>.
08/10 20:01:49 IO: Failed to read packet header
08/10 20:01:50 Stream::get(int) failed to read padding
08/10 20:01:51 Failed to read ClassAd size.
08/10 20:01:52 SECMAN: no classad from server, failing
08/10 20:01:53 CLOSE <IP:4610> fd=1704
08/10 20:01:54 SECMAN: unable to create security session to
<159.189.162.50:1046> via TCP, failing.
08/10 20:01:55 ERROR: SECMAN:2004:Failed to create security session to
<159.189.162.50:1046> with TCP.|SECMAN:2007:Failed to end classad
message.
08/10 20:01:56 DaemonCore: startCommand() to <159.189.162.50:1046>
failed.
SendAliveToParent() failed.
08/10 20:02:17 ERROR "FAILED TO SEND INITIAL KEEP ALIVE TO OUR PARENT
<IP:1046>" at line 9310 in file
..\src\condor_daemon_core.V6\daemon_core.cpp
*** End of file QuillLog





--
------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525
timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Assistant Group Leader.