[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Quill++ assistance

Do I need to add the QUILL_NAME (In my case this is Quill@FORT) to the ALLOW argument in the global configuration file--testing now.

When I run the following:
condor_q -direct quilld -long

I get an error message:
SECMAN:2007:Failed to end classad message
--Quill daemon at quill@FORT(IP) associated with schedd (central manager) is not reachable or can't talk to rdbms.


From: Steven Timm <timm@xxxxxxxx>
To: Condor-Users Mail List <condor-users@xxxxxxxxxxx>
Date: 08/11/2010 09:13 AM
Subject: Re: [Condor-users] Quill++ assistance
Sent by: condor-users-bounces@xxxxxxxxxxx

This error is saying that the various quilld's on the worker
nodes can't contact the master.  Bad security configuration of
some sort is to blame.. windows gurus can help more.


On Wed, 11 Aug 2010, Michael O'Donnell wrote:

> I have been trying to set up Quill for our pool so we can track HTC use. I
> have followed the Condor manual for configuration of both the
> configuration files as well as PostGres. Quill will work for several hours
> but then most of the machines are dropped from the pool according to
> Quill. For example, If I enable Quill everything seems to work for at
> least several hours. But usually by the next morning Quill is not tracking
> any of the machines and all machines are dropped from the pool (as seen
> via condor_status). The Condor daemons are still running on each machine
> however.
> This seems to be related to the password/security based on the errors I am
> receiving below, but the database tables are populated, all the sql log
> files have information and everything looks ok.
> I have a homogeneous pool with Windows OS working nodes and our central
> manager is running on Windows 2008 server. Postgres is also running on
> this same server. Our bandwidth is 1Gbs and our pool is small (50 machines
> right now).
> Can anyone help me understand what I may be doing wrong or what the
> problem might be related to.
> Thank you for the help,
> Mike
> I am getting an error that the condor_quill.exe(exit 4) has exited via
> email to the administrator:
> *** Last 20 line(s) of file C:/Condor/log/QuillLog:
> SessionDuration = "86400"
> NewSession = "YES"
> RemoteVersion = "$CondorVersion: 7.4.0 Oct 31 2009 BuildID: 193173 $"
> ServerCommandSock = "<IP:4555>"
> Command = 60010
> AuthCommand = 60008
> 08/10 20:00:41 condor_write(fd=1704
> <IP:1046>,,size=514,timeout=20,flags=0)
> 08/10 20:00:47 condor_read(fd=1704 <IP:1046>,,size=5,timeout=20,flags=0)
> 08/10 20:01:03 condor_read(): fd=1704
> 08/10 20:01:24 condor_read(): select returned 0
> 08/10 20:01:48 condor_read(): timeout reading 5 bytes from
> <>.
> 08/10 20:01:49 IO: Failed to read packet header
> 08/10 20:01:50 Stream::get(int) failed to read padding
> 08/10 20:01:51 Failed to read ClassAd size.
> 08/10 20:01:52 SECMAN: no classad from server, failing
> 08/10 20:01:53 CLOSE <IP:4610> fd=1704
> 08/10 20:01:54 SECMAN: unable to create security session to
> <> via TCP, failing.
> 08/10 20:01:55 ERROR: SECMAN:2004:Failed to create security session to
> <> with TCP.|SECMAN:2007:Failed to end classad message.
> 08/10 20:01:56 DaemonCore: startCommand() to <> failed.
> SendAliveToParent() failed.
> <IP:1046>" at line 9310 in file
> ..\src\condor_daemon_core.V6\daemon_core.cpp
> *** End of file QuillLog

Steven C. Timm, Ph.D  (630) 840-8525
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Assistant Group Leader.
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting

The archives can be found at: