[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Quill++ assistance
- Date: Wed, 11 Aug 2010 07:49:57 -0600
- From: "Michael O'Donnell" <odonnellm@xxxxxxxx>
- Subject: [Condor-users] Quill++ assistance
I have been trying to set up Quill for
our pool so we can track HTC use. I have followed the Condor manual for
configuration of both the configuration files as well as PostGres. Quill
will work for several hours but then most of the machines are dropped from
the pool according to Quill. For example, If I enable Quill everything
seems to work for at least several hours. But usually by the next morning
Quill is not tracking any of the machines and all machines are dropped
from the pool (as seen via condor_status). The Condor daemons are still
running on each machine however.
This seems to be related to the password/security
based on the errors I am receiving below, but the database tables are populated,
all the sql log files have information and everything looks ok.
I have a homogeneous pool with Windows
OS working nodes and our central manager is running on Windows 2008 server.
Postgres is also running on this same server. Our bandwidth is 1Gbs and
our pool is small (50 machines right now).
Can anyone help me understand what I
may be doing wrong or what the problem might be related to.
Thank you for the help,
I am getting an error that the condor_quill.exe(exit
4) has exited via email to the administrator:
*** Last 20 line(s) of file C:/Condor/log/QuillLog:
SessionDuration = "86400"
NewSession = "YES"
RemoteVersion = "$CondorVersion: 7.4.0 Oct 31 2009 BuildID: 193173
ServerCommandSock = "<IP:4555>"
Command = 60010
AuthCommand = 60008
08/10 20:00:41 condor_write(fd=1704 <IP:1046>,,size=514,timeout=20,flags=0)
08/10 20:00:47 condor_read(fd=1704 <IP:1046>,,size=5,timeout=20,flags=0)
08/10 20:01:03 condor_read(): fd=1704
08/10 20:01:24 condor_read(): select returned 0
08/10 20:01:48 condor_read(): timeout reading 5 bytes from <22.214.171.124:1046>.
08/10 20:01:49 IO: Failed to read packet header
08/10 20:01:50 Stream::get(int) failed to read padding
08/10 20:01:51 Failed to read ClassAd size.
08/10 20:01:52 SECMAN: no classad from server, failing
08/10 20:01:53 CLOSE <IP:4610> fd=1704
08/10 20:01:54 SECMAN: unable to create security session to <126.96.36.199:1046>
via TCP, failing.
08/10 20:01:55 ERROR: SECMAN:2004:Failed to create security session to
<188.8.131.52:1046> with TCP.|SECMAN:2007:Failed to end classad
08/10 20:01:56 DaemonCore: startCommand() to <184.108.40.206:1046>
failed. SendAliveToParent() failed.
08/10 20:02:17 ERROR "FAILED TO SEND INITIAL KEEP ALIVE TO OUR PARENT
<IP:1046>" at line 9310 in file ..\src\condor_daemon_core.V6\daemon_core.cpp
*** End of file QuillLog