[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Windows XP Condor 7.4.0 Quill Issues
- Date: Fri, 9 Jul 2010 08:34:09 -0600
- From: "Michael O'Donnell" <odonnellm@xxxxxxxx>
- Subject: [Condor-users] Windows XP Condor 7.4.0 Quill Issues
We currently have set up Quill and postgres
with Condor. Everything appeared to work initially, but there were a couple
First, the response time for submitted
jobs went from no time spent as Idle to over an hour before the job executed.
Second, after a day or so we started
getting errors such as this one emailed to the condor administrator:
This is an automated email from the Condor system
on machine "IGSKBACBLT106.domain". Do not reply.
"C:\Condor/bin/condor_quill.exe" on "IGSKBACBLT106.domain"
exited with status 4.
Condor will automatically restart this process in 11 seconds.
*** Last 20 line(s) of file C:\Condor/log/QuillLog:
SessionDuration = "86400"
NewSession = "YES"
RemoteVersion = "$CondorVersion: 7.4.0 Oct 31 2009 BuildID: 193173
ServerCommandSock = "<IP:1662>"
Command = 60010
AuthCommand = 60008
07/02 20:37:33 condor_write(fd=1716 <IP:3596>,,size=505,timeout=20,flags=0)
07/02 20:37:33 condor_read(fd=1716 <IP:3596>,,size=5,timeout=20,flags=0)
07/02 20:37:34 condor_read(): fd=1716
07/02 20:37:54 condor_read(): select returned 0
07/02 20:37:56 condor_read(): timeout reading 5 bytes from <IP:3596>.
07/02 20:37:57 IO: Failed to read packet header
07/02 20:37:58 Stream::get(int) failed to read padding
07/02 20:37:59 Failed to read ClassAd size.
07/02 20:37:59 SECMAN: no classad from server, failing
07/02 20:38:00 CLOSE <IP:1688> fd=1716
07/02 20:38:01 SECMAN: unable to create security session to <IP:3596>
via TCP, failing.
07/02 20:38:02 ERROR: SECMAN:2004:Failed to create security session to
<IP:3596> with TCP.|SECMAN:2007:Failed to end classad message.
07/02 20:38:05 DaemonCore: startCommand() to <IP:3596> failed. SendAliveToParent()
07/02 20:38:06 ERROR "FAILED TO SEND INITIAL KEEP ALIVE TO OUR PARENT
<IP:3596>" at line 9310 in file ..\src\condor_daemon_core.V6\daemon_core.cpp
*** End of file QuillLog
Third, machines in our pool started
to drop off until our pool was no longer functioning. As a result we disabled
Quill and everything went back to normal.
I came across several related posts
but we have had no luck figuring out the culprit:
Also, when Quill is initially enabled,
the postgres tables were populated as expected and everything looked good.
We have Quill, postgres and CM on the
same server but because our pool is small enough (~50 cores) we did not
think this should be the problem. Our server and noes are all windows XP.
We are using NTSSPI and SSL for security.
Does anyone have any thoughts?
Thank you for your comments,