[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Windows XP Condor 7.4.0 Quill Issues




We currently have set up Quill and postgres with Condor. Everything appeared to work initially, but there were a couple problems.

First, the response time for submitted jobs went from no time spent as Idle to over an hour before the job executed.

Second, after a day or so we started getting errors such as this one emailed to the condor administrator:
This is an automated email from the Condor system
on machine "IGSKBACBLT106.domain".  Do not reply.

"C:\Condor/bin/condor_quill.exe" on "IGSKBACBLT106.domain" exited with status 4.
Condor will automatically restart this process in 11 seconds.

*** Last 20 line(s) of file C:\Condor/log/QuillLog:
SessionDuration = "86400"
NewSession = "YES"
RemoteVersion = "$CondorVersion: 7.4.0 Oct 31 2009 BuildID: 193173 $"
ServerCommandSock = "<IP:1662>"
Command = 60010
AuthCommand = 60008
07/02 20:37:33 condor_write(fd=1716 <IP:3596>,,size=505,timeout=20,flags=0)
07/02 20:37:33 condor_read(fd=1716 <IP:3596>,,size=5,timeout=20,flags=0)
07/02 20:37:34 condor_read(): fd=1716
07/02 20:37:54 condor_read(): select returned 0
07/02 20:37:56 condor_read(): timeout reading 5 bytes from <IP:3596>.
07/02 20:37:57 IO: Failed to read packet header
07/02 20:37:58 Stream::get(int) failed to read padding
07/02 20:37:59 Failed to read ClassAd size.
07/02 20:37:59 SECMAN: no classad from server, failing
07/02 20:38:00 CLOSE <IP:1688> fd=1716
07/02 20:38:01 SECMAN: unable to create security session to <IP:3596> via TCP, failing.
07/02 20:38:02 ERROR: SECMAN:2004:Failed to create security session to <IP:3596> with TCP.|SECMAN:2007:Failed to end classad message.
07/02 20:38:05 DaemonCore: startCommand() to <IP:3596> failed. SendAliveToParent() failed.
07/02 20:38:06 ERROR "FAILED TO SEND INITIAL KEEP ALIVE TO OUR PARENT <IP:3596>" at line 9310 in file ..\src\condor_daemon_core.V6\daemon_core.cpp
*** End of file QuillLog


Third, machines in our pool started to drop off until our pool was no longer functioning. As a result we disabled Quill and everything went back to normal.


I came across several related posts but we have had no luck figuring out the culprit:
https://lists.cs.wisc.edu/archive/condor-users/2010-March/msg00015.shtml
https://www-auth.cs.wisc.edu/lists/condor-users/2005-October/msg00402.shtml

Also, when Quill is initially enabled, the postgres tables were populated as expected and everything looked good.

We have Quill, postgres and CM on the same server but because our pool is small enough (~50 cores) we did not think this should be the problem. Our server and noes are all windows XP. We are using NTSSPI and SSL for security.

Does anyone have any thoughts?

Thank you for your comments,
Mike