[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Quill++ assistance



The log file is from a different machine. I do have the Postgres database on the same server and maybe this is causing the problem. My load is about 50% with the condor central manager and the database. I will try to move the database onto a different server and let others know what I find out.

Thank you for the suggestions,
Mike


-----condor-users-bounces@xxxxxxxxxxx wrote: -----

To: Condor-Users Mail List <condor-users@xxxxxxxxxxx>
From: Erik Paulson <epaulson@xxxxxxxxxxxx>
Sent by: condor-users-bounces@xxxxxxxxxxx
Date: 08/23/2010 01:08PM
Subject: Re: [Condor-users] Quill++ assistance

On Wed, Aug 11, 2010 at 8:49 AM, Michael O'Donnell <odonnellm@xxxxxxxx> wrote:
>
> I have been trying to set up Quill for our pool so we can track HTC use. I
> have followed the Condor manual for configuration of both the configuration
> files as well as PostGres. Quill will work for several hours but then most
> of the machines are dropped from the pool according to Quill. For example,
> If I enable Quill everything seems to work for at least several hours. But
> usually by the next morning Quill is not tracking any of the machines and
> all machines are dropped from the pool (as seen via condor_status). The
> Condor daemons are still running on each machine however.

Machines should not ever drop out of the pool when Quill is enabled -
on the execute/worker nodes, the Quill load is negligible - they write
out a few lines to a local file, and the Quill daemon on that machine
reads it and sends it over the network to a database. The startd never
blocks or even really knows if Quill exists.

Now, if your Postgres database is running on the same machine as your
central manager, then perhaps the load on the machine from both the
database and the collector/central manager is causing updates to be
dropped.

>
> This seems to be related to the password/security based on the errors I am
> receiving below, but the database tables are populated, all the sql log
> files have information and everything looks ok.
>

It could be a security setting problem - exiting with status 4 means
the daemon decided that something was wrong. However, if load on the
machine is sky-high at the time it could explain why the parent
process of the quill daemon didn't answer for the initial keep-alive.
Is the log below from same machine that hosts the central manger?

-Erik

> I have a homogeneous pool with Windows OS working nodes and our central
> manager is running on Windows 2008 server. Postgres is also running on this
> same server. Our bandwidth is 1Gbs and our pool is small (50 machines right
> now).
>
> Can anyone help me understand what I may be doing wrong or what the problem
> might be related to.
>
> Thank you for the help,
> Mike
>
>
> I am getting an error that the condor_quill.exe(exit 4) has exited via email
> to the administrator:
>
> *** Last 20 line(s) of file C:/Condor/log/QuillLog:
> SessionDuration = "86400"
> NewSession = "YES"
> RemoteVersion = "$CondorVersion: 7.4.0 Oct 31 2009 BuildID: 193173 $"
> ServerCommandSock = "<IP:4555>"
> Command = 60010
> AuthCommand = 60008
> 08/10 20:00:41 condor_write(fd=1704 <IP:1046>,,size=514,timeout=20,flags=0)
> 08/10 20:00:47 condor_read(fd=1704 <IP:1046>,,size=5,timeout=20,flags=0)
> 08/10 20:01:03 condor_read(): fd=1704
> 08/10 20:01:24 condor_read(): select returned 0
> 08/10 20:01:48 condor_read(): timeout reading 5 bytes from
> <159.189.162.50:1046>.
> 08/10 20:01:49 IO: Failed to read packet header
> 08/10 20:01:50 Stream::get(int) failed to read padding
> 08/10 20:01:51 Failed to read ClassAd size.
> 08/10 20:01:52 SECMAN: no classad from server, failing
> 08/10 20:01:53 CLOSE <IP:4610> fd=1704
> 08/10 20:01:54 SECMAN: unable to create security session to
> <159.189.162.50:1046> via TCP, failing.
> 08/10 20:01:55 ERROR: SECMAN:2004:Failed to create security session to
> <159.189.162.50:1046> with TCP.|SECMAN:2007:Failed to end classad message.
> 08/10 20:01:56 DaemonCore: startCommand() to <159.189.162.50:1046> failed.
> SendAliveToParent() failed.
> 08/10 20:02:17 ERROR "FAILED TO SEND INITIAL KEEP ALIVE TO OUR PARENT
> <IP:1046>" at line 9310 in file ..\src\condor_daemon_core.V6\daemon_core.cpp
> *** End of file QuillLog
>
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>
>
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/