[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Quill errors again





--On 27 July 2006 09:17 -0500 Steven Timm <timm@xxxxxxxx> wrote:

On Thu, 27 Jul 2006, Dr Ian C. Smith wrote:

Tim,

I've set the autovacuum on and set the parameters to their
defaults. Presumably condor purges the DB periodically as well ?

Only if QUILL_MANAGE_VACUUM is set to true.  Manual recommends
setting it false and letting postgres do its own vacuuming.


I'm still getting the same problem with condor_q though and the
same messages are appearing in the postgresql and condor_quill
logs. I appear to have multiple condor_quill instances despite
having only one condor_schedd. The condor_master seems to
spawn three instances from the outset and more come and go as
time goes on. Could this be the source of the problem and is
there anyway of preventing these multiple quill daemons.

cheers,

-ian.

What's the value of DAEMON_LIST, in fact what are the values
of all the various QUILL variables in your condor_config?
Something sounds really fishy.  Also you didn't say which version you are
running.

Using 6.7.20 on solaris 9. QUILL vars are below and are pretty much the defaults
(see below).

-ian.

DAEMON_LIST                     = MASTER, SCHEDD, QUILL
QUILL = $(SBIN)/condor_quill
#QUILL_ARGS =
QUILL_LOG = $(LOG)/QuillLog
QUILL_ADDRESS_FILE = $(LOG)/.quill_address
# If this is set to true, then the rest of the QUILL arguments must be defined
QUILL_ENABLED = TRUE
QUILL_NAME = quill@xxxxxxxxxxxxxxx
QUILL_DB_NAME = quill_db
QUILL_DB_QUERY_PASSWORD = xxx
QUILL_DB_IP_ADDR = ulgp2.liv.ac.uk:5432
QUILL_POLLING_PERIOD = 10
QUILL_HISTORY_DURATION  = 30
# Number of hours between scans of QUILL_HISTORY_DURATION.
QUILL_HISTORY_CLEANING_INTERVAL = 24
QUILL_IS_REMOTELY_QUERYABLE = TRUE
#QUILL_DEBUG = D_FULLDEBU




Steve



--On 26 July 2006 09:18 -0500 Steven Timm <timm@xxxxxxxx> wrote:

On Wed, 26 Jul 2006, Dr Ian C. Smith wrote:

Hi,

I'm getting exactly the same errors with Quill as
reported in:

https://lists.cs.wisc.edu/archive/condor-users/2005-December/msg00005.
sh tml

condor-admin is working a ticket with me right now and just sent me
a pre-release debug condor_quill that they claim fixes the duplicate
key problem, and so far has.


Namely, that the condor_q with quill stops reporting
after a while. It also starts to print out records like:

--- ???? ---
--- ???? ---
--- ???? ---

before this. Looking at the postgresql log there are
whole load of errors of the form:

ERROR:  duplicate key violates unique constraint "procads_str_pkey"
ERROR:  duplicate key violates unique constraint "procads_num_pkey"

You need to make sure the autovacuum flag is turned on in your postgres,
otherwise what happens is that the quill database gets full
of deleted rows.  Our postgres database grew to 14GB of disk space
but once we cleaned it out there was only some 900MB of useful
information.  In such a state it takes near forever, several
minutes for condor_q to finish.  sometimes you see the --- ???? ----
lines too but this is a transient thing reflecting a job that's only
partially loaded into (or out of) the database by quill.

In particular you have to have in postgresql.conf
autovacuum = on
and set the autovacuum_naptime, autovacuum_threshholde, etc.



I've noticed that there are several instances of condor_quill running
so is it the case that these are trying to write to the DB at the same
time causing a contention problem ?

Should only be one condor_quill running per schedd.  Do you have
multiple schedd's running on the same machine?



The condor_config file contains a comment that seem to pertain to this:

# The Postgreql server requires usernames that can manipulate tables.
# This
will
# be the username associated with this instance of the quill daemon
mirroring
# a schedd's job queue. Each quill daemon must have a unique username
# associated with it otherwise multiple quill daemons will corrupt the
# data held under an indentical user name.
QUILL_DB_NAME = quill_db

The statement is sort of misleading, because what QUILL_DB_NAME
really sets is the name of the database within postgres, not the
postgres user name which as far as I can tell are quillwriter and
quillreader for all.  I believe what has to happen is that
each separate QUILL instance should have a unique QUILL_NAME
and a unique QUILL_DB_NAME, which will translate to several different
databases but all of which can be served by the same instance of
postgres.

although I can't quite see what it means. Should each condor_quill
write to a separate postgresql DB or use a separate username ? If only
a single quill daemon runs would this solve the problem (how is this
configured ?).
There is supposed to be one quill per schedd running on the same machine
that the schedd is running on.

Presumably this would just move the bottleneck to the RDMS though ?

Your bottleneck is most likely the RDBMS anyway right now and
getting it duly vacuumed of all the deleted records is the path to
performance.


Any thoughts,

-ian.


Also any quill < 6.8.0 is likely to frequently crash with segmentation
faults due to a bunch of buffer overflow conditions that were
just recently fixed.  Even the 6.7.20 quill as released has
some problems in that regard.

Steve Timm




-----------------------------------
Dr Ian C. Smith,
e-Science team,
University of Liverpool
Computing Services Department



_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
with a subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR


--
------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525  timm@xxxxxxxx
http://home.fnal.gov/~timm/ Fermilab Computing Div/Core Support Services
Dept./Scientific Computing Section Assistant Group Leader, Farms and
Clustered Systems Group
Lead of Computing Farms Team
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with
a subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR


_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR


--
------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525  timm@xxxxxxxx
http://home.fnal.gov/~timm/ Fermilab Computing Div/Core Support Services
Dept./Scientific Computing Section Assistant Group Leader, Farms and
Clustered Systems Group
Lead of Computing Farms Team
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR