[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Quill errors again

On Wed, 26 Jul 2006, Dr Ian C. Smith wrote:


I'm getting exactly the same errors with Quill as
reported in:


condor-admin is working a ticket with me right now and just sent me
a pre-release debug condor_quill that they claim fixes the duplicate
key problem, and so far has.

Namely, that the condor_q with quill stops reporting
after a while. It also starts to print out records like:

--- ???? ---
--- ???? ---
--- ???? ---

before this. Looking at the postgresql log there are
whole load of errors of the form:

ERROR:  duplicate key violates unique constraint "procads_str_pkey"
ERROR:  duplicate key violates unique constraint "procads_num_pkey"

You need to make sure the autovacuum flag is turned on in your postgres,
otherwise what happens is that the quill database gets full
of deleted rows.  Our postgres database grew to 14GB of disk space
but once we cleaned it out there was only some 900MB of useful information. In such a state it takes near forever, several
minutes for condor_q to finish.  sometimes you see the --- ???? ----
lines too but this is a transient thing reflecting a job that's only
partially loaded into (or out of) the database by quill.

In particular you have to have in postgresql.conf
autovacuum = on
and set the autovacuum_naptime, autovacuum_threshholde, etc.

I've noticed that there are several instances of condor_quill running
so is it the case that these are trying to write to the DB at the same
time causing a contention problem ?

Should only be one condor_quill running per schedd.  Do you have
multiple schedd's running on the same machine?

The condor_config file contains a comment that seem to pertain to this:

# The Postgreql server requires usernames that can manipulate tables. This
# be the username associated with this instance of the quill daemon
# a schedd's job queue. Each quill daemon must have a unique username
# associated with it otherwise multiple quill daemons will corrupt the data
# held under an indentical user name.
QUILL_DB_NAME = quill_db

The statement is sort of misleading, because what QUILL_DB_NAME
really sets is the name of the database within postgres, not the
postgres user name which as far as I can tell are quillwriter and quillreader for all. I believe what has to happen is that
each separate QUILL instance should have a unique QUILL_NAME
and a unique QUILL_DB_NAME, which will translate to several different
databases but all of which can be served by the same instance of postgres.

although I can't quite see what it means. Should each condor_quill write to
a separate postgresql DB or use a separate username ? If only a single
quill daemon runs would this solve the problem (how is this configured ?).
There is supposed to be one quill per schedd running on the same machine
that the schedd is running on.

Presumably this would just move the bottleneck to the RDMS though ?

Your bottleneck is most likely the RDBMS anyway right now and
getting it duly vacuumed of all the deleted records is the path to performance.

Any thoughts,


Also any quill < 6.8.0 is likely to frequently crash with segmentation
faults due to a bunch of buffer overflow conditions that were
just recently fixed.  Even the 6.7.20 quill as released has
some problems in that regard.

Steve Timm

Dr Ian C. Smith,
e-Science team,
University of Liverpool
Computing Services Department

Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting

The archives can be found at either

Steven C. Timm, Ph.D  (630) 840-8525  timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Div/Core Support Services Dept./Scientific Computing Section
Assistant Group Leader, Farms and Clustered Systems Group
Lead of Computing Farms Team