[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Quill++ assistance



What are the sizes of the *sql.log on each machine?
It could be that the quill daemon is so far behind that
it is trying to upload lots of back data to the postgres
server and never catching up within the 1 hour window before the
startd kills it off.  You can get around that by
launching condor_quill from the command line and then it will
crank until it catches up.  If you're in that state then
 you want to do that operation in a staggered fashion so you don't have
every quill daemon pounding on the postgres server at once, just one
or two at a time.

It could also be that you've got some kind of garbage in
your *sql.log files that quill can't process properly. I ran
into a nasty bug of that flavor in condor 7.4.1 (and previous
versions) recently in which if there is even one malformed
statement in the sql.log file, quill attempts to load
the whole contents of the sql.log file into the errors table,
oftentimes filling up all disk in the postgres server,failing,
and then rolling back the transaction so nothing is accomplished.
That's supposed to be fixed in condor 7.4.3 but I haven't checked it yet.

D_FULLDEBUG in the quill logs would tell the story on either
of these two possibilities. The third possibility is that
you've got some kind of security setting missing so the
daemons either can't connect to the collector or can't stay
connected to the collector.

Finally, look at your postgres logs ( data/pg_log/pgstart.log)
if your postgres database isn't tuned properly, and many are not,
it will give you tuning suggestions to make it better. Also be
sure to get a good VACUUM FULL ANALYZE VERBOSE done.
If you are dominated by schedd quill data, it turns out that in
condor 7.4.1 and before there is one of the tables (the biggest one)
that doesn't get reindexed properly by the condor_dbmsd quill_reindextables() function and you have to reindex it manually.

There's lots of  reasons they are proposing to move condor_quill back to
a "contributed" status from the main 7.6 release--the above is just part of it.

Steve

On Thu, 26 Aug 2010, Michael O'Donnell wrote:

I have not examined the time intervals of the Quill daemons dying for our
pool, but I get hundreds of emails stating the quill daemon died and has
restarted on each machine. I have been trying to get Quill to work with
Windows as well, and I have been posting on this topic to this list. I
mentioned earlier that I have postgres database on the same server as our
CM. I was going to try installing postgress on a different server, but I
have not gotten around to this yet. I am pretty sure this is not the
problem, but it is something for me to try. I also have noticed that the
Quill daemon on our CM does not seem to die, but the Quill daemons on all
working nodes die on a regular basis. I have not determined why this is
the case, and the only difference is my OS. Our server is using server
2008 and our working nodes are 32/64bit windows xp and windows 7.

Mike





From:
<Greg.Hitchen@xxxxxxxx>
To:
<condor-users@xxxxxxxxxxx>
Date:
08/25/2010 08:07 PM
Subject:
Re: [Condor-users] Quill++ assistance
Sent by:
condor-users-bounces@xxxxxxxxxxx




That's correct, no other daemons are restarting, just condor_quill.

Interestingly, now that I have installed this version onto another
few PCs, the 1hr 25min is not EXACT. Two PCs that I "synched" yesterday
by restarting condor at the same time are now 2-3 minutes apart on
their condor_quill restarts. Maybe the condor_master restarting
condor_quill after 10secs isn't exact and the time diff gradually builds
up? I'll keep an eye on it.

Cheers

Greg


-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx [
mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Erik Paulson
Sent: Thursday, 26 August 2010 4:16 AM
To: Condor-Users Mail List
Subject: Re: [Condor-users] Quill++ assistance

And just to confirm, it's only Quill - none of the other daemons show
the same restart every hour and twenty-five minutes?

-Erik


On Wed, Aug 25, 2010 at 1:12 AM,  <Greg.Hitchen@xxxxxxxx> wrote:
Hi Erik

The 1hr 25 mins is definitely not related (as far as I can tell) to
virus
scans/server activity/etc.
I've checked all the scheduled type of activities that our PCs get
installed
with and nothng "fits".

In addition I have installed 7.4.3 onto several PCs now and they all
exhibit
the 1hr 25 restart
of condor_quill and it always starts exactly 1 hr 25 mins after condor
is
started, i.e. anytime
I do a condor net stop, condor net start on them then the first of the
1hr
25mins restarts
begins 1 hr 25mins after this.

There is a dprintf_failure.QUILL file created but it is empty and 0
bytes in
size.
No core file is created and condor_quill quite happily gets restarted by
condor_master after
10 secs until the MasterLog again says it exits with error 44 after the
next
1hr 25 mins.
Nothing gets logged in the QuillLog.

Cheers

Greg
________________________________
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Erik Paulson
Sent: Tuesday, 24 August 2010 3:46 AM
To: Condor-Users Mail List
Subject: Re: [Condor-users] Quill++ assistance


Greg: The "exit 44" issue is odd - status 44 means that Condor couldn't
log
some piece of information (which is why you don't see anything in the
logs
:). While I wouldn't rule anything in Condor out, 1:25:00 is not a
number
that strikes me as special in any of the Condor code, so I'm not sure
what
would happen on the Condor side with that periodicity. Are there any
file
server/virus scans/etc sort of activity that might interfere with writes
to
files that happen at your site?
Greg/Michael: the ACCESS_VIOLATION is happening in a strange spot. To
answer
your question, the Quill daemon should run continuously - however, if it
is
consistently crashing, the master will exponentially back off trying to
run
it until it only tries once an hour - so it may be likely that you'll
see a
core file with no Quill daemon running.
If that's the case and it is consistently crashing, I would love to see
your
full QuillLog, along with your sql.log file. We should be able to play
it
back and see exactly why it's crashing.
Thanks,
-Erik

On Wed, Aug 11, 2010 at 8:48 PM, <Greg.Hitchen@xxxxxxxx> wrote:

Perhaps not much help Michael but we've had similar problems with 7.2.4
on
windows
(see first attached email). It behaved somewhat better for 7.4.1 (see
second attached email)
and at least ran, even though restarting condor_quill every 1hr 25mins,
but a number of other
problems/issues with the 7.4 series has not allowed us to upgrade to
that
version yet.

Cheers

Greg

________________________________
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Michael
O'Donnell
Sent: Thursday, 12 August 2010 3:56 AM
To: Condor-Users Mail List
Subject: Re: [Condor-users] Quill++ assistance


I have these specified already and I do not see any issues. The
quilllog
file show SQL statements and success at populating the tables.

However, I am finding a file on all machine other than the central
manager
that has an access violation error. I am not sure if the
condor_quill.exe
daemon is supposed to run continuously, but I do not see it running on
any
machines other than the central manager.

The file that is showing up in the log directory on each machine is
called
core.QUILL.WIN32. Its contents are (Does this mean anything to anyone
else):

<...>

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with
a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/


_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/




--
------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525
timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Assistant Group Leader.