Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Quill++ assistance

Date: Thu, 26 Aug 2010 11:30:50 -0500 (CDT)
From: Steven Timm <timm@xxxxxxxx>
Subject: Re: [Condor-users] Quill++ assistance

What are the sizes of the *sql.log on each machine?
It could be that the quill daemon is so far behind that
it is trying to upload lots of back data to the postgres
server and never catching up within the 1 hour window before the
startd kills it off.  You can get around that by
launching condor_quill from the command line and then it will
crank until it catches up.  If you're in that state then
 you want to do that operation in a staggered fashion so you don't have
every quill daemon pounding on the postgres server at once, just one
or two at a time.

It could also be that you've got some kind of garbage in
your *sql.log files that quill can't process properly. I ran
into a nasty bug of that flavor in condor 7.4.1 (and previous
versions) recently in which if there is even one malformed
statement in the sql.log file, quill attempts to load
the whole contents of the sql.log file into the errors table,
oftentimes filling up all disk in the postgres server,failing,
and then rolling back the transaction so nothing is accomplished.
That's supposed to be fixed in condor 7.4.3 but I haven't checked it yet.

D_FULLDEBUG in the quill logs would tell the story on either
of these two possibilities. The third possibility is that
you've got some kind of security setting missing so the
daemons either can't connect to the collector or can't stay
connected to the collector.

Finally, look at your postgres logs ( data/pg_log/pgstart.log)
if your postgres database isn't tuned properly, and many are not,
it will give you tuning suggestions to make it better. Also be
sure to get a good VACUUM FULL ANALYZE VERBOSE done.
If you are dominated by schedd quill data, it turns out that in
condor 7.4.1 and before there is one of the tables (the biggest one)

that doesn't get reindexed properly by the condor_dbmsdquill_reindextables() function and you have to reindex it manually.


There's lots of  reasons they are proposing to move condor_quill back to

a "contributed" status from the main 7.6 release--the above is just partof it.


Steve

On Thu, 26 Aug 2010, Michael O'Donnell wrote:

I have not examined the time intervals of the Quill daemons dying for our
pool, but I get hundreds of emails stating the quill daemon died and has
restarted on each machine. I have been trying to get Quill to work with
Windows as well, and I have been posting on this topic to this list. I
mentioned earlier that I have postgres database on the same server as our
CM. I was going to try installing postgress on a different server, but I
have not gotten around to this yet. I am pretty sure this is not the
problem, but it is something for me to try. I also have noticed that the
Quill daemon on our CM does not seem to die, but the Quill daemons on all
working nodes die on a regular basis. I have not determined why this is
the case, and the only difference is my OS. Our server is using server
2008 and our working nodes are 32/64bit windows xp and windows 7.

Mike





From:
<Greg.Hitchen@xxxxxxxx>
To:
<condor-users@xxxxxxxxxxx>
Date:
08/25/2010 08:07 PM
Subject:
Re: [Condor-users] Quill++ assistance
Sent by:
condor-users-bounces@xxxxxxxxxxx




That's correct, no other daemons are restarting, just condor_quill.

Interestingly, now that I have installed this version onto another
few PCs, the 1hr 25min is not EXACT. Two PCs that I "synched" yesterday
by restarting condor at the same time are now 2-3 minutes apart on
their condor_quill restarts. Maybe the condor_master restarting
condor_quill after 10secs isn't exact and the time diff gradually builds
up? I'll keep an eye on it.

Cheers

Greg


-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx [
mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Erik Paulson
Sent: Thursday, 26 August 2010 4:16 AM
To: Condor-Users Mail List
Subject: Re: [Condor-users] Quill++ assistance

And just to confirm, it's only Quill - none of the other daemons show
the same restart every hour and twenty-five minutes?

-Erik


On Wed, Aug 25, 2010 at 1:12 AM,  <Greg.Hitchen@xxxxxxxx> wrote:

Hi Erik

The 1hr 25 mins is definitely not related (as far as I can tell) to

virus

scans/server activity/etc.
I've checked all the scheduled type of activities that our PCs get

installed

with and nothng "fits".

In addition I have installed 7.4.3 onto several PCs now and they all

exhibit

the 1hr 25 restart
of condor_quill and it always starts exactly 1 hr 25 mins after condor

is

started, i.e. anytime
I do a condor net stop, condor net start on them then the first of the

1hr

25mins restarts
begins 1 hr 25mins after this.

There is a dprintf_failure.QUILL file created but it is empty and 0

bytes in

size.
No core file is created and condor_quill quite happily gets restarted by
condor_master after
10 secs until the MasterLog again says it exits with error 44 after the

next

1hr 25 mins.
Nothing gets logged in the QuillLog.

Cheers

Greg
________________________________
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Erik Paulson
Sent: Tuesday, 24 August 2010 3:46 AM
To: Condor-Users Mail List
Subject: Re: [Condor-users] Quill++ assistance


Greg: The "exit 44" issue is odd - status 44 means that Condor couldn't

log

some piece of information (which is why you don't see anything in the

logs

:). While I wouldn't rule anything in Condor out, 1:25:00 is not a

number

that strikes me as special in any of the Condor code, so I'm not sure

what

would happen on the Condor side with that periodicity. Are there any

file

server/virus scans/etc sort of activity that might interfere with writes

to

files that happen at your site?
Greg/Michael: the ACCESS_VIOLATION is happening in a strange spot. To

answer

your question, the Quill daemon should run continuously - however, if it

is

consistently crashing, the master will exponentially back off trying to

run

it until it only tries once an hour - so it may be likely that you'll

see a

core file with no Quill daemon running.
If that's the case and it is consistently crashing, I would love to see

your

full QuillLog, along with your sql.log file. We should be able to play

it

back and see exactly why it's crashing.
Thanks,
-Erik

On Wed, Aug 11, 2010 at 8:48 PM, <Greg.Hitchen@xxxxxxxx> wrote:


Perhaps not much help Michael but we've had similar problems with 7.2.4

on

windows
(see first attached email). It behaved somewhat better for 7.4.1 (see
second attached email)
and at least ran, even though restarting condor_quill every 1hr 25mins,
but a number of other
problems/issues with the 7.4 series has not allowed us to upgrade to

that

version yet.

Cheers

Greg

________________________________
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Michael

O'Donnell

Sent: Thursday, 12 August 2010 3:56 AM
To: Condor-Users Mail List
Subject: Re: [Condor-users] Quill++ assistance


I have these specified already and I do not see any issues. The

quilllog

file show SQL statements and success at populating the tables.

However, I am finding a file on all machine other than the central

manager

that has an access violation error. I am not sure if the

condor_quill.exe

daemon is supposed to run continuously, but I do not see it running on

any

machines other than the central manager.

The file that is showing up in the log directory on each machine is

called

core.QUILL.WIN32. Its contents are (Does this mean anything to anyone

else):

<...>

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with

subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/


--
------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525
timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Assistant Group Leader.

References:
- [Condor-users] Quill++ assistance
  - From: Michael O'Donnell
- Re: [Condor-users] Quill++ assistance
  - From: Steven Timm
- Re: [Condor-users] Quill++ assistance
  - From: Michael O'Donnell
- Re: [Condor-users] Quill++ assistance
  - From: Steven Timm
- Re: [Condor-users] Quill++ assistance
  - From: Michael O'Donnell
- Re: [Condor-users] Quill++ assistance
  - From: Greg.Hitchen
- Re: [Condor-users] Quill++ assistance
  - From: Erik Paulson
- Re: [Condor-users] Quill++ assistance
  - From: Greg.Hitchen
- Re: [Condor-users] Quill++ assistance
  - From: Erik Paulson
- Re: [Condor-users] Quill++ assistance
  - From: Greg.Hitchen
- Re: [Condor-users] Quill++ assistance
  - From: Michael O'Donnell

Prev by Date: Re: [Condor-users] Quill++ assistance
Next by Date: [Condor-users] Help with server upgrade
Previous by thread: Re: [Condor-users] Quill++ assistance
Next by thread: Re: [Condor-users] Quill++ assistance
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Quill++ assistance