[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Quill++ assistance

Hi Erik
The 1hr 25 mins is definitely not related (as far as I can tell) to virus scans/server activity/etc.
I've checked all the scheduled type of activities that our PCs get installed with and nothng "fits".
In addition I have installed 7.4.3 onto several PCs now and they all exhibit the 1hr 25 restart
of condor_quill and it always starts exactly 1 hr 25 mins after condor is started, i.e. anytime
I do a condor net stop, condor net start on them then the first of the 1hr 25mins restarts
begins 1 hr 25mins after this.
There is a dprintf_failure.QUILL file created but it is empty and 0 bytes in size.
No core file is created and condor_quill quite happily gets restarted by condor_master after
10 secs until the MasterLog again says it exits with error 44 after the next 1hr 25 mins.
Nothing gets logged in the QuillLog.

From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Erik Paulson
Sent: Tuesday, 24 August 2010 3:46 AM
To: Condor-Users Mail List
Subject: Re: [Condor-users] Quill++ assistance

Greg: The "exit 44" issue is odd - status 44 means that Condor couldn't log some piece of information (which is why you don't see anything in the logs :). While I wouldn't rule anything in Condor out, 1:25:00 is not a number that strikes me as special in any of the Condor code, so I'm not sure what would happen on the Condor side with that periodicity. Are there any file server/virus scans/etc sort of activity that might interfere with writes to files that happen at your site?

Greg/Michael: the ACCESS_VIOLATION is happening in a strange spot. To answer your question, the Quill daemon should run continuously - however, if it is consistently crashing, the master will exponentially back off trying to run it until it only tries once an hour - so it may be likely that you'll see a core file with no Quill daemon running. 

If that's the case and it is consistently crashing, I would love to see your full QuillLog, along with your sql.log file. We should be able to play it back and see exactly why it's crashing. 



On Wed, Aug 11, 2010 at 8:48 PM, <Greg.Hitchen@xxxxxxxx> wrote:
Perhaps not much help Michael but we've had similar problems with 7.2.4 on windows
(see first attached email). It behaved somewhat better for 7.4.1 (see second attached email)
and at least ran, even though restarting condor_quill every 1hr 25mins, but a number of other
problems/issues with the 7.4 series has not allowed us to upgrade to that version yet.

From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Michael O'Donnell
Sent: Thursday, 12 August 2010 3:56 AM

To: Condor-Users Mail List
Subject: Re: [Condor-users] Quill++ assistance

I have these specified already and I do not see any issues. The quilllog file show SQL statements and success at populating the tables.

However, I am finding a file on all machine other than the central manager that has an access violation error. I am not sure if the condor_quill.exe daemon is supposed to run continuously, but I do not see it running on any machines other than the central manager.

The file that is showing up in the log directory on each machine is called core.QUILL.WIN32. Its contents are (Does this mean anything to anyone else):