[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Quill++ assistance



Hi Erik
 
I have attached the QuillLog file with D_FULLDEBUG turned on, and the MasterLog file.
The dprintf_failure.QUILL file is 0 bytes and empty. There is 120Gb of free disk space.
The sql.log file is empty as well because condor-quill restarts after 10 secs and continues
on as if nothing has happened.
 
The action happens at 10:35:15. I've included some excerpts here.
 
This following section is repeated every 10 secs in QuillLog
 
09/14 10:34:55 ******** Start of Polling Job Queue Log ********
09/14 10:34:55 Get JobQueue Polling Information
09/14 10:34:55 SQL Query = SELECT last_file_mtime, last_file_size, last_next_cmd_offset, last_cmd_offset, last_cmd_type, last_cmd_key, last_cmd_mytype, last_cmd_targettype, last_cmd_name, last_cmd_value from JobQueuePollingInfo where scheddname = 'MONAL-KB.nexus.csiro.au'
09/14 10:34:55 === Current Probing Information ===
09/14 10:34:55 fsize: 85  mtime: 1284080765
09/14 10:34:55 first log entry: 26 CreationTimestamp 1263526089
09/14 10:34:55 JOB QUEUE POLLING RESULT: NO CHANGE
09/14 10:34:55 ********* End of Polling Job Queue Log *********
 
Then the condor_quill exits with status 44 (as shown in MasterLog further down)
with NO messages in QuillLog BUT the last entry is part way through the
Polling Job Queue section when I presume it's reading job_queue.log?
 

09/14 10:35:15 ******** Start of Polling Job Queue Log ********
09/14 10:35:15 Get JobQueue Polling Information
09/14 10:35:15 SQL Query = SELECT last_file_mtime, last_file_size, last_next_cmd_offset, last_cmd_offset, last_cmd_type, last_cmd_key, last_cmd_mytype, last_cmd_targettype, last_cmd_name, last_cmd_value from JobQueuePollingInfo where scheddname = 'MONAL-KB.nexus.csiro.au'
09/14 10:35:15 === Current Probing Information ===
09/14 10:35:15 fsize: 85  mtime: 1284080765
09/14 10:35:27 Locale: English_United States.1252
09/14 10:35:27 ******************************************************
09/14 10:35:27 ** condor_quill.exe (CONDOR_QUILL) STARTING UP
09/14 10:35:27 ** C:\PROGRA~1\condor\bin\condor_quill.exe
09/14 10:35:27 ** SubsystemInfo: name=QUILL type=DAEMON(11) class=DAEMON(1)
09/14 10:35:27 ** Configuration: subsystem:QUILL local:<NONE> class:DAEMON
09/14 10:35:27 ** $CondorVersion: 7.4.3 Aug  4 2010 BuildID: 261829 $
09/14 10:35:27 ** $CondorPlatform: INTEL-WINNT50 $
09/14 10:35:27 ** PID = 7680
09/14 10:35:27 ** Log last touched 9/14 10:35:15
09/14 10:35:27 ******************************************************
09/14 10:35:27 Using config source: c:\PROGRA~1\condor\condor_config
09/14 10:35:27 Using local config sources:
09/14 10:35:27    C:\PROGRA~1\condor/condor_config.local
09/14 10:35:27 DaemonCore: Command Socket at <130.116.146.156:9983>
09/14 10:35:27 Will use UDP to update collector condor-wa.csiro.au <130.116.24.145:9618>
09/14 10:35:27 main_init() called
09/14 10:35:27 configuring tt options from config file
09/14 10:35:27 Using Polling Period = 10
09/14 10:35:27 Using logs 09/14 10:35:27 C:\PROGRA~1\condor/log/sql.log 09/14 10:35:27
09/14 10:35:27 Using Job Queue File C:\PROGRA~1\condor/spool/job_queue.log
09/14 10:35:27 Using Database Type = Postgres
09/14 10:35:27 Using Database IpAddress = condorquill.csiro.au:5432
09/14 10:35:27 Using Database Name = quilldatabase
09/14 10:35:27 Using Database User = quillwriter
MasterLog showing condor_quill exiting with status 44 and restarting 10 secs later.
 
09/14 10:35:15 Received UDP command 60011 (DC_NOP) from  <130.116.146.156:9380>, access level ALLOW
09/14 10:35:15 Calling HandleReq <handle_nop()> (0)
09/14 10:35:15 Return from HandleReq <handle_nop()> (handler: 0.000s, sec: 0.000s)
09/14 10:35:15 DaemonCore: pid 6556 exited with status 44, invoking reaper 1 <Daemons::DefaultReaper()>
09/14 10:35:15 The QUILL (pid 6556) exited with status 44
09/14 10:35:15 Sending obituary for "C:\PROGRA~1\condor/bin/condor_quill.exe"
09/14 10:35:17 restarting C:\PROGRA~1\condor/bin/condor_quill.exe in 10 seconds
09/14 10:35:17 DaemonCore: return from reaper for pid 6556
09/14 10:35:27 Started DaemonCore process "C:\PROGRA~1\condor/bin/condor_quill.exe", pid and pgroup = 7680
09/14 10:35:27 Received UDP command 60008 (DC_CHILDALIVE) from  <130.116.146.156:9380>, access level DAEMON
09/14 10:35:27 Calling HandleReq <HandleChildAliveCommand> (0)
09/14 10:35:27 Return from HandleReq <HandleChildAliveCommand> (handler: 0.000s, sec: 0.000s)
As I've mentioned before this is NOT a show stopper as condor_quill restarts and continues on OK.
 
The whole condor_quill thing seems to chug along mostly OK, SO LONG AS NOT MANY JOBS ARE SUBMITTED.
Then we strike problems. I think it's associated with many 1000's of jobs being submitted (we have max_jobs_running = 200)
resulting in large job_queue.log files that seem to take condor_quill a while to process? In any case once condor_quill
is "busy" the sql.log (and sql.log.copy) files get "largish" and cannot get processed into the database within the
1 hr timeout limit (or the 1hr 25 restart issue). Once that happens then it can never catch up and the only way
around is to delete the sql.log files, restart, and lose all the job info. I thought I might look into manually importing
the info in from the history file using condor_load_history?
 
We have been trying the condor_quill way in the hope that we could report our monthly usage (no. jobs, no hrs, etc)
easily. In theory this works fine and a 1 or 2 line sql query will do it. But it's not much use if all the job info doesn't
get into the db and we need to resort to our old manual method of collecting all the history files off our 15+ submit
nodes, running them through condor_history, importing into XL, calculating jobs, hrs, etc.
 
Thanks.
 
Cheers
 
Greg


From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Erik Paulson
Sent: Tuesday, 14 September 2010 5:26 AM
To: Condor-Users Mail List
Subject: Re: [Condor-users] Quill++ assistance

Right, the exponential backoff restart process can include some random jitter which will build up over time.

If you've got a chance, could you turn up QUILL_DEBUG to D_FULLDEBUG, and send me the QuillLog from a crashing machine, along with your sql.log file? 

Also, how much free disk space do you have? The dprintf_failure files should have some data inside of them explaining why the dprint failed, and being out of disk is one reason the dprintf can fail.

-Erik

On Wed, Aug 25, 2010 at 9:04 PM, <Greg.Hitchen@xxxxxxxx> wrote:

That's correct, no other daemons are restarting, just condor_quill.

Interestingly, now that I have installed this version onto another
few PCs, the 1hr 25min is not EXACT. Two PCs that I "synched" yesterday
by restarting condor at the same time are now 2-3 minutes apart on
their condor_quill restarts. Maybe the condor_master restarting
condor_quill after 10secs isn't exact and the time diff gradually builds
up? I'll keep an eye on it.

Cheers

Greg


-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Erik Paulson
Sent: Thursday, 26 August 2010 4:16 AM
To: Condor-Users Mail List
Subject: Re: [Condor-users] Quill++ assistance

And just to confirm, it's only Quill - none of the other daemons show
the same restart every hour and twenty-five minutes?

-Erik


On Wed, Aug 25, 2010 at 1:12 AM,  <Greg.Hitchen@xxxxxxxx> wrote:
> Hi Erik
>
> The 1hr 25 mins is definitely not related (as far as I can tell) to virus
> scans/server activity/etc.
> I've checked all the scheduled type of activities that our PCs get installed
> with and nothng "fits".
>
> In addition I have installed 7.4.3 onto several PCs now and they all exhibit
> the 1hr 25 restart
> of condor_quill and it always starts exactly 1 hr 25 mins after condor is
> started, i.e. anytime
> I do a condor net stop, condor net start on them then the first of the 1hr
> 25mins restarts
> begins 1 hr 25mins after this.
>
> There is a dprintf_failure.QUILL file created but it is empty and 0 bytes in
> size.
> No core file is created and condor_quill quite happily gets restarted by
> condor_master after
> 10 secs until the MasterLog again says it exits with error 44 after the next
> 1hr 25 mins.
> Nothing gets logged in the QuillLog.
>
> Cheers
>
> Greg
> ________________________________

Attachment: MasterLog.zip
Description: MasterLog.zip

Attachment: QuillLog.zip
Description: QuillLog.zip