Mailing List Archives
Public Access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Quill++ assistance
- Date: Tue, 14 Sep 2010 11:44:45 +0800
- From: <Greg.Hitchen@xxxxxxxx>
- Subject: Re: [Condor-users] Quill++ assistance
Hi Erik
I
have attached the QuillLog file with D_FULLDEBUG turned on, and the MasterLog
file.
The dprintf_failure.QUILL file is 0 bytes and empty. There is 120Gb of
free disk space.
The sql.log file is empty as well because condor-quill restarts after 10
secs and continues
on as if nothing has happened.
The action happens at 10:35:15. I've included some excerpts
here.
This following section is repeated every 10 secs in
QuillLog
09/14 10:34:55 ******** Start of Polling Job Queue Log ********
09/14
10:34:55 Get JobQueue Polling Information
09/14 10:34:55 SQL Query = SELECT
last_file_mtime, last_file_size, last_next_cmd_offset, last_cmd_offset,
last_cmd_type, last_cmd_key, last_cmd_mytype, last_cmd_targettype,
last_cmd_name, last_cmd_value from JobQueuePollingInfo where scheddname =
'MONAL-KB.nexus.csiro.au'
09/14 10:34:55 === Current Probing Information
===
09/14 10:34:55 fsize: 85 mtime: 1284080765
09/14 10:34:55
first log entry: 26 CreationTimestamp 1263526089
09/14 10:34:55 JOB QUEUE
POLLING RESULT: NO CHANGE
09/14 10:34:55
********* End of Polling Job Queue Log *********
Then the
condor_quill exits with status 44 (as shown in MasterLog further
down)
with NO messages in
QuillLog BUT the last entry is part way through the
Polling Job Queue
section when I presume it's reading job_queue.log?
09/14 10:35:15 ******** Start of Polling Job Queue Log ********
09/14
10:35:15 Get JobQueue Polling Information
09/14 10:35:15 SQL Query = SELECT
last_file_mtime, last_file_size, last_next_cmd_offset, last_cmd_offset,
last_cmd_type, last_cmd_key, last_cmd_mytype, last_cmd_targettype,
last_cmd_name, last_cmd_value from JobQueuePollingInfo where scheddname =
'MONAL-KB.nexus.csiro.au'
09/14 10:35:15 === Current Probing Information
===
09/14 10:35:15 fsize: 85 mtime: 1284080765
09/14 10:35:27
Locale: English_United States.1252
09/14 10:35:27
******************************************************
09/14 10:35:27 **
condor_quill.exe (CONDOR_QUILL) STARTING UP
09/14 10:35:27 **
C:\PROGRA~1\condor\bin\condor_quill.exe
09/14 10:35:27 ** SubsystemInfo:
name=QUILL type=DAEMON(11) class=DAEMON(1)
09/14 10:35:27 ** Configuration:
subsystem:QUILL local:<NONE> class:DAEMON
09/14 10:35:27 **
$CondorVersion: 7.4.3 Aug 4 2010 BuildID: 261829 $
09/14 10:35:27 **
$CondorPlatform: INTEL-WINNT50 $
09/14 10:35:27 ** PID = 7680
09/14
10:35:27 ** Log last touched 9/14 10:35:15
09/14 10:35:27
******************************************************
09/14 10:35:27 Using
config source: c:\PROGRA~1\condor\condor_config
09/14 10:35:27 Using local
config sources:
09/14 10:35:27
C:\PROGRA~1\condor/condor_config.local
09/14 10:35:27 DaemonCore: Command
Socket at <130.116.146.156:9983>
09/14 10:35:27 Will use UDP to update
collector condor-wa.csiro.au <130.116.24.145:9618>
09/14 10:35:27
main_init() called
09/14 10:35:27 configuring tt options from config
file
09/14 10:35:27 Using Polling Period = 10
09/14 10:35:27 Using logs
09/14 10:35:27 C:\PROGRA~1\condor/log/sql.log 09/14 10:35:27
09/14 10:35:27
Using Job Queue File C:\PROGRA~1\condor/spool/job_queue.log
09/14 10:35:27
Using Database Type = Postgres
09/14 10:35:27 Using Database IpAddress =
condorquill.csiro.au:5432
09/14 10:35:27 Using Database Name =
quilldatabase
09/14 10:35:27 Using Database User =
quillwriter
MasterLog showing condor_quill exiting with status 44 and restarting 10
secs later.
09/14 10:35:15 Received UDP command 60011 (DC_NOP) from
<130.116.146.156:9380>, access level ALLOW
09/14 10:35:15 Calling
HandleReq <handle_nop()> (0)
09/14 10:35:15 Return from HandleReq
<handle_nop()> (handler: 0.000s, sec: 0.000s)
09/14 10:35:15
DaemonCore: pid 6556 exited with status 44, invoking reaper 1
<Daemons::DefaultReaper()>
09/14 10:35:15 The QUILL (pid 6556) exited
with status 44
09/14 10:35:15 Sending obituary for
"C:\PROGRA~1\condor/bin/condor_quill.exe"
09/14 10:35:17 restarting
C:\PROGRA~1\condor/bin/condor_quill.exe in 10 seconds
09/14 10:35:17
DaemonCore: return from reaper for pid 6556
09/14 10:35:27 Started DaemonCore
process "C:\PROGRA~1\condor/bin/condor_quill.exe", pid and pgroup =
7680
09/14 10:35:27 Received UDP command 60008 (DC_CHILDALIVE) from
<130.116.146.156:9380>, access level DAEMON
09/14 10:35:27 Calling
HandleReq <HandleChildAliveCommand> (0)
09/14 10:35:27 Return from
HandleReq <HandleChildAliveCommand> (handler: 0.000s, sec:
0.000s)
As I've mentioned before this is NOT a show stopper as condor_quill
restarts and continues on OK.
The whole condor_quill thing seems to chug along mostly OK, SO LONG AS
NOT MANY JOBS ARE SUBMITTED.
Then we strike problems. I think it's associated with many 1000's of jobs
being submitted (we have max_jobs_running = 200)
resulting in large job_queue.log files that seem to take condor_quill a
while to process? In any case once condor_quill
is "busy" the sql.log (and sql.log.copy) files get "largish" and cannot
get processed into the database within the
1
hr timeout limit (or the 1hr 25 restart issue). Once that happens then it can
never catch up and the only way
around is to delete the sql.log files, restart, and lose all the job
info. I thought I might look into manually importing
the info in from the history file using
condor_load_history?
We have been trying the condor_quill way in the hope that we could
report our monthly usage (no. jobs, no hrs, etc)
easily. In theory this works fine and a 1 or 2 line sql query will do it.
But it's not much use if all the job info doesn't
get into the db and we need to resort to our old manual method of
collecting all the history files off our 15+ submit
nodes, running them through condor_history, importing into XL,
calculating jobs, hrs, etc.
Thanks.
Greg
Right, the exponential backoff restart process can include some
random jitter which will build up over time.
If you've got a chance, could you turn up QUILL_DEBUG to D_FULLDEBUG, and
send me the QuillLog from a crashing machine, along with your sql.log
file?
Also, how much free disk space do you have? The dprintf_failure files
should have some data inside of them explaining why the dprint failed, and being
out of disk is one reason the dprintf can fail.
-Erik
On Wed, Aug 25, 2010 at 9:04 PM,
<Greg.Hitchen@xxxxxxxx> wrote:
That's
correct, no other daemons are restarting, just
condor_quill.
Interestingly, now that I have installed this version
onto another
few PCs, the 1hr 25min is not EXACT. Two PCs that I "synched"
yesterday
by restarting condor at the same time are now 2-3 minutes apart
on
their condor_quill restarts. Maybe the condor_master
restarting
condor_quill after 10secs isn't exact and the time diff
gradually builds
up? I'll keep an eye on it.
Cheers
Greg
Sent: Thursday, 26 August 2010 4:16 AM
To: Condor-Users Mail
List
Subject: Re: [Condor-users] Quill++ assistance
And just to
confirm, it's only Quill - none of the other daemons show
the same restart
every hour and twenty-five minutes?
-Erik
On Wed, Aug 25,
2010 at 1:12 AM, <Greg.Hitchen@xxxxxxxx> wrote:
>
Hi Erik
>
> The 1hr 25 mins is definitely not related (as far
as I can tell) to virus
> scans/server activity/etc.
> I've
checked all the scheduled type of activities that our PCs get
installed
> with and nothng "fits".
>
> In addition I have
installed 7.4.3 onto several PCs now and they all exhibit
> the 1hr 25
restart
> of condor_quill and it always starts exactly 1 hr 25 mins
after condor is
> started, i.e. anytime
> I do a condor net stop,
condor net start on them then the first of the 1hr
> 25mins
restarts
> begins 1 hr 25mins after this.
>
> There is a
dprintf_failure.QUILL file created but it is empty and 0 bytes in
>
size.
> No core file is created and condor_quill quite happily gets
restarted by
> condor_master after
> 10 secs until the
MasterLog again says it exits with error 44 after the next
> 1hr 25
mins.
> Nothing gets logged in the QuillLog.
>
>
Cheers
>
> Greg
>
________________________________
Attachment:
MasterLog.zip
Description: MasterLog.zip
Attachment:
QuillLog.zip
Description: QuillLog.zip