[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor Quill Performance




This was originally sent to another list but looks like it could be also
useful here.

Ameet

---------- Forwarded message ----------
Date: Sun, 30 Oct 2005 19:33:10 -0600 (CST)
From: Ameet Kini <akini@xxxxxxxxxxx>
To: condor-devel@xxxxxxxxxxx
Subject: Condor Quill Performance



I just ran a bunch of tests to compare the performance of condor_rm
with/without quill, since its raised quite a stir lately as an
expensive operation.  The results are reported here.  Please contact me
directly if you have any questions on the particulars of the tests.

I tested both cases: when the postgres server is on the same machine
as the daemons (local) as well as on another machine (remote).  In the
local case, it steals cpu cycles away from the schedd and quill but also
saves time otherwise spent over the wire.  The difference, as we'll see
below, is not much.

With quill (local postgres):
My personal condor pool consisted of a single schedd and a single quill
server.

I submitted 20000 jobs within a single cluster, and waited for them to
reach the disk (as was verified by the results of condor_q).  I also set
their REQUIREMENTS=FALSE so that they would forever sit IDLE in the queue.

Then I issued a condor_rm -all.  This command took about 15 seconds to
return.  During that time, the schedd wrote a bunch of such records to the
job_queue.log file:

103 2.17790 RemoveReason "via condor_rm (by user akini)"
103 2.17790 EnteredCurrentStatus 1130363247
103 2.17790 JobStatus 3

This it did for all 20000 jobs.  So that write is synchronous, meaning,
that the condor_rm command would not return until all such writes are
written to disk.  It took quill about 40 seconds more to send those
records to the database.  So within a minute, both the schedd and quill
were consistent.  This could be verified by running a condor_q on the
quill database and seeing 20000 jobs with a status 'X'.

Now, once the schedd records those records safely, it then starts actually
purging records out of its job queue.  While its doing that, it sync's off
a bunch of such records to the job_queue.log file:

103 2.11915 JobFinishedHookDone 1130363884
102 2.11915

And quill follows suit by sending those records to the database and also
updating its history tables.  If you do a condor_q at this time, you would
see the queue shrinking in size.  This entire operation of purging jobs
out of the queue takes quite long.  It took approximately 14 minutes for
the schedd and quill took about 10 more minutes after that.

Without quill:
The configuration is the same except now quill and the postgres server are
not running on the system.

condor_rm now returned in 11 seconds (compared to the 15 seconds above).

The second stage now took approximately 10 minutes to complete (as
compared to the 14 minutes above)

Finally, the case when postgres is remote.

With quill (remote postgres):
condor_rm returned within 8 seconds.  It took the schedd approximately 8
minutes to sync everything to disk (thats 2 minutes faster than the above
case possibly due to reduced cpu contention from the database server).
Quill took a total of 15 more minutes to write everything to the database
server. Thats almost comparable to the case when the database is local.

Ameet