[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] [Condor-devel] Condor Quill Performance



Note: This discussion began on another mailing list but I feel this
tangent is more useful on users

On 10/31/05, Ameet Kini <akini@xxxxxxxxxxx> wrote:
>
>
> I just ran a bunch of tests to compare the performance of condor_rm
> with/without quill, since its raised quite a stir lately as an
> expensive operation.

I decided to run relatively equivalent tests on windows without quill.

This is very rough but highlights some significant differences and I
think has isolated the slowness on our pool. I think its likely to be
useful to anyone else with a corporate pool with windows.

Test machine dual Xeon 3.2 GHz, 2 GB ram and Windows XP Pro SP2
$ condor_version
$CondorVersion: 6.6.8 Jan 31 2005 $
$CondorPlatform: INTEL-WINNT40 $

with a near empty queue (only 7 previous jobs) I submitted the following

universe = vanilla
executable = printname.bat
Requirements = False
output = submitlots.$(Cluster).$(Process).out
error = submitlots.$(Cluster).$(Process).err
log = submitlots.$(Cluster).$(Process).log
nice_user = true
queue 20000

and waited till all CPU activity on the schedd and disk activity on
the job_queue.log were complete. This was actually pretty quick taking
about 90 seconds.

> Then I issued a condor_rm -all.  This command took about 15 seconds to
> return.  During that time, the schedd wrote a bunch of such records to the
> job_queue.log file:

$ time condor_rm -all
All jobs marked for removal.

real    0m10.791s
user    0m0.031s
sys     0m0.000s

nice and fast.

At this point the schedd's CPU usage became minimal (less than 5% of 1
CPU) but there was still a significant load from 'System' which took
most of one CPU. (I noted that the history log file increased in size
steadily over this period)

I attempted condor_q at infrequent intervals (about every 5 minutes -
sorry not very scientific)

if it took much longer than 1 minute I killed it

here is a sample

$ time condor_q
<snip>
   7.19990 nice-user.matt 10/31 10:16   0+00:00:00 X  0   0.0  printname.bat
   7.19991 nice-user.matt 10/31 10:16   0+00:00:00 X  0   0.0  printname.bat
   7.19992 nice-user.matt 10/31 10:16   0+00:00:00 X  0   0.0  printname.bat
   7.19993 nice-user.matt 10/31 10:16   0+00:00:00 X  0   0.0  printname.bat
   7.19994 nice-user.matt 10/31 10:16   0+00:00:00 X  0   0.0  printname.bat
   7.19995 nice-user.matt 10/31 10:16   0+00:00:00 X  0   0.0  printname.bat
   7.19996 nice-user.matt 10/31 10:16   0+00:00:00 X  0   0.0  printname.bat
   7.19997 nice-user.matt 10/31 10:16   0+00:00:00 X  0   0.0  printname.bat
   7.19998 nice-user.matt 10/31 10:16   0+00:00:00 X  0   0.0  printname.bat
   7.19999 nice-user.matt 10/31 10:16   0+00:00:00 X  0   0.0  printname.bat

0 jobs; 0 idle, 0 running, 0 held

real    1m10.625s
user    0m0.015s
sys     0m0.015s

As an aside for the purposes of my wrapper code I view >60 seconds
time delay in response to a condor_q a timeout. Most users would prob
quit at 30 :)

> Now, once the schedd records those records safely, it then starts actually
> purging records out of its job queue.

After 20 mins I stopped bothering with the condor_q since it clearly
slowed down the schedd's attempts to clean out its queue and just kept
an eye on the CPU usage and history file...

After doing some digging with ProcEXP I spotted a lot of activity
within an anti virus dll. This was not on the job_queue.log but the
history log. I think Ive found my culprit. I am going to get my
systems guys to disable scanning on the spool directory and retest.

If anyone else with windows has noticed serious slow downs after
running a sizable condor_rm I would suggest they take a look at their
virus scanner...

Matt