[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Benchmarking condor



On 6/29/05, Juan Ignacio Sánchez Lara <juanignaciosl@xxxxxxxxx> wrote:
> Hello,
> 
> what do you do when you want to measure the throughput and speed-up of your
> Condor cluster? I'm looking for standard benchmarks (instead of running
> multiple instances of a home-made software), but almost everything is only
> MPI-based (and I'd like to measure not only MPI performance but also
> standard).
> 
> Thank you very much:
> 
> PD: Matthew (maybe you're interested), I'm finally going to implement a web
> interface to the SOAP condor API, so I promise in the next days/weeks I can
> report about NuSOAP

Condor does its own benchmarking which is the quickest for you to get.

Take a look at the sum of:
condor_status -format "vm%d@" VirtualMachineId -format "%s " Machine
-format "%d " KFlops -format "%d\n" Mips

Note that this may well report machines which have startd's but do not
run jobs, therefore they should be excluded.

Also note in general (and specifically to condor) that benchmarking
something so complex as a cluster is neither an exact science nor an
easy one.

The KFlops/MIPS values are about as useful as BogoMIPS et al. i.e.
useful only as a vague indicator of relative raw CPU performance.

The best way to benchmark your pool is to take the suite of
applications you run on it and run each one in some well defined,
repeatable and close to reality mode then see how long the jobs take
to finish.

**Note the take to finish bit.**

This is inherently subjective. For some jobs the data returned becomes
useful as it trickles in, thus the total wall clock time plus the
overhead of negotiation, transmission etc. for each job could be
summed and for a relatively reasonable basis for the throughput of the
farm in isolation.
To make it a better test you should have no other jobs on the farm at
the time though this may be completely unrealistic.

If the data is useful only when the last job is finished then timing
till the end of the last job is more meaningful but less useful for
comparisons since it will be *extremely* variable with respect to some
key limits (n jobs, m machines if n is not significantly bigger /
smaller than m then the times will change in big steps. essentially n
mod m value changing will have a significant effect on the reported
value even if the throughput itself doesn't change significantly)

Another key factor is if the machines (and indeed jobs) in the pool
are very non uniform then this will affect the comparability unless
you tune things extremely finely (which may work for the benchmark but
not too well in real usage)

Essentially you may think you are asking a reasonable question but
such simple questions normally spawn 10 more tricky ones repeat ad
nauseum.

The quick way to get a feel for a stable pool's power is to evaluate
the performance of each machine on a particular set of tasks (where
the task is representative of something you do regularly).

Evaluate the number of such tasks each machine could pump though in
some significant time period (hour/day/week etc).

Work out the rough split in terms of tasks in your current/projected usage.

Do the maths and you have a rough guide to the throughput your farm
can achieve. If you seem to get significantly lower throughput than
this number suggests either: your assumptions regarding the task
splits or their closeness to real load were invalid; or some aspect(s)
of the farm such as scheduling/checkpointing/farm errors are sucking
away useful time.

This is a useful thing to spot since it means you can target your
investigations into this and measure whether any changes (removing a
destructive machine for example) actually lead to a quantitative
improvement.

Short answer (after a very long one) is not easily if you want
anything other than trivial evaluations on trivial tasks.

Matt