[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Ditributed computing performance



Thank you very much JK for the quick response.
I'm sorry for my slow response, we have proxy problem yesterday.

> I am not certain where your speedup is going to come from.
> HTC is about getting more jobs finished in a certain amount of time
rather than getting a single job to finish quicker.
>
> So if I have 1,000 jobs that take an hour each, I could get a fair
amount
> of speedup by spreading them out over 4 machines rather than running
them on a single node.
> To get speedup for your single job you will have to "cut it down"
somehow, say by reducing the data it works over, or reducing its
iterations; the remaining data being analysed across the other 3 jobs.

Maybe it's vague what I'm trying to say what I want to do. My
applications/programs are MONTE CARLO simulations, so the program can be
independently segregated in any granularity I want. Say, for 1
job/simulation, there are 10000 events and I'm going to use 4 machines,
then the optimal granularity would be executing 2500 events on each
machine. These machines have the same OS and Arch.

So in my speedup test, I exclusively use all the machines in the pool for
a month to simulate all the data I needed. I take the time of execution
using 1 single machine, 2 machines, 3 machines and 4 machines. Then I vary
the number of events (9000 events, 80000 events, and so on) and then
record again the execution times. These are CPU-intensive programs and
99.9% is the CPU usage as I can see when I execute the program...BUT I am
not sure how to average the CPU usage because it fluctuates whenever I
open web browser (Mozilla Firefox), new terminal, new folder. Much
fluctuation when I use SSH with X window from other machine so that I can
see the progress of the job.
(This is also another thing I want to ask from Condor-team, How to get the
CPU usage for each job in real time, or just the average CPU-usage).

> HTC (and Condor) works best for parameter sweep, monte carlo and similar
so called "serial" jobs (I dislike that term since the jobs can
typically
> be run in any order - I prefer the term "independent"). Jobs that will
need to
> regularly swap information and synchronise ("parallel" jobs such as MPI)
are a better
> fit for the HPC, single cluster model. Condor can be configured to do
MPI work, but it is best to get an idea of how it works using the HTC
model first.

There's no need for me to use MPI, the job can be easily parallelized.

So basically, what I did is trying to use the above-mentioned capability
and flexibility of Condor, that is, it can be configured to do such
parallel computing.
But I'm still confused about the difference between parallel and
distributed computing, the two are not clearly differentiated in most
books. Is speedup one of the performance test of distributed computing
also or it's purely for parallel computing? And in my case, Condor sets up
a distributed HTC, Am I right? Is throughput the only performance test I
can perform in evaluating distributed computing environment? Or there are
others? And how can I perform these evaluations? Are there any standards?
It happened to me that sometimes computer won't execute any job even when
it is idle and there are jobs on the queue and this computer is configured
to run ALWAYS even when the owner resumes to use the computer.

So, perhaps if there is anyone did this study before, then I can have
something to compare. Or are there any standards?


Thank you once again,

Leo

>
> I hope this is of some help
>
> Cheers
>
> JK
>
>
>> -----Original Message-----
>> From: condor-users-bounces@xxxxxxxxxxx
>> [mailto:condor-users-bounces@xxxxxxxxxxx]On Behalf Of Leo Cristobal C.
Ambolode II
>> Sent: Wednesday, September 19, 2007 7:02 AM
>> To: condor-users@xxxxxxxxxxx
>> Subject: [Condor-users] Ditributed computing performance
>> Hi condor-users and developers,
>> I have a simple condor pool consists of 4 Linux machines. I
>> am about to
>> evaluate this cluster of computers. So far, I've been able to test its
speedup, I simulate a single long-running job (it takes about
>> a day to a
>> week to finish the jobs). I increase the number of machines (say from
single machine up to using 4 machines) used in simulating the
>> program. SO
>> far so good. I am using programs/applications related in our
>> field which
>> is High Energy Physics; we used SimTools which in turn used ROOT and
GEANT4 (URL's are www.root.cern.ch and www.geant4.cern.ch,
>> respectively).
>> I've read "Distributed and Parallel Computing" book by
>> Al-Rashini?, I'm
>> sorry if I did not get the correct title or the correct
>> spelling of the
>> author. It talks about Response Time, Throughput, Network...,
>> etc. Have
>> anyone tried the evaluation I am going to make? What are the
>> appropriate
>> performance parameters that I am going to investigate and how
>> should it be
>> done? I only have 4 machines. At first, I am only interested
>> with speedup
>> and more on parallel computing, but since my study is on distributed
computing and is somehow differs from parallel computing,
>> then I have to
>> investigate more to justify distributed HTC.
>> I thank you in advance. If you have further questions
>> regarding the nature
>> of my study, feel free to ask me.
>> Sincerely,
>> Leo
>> _______________________________________________
>> Condor-users mailing list
>> To unsubscribe, send a message to
>> condor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/condor-users/
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with
a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/