[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Ditributed computing performance



> Thank you very much JK for the quick response.
> I'm sorry for my slow response, we have proxy problem yesterday.
> 
> ... 
> Maybe it's vague what I'm trying to say what I want to do. My
> applications/programs are MONTE CARLO simulations, so the 
> program can be
> independently segregated in any granularity I want. Say, for 1
> job/simulation, there are 10000 events and I'm going to use 4 
> machines,
> then the optimal granularity would be executing 2500 events on each
> machine. These machines have the same OS and Arch.

This should be ideal then.

> So in my speedup test, I exclusively use all the machines in 
> the pool for
> a month to simulate all the data I needed. I take the time of 
> execution
> using 1 single machine, 2 machines, 3 machines and 4 
> machines. Then I vary
> the number of events (9000 events, 80000 events, and so on) and then
> record again the execution times. These are CPU-intensive programs and
> 99.9% is the CPU usage as I can see when I execute the 
> program...BUT I am
> not sure how to average the CPU usage because it fluctuates whenever I
> open web browser (Mozilla Firefox), new terminal, new folder. Much
> fluctuation when I use SSH with X window from other machine 
> so that I can
> see the progress of the job.
> (This is also another thing I want to ask from Condor-team, 
> How to get the
> CPU usage for each job in real time, or just the average CPU-usage).

It looks like you will get some sensible measurements doing this, but if you
want it to run faster, just ass more machines. There will be a hit when 
some machines are used more than others, but over time condor will match
less jobs to the machines that are busier, so it should self-balance.

You will have to decide whether when a user returns, jobs keep going,
suspend ro terminate.
 
> There's no need for me to use MPI, the job can be easily parallelized.
> 
> So basically, what I did is trying to use the above-mentioned 
> capability
> and flexibility of Condor, that is, it can be configured to do such
> parallel computing.
> But I'm still confused about the difference between parallel and
> distributed computing, the two are not clearly differentiated in most
> books.

Parallel computing broadly speaking covers a wide range of activities:
* multi-threaded jobs
* multi-processing with shared resources
* distributed computing
* MPP
* SMP
* small clusters
* etc

If you want to make a distinction between distributed and parallel, then
the following are some guidelines:
* parallel jobs tend to be closer coupled, passing a lot of data, they are
  not independent, so if one job dies, the rest may a swell terminate too
  distributed on the other hand are typically so called serial (independent)
  jobs which if they fail can either be forgotten about (say in MCMC when we may
  have enough results already), ro restarted.
* parallel jobs tend to be homogenous, distributed can be heterogenous. This is in
  terms of process itself (distributed computing which is not performance related can
  have a variety of distinct "actors" acting as producer-consumer, client-server, etc),
  or also in terms of processors, operating systems and geographical location.
  Note that for parameter sweep or MC, then the jobs are basically homogenous, but they
  may run on heterogenous OSes

> Is speedup one of the performance test of distributed computing
> also or it's purely for parallel computing? And in my case, 
> Condor sets up
> a distributed HTC, Am I right? Is throughput the only 
> performance test I
> can perform in evaluating distributed computing environment? 
> Or there are
> others? And how can I perform these evaluations? Are there 
> any standards?

High Throughput Computing is normally measured in how many jobs you can do in a year.
If you want more jobs to run, add more resources.

> It happened to me that sometimes computer won't execute any 
> job even when
> it is idle and there are jobs on the queue and this computer 
> is configured
> to run ALWAYS even when the owner resumes to use the computer.

That is the fun of condor, there are sometimes a few teething troubles.
Sometimes you just have to let the matchmaking happen, it can take a while
sometimes.

If you have nodes which you suspect never run a job, you can target them 
directly in a REQUIREMENTS statement (I have a unix shellscript that will
generate a submit file to run on EVERY node in a pool, or indeed to run on
every example of a particular ClassAd). Then you can check logs to see 
how soon it gets matched and why it doesn't run.

condor_q -anal
and
condor_q -better-anal
may help
 
> So, perhaps if there is anyone did this study before, then I can have
> something to compare. Or are there any standards?

I haven't done any work in this area I'm afraid, maybe someone else has.

Cheers

JK