[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Looking for a good HPC vs. HTC soundbite

Hi Michael,

I quite enjoy Prof. Livnyâs FLOPY vs FLOPS slides that he used. Iâve found that it accurately conveys that the problem-space isnât just as simple as âturn all these on, link them up and youâre good to go.â

I once had someone suggest - unfortunately canât remember who - that there were very few problems that actually required the power and inconvenience of a HPC installation, most of the problems that claim to be HPC are actually extremely in-efficient solutions which could be broken down and run as high-throughput jobs with all the benefits.

I genuinely think, looking forward into the future of computing (I know, dodgy ground here.), with the rise of computing as a commodity and the continuing march of theâCloudâ, the cost per cpu-hour is going to continue nose-diving.

Having a quick glance at AWS Spot I notice that the m4.xlarge (2 cores, 8 GB) is currently 4 cents an hour.

Weâll start seeing citizen scientists who throw $50-100 on cloud-resources and get not 20 cores for a few hours like now, but a 1000 or 10,000 cores for a few hours.

Which is why I think the HTCondor team is spot on when they focus on things like the grid universe and cloud GAHPs as well as the incoming condor_annex. Theyâre the future.

As a company/research organization you can call an API 100,000 cores on multiple cloud providers in the time it takes to get a coffee and a get a bill in a couple of days. 

Speaking hypothetically, as a the above organization, if your problem-space wasnât HPC-bound, why would you waste the time and expense on very complex and very intricate HPC set-ups, requiring highly specialized networks, cooling and knowledge to run?

Then it gets even muddier when you see things like the Intel Phiâs coming in, those things are going to take the HTC world by storm in big installations. At the same time people are going to be scratching their heads wondering why they now need that 1,000-core infiniband set-up that makes them feel a bit ill whenever they think of the cost.

Cheers, Iain

Indeed, Iâll see you again at HTCondor Week this year.

On May 2, 2016, at 19:22, Michael V Pelletier <Michael.V.Pelletier@xxxxxxxxxxxx> wrote:

Hi folks,

I'm on my way to a company-internal symposium in Tucson where I'll be giving a presentation about some interesting recent work with HTCondor, in the High *Performance* Computing track. I'll be talking to an audience which, by and large, is unfamiliar with HTCondor capabilities, and if they've seen it before, it might be v6.x or earlier. This, of course, got me to thinking about something that I recall Prof. Livny talking about at HTCondor Week last year.

The emphasis in "High Performance" computing has been to orchestrate as many cooperating CPU cores as possible, running as fast as possible - you see things like the SGI UV 3000 series where  you have up to 256 sockets (not cores, *sockets*... zomg...) on a cache-coherent memory image, the proliferation of Infiniband fabrics, 10Gb, 40Gb, and 100Gb Ethernets, rDMA and ROCE, etc. etc., all working to build out the most fierce Lamborghini of a computing system the world has ever seen.

But Prof. Livny's observation was that the paradigm of large-scale computing is shifting around us, and it will have the same kind of revolutionary impact on computing as the introduction of the PC. We are entering a world where for an absurdly modest price, you can harness the power of tens or hundreds of thousands of CPU cores for only as long as you need it. Even with the most dense Xeon chips in the biggest UV 3000 available, to the tune of millions upon millions of dollars, you can't even remotely come close to the power and scale that's available in Amazon EC2, Azure, and the rest for however many pennies per hour you want to spend.

Simple math dictates that a hundred machines which take ten minutes to run a given task will complete more of those tasks in a given time than a tricked-out muscle-machine which can complete the same task in ten seconds, and that's what "High Throughput" is all about, and that was what I saw as the crux of Prof. Livny's observation: the most important work in large-scale computing in the coming years is going to be figuring out how to adapt the design of algorithms to this new reality - figuring out how to run your four-week 20-core MPI job in a few hours on 20,000 intermittently-available EC2 spot instances instead.

Since I only have about twenty minutes in my time slot, I'd be delighted if someone who has thought through this issue could offer a pithy, memorable, and succinct way to express this idea to a potentially skeptical audience. Or a link to one.

Thanks for any suggestions!

Hope to see some of you at HTCondor Week two weeks hence!

  <Mail Attachment.gif>

<Mail Attachment.gif> Michael V. Pelletier
IT Program Execution
Principal Engineer
978.858.9681 (5-9681)
339.293.9149 cell

HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting

The archives can be found at:

Attachment: smime.p7s
Description: S/MIME cryptographic signature