I quite enjoy Prof. Livnyâs FLOPY vs FLOPS slides that he used. Iâve found that it accurately conveys that the problem-space isnât just as simple as âturn all these on, link them up and youâre good to go.â
I once had someone suggest - unfortunately canât remember who - that there were very few problems that actually required the power and inconvenience of a HPC installation, most of the problems that claim to be HPC are actually extremely in-efficient solutions which could be broken down and run as high-throughput jobs with all the benefits.
I genuinely think, looking forward into the future of computing (I know, dodgy ground here.), with the rise of computing as a commodity and the continuing march of theâCloudâ, the cost per cpu-hour is going to continue nose-diving.
Having a quick glance at AWS Spot I notice that the m4.xlarge (2 cores, 8 GB) is currently 4 cents an hour.
Weâll start seeing citizen scientists who throw $50-100 on cloud-resources and get not 20 cores for a few hours like now, but a 1000 or 10,000 cores for a few hours.
Which is why I think the HTCondor team is spot on when they focus on things like the grid universe and cloud GAHPs as well as the incoming condor_annex. Theyâre the future.
As a company/research organization you can call an API 100,000 cores on multiple cloud providers in the time it takes to get a coffee and a get a bill in a couple of days.
Speaking hypothetically, as a the above organization, if your problem-space wasnât HPC-bound, why would you waste the time and expense on very complex and very intricate HPC set-ups, requiring highly specialized networks, cooling and knowledge to run?
Then it gets even muddier when you see things like the Intel Phiâs coming in, those things are going to take the HTC world by storm in big installations. At the same time people are going to be scratching their heads wondering why they now need that 1,000-core infiniband set-up that makes them feel a bit ill whenever they think of the cost.
Indeed, Iâll see you again at HTCondor Week this year.
I'm on my way to a company-internal
symposium in Tucson where I'll be giving a presentation about some interesting
recent work with HTCondor, in the High *Performance* Computing track. I'll
be talking to an audience which, by and large, is unfamiliar with HTCondor
capabilities, and if they've seen it before, it might be v6.x or earlier.
This, of course, got me to thinking about something that I recall Prof.
Livny talking about at HTCondor Week last year.
The emphasis in "High Performance"
computing has been to orchestrate as many cooperating CPU cores as possible,
running as fast as possible - you see things like the SGI UV 3000 series
where you have up to 256 sockets (not cores, *sockets*... zomg...)
on a cache-coherent memory image, the proliferation of Infiniband fabrics,
10Gb, 40Gb, and 100Gb Ethernets, rDMA and ROCE, etc. etc., all working
to build out the most fierce Lamborghini of a computing system the world
has ever seen.
But Prof. Livny's observation was that
the paradigm of large-scale computing is shifting around us, and it will
have the same kind of revolutionary impact on computing as the introduction
of the PC. We are entering a world where for an absurdly modest price,
you can harness the power of tens or hundreds of thousands of CPU cores
for only as long as you need it. Even with the most dense Xeon chips in
the biggest UV 3000 available, to the tune of millions upon millions of
dollars, you can't even remotely come close to the power and scale that's
available in Amazon EC2, Azure, and the rest for however many pennies per
hour you want to spend.
Simple math dictates that a hundred
machines which take ten minutes to run a given task will complete more
of those tasks in a given time than a tricked-out muscle-machine which
can complete the same task in ten seconds, and that's what "High Throughput"
is all about, and that was what I saw as the crux of Prof. Livny's observation:
the most important work in large-scale computing in the coming years is
going to be figuring out how to adapt the design of algorithms to this
new reality - figuring out how to run your four-week 20-core MPI job in
a few hours on 20,000 intermittently-available EC2 spot instances instead.
Since I only have about twenty minutes
in my time slot, I'd be delighted if someone who has thought through this
issue could offer a pithy, memorable, and succinct way to express this
idea to a potentially skeptical audience. Or a link to one.
Thanks for any suggestions!
Hope to see some of you at HTCondor
Week two weeks hence!
|Michael V. Pelletier
IT Program Execution
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
You can also unsubscribe by visiting
The archives can be found at: