[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Looking for a good HPC vs. HTC soundbite



Hi folks,

I'm on my way to a company-internal symposium in Tucson where I'll be giving a presentation about some interesting recent work with HTCondor, in the High *Performance* Computing track. I'll be talking to an audience which, by and large, is unfamiliar with HTCondor capabilities, and if they've seen it before, it might be v6.x or earlier. This, of course, got me to thinking about something that I recall Prof. Livny talking about at HTCondor Week last year.

The emphasis in "High Performance" computing has been to orchestrate as many cooperating CPU cores as possible, running as fast as possible - you see things like the SGI UV 3000 series where  you have up to 256 sockets (not cores, *sockets*... zomg...) on a cache-coherent memory image, the proliferation of Infiniband fabrics, 10Gb, 40Gb, and 100Gb Ethernets, rDMA and ROCE, etc. etc., all working to build out the most fierce Lamborghini of a computing system the world has ever seen.

But Prof. Livny's observation was that the paradigm of large-scale computing is shifting around us, and it will have the same kind of revolutionary impact on computing as the introduction of the PC. We are entering a world where for an absurdly modest price, you can harness the power of tens or hundreds of thousands of CPU cores for only as long as you need it. Even with the most dense Xeon chips in the biggest UV 3000 available, to the tune of millions upon millions of dollars, you can't even remotely come close to the power and scale that's available in Amazon EC2, Azure, and the rest for however many pennies per hour you want to spend.

Simple math dictates that a hundred machines which take ten minutes to run a given task will complete more of those tasks in a given time than a tricked-out muscle-machine which can complete the same task in ten seconds, and that's what "High Throughput" is all about, and that was what I saw as the crux of Prof. Livny's observation: the most important work in large-scale computing in the coming years is going to be figuring out how to adapt the design of algorithms to this new reality - figuring out how to run your four-week 20-core MPI job in a few hours on 20,000 intermittently-available EC2 spot instances instead.

Since I only have about twenty minutes in my time slot, I'd be delighted if someone who has thought through this issue could offer a pithy, memorable, and succinct way to express this idea to a potentially skeptical audience. Or a link to one.

Thanks for any suggestions!

Hope to see some of you at HTCondor Week two weeks hence!

 

Michael V. Pelletier
IT Program Execution
Principal Engineer
978.858.9681 (5-9681)
339.293.9149 cell
michael.v.pelletier@xxxxxxxxxxxx