On Wed, Sep 1, 2010 at 11:51 AM, Matt Hope <Matt.Hope@xxxxxxxxxxxxxxx>
We actually only use the speculative stuff for machines due for reboot now.
We have found that, due to the increased number of cores relative to local memory/disk in recent hardware a simple partitioning based on fixed slots with some of them having more cores assigned is sufficient. Then those 'large slots' will use RANK to strongly prefer jobs from a certain class over others and the big jobs won't run on small slots but small jobs will run on big slots (but just the one, not N where N is the number of cores in that slot).
For the most part I found it was just best to dedicate hardware to job classes: big, parallel process jobs got a dedicated set of machines and then there was a single threaded pool for everything else with slots that were set to varying memory sizes based on the fact that the jobs we were running had pretty bin-able memory footprints we could exploit.
Hardware was just cheap enough to use the "throw more" at it approach and the queues had a near constant stream of each class of job in them so not much was going to waste.
I arrived at this solution after playing around with some fancy preemption solutions where machines would advertise N slots for medium and small jobs, but a big job could take slot 1 and preempt slots 2 -> N any time it wanted to. Idea was we'd use this setup on a few machines where big jobs would go, but the machines wouldn't be under utilized when there were no big jobs in the queue. They'd still run N medium and small jobs. Didn't take us very long to see that all these machines were running big jobs 100% of the time so the setup could be simplified considerably and the machines set to just advertise 1 slot always.
I did look at dynamic slots, and decided that I had no immediate solution to the same problem you describe (starvation of big jobs) and that consistency trumped optimality given the other, non core count, related limitations.
I believe some people on this list are using/experimenting with dynamic. I personally (at work) am more likely to move to push for a move to Job Hooks and deal with everything myself since we already have a database of tasks to run and the schedd queue in between is actually a bit of a Split-Brain issue. I'd rather get rid of them and write my own fair share code based on our domain specific setup.
So the last rev of the system at Altera, before I left, took exactly this approach: Job Hooks to pull jobs from a DB using a pull algorithm that was completely our own design. I'd say it was mixed success. There were a ton of growing pains as we brought more and more machines over to the new approach. And Job Hooks have some issues on Windows that destroy some of their usefulness (namely the startd crashes randomly because it tries to close a pipe down before it's flushed when you're using hooks on windows).
We worked around the Windows issue by having the machine start the hook script as the first job it ran. But then this meant I had to do session cleanup on Windows as part of the hook script which sucks.
The best advice I could offer to someone looking to go completely over to Job Hooks would be: consider doing job queues where the sorting logic is on the queue server side. So you provide queues and all your Job Hooks do is rotate through them and, if there's a job on the queue, pop the top job and run it. So you're not doing searching through queues as part of the hook. We were doing searching through the queue as part of the hook and there were many rabbit holes to fall down along the way to getting it to work efficiently and in a manner that was scalable. Took us almost 6 months to really get it tuned and even then every new rack of blades added kept me up at night wondering if it'd scale up one more time.