[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] controlling memory intensive jobs
- Date: Mon, 9 Nov 2009 07:52:26 -0800
- From: Ian Chesal <ICHESAL@xxxxxxxxxx>
- Subject: Re: [Condor-users] controlling memory intensive jobs
> I am curious about your dynamic policies now. At our lab these servers
> are keep having memory problems .
Mag, Matt Hope answered with much of what I would have said so I won't repeat it.
I don't actually use dynmaic partitioning. It's new and most of my farm isn't running 7.2.x yet.
I too have a pretty good idea of how my jobs behave. They're all sliced up in to nice memory buckets by our submission front end. And for the most part, because it's part of the engineer's job here, everyone knows just how much memory and CPU their jobs will need. The submission front end we use ensures no one gets in to the system without a memory spec and disk spec on their jobs. And if they don't specify something the system slaps a default spec on their jobs that assumes it's using the most of everything, so they quickly learn not to be lazy. Works wonders.
On my nodes I unbalance the static partitions to create slots that deal with big memory jobs and slots that deal with small memory jobs. On my 4 core x 2 CPU machines I'll typically assign 1 processor to each slot but 4 of the slots will get 15.5% of the RAM and disk and the other 4 will each get 9.5% of the RAM and disk. These numbers were arrived at after some careful study of the jobs people run, how often they're wrong with their memory guesses, and how we can best avoid out-of-memory problems when pairing jobs on machines.
For the most part all of my jobs require one core and only one core. So the 1:1 slot:core ratio works.
I have one more trick up my sleeve that we use for jobs that are multi-threaded or multi-process that wants more than one core. Slot 1 on my machines will vacate and set all the other slots on the box to Owner if a job with the IsMultiThreaded=1 attribute lands in that slot.
So if a user needs a whole machine they can submit, targetting only slot 1 on machines, with the attribute IsMultiThreaded=1 set on the job and they'll be ensured of obtaining the entire box when their job starts. It's obviously a very destructive option and needs to be used with care or you can end up killing forward progress on your non-MT jobs.
If you want the config snippets for the above setup let me know and I'll try and get them into post shape. Actually, if you search the archives I may have, at one point, posted them in a non-Altera farm format. :)
Hope that helps!
This message may contain information that is confidential or otherwise protected from disclosure. If you are not the intended recipient, you are hereby notified that any use, disclosure, dissemination, distribution, or copying of this message, or any attachments, is strictly prohibited. If you have received this message in error, please advise the sender by reply e-mail, and delete the message and any attachments. Thank you.