[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] 7.8 and defrag or similar of dynamic slots







On Monday, 21 May, 2012 at 6:36 AM, Ian Cottam wrote:

On 18/05/2012 20:04, "Dan Bradley" <dan@xxxxxxxxxxxx> wrote:



On 5/18/12 12:19 PM, Ian Cottam wrote:
We are thinking about updating to 7.8.0.
I noticed that there is, with 7.8, a defrag daemon for dynamic slots.
On our (main) pool we have preemption off anyway: am I right in thinking
that this defrag then is not for us?

Defragmentation is desirable when jobs requiring large slots (e.g. many
cores or big memory) suffer from starvation (rarely or never getting
scheduled to run) due to fragmented machines. Machines become
fragmented when they are partitioned into small slots to fit small
jobs. If many small jobs are running on a machine at the same time, the
chance is small that they will all exit at the same time, freeing up a
large chunk of resources for large jobs to use. The Condor negotiator's
resource allocation algorithm currently just works with the slots that
exist. It does not make reservations or preempt multiple slots, so some
method of defragmenting machines is needed to avoid the problem of
starvation of large jobs.

Defragmentation can cause jobs to be killed. If you do not want that,
MaxJobRetirementTime can be used to specify how long jobs should be
allowed to run on machines that are being drained.


I only ask because sometimes (with 7.4/7.6) and dynamic slots we see
partial matches that don't go through and wondered if there was
something
in 7.8 that helps with this.

If by "partial matches that don't go through" you mean the starvation
problem I mentioned above, then condor_degrag can help. If it is some
other problem, then it may or may not.

--Dan


What we have is jobs that Match but never start.
We have just demonstrated that if we move the Memory requirement from the
Requirements line to a Request_memory=n line, they work.
We are not entirely sure why.
-Ian

This is the new normal for 7.8.0 and beyond. In fact, I thought condor_submit was supposed to warn if you used the requirements string to specify this. There was a debate on condor-dev about it but I'm not sure how it ended up.

In any case, going forward, it's recommended you use request_memory, request_cpus and request_disk for these job constraints. They behave better in dynamic slot pool setups and they'll be the only way to specify these constraints at some point in the future, with the use of the requirements _expression_ for these three things being completely unsupported.

There should be some lines about this in the 7.8.0 or 7.7.x release notes.

Hope that helps.

- Ian


-- 
Ian Chesal
ichesal@xxxxxxxxxxxxxxxxxx
Cycle Computing, LLC