[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] our ideal configuration



On 6/29/07, Steffen Grunewald <steffen.grunewald@xxxxxxxxxx> wrote:
On Fri, Jun 29, 2007 at 10:47:49AM +0200, Horv?tth Szabolcs wrote:
> Hi,
>
> I was trying to find the ideal configuration settings for our needs and
> although part of it works nicely I still could not
> get everyting working as intended. Any tip, example or experience with
> similar setups is much appreciated!
>
> - No preemtion.

Think about
WANT_SUSPEND = False
WANT_VACATE = False
SUSPEND = False
CONTINUE = True
PREEMPT = False
KILL = False
PREEMPTION_REQUIREMENT = False
and the like, and also have a look at the CondorWeek 2007 presentation about
large compute clusters.

And for reference no use of RANK

> - The job execution order should be controlled by custom job attributes
> (low/mid/high priority) and the execution
> order should respect both this setting and condor job priority. User
> priority should balance job count between
> users but an important task should run in spite of user priority.
> Important stuff should be out the door as fast as it can be
> without preemption.

Use RANK, NEGOTIATOR_{PRE,POST}_JOB_RANK
Question is, would you let the users specify whether their job is high-prio?
Would you seriously expect users to rank their stuff low-prio??
Who would define whether stuff is "important"?
To let others in, use CLAIM_WORKLIFE (keep it low, and "new" users will
be allowed to run jobs as long as their dynamic priority is better than
the one of the long-time user)

In a corporate setting this can work well so long as the jobs run for
a while so the person whose jobs get kicked by someone 'lying' about
their job can go and find them and tell them off.
It works really well if you have a combination of long running jobs
and shorter ones. You then have the people who have their 'fair
allocation' of the pool who run big jobs happy to let you run fast
jobs on it if they know they will always outrank those 'opportunistic'
jobs. Automatically attributing those jobs (either by a wrapper round
the submit process or a standard submit template) so this happens
keeps it all running smoothly.

> - Job execution should be as much first-in-first-out as can be but it
> should respect attribute changes of the users.

Choose a short PRIORITY_HALFLIFE.

yes, I should have mentioned this one very useful if you only care
about the here and now (i.e. you don't want to penalise people for
taking up the slack)

> - DAGs of a user should run one after the other: by default if maxidle
> or maxjobs flags are used and more dags
> are submitted than the execution of jobs gets mingled between DAGs, they
> run in the order of job submission.
> I'd like to have the first submitted dag complete and only then start
> the jobs of the next one.

I don't know how to handle this. Somehow it's against the concept of Condor
(which aim at High Throughput, not high fairness or whatever)

I strongly disagree. This being implemented would have no impact on
throughput at all. it is simply different ordering of the pending
queue (which could be fair or unfair it matters not so long as it is
controllable).
That does however bring up nicely the issue of the claim lifetime
which is a trade off situation between fine grained control and
throughput. In 6.6 you couldn't really affect this but from 6.7
something you could with CLAIM_TIMEOUT.
If your jobs last longer than an hour and your negotiation cycles are
reasonably regular and quick (say every 5 minutes) then I wouldn't
worry too much about making the CLAIM_TIMEOUT less than the job
runtime. If you tend to have very long jobs coming from one group of
users and shorter ones from another then workout how much throughput
you are willing to lose for fairness in terms of adding about (0.75 *
negotiation cycle time) to every block of n jobs where n is the number
of short jobs which would run happily in the claim timeout. the 0.75
is based on what I measured a while back (with a _very_ rough
methodology - if you care benchmark the claim overhead and negotiation
latency under load).
Remember that this latency will apply (albeit reduced) even when the
pool is not saturated.

Matt