[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] how to limit no of running jobs ?
- Date: Mon, 5 Jun 2006 12:01:32 +0100
- From: "Matt Hope" <matthew.hope@xxxxxxxxx>
- Subject: Re: [Condor-users] how to limit no of running jobs ?
On 6/5/06, Dr Ian C. Smith <i.c.smith@xxxxxxxxxxxxxxx> wrote:
Thanks for the speedy reply. I always thought this was part of the
Condor functionality but apparently not. The reason I ask is
that I kind see two different groups of our Condor users developing.
The first run small numbers of long (as in weeks) jobs under DAGMan,
the second will be running large numbers of short (~ 30 mins) jobs without
I'm worried that jobs from the first group will be edged out by the
second - is this likely to be the case ? Should I in some way increase
the priority of the long jobs ?
Hehe - welcome to my world, roughly the same for me but I have the
additional requirement of certain jobs always running before others
(and kicking as needed)
Since checkpointing on windows is a bit of a nightmare so preemption
of long running jobs should be avoided at all costs I have organised
it my partitioning the farm on a VM basis (all are SMP so I can very
easily do different things based on multiplying by the
VirtualMachineId). I then set the first vms to always prefer the long
jobs (users are expected to indicate their job types - if they don't
they go to the bottom of the pile*) and the second ones to prefer the
Some more important long running jobs are then allowed to run on VM2
with higher rank than anything else.
The slow running jobs users tend not to allow jobs to run on VM2
(apart from those high priority ones) The short running jobs tend to
be allowed to run anywhere (trying fill in cracks where possible)
By keeping the number of the high priority jobs manageable (by having
a special schedd and limiting the max jobs to be about 2/3rds of the
farms vm2's) Most users get done in a reasonable amount of time,
occasionally the fast ones can have a day or so latency though.
I make no use of user priority except for balancing users within the
* note - in all this I have the following assumptions:
1) My users won't lie (though they may occasionally screw up)
2) That if I need to segment some users jobs I can get them running on
their own schedd without too much effort (my systems guys are very
helpful that way)
You may not have these luxuries and will have to adapt accordingly
I've never quite understood how Condor shares resources between users.
For schedulers like Sun Grid Engine there are variety
of policies which can be employed.
This is conceptually reasonably simple - it attempts to distribute
resources such that within a time window the relative execution time
available to each user sums to values which match the relative
weighting of the users as defined by the admin.
Obviously the trick here is how the time window works, this is
essentially the half life of the decay function on previous usage.
The tweak aspect is how that affects preemption since if a job runs
for a long time you at some point need to decide if you will kick it
to try to balance the books. If there are short jobs then this
shouldn't happen very often since the negotiator gets more of a chance
to keep things in trim at those points.
There is layered on top the concept of group based accounting (I don't
use this since things change too often round here for me to make
reliable user based group membership verses job meta information which
I can change in a hurry if need be)
As I said I don't really use this (I have the half life set to 1second
so only immediate use counts and startd based ranking deals with
prioritization (sadly this means users must self select to avoid
preemption but this works reasonably well if the number of groups are
very small relative to the number of machines (currently 3 distinct
groups with several hundred nodes)