Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] how to limit no of running jobs ?

Date: Mon, 05 Jun 2006 14:43:21 +0100
From: "Dr Ian C. Smith" <i.c.smith@xxxxxxxxxxxxxxx>
Subject: Re: [Condor-users] how to limit no of running jobs ?

Wow - what a can of worms ! I think I'll partition the
pool so that the long running jobs run on just the PCs
in one our of the classrooms. This info is advertised
through the machine classads so it's quite easy to set up.

cheers,

-ian.

--On 05 June 2006 12:01 +0100 Matt Hope <matthew.hope@xxxxxxxxx> wrote:

On 6/5/06, Dr Ian C. Smith <i.c.smith@xxxxxxxxxxxxxxx> wrote:


Thanks for the speedy reply. I always thought this was part of the
Condor functionality but apparently not. The reason I ask is
that I kind see two different groups of our Condor users developing.
The first run small numbers of long (as in weeks) jobs under DAGMan,
the second will be running large numbers of short (~ 30 mins) jobs
without DAGMan.
I'm worried that jobs from the first group will be edged out by the
second - is this likely to be the case ? Should I in some way increase
the priority of the long jobs ?


Hehe - welcome to my world, roughly the same for me but I have the
additional requirement of certain jobs always running before others
(and kicking as needed)

Since checkpointing on windows is a bit of a nightmare so preemption
of long running jobs should be avoided at all costs I have organised
it my partitioning the farm on a VM basis (all are SMP so I can very
easily do different things based on multiplying by the
VirtualMachineId). I then set the first vms to always prefer the long
jobs (users are expected to indicate their job types - if they don't
they go to the bottom of the pile*) and the second ones to prefer the
fast ones.
Some more important long running jobs are then allowed to run on VM2
with higher rank than anything else.
The slow running jobs users tend not to allow jobs to run on VM2
(apart from those high priority ones) The short running jobs tend to
be allowed to run anywhere (trying fill in cracks where possible)

By keeping the number of the high priority jobs manageable (by having
a special schedd and limiting the max jobs to be about 2/3rds of the
farms vm2's) Most users get done in a reasonable amount of time,
occasionally the fast ones can have a day or so latency though.

I make no use of user priority except for balancing users within the
same segment

* note - in all this I have the following assumptions:
1) My users won't lie (though they may occasionally screw up)
2) That if I need to segment some users jobs I can get them running on
their own schedd without too much effort (my systems guys are very
helpful that way)

You may not have these luxuries and will have to adapt accordingly

I've never quite understood how Condor shares resources between users.
For schedulers like Sun Grid Engine there are variety
of policies which can be employed.


This is conceptually reasonably simple - it attempts to distribute
resources such that within a time window the relative execution time
available to each user sums to values which match the relative
weighting of the users as defined by the admin.

Obviously the trick here is how the time window works, this is
essentially the half life of the decay function on previous usage.

The tweak aspect is how that affects preemption since if a job runs
for a long time you at some point need to decide if you will kick it
to try to balance the books. If there are short jobs then this
shouldn't happen very often since the negotiator gets more of a chance
to keep things in trim at those points.

There is layered on top the concept of group based accounting (I don't
use this since things change too often round here for me to make
reliable user based group membership verses job meta information which
I can change in a hurry if need be)

As I said I don't really use this (I have the half life set to 1second
so only immediate use counts and startd based ranking deals with
prioritization (sadly this means users must self select to avoid
preemption but this works reasonably well if the number of groups are
very small relative to the number of machines (currently 3 distinct
groups with several hundred nodes)

Matt
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR

References:
- [Condor-users] how to limit no of running jobs ?
  - From: Dr Ian C. Smith
- Re: [Condor-users] how to limit no of running jobs ?
  - From: Matt Hope
- Re: [Condor-users] how to limit no of running jobs ?
  - From: Dr Ian C. Smith
- Re: [Condor-users] how to limit no of running jobs ?
  - From: Matt Hope

Prev by Date: [Condor-users] Question about Configuring Condor for Running Dedicated Jobs( How to use condor_config.local.dedicated.resource ? )
Next by Date: [Condor-users] Condor-G question
Previous by thread: Re: [Condor-users] how to limit no of running jobs ?
Next by thread: Re: [Condor-users] Installation error (outside of drive C:\)
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] how to limit no of running jobs ?