[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] counting licenses



On 5/4/05, Joshua Kolden <joshua@xxxxxxxxxxxxxxxxx> wrote:
> Matt Hope wrote:
> 
> >On 5/4/05, Joshua Kolden <joshua@xxxxxxxxxxxxxxxxx> wrote:
> >
> >
> >>I need to have a dynamic expression in which I can tick off licenses use
> >>independently of the cpu I'm running on.  Are dynamic expressions like
> >>this even possible?
> >>
> >Dynamic user defined expressions are not really supported by condor itself.
> >
> Hmm that's too bad.  Most of the render queue technology we use in
> visual effects production has this functionality, I had always kind of
> thought of it as a core feature.

I meant dynamic as in, defined externally to the system but inherently
tied to events originating from within the system as well as external
to it (so it can update in response to both)

> >conceptually (ignore implementation or physical location in this part) you need:
> >
> >A 'licence server' which hands out a licence, takes it back and
> >reports how many are free to use.
> >
> >
> In other applications there is simply a way to define local resources
> and global resources.  A cpu would be a local resource, while network
> load might be a global resource.  Advanced queue systems will even allow
> you to run a program (or a plug-in) that will asses the current state of
> the global resource.  I might consume one of each on a particular job,
> for example, by taking a cpu from a local resource and a software use
> from the global resource list.  But the global resource list can be
> anything you like you just define a variable and a count and in your job
> an expression to consume it.

I can't comment on the reason why condor does not come with such
functionality - perhaps the heritage means this was never a core issue
for it...

> >Realistically condor could do with a condor_licence daemon and a
> >simple way for jobs to indicate that they required the licence. condor
> >would then be able to control the whole thing much more effectively.
> >Of course making this a reality is a slightly more complex issue...
> >
> >
> A) this is overly complex, and B) to specific, a general solution that
> has a non CPU constrained resource that can be consumed by an expression
> would be more consistent with the classAdd idea.

I did meant in a rather more general sense - but I would be wary of
trying to start with a system which has full generality. for example
in condor the evaluation of which jobs run is not only done by the
negotiator but by the individual machines themselves. In the event of
a two way disagreement the machine decides (fair n'est pas) What
happens if there is a three way tie regarding node/negotiator and
"resource contention daemon".

A simpler but more easy to reason around start might be a better first try.

> >I know you don't want to here this but at the moment any dynamic
> >solution you come up with is likely to be more wasteful/error prone
> >than simply assigning a set of machines as maya enabled and having
> >jobs requiring these machines preempt any non maya jobs running on
> >them.
> >
> >
> This is unfortunate, I had just assumed that there was a dynamic
> expression system, I really can't see how the queue can function
> properly without it.  How, for example, do you keep too many jobs from
> running at once and thrashing a network?  How do you consume disk space
> in a quota environment?  How do you handle any licensed software that
> floats?

In a future release the plan appears to be that local disk space is to
be handled by the starter enforcing constraints and killing the job if
it violates them. remote disk space is the purview of the quota system
of the filesystem...

There seems to be code relating to sophisticated network management
built into condor but not enabled yet - I don't know if this is
something not ready for prime time...

Condor is very much about the individual users having a reasonable
awareness of the impact of their jobs on the wider world and
throttling as they see fit.

The two ideas from John Wheez's (using dagman to control the flow of
jobs requiring licences) and Miron Livny (preempt any non maya job)
sound like the best current solution:

1) dagman 
pro - potentially reasonably close to optimal (might even be able to
get the single licence two cpus to work with some effort)
con - rather complex

2) Have a machine with a schedd solely for maya jobs. throttle the
max_jobs_running to be the number of licences.
pro - very simple
con - non optimal

2a) as 2 but set the max_jobs to be licences * (number of smpX machines * X)
where X is the number of cpus on the machine. Ensure all jobs are
submitted on hold.
Then have a management job running which checks the usage of licences
on the farm and releases/holds jobs to maintain this number.
For this to work the jobs should ensure that their requirements aren't
too restrictive else you could get no, or severely throttled, forward
motion
pro - could achieve near optimally, especially if the job rank can be
adjusted to prefer machines with an already running job
con - may briefly allow jobs to start when the licences are exceeded.
At this point the job should put itself on hold rather than
terminating (since that would be viewed as completion and the job will
not run again)

Matt