[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] load-balanced central manager?



On Nov 8, 2013, at 11:13 AM, Pek Daniel <pekdaniel@xxxxxxxxx> wrote:

> Hi,
> 
> Is it possible to use Condor in a way like there's multiple running
> instances of every component (including negotiator) in a pool, and in
> this way to provide a load-balanced fail-tolerant environment? Or is
> it possible to use only one single negotiator in a pool at once (I
> know it's possible to do fail-over with had)?

A few thoughts:
- You can setup a fault-tolerant pool with HTCondor HA.  RAL and CMS do this for their respective pools.
- It is possible to tell a negotiator to perform matchmaking for a subset of a pool (where the subset is determined by a given expression, NEGOTIATOR_SLOT_CONSTRAINT).  This provides a sort of primordial sharding.
- I believe there's a way to correctly forward data from sub-collectors to a central one so all the usage is recorded in one place.  I've never done this before, but Dan Bradley may be able to provide some help.

I believe one missing piece is having fairshare work correctly across multiple negotiators (forwarding the fairshares from a central negotiator).  Don't think it's that hard to implement (not sure if one can hack things together), but I'm pretty sure it doesn't exist.

> 
> I've read about flocking also. So in that way there'd be a number of
> pools available with their own central managers. What happens before a
> job get flocked? Does flocking help to provide some kind of load
> balancing between several central managers? Or it makes the situation
> even worse because it requires extra work from central managers?
> 

The amount of work required is basically equal (although flocking is only considered after the local negotiator can't match the job).

One thing worth highlighting is flocking does provide separate failure domains -- one subcluster can fail entirely with minimal effects elsewhre.

You end up with the same issue as above -- how do you make sure fairshare is done across the entire site?  Something to think about.

Brian