[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] can Condor somehow be a HA?



On 5/22/2017 6:45 AM, lejeczek wrote:
hi fellas

I've only started looking at htcondor, not having a good understanding of it yet I wonder - htcondor has that concept of "central manager" and I wonder if this makes it a valid candidate for HA setup?

Does anybody have any experience with/thoughts on htcondor as HA and could share it here?

many thanks
L.

Hi,

First off, understand that if your installations central manager dies, currently running jobs will continue to run and even new jobs will continue to get scheduled in many cases (i.e. new jobs will still get scheduled to claimed slots). Even in production pools, most sites have no problem with rebooting their central manager or even taking it down for an hour or two - while the central manger is down, users may notice that condor_status stops working, but practically all other common tools continue to work (condor_submit, condor_q, condor_rm, etc). Thus many pools don't ever bother with an HA solution for the central manager.

If you are still concerned, the HTCondor central manager is actually very lightweight and holds very little state (just user prioirties), and this is very amenable to a high availability (HA) setup. You essentially have two choices:

1. HTCondor can be configured to have two central managers (hot/hot), and automatically fail over as needed. See the section in the HTCondor Manual titled "High Availability of the Central Manger" at

http://research.cs.wisc.edu/htcondor/manual/v8.6/3_13High_Availability.html#SECTION004132200000000000000

2. If you already run your services in a managed visualized setup (Mesos+Marathan, OpenStack, vSphere, HyperV, etc) that supports failover, you could setup your HTCondor central manager for HA leveraging those environments, i.e. same way you would setup a redundant email server, for instance.


Hope the above helps
Todd