[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] CM Failover with submits from CM



Janzen Brewer wrote:
> Thanks for the prompt replies.
> 
> I suppose my question has changed now. Is there any way to implement 
> Condor such that there is no single point of failure?
> 
> I've heard of DRBD, which I suppose could be used for redundancy in a 
> shared file system. I'd prefer not to have to implement it, though, as 
> my co-workers have told me it's more trouble than it's worth (e.g. 
> split-brain issues).
> 
> Thanks,
> Janzen

The split-brain problem is pretty standard in any distributed system,
including Condor.

You can do HA with the Central Manager, where each Collector has an
active copy of data. For HA Schedd (submit node), we rely on the
presence of a shared file system between schedd nodes, and only one node
is active.

There are options other than DRBD for distributed file systems, some
have better fail-over characteristics than others. You might want to
look into balancing the faults you can't tolerate vs what it costs to
handle them.

Best,


matt