[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] High availability condor_schedd, shared file systems

Steven Timm wrote:

Has anyone yet tried to set up a high availability condor_schedd
as described in section 3.10.1 of the condor manual?
If so, what solution are you using for the shared file system? I would
like to find a solution that doesn't involve NFS, if possible.
Has anyone tried AFS for this purpose?  GFS?

In the cases I've used it, it was always NFS. In theory it could be any shared filesystem that : (a) is up to the task in terms of load that the schedd places on the filesystem [e.g. all the writes/syncs to the SPOOL directory], and (b) implements file rename and hard link creation as an atomic operation.

And what is the effect when a backup schedd on a different IP takes
over the job queue?  condor_submit will not automatically fail over
to the new IP, will it?

A condor_submit that is in progress when the fault and failover occurs will fail. However, once the new failover schedd is running, condor_submits will work. The reason is condor_submit can/will lookup the IP address from the collector. Setup the failover schedd to have the same schedd name by placing SCHEDD_NAME in the condor_config file, then use "condor_submit -n".

What about all the ClaimId's, do those work OK?

Yes. Note the IP address in the ClaimIds is the condor_startd IP address, not the schedd address.

When the new schedd starts up, it will reconnect to starters of all active jobs, just like what happens when you reboot a submit machine.

Has anyone else tried to use linux-HA to move the schedd IP to the backup machine when the master schedd fails out so that the backup schedd starts
not only with the same job_queue.log but the same IP as before?

I have not, but in theory it should not be required because the only fixed IP/port in Condor is that of the collector. All other tools/daemons can use the collector as a directory service. Think about it... the schedd (by default) will startup every time on a dynamic port. Condor doesn't care if the port changes between restarts, or if the port+ip changes between restarts.