[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Capabilities of schedd HA



Janzen Brewer wrote:
I've been playing with schedd HA. I haven't quite gotten the configuration right, but before I put any more time into it, I want to make sure that it can do what I'm hoping it can.

Those who have read/replied to my earlier posts will recall that my Condor setup must have no single point of failure. I'm currently working on schedd. Schedd now runs on the CMs and needs only to take submissions from the active CM. I've tested CM failover while a job was executing. While negotiator did failover, the job was not able to complete until failback. Is there any way around this (e.g. shared file system between CMs)?

Yes. As you guessed, the trick is to setup a shared filesystem between the two machines and tell the condor_schedd to write its state into the shared filesystem.

All schedd state is written into the "spool" subdirectory, as defined by SPOOL in your condor_config. This is the directory that must be placed on a shared filesystem. Lets say this shared subdirectory is "/share/spool". You will also need to create a scratch/tmp directory on your shared filesystem if you don't already have one that has world-write permission (chmod 1777 on Unix). Lets call this directory "/share/tmp".

Another important point: With central manager failover, you can define a primary and a backup. Not so with schedd failover - the relationship is not primary/backup, but peers. Condor will ensure that one schedd is running across the two machines, but it doesn't care which one. For instance if the schedd is running on machine A and machine A dies, a schedd will be started on machine B -- but when machine A comes back, the schedd won't restart on machine A until such time that machine B fails.

For a suggested configuration on the two machines, do the following:

1) Shutdown condor on your (condor_off). You need to shut down Condor because we are going to edit DAEMON_LIST, and that setting cannot be changed on the fly via condor_reconfig. :(.

  2) Edit the DAEMON_LIST macro in condor_config and remove SCHEDD.

3) Edit the VALID_SPOOL_FILES macro to include SCHEDD.lock (note case is important).

  4) Append the following to the end of your condor_config :

# Tell the master to only start the schedd on one machine, the
# machine which obtained a lock in the shared filesystem.
MASTER_HA_LIST = SCHEDD
# Tell the schedd to use our shared subdirectory for its state.
SPOOL = /share/spool
# Tell the master to use our shared subdirectory for its lock file.
HA_LOCK_URL = file:/share/spool
# The lock has a lease; tell the master to refresh the lease every
# three minutes.
HA_POLL_PERIOD = 180
# Tell the master to consider a lock lease older than 10 minutes to be
# stale.
HA_LOCK_HOLD_TIME = 600
#
# When both machines are up, we have no idea which one will be
# running the schedd.
# So to enable client tools like condor_q, condor_submit, etc, to
# work from either submit machine, we need to give the condor_schedd
# instance a unique name and tell the clients to connect to this named
# schedd by default, no matter which machine is currently hosting it,
# instead of the usual default of connecting to the locally running
# schedd.
#
# Give this schedd instance a unique name independent of the machine's
# hostname (which is what would happen by default)
SCHEDD_NAME = myschedd@
# Tell client tools like condor_q to connect to this named schedd
# instance by default
SCHEDD_HOST = $(SCHEDD_NAME)
# Setup filesystem-based authentication to use /share/tmp if
# authentication via /tmp fails, which will happen if we are logged
# into machine A and the one shedd instance is currently running
# on machine B
SEC_DEFAULT_AUTHENTICATION_METHODS = FS, FS_REMOTE
FS_REMOTE_DIR = /share/tmp

  5) Startup Condor again (condor_on).

Make certain the condor_config changes above are only done your two submit machines, not every machine in the pool! The config changes above tells the condor_master to start one instance of the condor_schedd on the two machines.

In the above, we tell the schedd failover to happen within 10 minutes. I do this because the default job lease on a submitted job is 20 minutes - that means the submit machine can disappear for up to 20 minutes w/o the execute machine killing the job. Thus, by telling Condor to do the failover within 10 minutes, all jobs including jobs already in progress will continue to run and happily complete as if nothing happened. Section 2.15.4 of the v7.2 Condor Manual has more background on job leases:
http://www.cs.wisc.edu/condor/manual/v7.2/2_15Special_Environment.html#sec:Job-Lease

The above setup is off the top of my head, i.e. I didn't explicitly test it, but should be real close if not already good to go. Let us know if it helps you out. If you bless this as a good "recipe", I can update the Condor Manual appropriately and/or add this recipe to Condor admin recipes on the Condor Wiki at
 http://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToAdminRecipes

Much, but unfortunately not all, of the above wisdom can gleened from section 3.10 of the Version 7.2 Condor Manual.

Hope this helps...

regards,
Todd

--
Todd Tannenbaum                       University of Wisconsin-Madison
Condor Project Research               Department of Computer Sciences