[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Capabilities of schedd HA



Thanks for the detailed reply, Todd. I implemented your sample configuration in my submit machines' local config files (hope that doesn't make a difference). However, they are both still running separate SCHEDD processes. There is a shared file system between them to which the HA_LOCK_URL and SPOOL macros point. I've been keeping tails on the files created in the shared spool directory and SCHEDD appears to be writing its state information there. I suppose my question now is should there be processes on each submit machine for SCHEDD even though it is a HA daemon?

Thanks,
Janzen

Yes. As you guessed, the trick is to setup a shared filesystem between the two machines and tell the condor_schedd to write its state into the shared filesystem.

All schedd state is written into the "spool" subdirectory, as defined by SPOOL in your condor_config. This is the directory that must be placed on a shared filesystem. Lets say this shared subdirectory is "/share/spool". You will also need to create a scratch/tmp directory on your shared filesystem if you don't already have one that has world-write permission (chmod 1777 on Unix). Lets call this directory "/share/tmp".

Another important point: With central manager failover, you can define a primary and a backup. Not so with schedd failover - the relationship is not primary/backup, but peers. Condor will ensure that one schedd is running across the two machines, but it doesn't care which one. For instance if the schedd is running on machine A and machine A dies, a schedd will be started on machine B -- but when machine A comes back, the schedd won't restart on machine A until such time that machine B fails.

For a suggested configuration on the two machines, do the following:

1) Shutdown condor on your (condor_off). You need to shut down Condor because we are going to edit DAEMON_LIST, and that setting cannot be changed on the fly via condor_reconfig. :(.

   2) Edit the DAEMON_LIST macro in condor_config and remove SCHEDD.

3) Edit the VALID_SPOOL_FILES macro to include SCHEDD.lock (note case is important).

   4) Append the following to the end of your condor_config :

# Tell the master to only start the schedd on one machine, the
# machine which obtained a lock in the shared filesystem.
MASTER_HA_LIST = SCHEDD
# Tell the schedd to use our shared subdirectory for its state.
SPOOL = /share/spool
# Tell the master to use our shared subdirectory for its lock file.
HA_LOCK_URL = file:/share/spool
# The lock has a lease; tell the master to refresh the lease every
# three minutes.
HA_POLL_PERIOD = 180
# Tell the master to consider a lock lease older than 10 minutes to be
# stale.
HA_LOCK_HOLD_TIME = 600
#
# When both machines are up, we have no idea which one will be
# running the schedd.
# So to enable client tools like condor_q, condor_submit, etc, to
# work from either submit machine, we need to give the condor_schedd
# instance a unique name and tell the clients to connect to this named
# schedd by default, no matter which machine is currently hosting it,
# instead of the usual default of connecting to the locally running
# schedd.
#
# Give this schedd instance a unique name independent of the machine's
# hostname (which is what would happen by default)
SCHEDD_NAME = myschedd@
# Tell client tools like condor_q to connect to this named schedd
# instance by default
SCHEDD_HOST = $(SCHEDD_NAME)
# Setup filesystem-based authentication to use /share/tmp if
# authentication via /tmp fails, which will happen if we are logged
# into machine A and the one shedd instance is currently running
# on machine B
SEC_DEFAULT_AUTHENTICATION_METHODS = FS, FS_REMOTE
FS_REMOTE_DIR = /share/tmp

   5) Startup Condor again (condor_on).

Make certain the condor_config changes above are only done your two submit machines, not every machine in the pool! The config changes above tells the condor_master to start one instance of the condor_schedd on the two machines.

In the above, we tell the schedd failover to happen within 10 minutes. I do this because the default job lease on a submitted job is 20 minutes - that means the submit machine can disappear for up to 20 minutes w/o the execute machine killing the job. Thus, by telling Condor to do the failover within 10 minutes, all jobs including jobs already in progress will continue to run and happily complete as if nothing happened. Section 2.15.4 of the v7.2 Condor Manual has more background on job leases:
http://www.cs.wisc.edu/condor/manual/v7.2/2_15Special_Environment.html#sec:Job-Lease

The above setup is off the top of my head, i.e. I didn't explicitly test it, but should be real close if not already good to go. Let us know if it helps you out. If you bless this as a good "recipe", I can update the Condor Manual appropriately and/or add this recipe to Condor admin recipes on the Condor Wiki at
  http://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToAdminRecipes

Much, but unfortunately not all, of the above wisdom can gleened from section 3.10 of the Version 7.2 Condor Manual.

Hope this helps...

regards,
Todd