[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Capabilities of schedd HA
- Date: Thu, 16 Jul 2009 10:33:43 -0500
- From: Matthew Farrellee <matt@xxxxxxxxxx>
- Subject: Re: [Condor-users] Capabilities of schedd HA
The condor_masters on both machines actually collaborate to make sure
only one condor_schedd is running. If you don't run the condor_schedd
from the condor_master all bets are off.
Janzen Brewer wrote:
> Thanks for the detailed reply, Todd. I implemented your sample
> configuration in my submit machines' local config files (hope that
> doesn't make a difference). However, they are both still running
> separate SCHEDD processes. There is a shared file system between them to
> which the HA_LOCK_URL and SPOOL macros point. I've been keeping tails on
> the files created in the shared spool directory and SCHEDD appears to be
> writing its state information there. I suppose my question now is should
> there be processes on each submit machine for SCHEDD even though it is a
> HA daemon?
>> Yes. As you guessed, the trick is to setup a shared filesystem between
>> the two machines and tell the condor_schedd to write its state into the
>> shared filesystem.
>> All schedd state is written into the "spool" subdirectory, as defined by
>> SPOOL in your condor_config. This is the directory that must be placed
>> on a shared filesystem. Lets say this shared subdirectory is
>> "/share/spool". You will also need to create a scratch/tmp directory on
>> your shared filesystem if you don't already have one that has
>> world-write permission (chmod 1777 on Unix). Lets call this directory
>> Another important point: With central manager failover, you can define a
>> primary and a backup. Not so with schedd failover - the relationship
>> is not primary/backup, but peers. Condor will ensure that one schedd is
>> running across the two machines, but it doesn't care which one. For
>> instance if the schedd is running on machine A and machine A dies, a
>> schedd will be started on machine B -- but when machine A comes back,
>> the schedd won't restart on machine A until such time that machine B
>> For a suggested configuration on the two machines, do the following:
>> 1) Shutdown condor on your (condor_off). You need to shut down
>> Condor because we are going to edit DAEMON_LIST, and that setting cannot
>> be changed on the fly via condor_reconfig. :(.
>> 2) Edit the DAEMON_LIST macro in condor_config and remove SCHEDD.
>> 3) Edit the VALID_SPOOL_FILES macro to include SCHEDD.lock (note case
>> is important).
>> 4) Append the following to the end of your condor_config :
>> # Tell the master to only start the schedd on one machine, the
>> # machine which obtained a lock in the shared filesystem.
>> MASTER_HA_LIST = SCHEDD
>> # Tell the schedd to use our shared subdirectory for its state.
>> SPOOL = /share/spool
>> # Tell the master to use our shared subdirectory for its lock file.
>> HA_LOCK_URL = file:/share/spool
>> # The lock has a lease; tell the master to refresh the lease every
>> # three minutes.
>> HA_POLL_PERIOD = 180
>> # Tell the master to consider a lock lease older than 10 minutes to be
>> # stale.
>> HA_LOCK_HOLD_TIME = 600
>> # When both machines are up, we have no idea which one will be
>> # running the schedd.
>> # So to enable client tools like condor_q, condor_submit, etc, to
>> # work from either submit machine, we need to give the condor_schedd
>> # instance a unique name and tell the clients to connect to this named
>> # schedd by default, no matter which machine is currently hosting it,
>> # instead of the usual default of connecting to the locally running
>> # schedd.
>> # Give this schedd instance a unique name independent of the machine's
>> # hostname (which is what would happen by default)
>> SCHEDD_NAME = myschedd@
>> # Tell client tools like condor_q to connect to this named schedd
>> # instance by default
>> SCHEDD_HOST = $(SCHEDD_NAME)
>> # Setup filesystem-based authentication to use /share/tmp if
>> # authentication via /tmp fails, which will happen if we are logged
>> # into machine A and the one shedd instance is currently running
>> # on machine B
>> SEC_DEFAULT_AUTHENTICATION_METHODS = FS, FS_REMOTE
>> FS_REMOTE_DIR = /share/tmp
>> 5) Startup Condor again (condor_on).
>> Make certain the condor_config changes above are only done your two
>> submit machines, not every machine in the pool! The config changes
>> above tells the condor_master to start one instance of the condor_schedd
>> on the two machines.
>> In the above, we tell the schedd failover to happen within 10 minutes.
>> I do this because the default job lease on a submitted job is 20 minutes
>> - that means the submit machine can disappear for up to 20 minutes w/o
>> the execute machine killing the job. Thus, by telling Condor to do the
>> failover within 10 minutes, all jobs including jobs already in progress
>> will continue to run and happily complete as if nothing happened.
>> Section 2.15.4 of the v7.2 Condor Manual has more background on job leases:
>> The above setup is off the top of my head, i.e. I didn't explicitly test
>> it, but should be real close if not already good to go. Let us know if
>> it helps you out. If you bless this as a good "recipe", I can update
>> the Condor Manual appropriately and/or add this recipe to Condor admin
>> recipes on the Condor Wiki at
>> Much, but unfortunately not all, of the above wisdom can gleened from
>> section 3.10 of the Version 7.2 Condor Manual.
>> Hope this helps...
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> The archives can be found at: