[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Capabilities of schedd HA
- Date: Wed, 15 Jul 2009 18:12:25 -0500
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [Condor-users] Capabilities of schedd HA
Janzen Brewer wrote:
I've been playing with schedd HA. I haven't quite gotten the
configuration right, but before I put any more time into it, I want to
make sure that it can do what I'm hoping it can.
Those who have read/replied to my earlier posts will recall that my
Condor setup must have no single point of failure. I'm currently working
on schedd. Schedd now runs on the CMs and needs only to take submissions
from the active CM. I've tested CM failover while a job was executing.
While negotiator did failover, the job was not able to complete until
Is there any way around this (e.g. shared file system between
Yes. As you guessed, the trick is to setup a shared filesystem between
the two machines and tell the condor_schedd to write its state into the
All schedd state is written into the "spool" subdirectory, as defined by
SPOOL in your condor_config. This is the directory that must be placed
on a shared filesystem. Lets say this shared subdirectory is
"/share/spool". You will also need to create a scratch/tmp directory on
your shared filesystem if you don't already have one that has
world-write permission (chmod 1777 on Unix). Lets call this directory
Another important point: With central manager failover, you can define a
primary and a backup. Not so with schedd failover - the relationship
is not primary/backup, but peers. Condor will ensure that one schedd is
running across the two machines, but it doesn't care which one. For
instance if the schedd is running on machine A and machine A dies, a
schedd will be started on machine B -- but when machine A comes back,
the schedd won't restart on machine A until such time that machine B
For a suggested configuration on the two machines, do the following:
1) Shutdown condor on your (condor_off). You need to shut down
Condor because we are going to edit DAEMON_LIST, and that setting cannot
be changed on the fly via condor_reconfig. :(.
2) Edit the DAEMON_LIST macro in condor_config and remove SCHEDD.
3) Edit the VALID_SPOOL_FILES macro to include SCHEDD.lock (note case
4) Append the following to the end of your condor_config :
# Tell the master to only start the schedd on one machine, the
# machine which obtained a lock in the shared filesystem.
MASTER_HA_LIST = SCHEDD
# Tell the schedd to use our shared subdirectory for its state.
SPOOL = /share/spool
# Tell the master to use our shared subdirectory for its lock file.
HA_LOCK_URL = file:/share/spool
# The lock has a lease; tell the master to refresh the lease every
# three minutes.
HA_POLL_PERIOD = 180
# Tell the master to consider a lock lease older than 10 minutes to be
HA_LOCK_HOLD_TIME = 600
# When both machines are up, we have no idea which one will be
# running the schedd.
# So to enable client tools like condor_q, condor_submit, etc, to
# work from either submit machine, we need to give the condor_schedd
# instance a unique name and tell the clients to connect to this named
# schedd by default, no matter which machine is currently hosting it,
# instead of the usual default of connecting to the locally running
# Give this schedd instance a unique name independent of the machine's
# hostname (which is what would happen by default)
SCHEDD_NAME = myschedd@
# Tell client tools like condor_q to connect to this named schedd
# instance by default
SCHEDD_HOST = $(SCHEDD_NAME)
# Setup filesystem-based authentication to use /share/tmp if
# authentication via /tmp fails, which will happen if we are logged
# into machine A and the one shedd instance is currently running
# on machine B
SEC_DEFAULT_AUTHENTICATION_METHODS = FS, FS_REMOTE
FS_REMOTE_DIR = /share/tmp
5) Startup Condor again (condor_on).
Make certain the condor_config changes above are only done your two
submit machines, not every machine in the pool! The config changes
above tells the condor_master to start one instance of the condor_schedd
on the two machines.
In the above, we tell the schedd failover to happen within 10 minutes.
I do this because the default job lease on a submitted job is 20 minutes
- that means the submit machine can disappear for up to 20 minutes w/o
the execute machine killing the job. Thus, by telling Condor to do the
failover within 10 minutes, all jobs including jobs already in progress
will continue to run and happily complete as if nothing happened.
Section 2.15.4 of the v7.2 Condor Manual has more background on job leases:
The above setup is off the top of my head, i.e. I didn't explicitly test
it, but should be real close if not already good to go. Let us know if
it helps you out. If you bless this as a good "recipe", I can update
the Condor Manual appropriately and/or add this recipe to Condor admin
recipes on the Condor Wiki at
Much, but unfortunately not all, of the above wisdom can gleened from
section 3.10 of the Version 7.2 Condor Manual.
Hope this helps...
Todd Tannenbaum University of Wisconsin-Madison
Condor Project Research Department of Computer Sciences