Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Capabilities of schedd HA

Date: Thu, 16 Jul 2009 11:09:58 -0400
From: Janzen Brewer <janzen.brewer@xxxxxxxxxxxxxxx>
Subject: Re: [Condor-users] Capabilities of schedd HA

Thanks for the detailed reply, Todd. I implemented your sampleconfiguration in my submit machines' local config files (hope thatdoesn't make a difference). However, they are both still runningseparate SCHEDD processes. There is a shared file system between them towhich the HA_LOCK_URL and SPOOL macros point. I've been keeping tails onthe files created in the shared spool directory and SCHEDD appears to bewriting its state information there. I suppose my question now is shouldthere be processes on each submit machine for SCHEDD even though it is aHA daemon?


Thanks,
Janzen

Yes. As you guessed, the trick is to setup a shared filesystem betweenthe two machines and tell the condor_schedd to write its state into theshared filesystem.
All schedd state is written into the "spool" subdirectory, as defined bySPOOL in your condor_config. This is the directory that must be placedon a shared filesystem. Lets say this shared subdirectory is"/share/spool". You will also need to create a scratch/tmp directory onyour shared filesystem if you don't already have one that hasworld-write permission (chmod 1777 on Unix). Lets call this directory"/share/tmp".
Another important point: With central manager failover, you can define aprimary and a backup. Not so with schedd failover - the relationshipis not primary/backup, but peers. Condor will ensure that one schedd isrunning across the two machines, but it doesn't care which one. Forinstance if the schedd is running on machine A and machine A dies, aschedd will be started on machine B -- but when machine A comes back,the schedd won't restart on machine A until such time that machine Bfails.
For a suggested configuration on the two machines, do the following:
1) Shutdown condor on your (condor_off). You need to shut downCondor because we are going to edit DAEMON_LIST, and that setting cannotbe changed on the fly via condor_reconfig. :(.
   2) Edit the DAEMON_LIST macro in condor_config and remove SCHEDD.
3) Edit the VALID_SPOOL_FILES macro to include SCHEDD.lock (note caseis important).
   4) Append the following to the end of your condor_config :

# Tell the master to only start the schedd on one machine, the
# machine which obtained a lock in the shared filesystem.
MASTER_HA_LIST = SCHEDD
# Tell the schedd to use our shared subdirectory for its state.
SPOOL = /share/spool
# Tell the master to use our shared subdirectory for its lock file.
HA_LOCK_URL = file:/share/spool
# The lock has a lease; tell the master to refresh the lease every
# three minutes.
HA_POLL_PERIOD = 180
# Tell the master to consider a lock lease older than 10 minutes to be
# stale.
HA_LOCK_HOLD_TIME = 600
#
# When both machines are up, we have no idea which one will be
# running the schedd.
# So to enable client tools like condor_q, condor_submit, etc, to
# work from either submit machine, we need to give the condor_schedd
# instance a unique name and tell the clients to connect to this named
# schedd by default, no matter which machine is currently hosting it,
# instead of the usual default of connecting to the locally running
# schedd.
#
# Give this schedd instance a unique name independent of the machine's
# hostname (which is what would happen by default)
SCHEDD_NAME = myschedd@
# Tell client tools like condor_q to connect to this named schedd
# instance by default
SCHEDD_HOST = $(SCHEDD_NAME)
# Setup filesystem-based authentication to use /share/tmp if
# authentication via /tmp fails, which will happen if we are logged
# into machine A and the one shedd instance is currently running
# on machine B
SEC_DEFAULT_AUTHENTICATION_METHODS = FS, FS_REMOTE
FS_REMOTE_DIR = /share/tmp

   5) Startup Condor again (condor_on).
Make certain the condor_config changes above are only done your twosubmit machines, not every machine in the pool! The config changesabove tells the condor_master to start one instance of the condor_scheddon the two machines.
In the above, we tell the schedd failover to happen within 10 minutes.I do this because the default job lease on a submitted job is 20 minutes- that means the submit machine can disappear for up to 20 minutes w/othe execute machine killing the job. Thus, by telling Condor to do thefailover within 10 minutes, all jobs including jobs already in progresswill continue to run and happily complete as if nothing happened.Section 2.15.4 of the v7.2 Condor Manual has more background on job leases:
http://www.cs.wisc.edu/condor/manual/v7.2/2_15Special_Environment.html#sec:Job-Lease
The above setup is off the top of my head, i.e. I didn't explicitly testit, but should be real close if not already good to go. Let us know ifit helps you out. If you bless this as a good "recipe", I can updatethe Condor Manual appropriately and/or add this recipe to Condor adminrecipes on the Condor Wiki at
  http://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToAdminRecipes
Much, but unfortunately not all, of the above wisdom can gleened fromsection 3.10 of the Version 7.2 Condor Manual.
Hope this helps...

regards,
Todd

Follow-Ups:
- Re: [Condor-users] Capabilities of schedd HA
  - From: Matthew Farrellee

References:
- [Condor-users] Capabilities of schedd HA
  - From: Janzen Brewer
- Re: [Condor-users] Capabilities of schedd HA
  - From: Todd Tannenbaum

Prev by Date: [Condor-users] Can I tell Condor to reset its stats at reboot?
Next by Date: Re: [Condor-users] Capabilities of schedd HA
Previous by thread: Re: [Condor-users] Capabilities of schedd HA
Next by thread: Re: [Condor-users] Capabilities of schedd HA
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Capabilities of schedd HA