Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Capabilities of schedd HA

Date: Wed, 15 Jul 2009 18:12:25 -0500
From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
Subject: Re: [Condor-users] Capabilities of schedd HA

Janzen Brewer wrote:

I've been playing with schedd HA. I haven't quite gotten theconfiguration right, but before I put any more time into it, I want tomake sure that it can do what I'm hoping it can.
Those who have read/replied to my earlier posts will recall that myCondor setup must have no single point of failure. I'm currently workingon schedd. Schedd now runs on the CMs and needs only to take submissionsfrom the active CM. I've tested CM failover while a job was executing.While negotiator did failover, the job was not able to complete untilfailback.Is there any way around this (e.g. shared file system betweenCMs)?

Yes. As you guessed, the trick is to setup a shared filesystem betweenthe two machines and tell the condor_schedd to write its state into theshared filesystem.

All schedd state is written into the "spool" subdirectory, as defined bySPOOL in your condor_config. This is the directory that must be placedon a shared filesystem. Lets say this shared subdirectory is"/share/spool". You will also need to create a scratch/tmp directory onyour shared filesystem if you don't already have one that hasworld-write permission (chmod 1777 on Unix). Lets call this directory"/share/tmp".

Another important point: With central manager failover, you can define aprimary and a backup. Not so with schedd failover - the relationshipis not primary/backup, but peers. Condor will ensure that one schedd isrunning across the two machines, but it doesn't care which one. Forinstance if the schedd is running on machine A and machine A dies, aschedd will be started on machine B -- but when machine A comes back,the schedd won't restart on machine A until such time that machine Bfails.


For a suggested configuration on the two machines, do the following:

1) Shutdown condor on your (condor_off). You need to shut downCondor because we are going to edit DAEMON_LIST, and that setting cannotbe changed on the fly via condor_reconfig. :(.


  2) Edit the DAEMON_LIST macro in condor_config and remove SCHEDD.

3) Edit the VALID_SPOOL_FILES macro to include SCHEDD.lock (note caseis important).


  4) Append the following to the end of your condor_config :

# Tell the master to only start the schedd on one machine, the
# machine which obtained a lock in the shared filesystem.
MASTER_HA_LIST = SCHEDD
# Tell the schedd to use our shared subdirectory for its state.
SPOOL = /share/spool
# Tell the master to use our shared subdirectory for its lock file.
HA_LOCK_URL = file:/share/spool
# The lock has a lease; tell the master to refresh the lease every
# three minutes.
HA_POLL_PERIOD = 180
# Tell the master to consider a lock lease older than 10 minutes to be
# stale.
HA_LOCK_HOLD_TIME = 600
#
# When both machines are up, we have no idea which one will be
# running the schedd.
# So to enable client tools like condor_q, condor_submit, etc, to
# work from either submit machine, we need to give the condor_schedd
# instance a unique name and tell the clients to connect to this named
# schedd by default, no matter which machine is currently hosting it,
# instead of the usual default of connecting to the locally running
# schedd.
#
# Give this schedd instance a unique name independent of the machine's
# hostname (which is what would happen by default)
SCHEDD_NAME = myschedd@
# Tell client tools like condor_q to connect to this named schedd
# instance by default
SCHEDD_HOST = $(SCHEDD_NAME)
# Setup filesystem-based authentication to use /share/tmp if
# authentication via /tmp fails, which will happen if we are logged
# into machine A and the one shedd instance is currently running
# on machine B
SEC_DEFAULT_AUTHENTICATION_METHODS = FS, FS_REMOTE
FS_REMOTE_DIR = /share/tmp

  5) Startup Condor again (condor_on).

Make certain the condor_config changes above are only done your twosubmit machines, not every machine in the pool! The config changesabove tells the condor_master to start one instance of the condor_scheddon the two machines.

In the above, we tell the schedd failover to happen within 10 minutes.I do this because the default job lease on a submitted job is 20 minutes- that means the submit machine can disappear for up to 20 minutes w/othe execute machine killing the job. Thus, by telling Condor to do thefailover within 10 minutes, all jobs including jobs already in progresswill continue to run and happily complete as if nothing happened.Section 2.15.4 of the v7.2 Condor Manual has more background on job leases:

http://www.cs.wisc.edu/condor/manual/v7.2/2_15Special_Environment.html#sec:Job-Lease

The above setup is off the top of my head, i.e. I didn't explicitly testit, but should be real close if not already good to go. Let us know ifit helps you out. If you bless this as a good "recipe", I can updatethe Condor Manual appropriately and/or add this recipe to Condor adminrecipes on the Condor Wiki at

 http://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToAdminRecipes

Much, but unfortunately not all, of the above wisdom can gleened fromsection 3.10 of the Version 7.2 Condor Manual.


Hope this helps...

regards,
Todd

--
Todd Tannenbaum                       University of Wisconsin-Madison
Condor Project Research               Department of Computer Sciences

Follow-Ups:
- Re: [Condor-users] Capabilities of schedd HA
  - From: Janzen Brewer

References:
- [Condor-users] Capabilities of schedd HA
  - From: Janzen Brewer

Prev by Date: [Condor-users] condor_status not returning
Next by Date: [Condor-users] Can I tell Condor to reset its stats at reboot?
Previous by thread: Re: [Condor-users] condor_status not returning
Next by thread: Re: [Condor-users] Capabilities of schedd HA
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Capabilities of schedd HA