[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Capabilities of schedd HA

The condor_masters on both machines actually collaborate to make sure
only one condor_schedd is running. If you don't run the condor_schedd
from the condor_master all bets are off.



Janzen Brewer wrote:
> Thanks for the detailed reply, Todd. I implemented your sample 
> configuration in my submit machines' local config files (hope that 
> doesn't make a difference). However, they are both still running 
> separate SCHEDD processes. There is a shared file system between them to 
> which the HA_LOCK_URL and SPOOL macros point. I've been keeping tails on 
> the files created in the shared spool directory and SCHEDD appears to be 
> writing its state information there. I suppose my question now is should 
> there be processes on each submit machine for SCHEDD even though it is a 
> HA daemon?
> Thanks,
> Janzen
>> Yes.  As you guessed, the trick is to setup a shared filesystem between 
>> the two machines and tell the condor_schedd to write its state into the 
>> shared filesystem.
>> All schedd state is written into the "spool" subdirectory, as defined by 
>> SPOOL in your condor_config.  This is the directory that must be placed 
>> on a shared filesystem.  Lets say this shared subdirectory is 
>> "/share/spool".  You will also need to create a scratch/tmp directory on 
>> your shared filesystem if you don't already have one that has 
>> world-write permission (chmod 1777 on Unix).  Lets call this directory 
>> "/share/tmp".
>> Another important point: With central manager failover, you can define a 
>>   primary and a backup.  Not so with schedd failover - the relationship 
>> is not primary/backup, but peers.  Condor will ensure that one schedd is 
>> running across the two machines, but it doesn't care which one. For 
>> instance if the schedd is running on machine A and machine A dies, a 
>> schedd will be started on machine B -- but when machine A comes back, 
>> the schedd won't restart on machine A until such time that machine B 
>> fails.
>> For a suggested configuration on the two machines, do the following:
>>    1) Shutdown condor on your (condor_off).  You need to shut down 
>> Condor because we are going to edit DAEMON_LIST, and that setting cannot 
>> be changed on the fly via condor_reconfig.  :(.
>>    2) Edit the DAEMON_LIST macro in condor_config and remove SCHEDD.
>>    3) Edit the VALID_SPOOL_FILES macro to include SCHEDD.lock (note case 
>> is important).
>>    4) Append the following to the end of your condor_config :
>> # Tell the master to only start the schedd on one machine, the
>> # machine which obtained a lock in the shared filesystem.
>> # Tell the schedd to use our shared subdirectory for its state.
>> SPOOL = /share/spool
>> # Tell the master to use our shared subdirectory for its lock file.
>> HA_LOCK_URL = file:/share/spool
>> # The lock has a lease; tell the master to refresh the lease every
>> # three minutes.
>> # Tell the master to consider a lock lease older than 10 minutes to be
>> # stale.
>> #
>> # When both machines are up, we have no idea which one will be
>> # running the schedd.
>> # So to enable client tools like condor_q, condor_submit, etc, to
>> # work from either submit machine, we need to give the condor_schedd
>> # instance a unique name and tell the clients to connect to this named
>> # schedd by default, no matter which machine is currently hosting it,
>> # instead of the usual default of connecting to the locally running
>> # schedd.
>> #
>> # Give this schedd instance a unique name independent of the machine's
>> # hostname (which is what would happen by default)
>> SCHEDD_NAME = myschedd@
>> # Tell client tools like condor_q to connect to this named schedd
>> # instance by default
>> # Setup filesystem-based authentication to use /share/tmp if
>> # authentication via /tmp fails, which will happen if we are logged
>> # into machine A and the one shedd instance is currently running
>> # on machine B
>> FS_REMOTE_DIR = /share/tmp
>>    5) Startup Condor again (condor_on).
>> Make certain the condor_config changes above are only done your two 
>> submit machines, not every machine in the pool!  The config changes 
>> above tells the condor_master to start one instance of the condor_schedd 
>> on the two machines.
>> In the above, we tell the schedd failover to happen within 10 minutes. 
>> I do this because the default job lease on a submitted job is 20 minutes 
>> - that means the submit machine can disappear for up to 20 minutes w/o 
>> the execute machine killing the job.  Thus, by telling Condor to do the 
>> failover within 10 minutes, all jobs including jobs already in progress 
>> will continue to run and happily complete as if nothing happened. 
>> Section 2.15.4 of the v7.2 Condor Manual has more background on job leases:
>> http://www.cs.wisc.edu/condor/manual/v7.2/2_15Special_Environment.html#sec:Job-Lease
>> The above setup is off the top of my head, i.e. I didn't explicitly test 
>> it, but should be real close if not already good to go.  Let us know if 
>> it helps you out.  If you bless this as a good "recipe", I can update 
>> the Condor Manual appropriately and/or add this recipe to Condor admin 
>> recipes on the Condor Wiki at
>>   http://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToAdminRecipes
>> Much, but unfortunately not all, of the above wisdom can gleened from 
>> section 3.10 of the Version 7.2 Condor Manual.
>> Hope this helps...
>> regards,
>> Todd
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> The archives can be found at: 
> https://lists.cs.wisc.edu/archive/condor-users/