[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Using condor to update condor



On Fri, Jun 24, 2005 at 07:23:23PM +0100, Kewley, J (John) wrote:
> On the contrary. My agreement that it would be nice to permit this software upgrade through condor channels
> had the key point of doing it when each machine was ready.
>  
> Out of interest: how do the Condor Team upgrade their machines with the new versions of Condor?
>  

We use some features of Condor to handle the updates. There are two 
parts of an upgrade: getting the new binaries in place and then telling
Condor to restart itself to use those binaries.

We have it easy when it comes to getting the new binaries in place, because
we have AFS available on each machine, so we can securely install software
owned by 'root' once and it's available to all the machines. This is going to
be the hard part of using Condor to update binaries, since Condor jobs run
as lower-privileged users. You'll have to have some sort of sudo script to
update your binaries if you're going to try and use a Condor job to push 
out new binaries (which would be nice - to update your pool, just queue
up 1 job per machine on your pool with 
'Requirements = CondorVersion=="Condor.6.7.8"' to upgrade to 6.7.9)

so when we want to upgrade Condor, we create a new directory in AFS
like /unsup/condor/condor-6.7.8. We leave the old version installed in
/unsup/condor/condor-6.7.7

Then we have a symbolic link /unsup/condor/current which points to
the version of Condor we want to run. The condor config file use this
path to say what the binaries it should use.  To "upgrade", we just change what
this symlink points at. Because we didn't delete the old versions, the daemons
are still running. The condor_master checks the timestamps of the daemons
it's supposed to be running (ie /unsup/condor/current/sbin/condor_startd) 
and when the symlink changes it will notice that it's a new file with a more
recent timestamp and it restarts itself. We can (but don't currently) use
MaxJobRetirementTime to give any running jobs time to vacate. 

(That description is actually a bit simplified, because we use the
AFS @sys links pretty extensively so we can run one version of Condor
on Solaris, and a different version on Linux. We've also got another layer
of symlinks between everything so we can have wrapper scripts around some
of the tools that detect if someone is trying to run condor_q on a machine
that we intentionally don't run a schedd on, etc)

To protect against AFS outages, we actually run the condor_master 
from the local hard drive, so if machine falls off of AFS the master will
survive and eventually restart the other daemons. What we do is define
in our config files

MASTER = /usr/condor/ftsh_master
MASTER_BIN = /unsup/condor/sbin/condor_master
MASTER_WATCH_FILE = $(MASTER_BIN)

which is a shell (well, fault-tolerant shell) script that looks at the
MASTER_BIN file in AFS, and compares it to /usr/condor/condor_master.bin
file. If MASTER_BIN is more recent, it copies the AFS file to 
/usr/condor/condor_master.new and then exchanges it with 
/usr/condor/condor_master.bin. Then the script execs 
/usr/condor/condor_master.bin. The MASTER_WATCH_FILE means "watch
the timestamp on $MASTER_WATCH_FILE file instead of $MASTER, but when
$MASTER_WATCH_FILE changes restart $MASTER

-Erik