Hi all, We’re currently looking at upgrading our Condor pool to Condor 10.0.9 from Condor 9.0.15. I plan on upgrading the Schedd’s first, in testing, this works as expected where the daemons get restarted, jobs in the queue are picked up again
and the Schedd carries on where it left off. I then plan on upgrading the Startd’s next, this again, goes smoothly. We have the config setup so a graceful restart of the daemons happens, the jobs drain out, condor is restarted and jobs start to run on the
startd once again. However, when we upgrade the Central Managers, Startds loose communication to the Central Managers and are only re-established after a restart of the Condor daemons on the startd host, this would kill any running jobs on the node.
Looking at the changes between the two versions, I believe this may be to do with the following: -
https://opensciencegrid.atlassian.net/browse/HTCONDOR-283 -
https://opensciencegrid.atlassian.net/browse/HTCONDOR-287 -
https://opensciencegrid.atlassian.net/browse/HTCONDOR-1057 I’m guessing as there is this change in Condor 10, the start’s need to re-negotiate the security between the daemons, requiring this restart. My question to the community is if there is a way to upgrade the Condor pool without requiring
the startd restart once the Central Managers are upgraded. Interestingly this does not affect the schedd’s which continue to communicate with the Central Managers. Many thanks, Thomas Birkett Senior Systems Administrator Scientific Computing Department Science and Technology Facilities Council (STFC) Rutherford Appleton Laboratory, Chilton, Didcot |