[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Condor upgrade from 9.0x to 10.0x



Hi all,

 

We’re currently looking at upgrading our Condor pool to Condor 10.0.9 from Condor 9.0.15. I plan on upgrading the Schedd’s first, in testing, this works as expected where the daemons get restarted, jobs in the queue are picked up again and the Schedd carries on where it left off. I then plan on upgrading the Startd’s next, this again, goes smoothly. We have the config setup so a graceful restart of the daemons happens, the jobs drain out, condor is restarted and jobs start to run on the startd once again. However, when we upgrade the Central Managers, Startds loose communication to the Central Managers and are only re-established after a restart of the Condor daemons on the startd host, this would kill any running jobs on the node.

 

Looking at the changes between the two versions, I believe this may be to do with the following:

 

- https://opensciencegrid.atlassian.net/browse/HTCONDOR-283

- https://opensciencegrid.atlassian.net/browse/HTCONDOR-287

- https://opensciencegrid.atlassian.net/browse/HTCONDOR-1057

 

I’m guessing as there is this change in Condor 10, the start’s need to re-negotiate the security between the daemons, requiring this restart. My question to the community is if there is a way to upgrade the Condor pool without requiring the startd restart once the Central Managers are upgraded. Interestingly this does not affect the schedd’s which continue to communicate with the Central Managers.

 

Many thanks,

 

Thomas Birkett

Senior Systems Administrator

Scientific Computing Department  

Science and Technology Facilities Council (STFC)

Rutherford Appleton Laboratory, Chilton, Didcot 
OX11 0QX

 

signature_609518872