[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Condor upgrade from 9.0x to 10.0x



Iâm surprised that the schedds are able to reconnect the central manager immediately after upgrading the central manager but the startds are not. My first suspicion here is that the central manager is unable to connect the startds. Do you know if the startds can reconnect immediately when a 9.0.15 central manager restarts (without a version upgrade)?

I believe this is tied to the in-memory security session shared between each startd and the collector, which is affected by HTCONDOR-1057. When the collector restarts, all of its security sessions are lost, so when a startd attempts to connect using the security session, it will fail. The startd will continue to attempt using the security session for future connections until either itâs told the session is now invalid or the session duration expires (default duration is 1 day). At that point, the startd will attempt to authenticate from scratch (using tokens, a pool password, etc).

In HTCondor 9.0, the collector (or any daemon on the server side of a connection) sends a notification about an invalid session via a separate TCP connection back to the client. If this connection fails, then the client wonât learn about the invalidation. Starting with 10.0, the server sends an invalidation notification in the same TCP connection made by the client. This is more reliable, but requires both sides to be 10.0 or later. In your upgrade case, the in-band notification wonât be used, because the startd believes the collector is version 9.0 (from the existing security session).

If you canât identify a fixable network issue thatâs preventing the collector from connecting to the startds, another option to shorten the security lease duration from 1 day to 1 hour (set SEC_DEFAULT_SESSION_DURATION=3600 in the startdsâ configuration). Do this at least one day before upgrading the central manager (when upgrading the startds would be a good time). That way, after the central manager is upgraded, the amount of time the startd ads are missing from the collector is minimized.

Keep in mind that when the startds canât connect to the collector, their jobs can continue running and the matched schedds can start new jobs (limited by configuration knob CLAIM_WORKLIFE).

 - Jaime

On Oct 4, 2023, at 3:16 AM, Thomas Birkett - STFC UKRI <thomas.birkett@xxxxxxxxxx> wrote:

Hi all,
 
Weâre currently looking at upgrading our Condor pool to Condor 10.0.9 from Condor 9.0.15. I plan on upgrading the Scheddâs first, in testing, this works as expected where the daemons get restarted, jobs in the queue are picked up again and the Schedd carries on where it left off. I then plan on upgrading the Startdâs next, this again, goes smoothly. We have the config setup so a graceful restart of the daemons happens, the jobs drain out, condor is restarted and jobs start to run on the startd once again. However, when we upgrade the Central Managers, Startds loose communication to the Central Managers and are only re-established after a restart of the Condor daemons on the startd host, this would kill any running jobs on the node.
 
Looking at the changes between the two versions, I believe this may be to do with the following:
 
 
Iâm guessing as there is this change in Condor 10, the startâs need to re-negotiate the security between the daemons, requiring this restart. My question to the community is if there is a way to upgrade the Condor pool without requiring the startd restart once the Central Managers are upgraded. Interestingly this does not affect the scheddâs which continue to communicate with the Central Managers.
 
Many thanks,
 
Thomas Birkett
Senior Systems Administrator
Scientific Computing Department  
Science and Technology Facilities Council (STFC)
Rutherford Appleton Laboratory, Chilton, Didcot 
OX11 0QX
 
<image001.png>
 
 
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/