Re: [HTCondor-users] Condor upgrade from 9.0x to 10.0x

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

Hi Jaime,

Thank you for the in-depth information. Using this we have planned our upgrade accordingly and will be on Condor 10 soon! Once again thanks for the background.

Tom

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Jaime Frey via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Date: Friday, 6 October 2023 at 22:45
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Jaime Frey <jfrey@xxxxxxxxxxx>, condor-users@xxxxxxxxxxx <condor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Condor upgrade from 9.0x to 10.0x

I’m surprised that the schedds are able to reconnect the central manager immediately after upgrading the central manager but the startds are not. My first suspicion here is that the central manager is unable to connect the startds. Do you know if the startds can reconnect immediately when a 9.0.15 central manager restarts (without a version upgrade)?

I believe this is tied to the in-memory security session shared between each startd and the collector, which is affected by HTCONDOR-1057. When the collector restarts, all of its security sessions are lost, so when a startd attempts to connect using the security session, it will fail. The startd will continue to attempt using the security session for future connections until either it’s told the session is now invalid or the session duration expires (default duration is 1 day). At that point, the startd will attempt to authenticate from scratch (using tokens, a pool password, etc).

In HTCondor 9.0, the collector (or any daemon on the server side of a connection) sends a notification about an invalid session via a separate TCP connection back to the client. If this connection fails, then the client won’t learn about the invalidation. Starting with 10.0, the server sends an invalidation notification in the same TCP connection made by the client. This is more reliable, but requires both sides to be 10.0 or later. In your upgrade case, the in-band notification won’t be used, because the startd believes the collector is version 9.0 (from the existing security session).

If you can’t identify a fixable network issue that’s preventing the collector from connecting to the startds, another option to shorten the security lease duration from 1 day to 1 hour (set SEC_DEFAULT_SESSION_DURATION=3600 in the startds’ configuration). Do this at least one day before upgrading the central manager (when upgrading the startds would be a good time). That way, after the central manager is upgraded, the amount of time the startd ads are missing from the collector is minimized.

Keep in mind that when the startds can’t connect to the collector, their jobs can continue running and the matched schedds can start new jobs (limited by configuration knob CLAIM_WORKLIFE).

- Jaime

On Oct 4, 2023, at 3:16 AM, Thomas Birkett - STFC UKRI <thomas.birkett@xxxxxxxxxx> wrote:

Hi all,

We’re currently looking at upgrading our Condor pool to Condor 10.0.9 from Condor 9.0.15. I plan on upgrading the Schedd’s first, in testing, this works as expected where the daemons get restarted, jobs in the queue are picked up again and the Schedd carries on where it left off. I then plan on upgrading the Startd’s next, this again, goes smoothly. We have the config setup so a graceful restart of the daemons happens, the jobs drain out, condor is restarted and jobs start to run on the startd once again. However, when we upgrade the Central Managers, Startds loose communication to the Central Managers and are only re-established after a restart of the Condor daemons on the startd host, this would kill any running jobs on the node.

Looking at the changes between the two versions, I believe this may be to do with the following:

- https://opensciencegrid.atlassian.net/browse/HTCONDOR-283

- https://opensciencegrid.atlassian.net/browse/HTCONDOR-287

- https://opensciencegrid.atlassian.net/browse/HTCONDOR-1057

I’m guessing as there is this change in Condor 10, the start’s need to re-negotiate the security between the daemons, requiring this restart. My question to the community is if there is a way to upgrade the Condor pool without requiring the startd restart once the Central Managers are upgraded. Interestingly this does not affect the schedd’s which continue to communicate with the Central Managers.

Many thanks,

Thomas Birkett

Senior Systems Administrator

Scientific Computing Department

Science and Technology Facilities Council (STFC)

Rutherford Appleton Laboratory, Chilton, Didcot
OX11 0QX

<image001.png>

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Mailing List Archives

Public Access

Re: [HTCondor-users] Condor upgrade from 9.0x to 10.0x