[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] best way to switch a main negotiator/collector head towards a secondary?




Hi Thomas:

I think it might help to go down into the details a bit here, to understand what the best approach is.

The CM has two components, the collector and the negotiator. In a HA setup, usually there are two collectors, and all HTCondor daemons advertise to both. Queries pick one collector. In HA terms, the collector is Active-Active. The negotiator is different, there can be only one active at one. However, if no negotiator is running in the pool, all jobs continue to run as usual, and schedd can even start new jobs running with the matches they currently have. The persistent state in the negotiator contains the historical accounting information, in the "Accountingnew.log" file. The HAD daemons periodically transfer the Accountingnew.log file from the active to the backup machine, and heartbeat the two machines in a HAD central manager setup, when if the currently active negotiator fails, it starts the other negotiator, with a potentially somewhat-out-of-date accounting information.

Most sites are willing to allow some time for no negotiator to be running at all, as little throughput will be generally lost.

-greg

On 8/27/21 8:03 AM, Thomas Hartmann wrote:
Hi all,

what would probably the best way to move gracefully from a master negotiator/collector to the secondary fallback master in a HA setup?

E.g., with something like [1], where the collector/negotiator on `primary.site.foo` would be the default and `fallback.site.foo` only would take over, when primary does not answer within the backoff constant.

Now, for maintenance/rebuilds/... of `primary` I would like to go safe and gracefully switch the negotiation to the secondary for the whole cluster, i.e., something like "draining" the resource updates and negotiations without affecting all the existing jobs and shadows on the startds and schedds.

Would it be sufficient just to switch the ranking of the masters
 CENTRAL_MANAGER1 = fallback.site.foo
 CENTRAL_MANAGER2 = primary.site.foo
?

I am a bit unsure how to take best the backoff constant and negotiation cycle durations into account with respect to the deployment time on the cluster. Since we run our puppets with a 30m frequency per node, this would be the worst-case time a config update might take to reach a node in the cluster.
I.e., if the cluster runs for up to 30m in a mixed state of some nodes on the default on some already on the inverted master ranking, would we have two active collectors/negotiators? (which is probably not a good thing...)
Is this something to worry about and is there a better approach - or am I maybe overthinking it?

Cheers,
 Thomas


[1]
> cat 01masterd_ha.conf
CENTRAL_MANAGER1 = primary.site.foo
CENTRAL_MANAGER2 = fallback.site.foo
CONDOR_HOST = $(CENTRAL_MANAGER1), $(CENTRAL_MANAGER2)

DAEMON_LIST = $(DAEMON_LIST), HAD, REPLICATION
MASTER_HAD_BACKOFF_CONSTANT = 360
...


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/