[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] High rate of random "Failed to securely exchange session key" errors in 8.6.x



Hi,

I was recently upgrading lots of worker nodes from 8.6.1 to 8.6.2, and found that despite having:

MASTER_NEW_BINARY_RESTART=PEACEFUL

jobs were being killed on some (but not all) of the worker nodes. I then noticed that "condor_reconfig" randomly fails. Here's an example running it 4 times on a worker node:

[root@lcg1377 ~]# condor_reconfig
Sent "Reconfig" command to local master
[root@lcg1377 ~]# condor_reconfig
Sent "Reconfig" command to local master
[root@lcg1377 ~]# condor_reconfig
Sent "Reconfig" command to local master
[root@lcg1377 ~]# condor_reconfig
ERROR
AUTHENTICATE:1005:Failed to securely exchange session key
Can't send Reconfig command to local master

And with the "-debug" option:

[root@lcg1377 ~]# condor_reconfig -debug
05/15/17 19:41:12 condor_read() failed: recv(fd=3) returned -1, errno = 104 Connection reset by peer, reading 5 bytes from <X.X.X.X:9618>.
05/15/17 19:41:12 IO: Failed to read packet header
05/15/17 19:41:12 SECMAN: required authentication with <X.X.X.X:9618> failed, so aborting command DC_RECONFIG_FULL.
ERROR
AUTHENTICATE:1005:Failed to securely exchange session key
Can't send Reconfig command to local master

Today I upgraded the central managers to 8.6.2, and just noticed large numbers of "Failed to securely exchange session key" errors in CollectorLog and NegotiatorLog. Despite these frequent errors things are generally working fine.

We are using FS and PASSWORD authentication.

Has anyone else seen this? With 8.4.x I don't ever remember seeing issues like this.

Thanks,
Andrew.