[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] problems with htcondor-ce 3.2.1-1 + condor 8.8.1



On 11/03/19 15:45, Brian Lin wrote:
That's curious, do you
see any errors in /etc/condor/CollectorLog on
htc-2.cr.cnaf.infn.it?
Yes, see below.
What's `condor_config_val COLLECTOR_HOST` return
[root@htc-2 condor]# condor_config_val COLLECTOR_HOST
htc-2.cr.cnaf.infn.it

on the CE? How about `condor_status -schedd` on the central manager?
#this very moment the cluster is quite screwed and the CM does not start. (CEDAR:6001:Failed to connect to <131.154.195.32:9618>) (downgraded and upgraded again, neutralizing configurations from puppet classes. )

Thanks,
Brian



I raised log verbosity; my understanding (see logs below) is that the JobRouter at ce02-htc fails to authenticate with CM at htc-2 because it attempts FS method, which fails because they have no common filesystem. The SEC_*AUTHENTICATION_METHODS (and most of other settings) seems to be equivalent with the other cluster. I tried adding the PASSWORD method: SEC_*_AUTHENTICATION_METHODS = ..., PASSWORD
but it didn't work; maybe i missed the right combination, though.

The IP in the logs are:
(131.154.195.32 == htc-2.cr.cnaf.infn.it)
(131.154.192.41 == ce02-htc.cr.cnaf.infn.it)

From JobRouterLog at ce02-htc:

03/11/19 07:13:28 (D_ALWAYS:2) Will use TCP to update collector htc-2.cr.cnaf.infn.it <131.154.195.32:9618> 03/11/19 07:13:28 (D_ALWAYS:2) Trying to query collector <131.154.195.32:9618> 03/11/19 07:13:28 (D_ALWAYS) SECMAN: required authentication with collector at <131.154.195.32:9618> failed, so aborting command QUERY_SCHEDD_ADS. 03/11/19 07:13:28 (D_ALWAYS) ERROR: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using FS 03/11/19 07:13:28 (D_ALWAYS) ERROR (pool htc-2.cr.cnaf.infn.it:9618) Can't find address of schedd 03/11/19 07:13:28 (D_ALWAYS) JobRouter failure (src=320.0,route=condor_pool_cms): failed to submit job

CollectorLog at htc-2.cr.cnaf.infn.it:

03/11/19 07:13:39 SECMAN: new session, doing initial authentication.
03/11/19 07:13:39 Returning to DC while we wait for socket to authenticate.
03/11/19 07:13:39 AUTHENTICATE: setting timeout for (unknown) to 20.
03/11/19 07:13:39 HANDSHAKE: in handshake(my_methods = 'FS')
03/11/19 07:13:39 HANDSHAKE: handshake() - i am the server
03/11/19 07:13:39 HANDSHAKE: client sent (methods == 4)
03/11/19 07:13:39 HANDSHAKE: i picked (method == 4)
03/11/19 07:13:39 HANDSHAKE: client received (method == 4)
03/11/19 07:13:39 FS: client template is /tmp/FS_XXXXXXXXX
03/11/19 07:13:39 FS: client filename is /tmp/FS_XXXU3AGXf
03/11/19 07:13:39 Will return to DC because authentication is incomplete.
03/11/19 07:13:39 AUTHENTICATE_FS: used dir /tmp/FS_XXXU3AGXf, status: 0
03/11/19 07:13:39 AUTHENTICATE: method -1 (FS) failed.
03/11/19 07:13:39 HANDSHAKE: in handshake(my_methods = 'FS')
03/11/19 07:13:39 AUTHENTICATE: handshake would block
03/11/19 07:13:39 Will return to DC to continue authentication..
03/11/19 07:13:39 HANDSHAKE: handshake() - i am the server
03/11/19 07:13:39 HANDSHAKE: client sent (methods == 0)
03/11/19 07:13:39 HANDSHAKE: i picked (method == 0)
03/11/19 07:13:39 HANDSHAKE: client received (method == 0)
03/11/19 07:13:39 DC_AUTHENTICATE: required authentication of 131.154.192.41 failed: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENT ICATE:1004:Failed to authenticate using FS|FS:1004:Unable to lstat(/tmp/FS_XXXU3AGXf) 03/11/19 07:13:39 DC_AUTHENTICATE: received DC_AUTHENTICATE from <131.154.192.41:12036> 03/11/19 07:13:39 DC_AUTHENTICATE: generating BLOWFISH key for session htc-2:13943:1552284819:2284...



Thanks for your help
Stefano