[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] problems with htcondor-ce 3.2.1-1 + condor 8.8.1





On 11/03/19 21:13, Brian Lin wrote:
Does `condor_status -schedd -pool htc-2.cr.cnaf.infn.it` succeed from
the old CE but fail from ce02? I'd be surprised if anything worked since
`condor_status -schedd` from the central manager isn't working!
In the "old cluster" (ce01-htc , htc-1):

[root@ce01-htc ~]# condor_status -schedd -pool htc-1.cr.cnaf.infn.it
NameÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ MachineÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ RunningJobs IdleJobsÂÂ HeldJobs

ce01-htc.cr.cnaf.infn.it ce01-htc.cr.cnaf.infn.it 32ÂÂÂÂÂÂÂÂ 42ÂÂÂÂÂÂÂÂÂ 0

ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ TotalRunningJobsÂÂÂÂÂ TotalIdleJobs TotalHeldJobs


ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ TotalÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 32 42ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0


[root@htc-1 ~]# condor_status -schedd
NameÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ MachineÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ RunningJobs IdleJobsÂÂ HeldJobs

ce01-htc.cr.cnaf.infn.it ce01-htc.cr.cnaf.infn.it 32ÂÂÂÂÂÂÂÂ 42ÂÂÂÂÂÂÂÂÂ 0

ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ TotalRunningJobsÂÂÂÂÂ TotalIdleJobs TotalHeldJobs


ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ TotalÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 32 42ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0

Stefano



Is READ access to the collector restricted? Running `condor_ping
-verbose -type collector READ` from the CE host would give you a good
idea of the required permissions. However, I'm just realizing we don't
have a 'condor_ce_store_cred' [1], so the instructions for setting up
password auth [2] won't work on the CE side.

- Brian

[1] https://github.com/opensciencegrid/htcondor-ce/pull/218

[2]
http://research.cs.wisc.edu/htcondor/manual/v8.8/Security.html#x36-2780003.8.3


On 3/11/19 11:53 AM, Stefano Dal Pra wrote:
On 11/03/19 15:45, Brian Lin wrote:
That's curious, do you
see any errors in /etc/condor/CollectorLog on
htc-2.cr.cnaf.infn.it?
Yes, see below.
What's `condor_config_val COLLECTOR_HOST` return
[root@htc-2 condor]# condor_config_val COLLECTOR_HOST
htc-2.cr.cnaf.infn.it

on the CE? How about `condor_status -schedd` on the central manager?
#this very moment the cluster is quite screwed and the CM does not
start. (CEDAR:6001:Failed to connect to <131.154.195.32:9618>)
(downgraded and upgraded again, neutralizing configurations from
puppet classes. )
Thanks,
Brian


I raised log verbosity; my understanding (see logs below) is that the
JobRouter at ce02-htc fails to authenticate with CM at htc-2
because it attempts FS method, which fails because they have no common
filesystem.
The SEC_*AUTHENTICATION_METHODS (and most of other settings) seems to
be equivalent with the other cluster.
I tried adding the PASSWORD method: SEC_*_AUTHENTICATION_METHODS =
..., PASSWORD
but it didn't work; maybe i missed the right combination, though.

The IP in the logs are:
(131.154.195.32 == htc-2.cr.cnaf.infn.it)
(131.154.192.41 == ce02-htc.cr.cnaf.infn.it)

 From JobRouterLog at ce02-htc:

03/11/19 07:13:28 (D_ALWAYS:2) Will use TCP to update collector
htc-2.cr.cnaf.infn.it <131.154.195.32:9618>
03/11/19 07:13:28 (D_ALWAYS:2) Trying to query collector
<131.154.195.32:9618>
03/11/19 07:13:28 (D_ALWAYS) SECMAN: required authentication with
collector at <131.154.195.32:9618> failed, so aborting command
QUERY_SCHEDD_ADS.
03/11/19 07:13:28 (D_ALWAYS) ERROR: AUTHENTICATE:1003:Failed to
authenticate with any method|AUTHENTICATE:1004:Failed to authenticate
using FS
03/11/19 07:13:28 (D_ALWAYS) ERROR (pool htc-2.cr.cnaf.infn.it:9618)
Can't find address of schedd
03/11/19 07:13:28 (D_ALWAYS) JobRouter failure
(src=320.0,route=condor_pool_cms): failed to submit job

CollectorLog at htc-2.cr.cnaf.infn.it:

03/11/19 07:13:39 SECMAN: new session, doing initial authentication.
03/11/19 07:13:39 Returning to DC while we wait for socket to
authenticate.
03/11/19 07:13:39 AUTHENTICATE: setting timeout for (unknown) to 20.
03/11/19 07:13:39 HANDSHAKE: in handshake(my_methods = 'FS')
03/11/19 07:13:39 HANDSHAKE: handshake() - i am the server
03/11/19 07:13:39 HANDSHAKE: client sent (methods == 4)
03/11/19 07:13:39 HANDSHAKE: i picked (method == 4)
03/11/19 07:13:39 HANDSHAKE: client received (method == 4)
03/11/19 07:13:39 FS: client template is /tmp/FS_XXXXXXXXX
03/11/19 07:13:39 FS: client filename is /tmp/FS_XXXU3AGXf
03/11/19 07:13:39 Will return to DC because authentication is incomplete.
03/11/19 07:13:39 AUTHENTICATE_FS: used dir /tmp/FS_XXXU3AGXf, status: 0
03/11/19 07:13:39 AUTHENTICATE: method -1 (FS) failed.
03/11/19 07:13:39 HANDSHAKE: in handshake(my_methods = 'FS')
03/11/19 07:13:39 AUTHENTICATE: handshake would block
03/11/19 07:13:39 Will return to DC to continue authentication..
03/11/19 07:13:39 HANDSHAKE: handshake() - i am the server
03/11/19 07:13:39 HANDSHAKE: client sent (methods == 0)
03/11/19 07:13:39 HANDSHAKE: i picked (method == 0)
03/11/19 07:13:39 HANDSHAKE: client received (method == 0)
03/11/19 07:13:39 DC_AUTHENTICATE: required authentication of
131.154.192.41 failed: AUTHENTICATE:1003:Failed to authenticate with
any method|AUTHENT
ICATE:1004:Failed to authenticate using FS|FS:1004:Unable to
lstat(/tmp/FS_XXXU3AGXf)
03/11/19 07:13:39 DC_AUTHENTICATE: received DC_AUTHENTICATE from
<131.154.192.41:12036>
03/11/19 07:13:39 DC_AUTHENTICATE: generating BLOWFISH key for session
htc-2:13943:1552284819:2284...



Thanks for your help
Stefano