[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] problems with htcondor-ce 3.2.1-1 + condor 8.8.1



Oh, you're also spinning up a whole new cluster? I think we'll need to 
sort out the issues with your CM before we can tackle the CE issues. 
Anything interesting in /var/log/condor on the CM?

- Brian

On 3/11/19 4:05 PM, Stefano Dal Pra wrote:
>
>
> On 11/03/19 21:13, Brian Lin wrote:
>> Does `condor_status -schedd -pool htc-2.cr.cnaf.infn.it` succeed from
>> the old CE but fail from ce02? I'd be surprised if anything worked since
>> `condor_status -schedd` from the central manager isn't working!
> In the "old cluster" (ce01-htc , htc-1):
>
> [root@ce01-htc ~]# condor_status -schedd -pool htc-1.cr.cnaf.infn.it
> NameÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ MachineÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ RunningJobs 
> IdleJobsÂÂ HeldJobs
>
> ce01-htc.cr.cnaf.infn.it ce01-htc.cr.cnaf.infn.it 32 42ÂÂÂÂÂÂÂÂÂ 0
>
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ TotalRunningJobsÂÂÂÂÂ TotalIdleJobs TotalHeldJobs
>
>
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ TotalÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 32 42ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0
>
>
> [root@htc-1 ~]# condor_status -schedd
> NameÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ MachineÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ RunningJobs 
> IdleJobsÂÂ HeldJobs
>
> ce01-htc.cr.cnaf.infn.it ce01-htc.cr.cnaf.infn.it 32 42ÂÂÂÂÂÂÂÂÂ 0
>
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ TotalRunningJobsÂÂÂÂÂ TotalIdleJobs TotalHeldJobs
>
>
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ TotalÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 32 42ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0
>
> Stefano
>
>
>>
>> Is READ access to the collector restricted? Running `condor_ping
>> -verbose -type collector READ` from the CE host would give you a good
>> idea of the required permissions. However, I'm just realizing we don't
>> have a 'condor_ce_store_cred' [1], so the instructions for setting up
>> password auth [2] won't work on the CE side.
>>
>> - Brian
>>
>> [1] https://github.com/opensciencegrid/htcondor-ce/pull/218
>>
>> [2]
>> http://research.cs.wisc.edu/htcondor/manual/v8.8/Security.html#x36-2780003.8.3 
>>
>>
>>
>> On 3/11/19 11:53 AM, Stefano Dal Pra wrote:
>>> On 11/03/19 15:45, Brian Lin wrote:
>>>> That's curious, do you
>>>> see any errors in /etc/condor/CollectorLog on
>>>> htc-2.cr.cnaf.infn.it?
>>> Yes, see below.
>>>> What's `condor_config_val COLLECTOR_HOST` return
>>> [root@htc-2 condor]# condor_config_val COLLECTOR_HOST
>>> htc-2.cr.cnaf.infn.it
>>>
>>>> on the CE? How about `condor_status -schedd` on the central manager?
>>> #this very moment the cluster is quite screwed and the CM does not
>>> start. (CEDAR:6001:Failed to connect to <131.154.195.32:9618>)
>>> (downgraded and upgraded again, neutralizing configurations from
>>> puppet classes. )
>>>> Thanks,
>>>> Brian
>>>
>>>
>>> I raised log verbosity; my understanding (see logs below) is that the
>>> JobRouter at ce02-htc fails to authenticate with CM at htc-2
>>> because it attempts FS method, which fails because they have no common
>>> filesystem.
>>> The SEC_*AUTHENTICATION_METHODS (and most of other settings) seems to
>>> be equivalent with the other cluster.
>>> I tried adding the PASSWORD method: SEC_*_AUTHENTICATION_METHODS =
>>> ..., PASSWORD
>>> but it didn't work; maybe i missed the right combination, though.
>>>
>>> The IP in the logs are:
>>> (131.154.195.32 == htc-2.cr.cnaf.infn.it)
>>> (131.154.192.41 == ce02-htc.cr.cnaf.infn.it)
>>>
>>> ÂFrom JobRouterLog at ce02-htc:
>>>
>>> 03/11/19 07:13:28 (D_ALWAYS:2) Will use TCP to update collector
>>> htc-2.cr.cnaf.infn.it <131.154.195.32:9618>
>>> 03/11/19 07:13:28 (D_ALWAYS:2) Trying to query collector
>>> <131.154.195.32:9618>
>>> 03/11/19 07:13:28 (D_ALWAYS) SECMAN: required authentication with
>>> collector at <131.154.195.32:9618> failed, so aborting command
>>> QUERY_SCHEDD_ADS.
>>> 03/11/19 07:13:28 (D_ALWAYS) ERROR: AUTHENTICATE:1003:Failed to
>>> authenticate with any method|AUTHENTICATE:1004:Failed to authenticate
>>> using FS
>>> 03/11/19 07:13:28 (D_ALWAYS) ERROR (pool htc-2.cr.cnaf.infn.it:9618)
>>> Can't find address of schedd
>>> 03/11/19 07:13:28 (D_ALWAYS) JobRouter failure
>>> (src=320.0,route=condor_pool_cms): failed to submit job
>>>
>>> CollectorLog at htc-2.cr.cnaf.infn.it:
>>>
>>> 03/11/19 07:13:39 SECMAN: new session, doing initial authentication.
>>> 03/11/19 07:13:39 Returning to DC while we wait for socket to
>>> authenticate.
>>> 03/11/19 07:13:39 AUTHENTICATE: setting timeout for (unknown) to 20.
>>> 03/11/19 07:13:39 HANDSHAKE: in handshake(my_methods = 'FS')
>>> 03/11/19 07:13:39 HANDSHAKE: handshake() - i am the server
>>> 03/11/19 07:13:39 HANDSHAKE: client sent (methods == 4)
>>> 03/11/19 07:13:39 HANDSHAKE: i picked (method == 4)
>>> 03/11/19 07:13:39 HANDSHAKE: client received (method == 4)
>>> 03/11/19 07:13:39 FS: client template is /tmp/FS_XXXXXXXXX
>>> 03/11/19 07:13:39 FS: client filename is /tmp/FS_XXXU3AGXf
>>> 03/11/19 07:13:39 Will return to DC because authentication is 
>>> incomplete.
>>> 03/11/19 07:13:39 AUTHENTICATE_FS: used dir /tmp/FS_XXXU3AGXf, 
>>> status: 0
>>> 03/11/19 07:13:39 AUTHENTICATE: method -1 (FS) failed.
>>> 03/11/19 07:13:39 HANDSHAKE: in handshake(my_methods = 'FS')
>>> 03/11/19 07:13:39 AUTHENTICATE: handshake would block
>>> 03/11/19 07:13:39 Will return to DC to continue authentication..
>>> 03/11/19 07:13:39 HANDSHAKE: handshake() - i am the server
>>> 03/11/19 07:13:39 HANDSHAKE: client sent (methods == 0)
>>> 03/11/19 07:13:39 HANDSHAKE: i picked (method == 0)
>>> 03/11/19 07:13:39 HANDSHAKE: client received (method == 0)
>>> 03/11/19 07:13:39 DC_AUTHENTICATE: required authentication of
>>> 131.154.192.41 failed: AUTHENTICATE:1003:Failed to authenticate with
>>> any method|AUTHENT
>>> ICATE:1004:Failed to authenticate using FS|FS:1004:Unable to
>>> lstat(/tmp/FS_XXXU3AGXf)
>>> 03/11/19 07:13:39 DC_AUTHENTICATE: received DC_AUTHENTICATE from
>>> <131.154.192.41:12036>
>>> 03/11/19 07:13:39 DC_AUTHENTICATE: generating BLOWFISH key for session
>>> htc-2:13943:1552284819:2284...
>>>
>>>
>>>
>>> Thanks for your help
>>> Stefano
>>>
>