[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] StartLog: Failed to authenticate



I think a major problem here is that we donât have a good explanation of how to start with a minicondor installation and then expand to a multi-machine pool. If you know your goal is a multi-machine pool, you should start with the Administrative Quick Start Guide (https://htcondor.readthedocs.io/en/latest/getting-htcondor/admin-quick-start.html), which invokes the get_htcondor script with different arguments.

One wrinkle is that you want your central manager to also be a submit point, which the installation process above doesnât allow for. 
We recommend that the central manager be on its own machine that users canât log into, particularly in larger pools. This limits the damage that misbehaving user processes can have on the pool.

If you want your central manager to also be a submit point, start out by installing a Central Manager role on one machine and a Submit role on another machine using the instructions above. Then copy the file /etc/condor/config.d/01-submit.config from the Submit machine to the Central Manager machine and restart HTCondor on the Central Manager machine. You can then wipe out the installation on the Submit machine if you donât need a second submit point.

 - Jaime

On Aug 25, 2023, at 8:38 AM, Justin Killebrew via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:

Hello Cole.  

Adding TRUST_DOMAIN to the EP config file helped i.e. more successful actions in the EP StartLog but I still see errors:

I guess I can safely ignore this:
Running: /usr/bin/docker container prune -f --filter=label=org.htcondorproject=True
08/25/23 07:53:05 Failed to read results from '/usr/bin/docker container prune -f

But:
08/25/23 08:22:59 Token requested not yet approved; please ask collector bench12.timehole.org admin to approve request ID 2446816.

And:
08/25/23 08:26:01 SECMAN: FAILED: Received "DENIED" from server for user condor_pool@ using method IDTOKENS.
08/25/23 08:26:01 ERROR: SECMAN:2010:Received "DENIED" from server for user condor_pool@ using method IDTOKENS.
08/25/23 08:26:01 Collector update failed; will try to get a token request for trust domain 192.168.1.12, identity (default).
08/25/23 08:26:01 Failed to start non-blocking update to <192.168.1.12:9618>.
08/25/23 08:26:01 Trying token request to remote host bench12.timehole.org for user (default).
08/25/23 08:26:01 SECMAN: command 60047 DC_START_TOKEN_REQUEST to collector bench12.timehole.org from TCP port 46127 (blocking).
08/25/23 08:26:01 SECMAN: using session bench12:1603:1692965161:95 for {<192.168.1.12:9618?alias=bench12.timehole.org>,<60047>}.
08/25/23 08:26:01 SECMAN: resume, NOT reauthenticating.
08/25/23 08:26:01 SECMAN: Server rejected our session id
08/25/23 08:26:01 SECMAN: Invalidating negotiated session rejected by peer
08/25/23 08:26:01 DC_INVALIDATE_KEY: removed key id bench12:1603:1692965161:95.
08/25/23 08:26:01 Failed to request a new token: DAEMON:1:failed to start command for token request with remote daemon at '<192.168.1.12:9618?alias=bench12.timehole.org>'.|SECMAN:2004:Server rejected our session id

I tried to (auto) approve the token on the central manager, bench12,  with:
$ condor_token_request_approve -reqid 2446816
Remote daemon did not provide information for request ID 2446816.

Or:
$ condor_token_request_auto_approve -lifetime 3600 -netblock 192.168.1.0/24
Failed to create new auto-approval rule: SECMAN:2010:Received "DENIED" from server for user justin@xxxxxxxxxxxxxxxxxxxx using method FS.

Or sudo:
$ sudo condor_token_request_auto_approve -lifetime 3600 -netblock 192.168.1.0/24
[sudo] password for justin: 
Failed to create new auto-approval rule: SECMAN:2010:Received "DENIED" from server for user condor_pool@ using method IDTOKENS.

On both the EP and central manager:
$ condor_config_val TRUST_DOMAIN
192.168.1.12

Is this TRUST_DOMAIN confusion caused by not using FQDN during installations?  Which is better, names (resolved via hosts files only) or ip addresses?

Is this a normal amount of installation woes or have I steered off the rails somehow?  I will have to install again to create the ârealâ cluster.

Thanks very much!
JK
 





On Aug 24, 2023, at 10:03 AM, Cole Bollig <cabollig@xxxxxxxx> wrote:


      External Email - Use Caution      


Hi Justin,

The problem is the trust domain on the EP. The EP is trying to do IDToken authentication with a trust domain value set to a host name while the other side (collector) is saying that its trust domain is an IP address. Try explicitly setting TRUST_DOMAIN= <ip address> in the EP configuration. 

-Cole Bollig

From: Justin Killebrew <jk@xxxxxxx>
Sent: Wednesday, August 23, 2023 9:06 AM
To: Cole Bollig <cabollig@xxxxxxxx>
Cc: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] StartLog: Failed to authenticate
 
Thanks Cole.  

I added the appropriate configs:
    STARTD_DEBUG = D_SECURITY
on the EP (192.168.1.5) and on the central manager (192.168.1.12):
    COLLECTOR_DEBUG = D_SECURITY

The EP StartLog has some errors, hereâs a good example:
08/23/23 09:26:36 AUTHENTICATE: setting timeout for <192.168.1.12:9618?alias=bench12.timehole.org> to 20.
08/23/23 09:26:36 HANDSHAKE: in handshake(my_methods = 'TOKEN,FS')
08/23/23 09:26:36 HANDSHAKE: handshake() - i am the client
08/23/23 09:26:36 HANDSHAKE: sending (methods == 2052) to server
08/23/23 09:26:36 HANDSHAKE: server replied (method = 2048)
08/23/23 09:26:36 IDTOKENS: Examining /etc/condor/tokens.d/condor@xxxxxxxxxxxxxxxxxxxx for valid tokens from issuer 192.168.1.12.
08/23/23 09:26:36 Ignoring token as it is from trust domain bench12.timehole.org (server trust domain is 192.168.1.12).
08/23/23 09:26:36 TOKEN: No token found.
08/23/23 09:26:36 PW: Failed to fetch a login name
08/23/23 09:26:36 Client error: NULL in send?
08/23/23 09:26:36 Server sent status indicating not OK.
08/23/23 09:26:36 PW: Client received ERROR from server, propagating
08/23/23 09:26:36 Client error: don't know my own name?
08/23/23 09:26:36 Can't send null for random string.
08/23/23 09:26:36 Client error: I have no name?
08/23/23 09:26:36 AUTHENTICATE: method 2048 (IDTOKENS) failed.

and also:

08/23/23 09:46:17 ERROR: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using FS|AUTHENTICATE:1004:Failed to authenticate using IDTOKENS
08/23/23 09:46:17 Collector update failed; will try to get a token request for trust domain 192.168.1.12, identity (default).
08/23/23 09:46:17 Failed to start non-blocking update to <192.168.1.12:9618>.
08/23/23 09:46:17 Trying token request to remote host bench12.timehole.org for user (default).
08/23/23 09:46:17 SECMAN: command 60047 DC_START_TOKEN_REQUEST to collector bench12.timehole.org from TCP port 45105 (blocking).
08/23/23 09:46:17 SECMAN: new session, doing initial authentication.
08/23/23 09:46:17 SECMAN: Auth methods: TOKEN,FS
08/23/23 09:46:17 AUTHENTICATE: setting timeout for <192.168.1.12:9618?alias=bench12.timehole.org> to 20.
08/23/23 09:46:17 HANDSHAKE: in handshake(my_methods = 'TOKEN,FS')
08/23/23 09:46:17 HANDSHAKE: handshake() - i am the client
08/23/23 09:46:17 HANDSHAKE: sending (methods == 2052) to server
08/23/23 09:46:17 HANDSHAKE: server replied (method = 2048)
08/23/23 09:46:17 IDTOKENS: Examining /etc/condor/tokens.d/condor@xxxxxxxxxxxxxxxxxxxx for valid tokens from issuer 192.168.1.12.
08/23/23 09:46:17 Ignoring token as it is from trust domain bench12.timehole.org (server trust domain is 192.168.1.12).
08/23/23 09:46:17 TOKEN: No token found.
08/23/23 09:46:17 PW: Failed to fetch a login name
08/23/23 09:46:17 Client error: NULL in send?
08/23/23 09:46:17 Server sent status indicating not OK.
08/23/23 09:46:17 PW: Client received ERROR from server, propagating
08/23/23 09:46:17 Client error: don't know my own name?
08/23/23 09:46:17 Can't send null for random string.
08/23/23 09:46:17 Client error: I have no name?
08/23/23 09:46:17 AUTHENTICATE: method 2048 (IDTOKENS) failed.
08/23/23 09:46:17 HANDSHAKE: in handshake(my_methods = 'FS')
08/23/23 09:46:17 HANDSHAKE: handshake() - i am the client
08/23/23 09:46:17 HANDSHAKE: sending (methods == 4) to server
08/23/23 09:46:17 HANDSHAKE: server replied (method = 4)
08/23/23 09:46:17 AUTHENTICATE_FS: used dir /tmp/FS_XXX4FXmFJ, status: 0
08/23/23 09:46:17 AUTHENTICATE: method 4 (FS) failed.
08/23/23 09:46:17 HANDSHAKE: in handshake(my_methods = '')
08/23/23 09:46:17 HANDSHAKE: handshake() - i am the client
08/23/23 09:46:17 HANDSHAKE: sending (methods == 0) to server
08/23/23 09:46:17 HANDSHAKE: server replied (method = 0)
08/23/23 09:46:17 SECMAN: required authentication with collector bench12.timehole.org failed, so aborting command DC_START_TOKEN_REQUEST.
08/23/23 09:46:17 Failed to request a new token: DAEMON:1:failed to start command for token request with remote daemon at '<192.168.1.12:9618?alias=bench12.timehole.org>'.|AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using FS|AUTHENTICATE:1004:Failed to authe


The central manager CollectorLog shows authentication errors:
08/23/23 09:31:17 DC_AUTHENTICATE: required authentication of 192.168.1.5 failed: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using FS|FS:1004:Unable to lstat(/tmp/FS_XXXF5IHjw)|AUTHENTICATE:1004:Failed to authenticate using IDTOKENS
08/23/23 09:31:17 DC_AUTHENTICATE: received DC_AUTHENTICATE from <192.168.1.5:39869>
08/23/23 09:31:17 SECMAN: new session, doing initial authentication.
08/23/23 09:31:17 Returning to DC while we wait for socket to authenticate.
08/23/23 09:31:17 AUTHENTICATE: setting timeout for (unknown) to 20.
08/23/23 09:31:17 HANDSHAKE: in handshake(my_methods = 'TOKEN,FS')
08/23/23 09:31:17 HANDSHAKE: handshake() - i am the server
08/23/23 09:31:17 HANDSHAKE: client sent (methods == 2052)
08/23/23 09:31:17 HANDSHAKE: i picked (method == 2048)
08/23/23 09:31:17 HANDSHAKE: client received (method == 2048)
08/23/23 09:31:17 Will return to DC because authentication is incomplete.
08/23/23 09:31:17 PW: Server received ERROR from client, propagating
08/23/23 09:31:17 AUTHENTICATE: auth would still block
08/23/23 09:31:17 Will return to DC to continue authentication..
08/23/23 09:31:17 Error from client.
08/23/23 09:31:17 AUTHENTICATE: method 2048 (IDTOKENS) failed.

Is this sufficient debug level? 

Thanks for the help!

JK



On Aug 23, 2023, at 9:20 AM, Cole Bollig <cabollig@xxxxxxxx> wrote:


      External Email - Use Caution      


Hi Justin,

EP is short for Execution Point. An execution point in HTCondor is a host designated to run jobs. this host runs the startd.

Yes, you could add the increased log debugging level in any configuration file like the one under your config directory (/etc/condor/config.d/xx-myconfig.config). The format for this is <daemon>_DEBUG = D_SECURITY. So, for your case it would be COLLECTOR_DEBUG on the host with the collector and STARTD_DEBUG on the host with the new startd.

I believe that the condor_status -direct failed because it tried to get information about the startd, and that doesn't exist since the startd is failing to authenticate with the collector and subsequently not sending any ads.

I don't think you need to restart at this point, but rather add the security debugging to the collector and startd to get more information about why the authentication is failing.

-Cole Bollig



 
From: Justin Killebrew <jk@xxxxxxx>
Sent: Wednesday, August 23, 2023 7:43 AM
To: Cole Bollig <cabollig@xxxxxxxx>
Cc: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] StartLog: Failed to authenticate
 
Hi Cole.  Thanks for suggestions but I have more novice questions:

What does EP stand for?

How do I add D_SECURITY for the collector and the EP startd? Just add them to a config file like /etc/condor/config.d/xx-myconfig.config?

On the central manager, when I run _condor_TOOL_DEBUG=âD_SECURITYâ condor_status -debug -direct <hostname> I see the error:
    08/23/23 08:12:03 Can't find address for startd bench5.timehole.org
    Error: Failed to locate startd bench5.timehole.org
    Can't find address for startd bench5.timehole.org

So the central manager canât see the execute node, bench5, but itâs in the hosts file and can ping from the command line.  How is condor resolving names?

Should I start over?!   Is this much configuration trouble typical for a fresh, clean install?

Thanks,
JK




On Aug 21, 2023, at 10:26 AM, Cole Bollig <cabollig@xxxxxxxx> wrote:


      External Email - Use Caution      


Hi Justin,

It seems like the EP Startd is failing to authenticate to the collector when sending slot ads which would explain why condor_status is not showing the EP since the collector has no ads for that machine. I would add D_SECURITY or D_SECURITY:2 to the debugging level for the collector on the central manager node and the startd of this EP. You could also try running: _condor_TOOL_DEBUG="D_SECURITY" condor_status -debug -direct <hostname>

-Cole Bollig

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Justin Killebrew via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Friday, August 18, 2023 2:42 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Justin Killebrew <jk@xxxxxxx>
Subject: Re: [HTCondor-users] StartLog: Failed to authenticate
 
I meant to include the execute node condor_who -daemon:

Daemon       Alive  PID    PPID   Exit
------       -----  ---    ----   ----
Master       yes    7570   1      no
SharedPort   no     7604   no     no
Startd       yes    7605   7570   no

JK



> On Aug 18, 2023, at 3:38 PM, Justin Killebrew via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
>
>
>      External Email - Use Caution
>
>
>
> condor_who -daemons  on the central manager (also configured as submit role) shows:
>
> Daemon       Alive  PID    PPID   Exit
> ------       -----  ---    ----   ----
> Collector    yes    1608   1494   no
> Master       yes    1494   1      no
> Negotiator   yes    1609   1494   no
> Schedd       yes    1610   1494   no
> SharedPort   yes    1607   1494   no
>
> This looks correct but on the execute machine, StartLog has several
> ERROR: AUTHENTICATE:1003:Failed to authenticate with any method
> and
> SECMAN: required authentication with collector failed
>
> The central manager CollectorLog shows similar errors:
> DC_AUTHENTICATE: required authentication of 192.168.1.5 failed
>
> The firewall isnât active â Where else should I look?
>
> condor_status returns nothing on the central manager.  Is this because it doesnât see any execute machines?
>
>
> Thanks,
> JK
>
>
>
>> On Aug 17, 2023, at 12:28 PM, John M Knoeller <johnkn@xxxxxxxxxxx> wrote:
>>
>>
>>     External Email - Use Caution
>>
>>
>>
>> One way to troubleshoot is to run
>>
>>  condor_who -daemons
>>
>> On the execute node.  This tool scrapes log files to determine which daemons are alive and which are not.
>>
>> If the condor_master is running, then you can use
>>
>>  condor_who -quick
>>
>> which sends a query to the condor_master about the state of the other daemons.
>>
>> -tj
>>
>> -----Original Message-----
>> From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Justin Killebrew via HTCondor-users
>> Sent: Friday, August 11, 2023 3:03 PM
>> To: Todd L Miller <tlmiller@xxxxxxxxxxx>
>> Cc: Justin Killebrew <jk@xxxxxxx>; Justin Killebrew via HTCondor-users <htcondor-users@xxxxxxxxxxx>
>> Subject: Re: [HTCondor-users] condor_status returns nothing
>>
>> The StartLog showed that /var/lib/condor/execute didnât exist.  I created it and restarted condor and now condor_status works as expected.
>>
>> Thanks!
>>
>> JK
>>
>>
>>> On Aug 11, 2023, at 3:47 PM, Todd L Miller <tlmiller@xxxxxxxxxxx> wrote:
>>>
>>>
>>>   External Email - Use Caution
>>>
>>>
>>>
>>>> Should there be a startd running?  How do I troubleshoot this installation?
>>>
>>>     Yes.  First thing to do is look at the MasterLog and StartLog
>>> files (which will probably be in /var/log/condor, but you can run
>>> `condor_config_val LOG` to find out for sure).  From your process tree, it
>>> looks like either the master isn't starting the startd or the startd is
>>> crashing (almost?) immediately on start-up.
>>>
>>> - ToddM
>>
>>
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
>
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/