[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] StartLog: Failed to authenticate



Hi Justin,

EP is short for Execution Point. An execution point in HTCondor is a host designated to run jobs. this host runs the startd.

Yes, you could add the increased log debugging level in any configuration file like the one under your config directory (/etc/condor/config.d/xx-myconfig.config). The format for this is <daemon>_DEBUG = D_SECURITY. So, for your case it would be COLLECTOR_DEBUG on the host with the collector and STARTD_DEBUG on the host with the new startd.

I believe that the condor_status -direct failed because it tried to get information about the startd, and that doesn't exist since the startd is failing to authenticate with the collector and subsequently not sending any ads.

I don't think you need to restart at this point, but rather add the security debugging to the collector and startd to get more information about why the authentication is failing.

-Cole Bollig



From: Justin Killebrew <jk@xxxxxxx>
Sent: Wednesday, August 23, 2023 7:43 AM
To: Cole Bollig <cabollig@xxxxxxxx>
Cc: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] StartLog: Failed to authenticate
 
Hi Cole.  Thanks for suggestions but I have more novice questions:

What does EP stand for?

How do I add D_SECURITY for the collector and the EP startd? Just add them to a config file like /etc/condor/config.d/xx-myconfig.config?

On the central manager, when I run _condor_TOOL_DEBUG=“D_SECURITY” condor_status -debug -direct <hostname> I see the error:
    08/23/23 08:12:03 Can't find address for startd bench5.timehole.org
    Error: Failed to locate startd bench5.timehole.org
    Can't find address for startd bench5.timehole.org

So the central manager can’t see the execute node, bench5, but it’s in the hosts file and can ping from the command line.  How is condor resolving names?

Should I start over?!   Is this much configuration trouble typical for a fresh, clean install?

Thanks,
JK




On Aug 21, 2023, at 10:26 AM, Cole Bollig <cabollig@xxxxxxxx> wrote:


      External Email - Use Caution      


Hi Justin,

It seems like the EP Startd is failing to authenticate to the collector when sending slot ads which would explain why condor_status is not showing the EP since the collector has no ads for that machine. I would add D_SECURITY or D_SECURITY:2 to the debugging level for the collector on the central manager node and the startd of this EP. You could also try running: _condor_TOOL_DEBUG="D_SECURITY" condor_status -debug -direct <hostname>

-Cole Bollig

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Justin Killebrew via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Friday, August 18, 2023 2:42 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Justin Killebrew <jk@xxxxxxx>
Subject: Re: [HTCondor-users] StartLog: Failed to authenticate
 
I meant to include the execute node condor_who -daemon:

Daemon       Alive  PID    PPID   Exit
------       -----  ---    ----   ----
Master       yes    7570   1      no
SharedPort   no     7604   no     no
Startd       yes    7605   7570   no

JK



> On Aug 18, 2023, at 3:38 PM, Justin Killebrew via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
>
>
>      External Email - Use Caution
>
>
>
> condor_who -daemons  on the central manager (also configured as submit role) shows:
>
> Daemon       Alive  PID    PPID   Exit
> ------       -----  ---    ----   ----
> Collector    yes    1608   1494   no
> Master       yes    1494   1      no
> Negotiator   yes    1609   1494   no
> Schedd       yes    1610   1494   no
> SharedPort   yes    1607   1494   no
>
> This looks correct but on the execute machine, StartLog has several
> ERROR: AUTHENTICATE:1003:Failed to authenticate with any method
> and
> SECMAN: required authentication with collector failed
>
> The central manager CollectorLog shows similar errors:
> DC_AUTHENTICATE: required authentication of 192.168.1.5 failed
>
> The firewall isn’t active … Where else should I look?
>
> condor_status returns nothing on the central manager.  Is this because it doesn’t see any execute machines?
>
>
> Thanks,
> JK
>
>
>
>> On Aug 17, 2023, at 12:28 PM, John M Knoeller <johnkn@xxxxxxxxxxx> wrote:
>>
>>
>>     External Email - Use Caution
>>
>>
>>
>> One way to troubleshoot is to run
>>
>>  condor_who -daemons
>>
>> On the execute node.  This tool scrapes log files to determine which daemons are alive and which are not.
>>
>> If the condor_master is running, then you can use
>>
>>  condor_who -quick
>>
>> which sends a query to the condor_master about the state of the other daemons.
>>
>> -tj
>>
>> -----Original Message-----
>> From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Justin Killebrew via HTCondor-users
>> Sent: Friday, August 11, 2023 3:03 PM
>> To: Todd L Miller <tlmiller@xxxxxxxxxxx>
>> Cc: Justin Killebrew <jk@xxxxxxx>; Justin Killebrew via HTCondor-users <htcondor-users@xxxxxxxxxxx>
>> Subject: Re: [HTCondor-users] condor_status returns nothing
>>
>> The StartLog showed that /var/lib/condor/execute didn’t exist.  I created it and restarted condor and now condor_status works as expected.
>>
>> Thanks!
>>
>> JK
>>
>>
>>> On Aug 11, 2023, at 3:47 PM, Todd L Miller <tlmiller@xxxxxxxxxxx> wrote:
>>>
>>>
>>>   External Email - Use Caution
>>>
>>>
>>>
>>>> Should there be a startd running?  How do I troubleshoot this installation?
>>>
>>>     Yes.  First thing to do is look at the MasterLog and StartLog
>>> files (which will probably be in /var/log/condor, but you can run
>>> `condor_config_val LOG` to find out for sure).  From your process tree, it
>>> looks like either the master isn't starting the startd or the startd is
>>> crashing (almost?) immediately on start-up.
>>>
>>> - ToddM
>>
>>
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
>
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/