[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Error with global Queue



Alright, I now Have

Central Manager = MASTER, COLLECTOR, NEGOTIATOR
Submit Nodes = MASTER, SCHEDD
Execute nodes = MASTER, SCHEDD, STARTD

That has fixed the error in the logs, but "condor_q -global" still presents:

>Failed to fetch ads from:
<xxx.xxx.xxx.xxx:9618?addrs=xxx.xxx.xxx.xxx-9618&noUDP&sock=3573919_bf24_3>
: host.my.domain.com
>AUTHENTICATE:1003:Failed to authenticate with any method
>AUTHENTICATE:1004:Failed to authenticate using GSI
>GSI:5003:Failed to authenticate. Globus is reporting error
(851968:50). There is probably a problem with your credentials. (Did
you run grid-proxy-init?)
>AUTHENTICATE:1004:Failed to authenticate using KERBEROS
>AUTHENTICATE:1004:Failed to authenticate using FS

Thank's again for your help, if you have any more idea's I'd be very
appreciative


--Brandon

On 1/8/18 1:16 PM, John M Knoeller wrote:
> This message. 
>
>> 01/08/18 09:59:56 DaemonCore: Can't receive command request from
> xxx.xxx.xxx.105 (perhaps a timeout?)
>
> Is generally not a problem.  You will see it in working  pools once per negotiation cycle, it happens because the negotiator hangs up after updating accounting ads, but the collector assumes that any socket opened for updates will never be closed, so it looks for a second command after the first and we get a warning when It doesn't find one.
>
> This message. 
>> 01/08/18 10:03:14 PERMISSION DENIED to condor_pool@xxxxxxxxxxxxx from
> host xxx.xxx.xxx.60 for command 10 (QUERY_STARTD_PVT_ADS), access level
> NEGOTIATOR: reason: cached result for NEGOTIATOR; see first case for the
> full reason
>
> Is a problem.  It indicates that host xxx.xxx.xxx.60 is running a NEGOTIATOR, and that negotiator is unable to negotiate because
> the COLLECTOR will not send it the information it needs to do so.   if xxx.xxx.xxx.60 is not your central manager, then you probably just
> need to remove NEGOATIATOR from the DAEMON_LIST in the configuration of host xxx.xxx.xxx.60
>
> try running
>
> condor_config_val -verbose DAEMON_LIST
>
> on each of your submit nodes.  the result should not have either COLLECTOR or NEGOTIATOR.  
>
> -tj
>
> -----Original Message-----
> From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Brandon Graves
> Sent: Monday, January 8, 2018 12:18 PM
> To: htcondor-users@xxxxxxxxxxx
> Subject: Re: [HTCondor-users] Error with global Queue
>
> Both commands yield valid results without errors.
>
> in the Central Managers CollectorLog I have:
>
>> 01/08/18 10:03:14 PERMISSION DENIED to condor_pool@xxxxxxxxxxxxx from
> host xxx.xxx.xxx.60 for command 10 (QUERY_STARTD_PVT_ADS), access level
> NEGOTIATOR: reason: cached result for NEGOTIATOR; see first case for the
> full reason
>> 01/08/18 10:03:14 DC_AUTHENTICATE: Command not authorized, done!
>> 01/08/18 10:03:20 Got QUERY_STARTD_ADS
>> 01/08/18 10:03:20 Number of Active Workers 0
>> 01/08/18 10:03:20 Got QUERY_STARTD_ADS
>> 01/08/18 10:03:20 Number of Active Workers 0
>> 01/08/18 10:03:26 Got QUERY_STARTD_PVT_ADS
>> 01/08/18 10:03:26 Number of Active Workers 0
>> 01/08/18 10:03:26 Number of Active Workers 0
>> 01/08/18 10:03:26 DaemonCore: Can't receive command request from
> xxx.xxx.xxx.105 (perhaps a timeout?)
>
> xxx.60 is one of my submit nodes, and xxx.105 is the central manager.
>
> There is also a similar entry for other nodes. I looked through for logs
> with a bit more detail and got:
>
>> 01/08/18 09:59:56 DaemonCore: Can't receive command request from
> xxx.xxx.xxx.105 (perhaps a timeout?)
>> 01/08/18 09:59:56 PERMISSION DENIED to condor_pool@xxxxxxxxxxxxxxxxxxx
> from host xxx.xxx.xxx.52 for command 10 (QUERY_STARTD_PVT_ADS), access
> level NEGOTIATOR: reason: cached result for NEGOTIATOR; see first case
> for the full reason
>> 01/08/18 09:59:56 DC_AUTHENTICATE: Command not authorized, done!
> Thank you for any further insight you can provide!
> -Brandon
>
>
> On 1/8/18 9:46 AM, John M Knoeller wrote:
>> I think this means that condor_q is unable to fetch schedd ads from the collector.   
>>
>> Try running 
>>
>>    condor_status -schedd
>>
>> do you get the same error?
>>
>> does a simple 
>>
>>    condor_status 
>>
>> work?
>>
>> If you look in the CollectorLog on the central manager, do you see any messages about the rejected query?
>>
>> -----Original Message-----
>> From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Brandon Graves
>> Sent: Monday, January 8, 2018 11:26 AM
>> To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
>> Subject: [HTCondor-users] Error with global Queue
>>
>> Hello All,
>>
>> I recently replaced my Central Manager, and a few odd things have come
>> up. The only definite error message I can find though happens when
>> "condor_q -global" is run:
>>
>>> -- Failed to fetch ads from:
>> <xxx.xxx.xxx.49:9618?addrs=xxx.xxx.xxx.49-9618+[> : server1.my.domain.com
>>> AUTHENTICATE:1003:Failed to authenticate with any method
>>> AUTHENTICATE:1004:Failed to authenticate using GSI
>>> GSI:5003:Failed to authenticate. Globus is reporting error
>> (851968:50). There is probably a problem with your credentials. (Did
>> you run grid-proxy-init?)
>>> AUTHENTICATE:1004:Failed to authenticate using KERBEROS
>>> AUTHENTICATE:1004:Failed to authenticate using FS
>> My basic configuration is Central manager, connected to 2 submit nodes.
>> Each submit node seems to be able to see it's own queue, one of the
>> submit nodes off and on seems to be having trouble running jobs, but I
>> can't seem to find any errors that make sense. For now I'd like to
>> figure out the global queue error as I suspect they are related.
>>
>> My config file as far as authentication goes looks like this:
>>
>>
>>> SEC_PASSWORD_FILE = /etc/condor/pool_password
>>> SEC_DAEMON_AUTHENTICATION = REQUIRED
>>> SEC_DAEMON_INTEGRITY = REQUIRED
>>> SEC_DAEMON_AUTHENTICATION_METHODS = PASSWORD
>>> SEC_NEGOTIATOR_AUTHENTICATION = REQUIRED
>>> SEC_NEGOTIATOR_INTEGRITY = REQUIRED
>>> SEC_NEGOTIATOR_AUTHENTICATION_METHODS = PASSWORD
>>> SEC_CLIENT_AUTHENTICATION_METHODS = FS, PASSWORD, KERBEROS, GSI
>> ( I didn't do the initial install/configuration of HTcondor on these
>> systems, I'm just the new admin for them, and still getting my footing)
>>
>> I've looked through some of the logs, but I can't seem to find any
>> specific error messages that point me in a new direction. Any
>> tips/tricks/idea's would be appreciated
>>
>>
>> --Brandon
>>
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/