[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Error with global Queue



This message. 

>01/08/18 09:59:56 DaemonCore: Can't receive command request from
xxx.xxx.xxx.105 (perhaps a timeout?)

Is generally not a problem.  You will see it in working  pools once per negotiation cycle, it happens because the negotiator hangs up after updating accounting ads, but the collector assumes that any socket opened for updates will never be closed, so it looks for a second command after the first and we get a warning when It doesn't find one.

This message. 
>01/08/18 10:03:14 PERMISSION DENIED to condor_pool@xxxxxxxxxxxxx from
host xxx.xxx.xxx.60 for command 10 (QUERY_STARTD_PVT_ADS), access level
NEGOTIATOR: reason: cached result for NEGOTIATOR; see first case for the
full reason

Is a problem.  It indicates that host xxx.xxx.xxx.60 is running a NEGOTIATOR, and that negotiator is unable to negotiate because
the COLLECTOR will not send it the information it needs to do so.   if xxx.xxx.xxx.60 is not your central manager, then you probably just
need to remove NEGOATIATOR from the DAEMON_LIST in the configuration of host xxx.xxx.xxx.60

try running

condor_config_val -verbose DAEMON_LIST

on each of your submit nodes.  the result should not have either COLLECTOR or NEGOTIATOR.  

-tj

-----Original Message-----
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Brandon Graves
Sent: Monday, January 8, 2018 12:18 PM
To: htcondor-users@xxxxxxxxxxx
Subject: Re: [HTCondor-users] Error with global Queue

Both commands yield valid results without errors.

in the Central Managers CollectorLog I have:

>01/08/18 10:03:14 PERMISSION DENIED to condor_pool@xxxxxxxxxxxxx from
host xxx.xxx.xxx.60 for command 10 (QUERY_STARTD_PVT_ADS), access level
NEGOTIATOR: reason: cached result for NEGOTIATOR; see first case for the
full reason
>01/08/18 10:03:14 DC_AUTHENTICATE: Command not authorized, done!
>01/08/18 10:03:20 Got QUERY_STARTD_ADS
>01/08/18 10:03:20 Number of Active Workers 0
>01/08/18 10:03:20 Got QUERY_STARTD_ADS
>01/08/18 10:03:20 Number of Active Workers 0
>01/08/18 10:03:26 Got QUERY_STARTD_PVT_ADS
>01/08/18 10:03:26 Number of Active Workers 0
>01/08/18 10:03:26 Number of Active Workers 0
>01/08/18 10:03:26 DaemonCore: Can't receive command request from
xxx.xxx.xxx.105 (perhaps a timeout?)

xxx.60 is one of my submit nodes, and xxx.105 is the central manager.

There is also a similar entry for other nodes. I looked through for logs
with a bit more detail and got:

>01/08/18 09:59:56 DaemonCore: Can't receive command request from
xxx.xxx.xxx.105 (perhaps a timeout?)
>01/08/18 09:59:56 PERMISSION DENIED to condor_pool@xxxxxxxxxxxxxxxxxxx
from host xxx.xxx.xxx.52 for command 10 (QUERY_STARTD_PVT_ADS), access
level NEGOTIATOR: reason: cached result for NEGOTIATOR; see first case
for the full reason
>01/08/18 09:59:56 DC_AUTHENTICATE: Command not authorized, done!

Thank you for any further insight you can provide!
-Brandon


On 1/8/18 9:46 AM, John M Knoeller wrote:
> I think this means that condor_q is unable to fetch schedd ads from the collector.   
>
> Try running 
>
>    condor_status -schedd
>
> do you get the same error?
>
> does a simple 
>
>    condor_status 
>
> work?
>
> If you look in the CollectorLog on the central manager, do you see any messages about the rejected query?
>
> -----Original Message-----
> From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Brandon Graves
> Sent: Monday, January 8, 2018 11:26 AM
> To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
> Subject: [HTCondor-users] Error with global Queue
>
> Hello All,
>
> I recently replaced my Central Manager, and a few odd things have come
> up. The only definite error message I can find though happens when
> "condor_q -global" is run:
>
>> -- Failed to fetch ads from:
> <xxx.xxx.xxx.49:9618?addrs=xxx.xxx.xxx.49-9618+[> : server1.my.domain.com
>> AUTHENTICATE:1003:Failed to authenticate with any method
>> AUTHENTICATE:1004:Failed to authenticate using GSI
>> GSI:5003:Failed to authenticate. Globus is reporting error
> (851968:50). There is probably a problem with your credentials. (Did
> you run grid-proxy-init?)
>> AUTHENTICATE:1004:Failed to authenticate using KERBEROS
>> AUTHENTICATE:1004:Failed to authenticate using FS
> My basic configuration is Central manager, connected to 2 submit nodes.
> Each submit node seems to be able to see it's own queue, one of the
> submit nodes off and on seems to be having trouble running jobs, but I
> can't seem to find any errors that make sense. For now I'd like to
> figure out the global queue error as I suspect they are related.
>
> My config file as far as authentication goes looks like this:
>
>
>> SEC_PASSWORD_FILE = /etc/condor/pool_password
>> SEC_DAEMON_AUTHENTICATION = REQUIRED
>> SEC_DAEMON_INTEGRITY = REQUIRED
>> SEC_DAEMON_AUTHENTICATION_METHODS = PASSWORD
>> SEC_NEGOTIATOR_AUTHENTICATION = REQUIRED
>> SEC_NEGOTIATOR_INTEGRITY = REQUIRED
>> SEC_NEGOTIATOR_AUTHENTICATION_METHODS = PASSWORD
>> SEC_CLIENT_AUTHENTICATION_METHODS = FS, PASSWORD, KERBEROS, GSI
> ( I didn't do the initial install/configuration of HTcondor on these
> systems, I'm just the new admin for them, and still getting my footing)
>
> I've looked through some of the logs, but I can't seem to find any
> specific error messages that point me in a new direction. Any
> tips/tricks/idea's would be appreciated
>
>
> --Brandon
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/