[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Error with global Queue



Just for Records sake problem resolved, I added:

CONDOR_Q_ONLY_MY_JOBS = FALSE


to my config files, and it resolved the issue.


Thanks for the help on the other errors!

-Brandon


On 1/8/18 3:09 PM, Brandon Graves wrote:
Alright, I now Have

Central Manager = MASTER, COLLECTOR, NEGOTIATOR
Submit Nodes = MASTER, SCHEDD
Execute nodes = MASTER, SCHEDD, STARTD

That has fixed the error in the logs, but "condor_q -global" still presents:

Failed to fetch ads from:
<xxx.xxx.xxx.xxx:9618?addrs=xxx.xxx.xxx.xxx-9618&noUDP&sock=3573919_bf24_3>
: host.my.domain.com
AUTHENTICATE:1003:Failed to authenticate with any method
AUTHENTICATE:1004:Failed to authenticate using GSI
GSI:5003:Failed to authenticate. Globus is reporting error
(851968:50). There is probably a problem with your credentials. (Did
you run grid-proxy-init?)
AUTHENTICATE:1004:Failed to authenticate using KERBEROS
AUTHENTICATE:1004:Failed to authenticate using FS
Thank's again for your help, if you have any more idea's I'd be very
appreciative


--Brandon

On 1/8/18 1:16 PM, John M Knoeller wrote:
This message. 

01/08/18 09:59:56 DaemonCore: Can't receive command request from
xxx.xxx.xxx.105 (perhaps a timeout?)

Is generally not a problem.  You will see it in working  pools once per negotiation cycle, it happens because the negotiator hangs up after updating accounting ads, but the collector assumes that any socket opened for updates will never be closed, so it looks for a second command after the first and we get a warning when It doesn't find one.

This message. 
01/08/18 10:03:14 PERMISSION DENIED to condor_pool@xxxxxxxxxxxxx from
host xxx.xxx.xxx.60 for command 10 (QUERY_STARTD_PVT_ADS), access level
NEGOTIATOR: reason: cached result for NEGOTIATOR; see first case for the
full reason

Is a problem.  It indicates that host xxx.xxx.xxx.60 is running a NEGOTIATOR, and that negotiator is unable to negotiate because
the COLLECTOR will not send it the information it needs to do so.   if xxx.xxx.xxx.60 is not your central manager, then you probably just
need to remove NEGOATIATOR from the DAEMON_LIST in the configuration of host xxx.xxx.xxx.60

try running

condor_config_val -verbose DAEMON_LIST

on each of your submit nodes.  the result should not have either COLLECTOR or NEGOTIATOR.  

-tj

-----Original Message-----
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Brandon Graves
Sent: Monday, January 8, 2018 12:18 PM
To: htcondor-users@xxxxxxxxxxx
Subject: Re: [HTCondor-users] Error with global Queue

Both commands yield valid results without errors.

in the Central Managers CollectorLog I have:

01/08/18 10:03:14 PERMISSION DENIED to condor_pool@xxxxxxxxxxxxx from
host xxx.xxx.xxx.60 for command 10 (QUERY_STARTD_PVT_ADS), access level
NEGOTIATOR: reason: cached result for NEGOTIATOR; see first case for the
full reason
01/08/18 10:03:14 DC_AUTHENTICATE: Command not authorized, done!
01/08/18 10:03:20 Got QUERY_STARTD_ADS
01/08/18 10:03:20 Number of Active Workers 0
01/08/18 10:03:20 Got QUERY_STARTD_ADS
01/08/18 10:03:20 Number of Active Workers 0
01/08/18 10:03:26 Got QUERY_STARTD_PVT_ADS
01/08/18 10:03:26 Number of Active Workers 0
01/08/18 10:03:26 Number of Active Workers 0
01/08/18 10:03:26 DaemonCore: Can't receive command request from
xxx.xxx.xxx.105 (perhaps a timeout?)

xxx.60 is one of my submit nodes, and xxx.105 is the central manager.

There is also a similar entry for other nodes. I looked through for logs
with a bit more detail and got:

01/08/18 09:59:56 DaemonCore: Can't receive command request from
xxx.xxx.xxx.105 (perhaps a timeout?)
01/08/18 09:59:56 PERMISSION DENIED to condor_pool@xxxxxxxxxxxxxxxxxxx
from host xxx.xxx.xxx.52 for command 10 (QUERY_STARTD_PVT_ADS), access
level NEGOTIATOR: reason: cached result for NEGOTIATOR; see first case
for the full reason
01/08/18 09:59:56 DC_AUTHENTICATE: Command not authorized, done!
Thank you for any further insight you can provide!
-Brandon


On 1/8/18 9:46 AM, John M Knoeller wrote:
I think this means that condor_q is unable to fetch schedd ads from the collector.   

Try running 

   condor_status -schedd

do you get the same error?

does a simple 

   condor_status 

work?

If you look in the CollectorLog on the central manager, do you see any messages about the rejected query?

-----Original Message-----
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Brandon Graves
Sent: Monday, January 8, 2018 11:26 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Error with global Queue

Hello All,

I recently replaced my Central Manager, and a few odd things have come
up. The only definite error message I can find though happens when
"condor_q -global" is run:

-- Failed to fetch ads from:
<xxx.xxx.xxx.49:9618?addrs=xxx.xxx.xxx.49-9618+[> : server1.my.domain.com
AUTHENTICATE:1003:Failed to authenticate with any method
AUTHENTICATE:1004:Failed to authenticate using GSI
GSI:5003:Failed to authenticate. Globus is reporting error
(851968:50). There is probably a problem with your credentials. (Did
you run grid-proxy-init?)
AUTHENTICATE:1004:Failed to authenticate using KERBEROS
AUTHENTICATE:1004:Failed to authenticate using FS
My basic configuration is Central manager, connected to 2 submit nodes.
Each submit node seems to be able to see it's own queue, one of the
submit nodes off and on seems to be having trouble running jobs, but I
can't seem to find any errors that make sense. For now I'd like to
figure out the global queue error as I suspect they are related.

My config file as far as authentication goes looks like this:


SEC_PASSWORD_FILE = /etc/condor/pool_password
SEC_DAEMON_AUTHENTICATION = REQUIRED
SEC_DAEMON_INTEGRITY = REQUIRED
SEC_DAEMON_AUTHENTICATION_METHODS = PASSWORD
SEC_NEGOTIATOR_AUTHENTICATION = REQUIRED
SEC_NEGOTIATOR_INTEGRITY = REQUIRED
SEC_NEGOTIATOR_AUTHENTICATION_METHODS = PASSWORD
SEC_CLIENT_AUTHENTICATION_METHODS = FS, PASSWORD, KERBEROS, GSI
( I didn't do the initial install/configuration of HTcondor on these
systems, I'm just the new admin for them, and still getting my footing)

I've looked through some of the logs, but I can't seem to find any
specific error messages that point me in a new direction. Any
tips/tricks/idea's would be appreciated


--Brandon

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/