[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] 6.7.18 problem: Kerberos authentication issues post-upgrade



Hi,

I have just upgraded my local Condor pool to 6.7.18 (from 6.7.16) and I'm running into what look like some Kerberos authentication issues.

Scenario:
========
Every machine uses the same global configuration file:
http://www.doc.ic.ac.uk/condor/doc-config/condor_config.global
(Locally retrieved from an NFS volume.)

Note the strong-authentication section at the tail of the file; All condor daemons are required to authenticate using the local host keytab stored in /etc/krb5.keytab, and all WRITE operations must be authenticated with Kerberos credentials.

Two machines of note:
skimmer.doc.ic.ac.uk acts as Condor master.
lightyear.doc.ic.ac.uk acts as a submit-only node.

Both machines are running a distributed derived from Mandrake 10.2 on a locally-built 2.6.13 kernel; the local Kerberos packages are derived from MIT Kerberos 1.4.2:

# rpm -qa|grep krb
libkrb53-devel-1.4.2-0.1.102mdk
libkrbafs0-1.2.2-4mdk
libkrb53-1.4.2-0.1.102mdk
krb5-workstation-1.4.2-0.1.102mdk
libkrbafs0-devel-1.2.2-4mdk
ftp-client-krb5-1.4.2-0.1.102mdk
pam_krb5-2.1.8-1doc
telnet-client-krb5-1.4.2-0.1.102mdk

Failure case:
=============
User 'mwj' tries to submit a set of Condor jobs to the local schedd on lightyear. This is successful, as they have a local kerberos TGT.

The jobs, however, never start. Indeed, when running `condor_q -global` they do not appear at all, whereas they _are_ listed when queried using `condor_q` on lightyear itself. This suggests a communications issue of some kind.

Reviewing the MasterLog on Lightyear, the following errors were displayed:

==> MasterLog <==
3/29 12:57:19 AUTHENTICATE: no available authentication methods succeeded, failing! 3/29 12:57:19 DC_AUTHENTICATE: authenticate failed: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using KERBEROS
3/29 12:57:23 AUTH_ERROR: Internal credentials cache error
3/29 12:57:23 AUTHENTICATE: no available authentication methods succeeded, failing! 3/29 12:57:23 ERROR: SECMAN:2004:Failed to start a session with TCP|AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using KERBEROS
3/29 12:58:23 getpeername failed so connect must have failed
3/29 12:58:43 Connect failed for 20 seconds; returning FALSE
3/29 12:58:43 ERROR: SECMAN:2003:TCP connection to <146.169.1.113:9618> failed

3/29 12:59:43 getpeername failed so connect must have failed
3/29 13:00:03 Connect failed for 20 seconds; returning FALSE
3/29 13:00:03 ERROR: SECMAN:2003:TCP connection to <146.169.1.113:9618> failed

The "Internal credentials cache error" appears to be the significant issue here; it looks like the Master daemon on Lightyear is unable to mutually-authenticate with the daemons on Skimmer as a result of this cache problem, resulting in the observed communications breakdown.

Reconfiguring the logging to add D_SECURITY, the following fuller output appears on Lightyear:

==> MasterLog <==
3/29 16:45:40 STARTCOMMAND: starting 2 to <146.169.1.113:9618> on UDP port 47686.
3/29 16:45:40 SECMAN: command 2 to <146.169.1.113:9618> on UDP port 47686.
3/29 16:45:40 SECMAN: command 60010 to <146.169.1.113:9618> on TCP port 43363.
3/29 16:45:40 SECMAN: new session, doing initial authentication.
3/29 16:45:40 SECMAN: Auth methods: KERBEROS
3/29 16:45:40 HANDSHAKE: in handshake(my_methods = 'KERBEROS')
3/29 16:45:40 HANDSHAKE: handshake() - i am the client
3/29 16:45:40 HANDSHAKE: sending (methods == 64) to server
3/29 16:45:40 HANDSHAKE: server replied (method = 64)
3/29 16:45:40 KERBEROS: krb5_unparse_name: host/skimmer.doc.ic.ac.uk@xxxxxxxxxxxx
3/29 16:45:40 KERBEROS: no user yet determined, will grab up to slash
3/29 16:45:40 KERBEROS: picked user: host
3/29 16:45:40 KERBEROS: remapping 'host' to 'condor'
3/29 16:45:40 unable to open map file (null), errno 14
3/29 16:45:40 Client is condor@(null)
3/29 16:45:40 KERBEROS: Server principal is host/skimmer.doc.ic.ac.uk@xxxxxxxxxxxx 3/29 16:45:40 init_daemon: client principal is 'host/lightyear.doc.ic.ac.uk@xxxxxxxxxxxx'
3/29 16:45:40 init_daemon: Using default keytab FILE:/etc/krb5.keytab
3/29 16:45:40 AUTH_ERROR: Internal credentials cache error
3/29 16:45:40 AUTHENTICATE: method 64 (KERBEROS) failed.
3/29 16:45:40 HANDSHAKE: in handshake(my_methods = '')
3/29 16:45:40 HANDSHAKE: handshake() - i am the client
3/29 16:45:40 HANDSHAKE: sending (methods == 0) to server
3/29 16:45:40 HANDSHAKE: server replied (method = 0)
3/29 16:45:40 AUTHENTICATE: no available authentication methods succeeded, failing!
3/29 16:45:40 SECMAN: unable to start session via TCP, failing.
3/29 16:45:40 ERROR: SECMAN:2004:Failed to start a session with TCP|AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using KERBEROS

It looks like it either cannot determine its local identity properly (note the "Client is condor@(null)" entry) or it is unable to process the local /etc/krb5.keytab file properly -- perhaps it is attempting to do so as the local 'condor' user, and not as root?

Any assistance with this issue would be greatly appreciated.

Cheers,
David
--
David McBride <dwm@xxxxxxxxxxxx>
Department of Computing, Imperial College, London