[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Kerberos AS-REPs for Daemon communication not cached



Dear HTCondor experts,

we've observed hefty AS-REQs (Kerberos Authentication Service Requests) with rates up to several hundred requests per second
when a lot of jobs are started and daemons (using Kerberos auth) need to talk to each other, issued by the central manager node (running negotiator and collector). 

I can also reproduce that more easily by running "condor_q -all -global" as "root" user who does not have Kerberos credentials on our condor-cm (central manager),
but can access the host principal (and hence use the service credentials to authenticate). A snippet from the debug logs running condor_q confirms my observation:

05/23/19 01:48:15 (fd:4) (pid:2411) (D_SECURITY) KERBEROS: Server principal is host/schedd1.domain@REALM
05/23/19 01:48:15 (fd:4) (pid:2411) (D_SECURITY) init_daemon: client principal is 'host/condor-cm1.domain@REALM'
05/23/19 01:48:15 (fd:4) (pid:2411) (D_SECURITY) init_daemon: Using default keytab FILE:/etc/krb5.keytab
05/23/19 01:48:15 (fd:4) (pid:2411) (D_SECURITY) init_daemon: Trying to get tgt credential for service host/schedd1@REALM
05/23/19 01:48:15 (fd:4) (pid:2411) (D_PRIV) PRIV_UNKNOWN --> PRIV_ROOT at /slots/10/dir_2560730/userdir/.tmpV7H12D/BUILD/condor-8.8.2/src/condor_io/condor_auth_kerberos.cpp:632
05/23/19 01:48:15 (fd:4) (pid:2411) (D_PRIV) PRIV_ROOT --> PRIV_UNKNOWN at /slots/10/dir_2560730/userdir/.tmpV7H12D/BUILD/condor-8.8.2/src/condor_io/condor_auth_kerberos.cpp:634
05/23/19 01:48:15 (fd:4) (pid:2411) (D_SECURITY) init_daemon: gic_kt creds_->client is 'host/condor-cm1.domain@REALM'
05/23/19 01:48:15 (fd:4) (pid:2411) (D_SECURITY) init_daemon: gic_kt creds_->server is 'host/schedd1.domain@REALM'
05/23/19 01:48:15 (fd:4) (pid:2411) (D_SECURITY) Success..........................

It seems that in daemon authentication, a fresh credential is fetched for each single daemon-to-daemon interaction. We realized that since the KDC of our computing centre got DOSed by that
and the service failed (twice up to now). 
Fetching a credential means, in "Kerberos speak" issuing an AS-REQ and having the KDC generate an AS-REP. This is computationally pretty expensive on the KDC end. 

Our computing centre is trying to improve the situation on their end to stand this hefty load better, but still it's best practice in Kerberos to cache AS-REPs. 

Could caching be added? 
Sadly, I do not have a straightforward suggestion what the implementation is missing to get that - for user credentials, the Kerberos library takes care of that automatically
(by using credential caches in files or the persistent kernel keyring), but that does not seem to happen for host / service credentials with HTCondor. Maybe HTCondor purges them after usage? 
But I did not find that explicitly in the code. 
However, issuing:
kinit -k host/condor-cm1.domain@REALM
successfully adds a TGT to the credential cache (in our case, the persistent kernel keyring), as I would expect it. But that does not happen with HTCondor. 

Cheers,
	Oliver

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature