[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Kerberos: forwarding tickets

On Tue, Oct 09, 2007 at 05:10:28PM -0400, Jonathan D. Proulx wrote:
> On Tue, Oct 09, 2007 at 03:51:08PM -0500, Erik Paulson wrote:
> :Alas, no. It's a goal, but it's not present yet. 
> :
> :The UW is entirely Kerberos and AFS, so having ticket forwarding would
> :be very helpful for the Condor developers, so I'm hopeful it will be 
> :implemented someday. 
> I'd love it too, but UW has been developing Condor for a long time and
> (presumably) has been an AFS shop for a long time.  Is there actually
> and reason to hope for this?

With the caveat of "I don't work for Condor, but work very closely with
Condor", yes, there's hope for this feature. We're certainly not thrilled
about using IP-based ACLs for our jobs.

> I understand some of the difficulties, for example credentials
> expiring mid job or while the job is in the queue, and I don't see a
> fix for this.  Time limited credentials are central to the security
> Kerberos provides, but this is a fundamental problem for batch queued
> systems and long running jobs.

[I think I've got this right, but I'm no krb expert, it's been a long day,
and I'm hungry. This plan could have both real flaws, and flaws that
I introduced by writing it down wrong :)]

The scheme we've got in mind for running at UWCS is part Condor features,
and part custom Kerb/AFS setup. Our Kerberos administrators are going
to create additional krb principals for each user. For example, I'm
'epaulson', there will also be an 'epaulson/condor' principal. I will
never know the password for this princpal, but I can set ACLs on my
files such that 'epaulson/condor' can read or write the subset of my AFS
files I think my Condor job will need to access. (This way, I can give
read permissions to my simulations to Condor, without giving my Condor
job access to things like my email and ssh private keys.) I can also
easily get a token for it at any time I want, but no one else can. (The
pre-generated tickets are on the local disk, protected by regular file
system permissions, so only my UID and root can read them. This is no
worse security than what kerberos already has, since root can read the
'epaulson' KRB5CCNAME just as easily as the 'epaulson/condor' KRB5CCNAME

The ticket for 'epaulson/condor' will be long-lived, and generated
automatically by our KDC and pushed out to the appropriate machines,
probably once an hour so it's always fresh. (The KDC doesn't need my
password for this, in fact I think it's a random password each time) We
already do this for certain system services. For example, our automated
builds get an automatic ticket so they can read the CVS repository. We
never worry about making sure someone refreshes the ticket for this job
every few days. 

When the job is matched and started, Condor will use the latest long-lived
automatic ticket to get a new ticket for the remote machine and start
the job. Our definition of long-lived is 30 days, which is longer than
any single expected run of a job, so we can always get a new ticket
the next time we start it up. If a single instance of the job keeps
a machine claimed and running the same process for 30 days, then we'd
have problems. I don't think we've ever seen a single job stay alive
for that long, though I guess I could imagine DAGMan jobs running that
long. Thankfully, they can "checkpoint" and restart with a new ticket.

If, heaven forbid, one of our execute machines is compromised, the only
tickets that are stolen are from whatever Condor jobs happen to wind up
there while the machine was compromised, and even then they're only good
for accessing files that were made accessible to the epaulson/condor
princpal, not epaulson.

Just to be clear, I can't imagine that Condor will require this setup, and
there will be a way to simply use the TGT to get a ticket for the remote
host and run the Condor job, all with the same krb5 princpal. 
(But, if you use our setup, it'll be cooler :)

> What is achievable (in my mind atleast :) is having the queue daemons
> authenticated so you could easily ACL a directory for that, weak
> though it is, or even system:authuser which is effectively what that
> permission would be since any authenicated user could submit a
> batch job that would get that ID.

You could do that today - create a 'condor-job' princpal, start the
Condor daemons on each machine so it has a 'condor-job' token, and
any job born underneath that Condor instance will have a 'condor-job'
token. As you point out, effectively every Condor job has permissions to
every other Condor job. You can still use Kerberos to authenticate the
user to the schedd, and the schedd to the startd, so you'd be able to
control who is submitting Condor jobs, but once they're in all bets are
off. (What's missing today is having the schedd autheticate as 'epaulson'
to the startd.  It currently authenticates as a Condor daemon between
the two, with no notion of the owner of the job)

> <ramble ramble>
> Anyway is any work being done in this direction, or are IP based ACLs
> "good enough" for the developers?

That's not my department :)