[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Kerberos: forwarding tickets

At Fermilab, which is also a huge Kerberos installation and
was one of the users that first requested the addition of Kerberos
authentication to condor, we came up with something similar.  All
the condor jobs (and our previous batch systems before them) ran
with a restricted credential, in our case username/cdf/headnode.
Since condor didn't and still doesn't generate them automatically,
we set up our own daemon to generate them for all the users in the
queue and transfer them out to the running jobs as a condor_transfer_file.

It's possible to store such credentials encrypted in keytab files
and just do the kerberos kinit when the job starts running.  In our
setup that's good for a week.  Anyone who needs longer than that,
should be using the standard universe and checkpointing because the
half-life of jobs in a condor pool means longer than 1-week jobs
are likely to crash anyway.

What we would have really needed, if we intended to stay with a
kerberos-authenticated condor pool, is condor schedd authenticating
the user timm@xxxxxxxx at submit time, and then signaling the startd
to make timm/cdf/hostname@xxxxxxxx credential at start time.  We're still
interested to see this feature but right now most our condor pools
are transitioning to GSI (globus/grid) authentication

Steve Timm
Steven C. Timm, Ph.D  (630) 840-8525
timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Assistant Group Leader.

On Tue, 9 Oct 2007, Erik Paulson wrote:

On Tue, Oct 09, 2007 at 05:10:28PM -0400, Jonathan D. Proulx wrote:
On Tue, Oct 09, 2007 at 03:51:08PM -0500, Erik Paulson wrote:

:Alas, no. It's a goal, but it's not present yet.
:The UW is entirely Kerberos and AFS, so having ticket forwarding would
:be very helpful for the Condor developers, so I'm hopeful it will be
:implemented someday.

I'd love it too, but UW has been developing Condor for a long time and
(presumably) has been an AFS shop for a long time.  Is there actually
and reason to hope for this?

With the caveat of "I don't work for Condor, but work very closely with
Condor", yes, there's hope for this feature. We're certainly not thrilled
about using IP-based ACLs for our jobs.

I understand some of the difficulties, for example credentials
expiring mid job or while the job is in the queue, and I don't see a
fix for this.  Time limited credentials are central to the security
Kerberos provides, but this is a fundamental problem for batch queued
systems and long running jobs.

[I think I've got this right, but I'm no krb expert, it's been a long day,
and I'm hungry. This plan could have both real flaws, and flaws that
I introduced by writing it down wrong :)]

The scheme we've got in mind for running at UWCS is part Condor features,
and part custom Kerb/AFS setup. Our Kerberos administrators are going
to create additional krb principals for each user. For example, I'm
'epaulson', there will also be an 'epaulson/condor' principal. I will
never know the password for this princpal, but I can set ACLs on my
files such that 'epaulson/condor' can read or write the subset of my AFS
files I think my Condor job will need to access. (This way, I can give
read permissions to my simulations to Condor, without giving my Condor
job access to things like my email and ssh private keys.) I can also
easily get a token for it at any time I want, but no one else can. (The
pre-generated tickets are on the local disk, protected by regular file
system permissions, so only my UID and root can read them. This is no
worse security than what kerberos already has, since root can read the
'epaulson' KRB5CCNAME just as easily as the 'epaulson/condor' KRB5CCNAME

The ticket for 'epaulson/condor' will be long-lived, and generated
automatically by our KDC and pushed out to the appropriate machines,
probably once an hour so it's always fresh. (The KDC doesn't need my
password for this, in fact I think it's a random password each time) We
already do this for certain system services. For example, our automated
builds get an automatic ticket so they can read the CVS repository. We
never worry about making sure someone refreshes the ticket for this job
every few days.

This sounds like a keytab is involved...that would be the natural way
to do it.

When the job is matched and started, Condor will use the latest long-lived
automatic ticket to get a new ticket for the remote machine and start
the job. Our definition of long-lived is 30 days, which is longer than
any single expected run of a job, so we can always get a new ticket
the next time we start it up. If a single instance of the job keeps
a machine claimed and running the same process for 30 days, then we'd
have problems. I don't think we've ever seen a single job stay alive
for that long, though I guess I could imagine DAGMan jobs running that
long. Thankfully, they can "checkpoint" and restart with a new ticket.

If, heaven forbid, one of our execute machines is compromised, the only
tickets that are stolen are from whatever Condor jobs happen to wind up
there while the machine was compromised, and even then they're only good
for accessing files that were made accessible to the epaulson/condor
princpal, not epaulson.

Just to be clear, I can't imagine that Condor will require this setup, and
there will be a way to simply use the TGT to get a ticket for the remote
host and run the Condor job, all with the same krb5 princpal.
(But, if you use our setup, it'll be cooler :)

What is achievable (in my mind atleast :) is having the queue daemons
authenticated so you could easily ACL a directory for that, weak
though it is, or even system:authuser which is effectively what that
permission would be since any authenicated user could submit a
batch job that would get that ID.

You could do that today - create a 'condor-job' princpal, start the
Condor daemons on each machine so it has a 'condor-job' token, and
any job born underneath that Condor instance will have a 'condor-job'
token. As you point out, effectively every Condor job has permissions to
every other Condor job. You can still use Kerberos to authenticate the
user to the schedd, and the schedd to the startd, so you'd be able to
control who is submitting Condor jobs, but once they're in all bets are
off. (What's missing today is having the schedd autheticate as 'epaulson'
to the startd.  It currently authenticates as a Condor daemon between
the two, with no notion of the owner of the job)

<ramble ramble>

Anyway is any work being done in this direction, or are IP based ACLs
"good enough" for the developers?

That's not my department :)

Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting

The archives can be found at: