[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor on Intel Macs (OS X 10.4 Tiger)


On Tue, Jun 19, 2007 at 09:48:34AM -0500, Ben Rogers wrote:
> That said it'd be nice to have a universal version at some point so that we
> don't have to advertise an additional classad to know which arch we are
> really on.

For OSes which go out of their way to perform binary compatibility,
it can get pretty hairy to figure out what works or not. Throw in some
edge cases and things get downright frustrating. Consequently, I'm sorry
for the forth coming explanation of how these ports work on different
machines. This is a complicated topic with as of yet no easy solutions
to the complexity. Currently, the mac port of Condor is in a little bit of flux
as I need to determine what Apple's grand plan is in the long run in order to
stabalize good defaults for Condor.

In addition, this forthcoming description also does not take into account
universal binaries because no condor releases are compiled as such.
However, I'll talk about universal binaries at the end.

Current view of the supported arch/opsys combinations and ports of Condor:

arch/opsys               Condor Port
----------               -----------
ppc_macos_10.3           Native 32-bit
ppc_macos_10.4           Use ppc_macos_10.3 port
                         Native 32-bit (not released, probably never)
x86_macos_10.4           Use ppc_macos_10.3 port
                         Native 32-bit (not released, hopefully 6.8.6)

What works:

(ppc_macos_10.3/Native 32-bit) running on (ppc_macos_10.3):
	1. it works as root
	2. it works as a personal condor
	3. it identifies itself as PPC/OSX

(ppc_macos_10.3/Native 32-bit) running on (ppc_macos_10.4):
	1. it works as root
	2. it identifies itself as PPC/OSX
	3. it *could* work as a personal condor if you alter some default policy:
		A: (as root) sysctl -w kern.tfp.policy=1
		B: I'm not sure how long this will be available in versions beyond 10.4
(ppc_macos_10.3/Native 32-bit) running on (x86_macos_10.4/Rosetta):
	1. it works as root
	2. it identifies itself as PPC/OSX
	3. it *could* work as a personal condor if you alter some default policy:
		A: (as root) sysctl -w kern.tfp.policy=1
		B: I'm not sure how long this will be available in versions beyond 10.4

What doesn't work:
	You can't run a ppc_macos_10.4 binary on a ppc_macos_10.3 machine
	(or if you can, you've managed to avoid some common problems).
	This extends logically to you most likely can't run a binary
	compiled on a revision of macos on an earlier revision of the OS.

	You can't run an x86 binary on ppc.

Sadly, now this gets even more complex....

By default, when you condor_submit a job, the arch and opsys are derived
from the installation of Condor on the submitting machine.

So, the job's Requirements expression will be: (example: you submitted
from a x86_macos_10.4 machine running the ppc_macos_10.3 version of
Condor using Rosetta):

Requirements = (Arch == "PPC") && (Opsys == "OSX") && ....

However, since ALL versions of Condor identify themselves thusly, if
you have mixed pools of ppc/10.4 and x86/10.3, then the job can land on
a machine of the incorrect architecture and fail. Of course, there are
other combinations where this will not work properly, like submitting a
ppc/10.4 compiled binary which might match to a ppc/10.3 machine. And
there are non-intuitive places where it works, like submitting a ppc
binary which ends up on an x86 machine, but works due to rosetta.

So, let's brain storm about what would happen if we change Condor in 
various ways to deal with this nasty problem:

Specific Ports

One might think you could fix this by having explicit ports for each
arch/opsys combination, and have them identify like PPC/MACOS_10.3,
PPC/MACOS_10.4, INTEL/MACOS_10.4. This would solve the problem of mixed
pools since you would get the correct default behavior and could alter
your requirements expression to explicitly state which architectures you
know your code could run on. On some cases, assuming that the binaries
you are submitting are identical to the compilation characteristics of
the installation of Condor, it could help you out--our solaris ports
do something similar with standard universe jobs.

In this world, suppose you are on a ppc/10.4 machine submitting a ppc/10.3
binary into a mixed pool with all three architectures, you'd write:

Requirements = (Arch == "PPC" || Arch == "INTEL") && (Opsys == "MACOS_10.3" ||
	Opsys == "MACOS_10.4")

Otherwise the default Condor would pick would be (assuming Rosetta is

Requirements = (Arch == "PPC" || Arch == "INTEL") && (Opsys == "MACOS_10.4")

In this case, the default behavior would miss out on a third of the available
catagories of resources.

But now, there are some serious upgrade path problems. What if you upgrade
to macos 10.5? You get a new port, but--oops, everything in the job
queue is referencing the old opsys/arch combinations it needs and now the
admin is left with condor_qedit'ing the jobs and praying it is correct. Or,
you leave some older execute machines around to drain the queue, and ensure
your submit nodes are the new machines.

The above reason of pool fragmentation and upgradability paths is why we
really tried to have only PPC/OSX or INTEL/OSX as possible descriptors
for macos Condor. However, as we've seen even that has problems.

Universal Binary

Suppose Condor is compiled to be a universal binary so runs out of the
box on ppc or x86. Now, how will it identify itself? UNI/OSX? Or should
it figure out the underlying architecture at run time and display that,

It turns out both answers are wrong and there isn't a right one in terms of

If Condor identified itself as UNI/OSX, then no matter what you condor_submit,
it will be matched to any machine in the pool, regardless if it is truely a
universal binary or not.

If Condor identified itself as the specific architecture, then submitting
a job will result in partial utilization of available execution machines.

Neither of them are good answers.

And, while a universal binary might grey the line between what architecture
you can run your binary on, it doesn't appear (to me at least) that it solves
problems concerning themselves with OS revision compatibility.


So, where does all of this mess currently leave us (keep the root/personal
condor issue in mind too)?  

For people who have a single arch/opsys mac cluster, do nothing
and Condor will just work.

For people who have mixed arch, but not opsys, pools, then submit ppc
executables from a ppc machine and you'll be fine.

For people who have mixed opsys but not arch pools, submit programs compiled
on the oldest os revision.

For people who have mixed arch and mixed opsys pools, you must hand specify
additional attributes in the machine ads and adjust your machine START
and job Requirements expressions to utilize them to ensure jobs end up in 
correct places.

For people who don't fit into the above catagories: you must hand
specify addition attributes in the machine ads and adjust your
machine START and job Requirements expressions to ensure your
jobs end up in correct places.


As the maintainer of the ports of Condor, I've thought about the above
problems a lot. In a future revision of Condor, I think what I'm probably
going to do is something like this:

1. The Condor distribution itself will be composed of universal binaries.
2. It will identify itself as INTEL/OSX or PPC/OSX on the correct hardware.
3. On INTEL/OSX machines, it'll autodetect if Rosetta is available and
	mark an attribute in the machine ad.
4. If condor_submit is run on a ppc machine, then it'll fix up the requirements
	expression to utilize rosetta enabled INTEL/OSX machines.
5. A user could put +universal_binary = TRUE in their submit file and Condor
	will run it on either INTEL/OSX or PPC/OSX machines.

This solves most of the problems a user will encounter and makes it easy
for them to use Condor in most cases, but still leaves problems like you
can't submit a ppc/10.4 binary and have it run on a ppc/10.3 machine. This
problem I still need to think more about for a good default solution.

Thank you.

Peter Keller