[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Submit Condor jobs such that <=N can land on one physical machine?



On 01/22/2010 02:35 PM, Ross, Jason wrote:
> I’ve got a Condor pool of multicore machines (running Windows 2k3
> Server), and an executable that does not play well with itself – running
> multiple instances of it on a single box, multicore or not, ends badly.
>  
> For a quick hack, I can specify (VirtualMachineID == 1) in my job
> submission’s requirements field.
> But that’s really not very good; if a machine’s slot 1 happens to be
> filled with another job, I can’t use that machine, even if it’s running
> no instances of my executable and even if it’s got other job slots free.
>  
> Is there a better way to write a job submission to express the idea,
> “Map no more than one of these per physical machine”?
> Or, even more generally, “Map no more than N of these per physical machine?”
>  
> It doesn’t seem like even a solution where I look at some machine
> classad field describing currently-running jobs would work, since
> negotiation happens all at once.
> I’d expect all my jobs to observe that none of themselves are running on
> box X, and for them all to map successfully at the same time to however
> many slots X has available.
>  
> Thanks!
> - Jason
>  

Jason,

This has been discussed quite a bit on the list in the context of GPUs.

You can definitely do this by setting an attribute on each of your jobs to identify them, and using that attribute as part of your START policy. Yes, you might get many jobs matched to a single machine in one go, but only one can successfully start. There's actually a possibility of nasty thrashing where all jobs can match and then get booted in a cycle. Using partitionable slots avoids that issue, but I doubt they are available in your version of Condor if you use VirtualMachineID -- it was deprecated in favor of SlotId many years ago now.

This could be more elegant, but it will work right now, on a 4 slot box (adding slots should be obvious):

In your execute node's configuration...

STARTD_ATTRS = SLOT1_GRUMPY_APP_COUNT, SLOT2_GRUMPY_APP_COUNT, SLOT3_GRUMPY_APP_COUNT, SLOT4_GRUMPY_APP_COUNT, GRUMPY_APP_COUNT
STARTD_JOB_EXPRS = GRUMPY_APP
STARTD_SLOT_ATTRS = GRUMPY_APP

SLOT1_GRUMPY_APP_COUNT = ifThenElse(slot1_GRUMPY_APP =?= UNDEFINED, 0, 1)
SLOT2_GRUMPY_APP_COUNT = ifThenElse(slot2_GRUMPY_APP =?= UNDEFINED, 0, 1)
SLOT3_GRUMPY_APP_COUNT = ifThenElse(slot3_GRUMPY_APP =?= UNDEFINED, 0, 1)
SLOT4_GRUMPY_APP_COUNT = ifThenElse(slot4_GRUMPY_APP =?= UNDEFINED, 0, 1)

GRUMPY_APP_COUNT = (SLOT1_GRUMPY_APP_COUNT + SLOT2_GRUMPY_APP_COUNT + SLOT3_GRUMPY_APP_COUNT + SLOT4_GRUMPY_APP_COUNT)

START = GRUMPY_APP_COUNT < 1

In your submit file...

+GRUMPY_APP = "This app doesn't play well with itself"

Ways of making it more elegant would be always defining GRUMPY_APP=0 on every slot (STARTD_ATTRS) and having +GRUMPY_APP=1, which avoids the "=?= UNDEFINED" test, or having the ability to loop over slots instead of statically referencing them, e.g. reduce(X.GUMPY_APP, {slots}) which doesn't exist in the ClassAd language.

Best,


matt