[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Submitting a parallel job in the vanilla universe



Brandon Leeds wrote:


Matthew Farrellee wrote:
Cargnelli Matthieu wrote:
Hi,

I'd like to know if it is possible to schedule a "vanilla" job which is multithreaded, on a single multi-processor machine. As I use the usual configuration (1 VM for 1 processor), I can use the totalCpus parameter in the job description file (Requirements=(totalCpus>=4) ) for instance. But then only one VM is reserved for the job. I suppose this method will work if a single job is submitted as in my tests, but what if 4 jobs are submitted ?

I couldn't find an answer aside from using the parallel universe. is it possible to reserve a full node with a vanilla job ?

Best regards


Dan's HOWTO covers this under "How to allow some jobs to claim the whole machine instead of one slot" - http://nmi.cs.wisc.edu/node/1482

Best,


matt

I tried accessing this document because I thought it might have some bearing on whether it would be possible to use Condor to submit OpenMP based programs, however it requires an NMI login which I don't have. Is there another place I can access this document?
Thanks,
Brandon Leeds
Lehigh University

Reproduced below... (direct cut&paste from Dan's How-to Recipies)

--

How to allow some jobs to claim the whole machine instead of one slot

Known to work with Condor version: 7.2

The simplest way to achieve this is to simply set NUM_CPUS=1 so that each machine just advertises a single slot. However, this prevents you from supporting a mix of single-cpu and whole-machine jobs. The following example achieves the goal of supporting both in all but one respect: the Condor accountant does not charge the whole-machine user for claiming all of the slots: it only charges the user for claiming one slot.

First, you would have whole-machine jobs advertise themselves as such with something like the following in the submit file:

+RequiresWholeMachine = True

Then put the following in your Condor configuration file. Make sure it either comes after the other attributes that this appends to (such as START) or that you merge the definitions together.

#require that whole-machine jobs only match to Slot1
START = ($(START)) && (TARGET.RequiresWholeMachine =!= TRUE || SlotID == 1)

# have the machine advertise when it is running a whole-machine job
STARTD_JOB_EXPRS = $(STARTD_JOB_EXPRS) RequiresWholeMachine

# Export the job expr to all other slots
STARTD_SLOT_EXPRS = RequiresWholeMachine

# require that no single-cpu jobs may start when a whole-machine job is running
START = ($(START)) && (SlotID == 1 || Slot1_RequiresWholeMachine =!= True)

# suspend existing single-cpu jobs when there is a whole-machine job
SUSPEND = ($(SUSPEND)) || (SlotID != 1 && Slot1_RequiresWholeMachine =?= True)

Instead of suspending the single-cpu jobs while the whole-machine job runs, you could suspend the whole-machine job while the single-cpu jobs finish. Example:

# advertise the activity of each slot into the ads of the other slots,
# so the SUSPEND expression can see it
STARTD_SLOT_EXPRS = $(STARTD_SLOT_EXPRS) Activity

# Suspend the whole-machine job until the other slots are empty
SUSPEND = ($(SUSPEND)) || (SlotID == 1 && Slot1_RequiresWholeMachine =?= True && \ (Slot2_Activity =?= "Busy" || Slot3_Activity =?= "Busy" || ... ) )

You might want to steer whole-machine jobs towards machines that are completely vacant, especially on the slots only for single-cpu jobs.

Here's a simple example that just avoids machines with a high load:

NEGOTIATOR_PRE_JOB_RANK = -TARGET.LoadAvg*(MY.RequiresWholeMachine =?= True)

A more complicated expression would look at the attributes of the other slots when forming the rank:

STARTD_SLOT_EXPRS = $(STARTD_SLOT_EXPRS) Activity

NEGOTIATOR_PRE_JOB_RANK = (MY.RequiresWholeMachine =?= True) * \
(Slot2_Activity =!= "Busy" + Slot3_Activity =!= "Busy" + ... )