[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Parallel universe locality - how? (revisited)



On Tue, Mar 10, 2015 at 12:26:36PM +0100, Steffen Grunewald wrote:
> On Wed, Mar 04, 2015 at 10:15:22AM +0100, Steffen Grunewald wrote:
> > On Tue, Mar 03, 2015 at 02:51:59PM +0000, Peter F. Couvares wrote:
> > > The NEGOTIATOR_PRE_JOB_RANK boolean expression evaluates in the context of each machine classad (including anything you publish in the machine ad from job ads of the jobs already running there, via STARTD_JOB_ATTRS), so you can simply reference the universe and give it a different rank.  Something like (in pseudocode):
> > >
> > > NEGOTIATOR_PRE_JOB_RANK = (is_parallel * 10) + (is_not_parallel * 20) + other_stuff
> > 
> > To avoid large amounts of parentheses, would it work to have
> > 
> > NEGOTIATOR_PRE_JOB_RANK = ifThenElse( (Target.JobUniverse =?= 11), \
> >                             $(PARALLEL_PRE_JOB_RANK), \
> >                             $(NONPARALLEL_PRE_JOB_RANK) )
> > 
> > with the two helper expressions accordingly set? (NONPARALLEL* basically mimicking
> > the default one, "a - b*Memory - c*Cpus", PARALLEL* something like "x + y*Cpus - z*Memory")
> 
> I tried this.
> It works.
> But it's far from optimal: up-ranking nodes with many free Cpus also downranks
> the same node for the next slot - apparently not all available Cpu resources are
> used up in one go.

What I'm thinking about now is to use ParallelSchedulingGroup.
What exactly would happen if I set that to the hostname of the execute node?
Would a ParallelUniverse job be able to run at all?

What I obviously need (and connot find in the documentation nor the wiki - that doesn't
even know about Parallel Universe...): an sub-expression in the PRE_JOB_RANK that
favours locality (iow: "this machine has been matched against at least one other rank
of this MPI job - and for exactly this reason has less Cpus").
How do I do that?

For Parallel Universe jobs to work at reasonable performance, in this low-budget network
environment, I need maximum locality, thus
- favour machines with as many unmatched Cpus as possible
- schedule as many MPI nodes as possible onto a single machine

As a bonus, since I've observed rank-0 threads to have higher (memory) requirements,
I'd like to be able to specify individual (cpu and/or memory) requests for individual
ranks.
This could be done by extending the current syntax to something like
  request_cpus = 4, (1)
(first thread gets 4 cores, all after get 1, Fortran format semantics)

Dreaming... (of better support for parallel jobs in Condor)

Thanks,
 Steffen

-- 
Steffen Grunewald * Cluster Admin * steffen.grunewald(*)aei.mpg.de
MPI f. Gravitationsphysik (AEI) * Am Mühlenberg 1, D-14476 Potsdam
http://www.aei.mpg.de/ * ------- * +49-331-567-{fon:7274,fax:7298}