Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Getting closer with Parallel Universe on Dynamic slots

Date: Fri, 25 Nov 2011 16:08:13 +0100
From: Steffen Grunewald <Steffen.Grunewald@xxxxxxxxxx>
Subject: Re: [Condor-users] Getting closer with Parallel Universe on Dynamic slots

On Fri, Nov 25, 2011 at 09:12:49AM -0500, Ian Chesal wrote:
> > > > RANK = 0
> > > > NEGOTIATOR_PRE_JOB_RANK = 1000000000 + 1000000000 * (TARGET.JobUniverse =?= 11) * (TotalCpus+TotalSlots) - 1000 * Memory
> > > > 
> > > > universe = parallel
> > > > initialdir = /home/steffeng/tests/mpi/
> > > > executable = /home/steffeng/tests/mpi/mpitest
> > > > arguments = $(Process) $(NODE)
> > > > output = out.$(NODE)
> > > > error = err.$(NODE)
> > > > log = log
> > > > notification = Never
> > > > on_exit_remove = (ExitBySignal == False) || ((ExitBySignal == True) && (ExitSignal != 11))
> > > > should_transfer_files = yes
> > > > when_to_transfer_output = on_exit
> > > > Requirements = ( TotalCpus == 4 )
> > > > request_memory = 500
> > > > machine_count = 10
> 
> See the section labeled 'Macros' in the condor_submit manual:
> 
> http://research.cs.wisc.edu/condor/manual/v7.6/condor_submit.html#74467
> 
> Specifically:
> request_cpus = $$(totalcpus)
> 
> I'm not saying this is going to work for you, but just that it might be worth trying.

Thanks Ian, for pointing me to that.

It turns out that request_cpus=n, independent of n, will result in one slot 
claimed per machine, as I could prove with "request_cpus=4" and "machine_count=4" 
which claimed a single slot on four machines, same as would "request_cpus=1" or 
"request_cpus=2" would have done.

"machine_count" obviously gets translated into the number of individual MPI jobs (nodes),
and "request_cpus" would define the number of CPU cores assigned to each of them.
It's my problem if the nodes don't know about multi-core on their own.

Apparently, dynamic slot provisioning doesn't work well with parallel universe yet.

As soon as I return to old-style slot splitting (NUM_SLOTS=4, cpu=1, memory=25%, etc.)
I get the "proximity" I'm looking for - of machine_count=10, the first 4 nodes get 
sent to one node, 4 to the next one, 2 to another.

So I either do hard partitioning, and get proper MPI behaviour, or dynamic partitioning,
and am able to run memory-hungry jobs.
Unfortunately, the users have been asking for both (and the mix is unpredictable).

To add to the inconvenience, for each reconfig Condor has to be stopped completely 
on the machines affected.

Are there plans to make Condor more flexible? 
Using up as many dynamic slots as possible on the same machine would help a lot.
In the manual, and everywhere else I looked, "dynamic slots" and "parallel universe"
seem to be disjoint concepts...

BTW:
*If* there was proper co-existence of dynamic slots and parallel universe, one would
have to look for a N_P_J_R expression that yields best results for the parallel job
while harming as little other jobs as possible - perhaps such a thing doesn't even
exist if preemption is allowed?
Without preemption things should be easier:
- Rank by number of unclaimed CPUs?
  How to do that? have another machine ClassAd attribute UnclaimedCpus?
  I vaguely remember someone had come up with a huge ifThenElse construction to
  sum up the resources "bound" by claimed dynamic slots, but there should be a
  solution that still works for 64 cores...)

S

References:
- [Condor-users] Getting closer with Parallel Universe on Dynamic slots
  - From: Steffen Grunewald
- Re: [Condor-users] Getting closer with Parallel Universe on Dynamic slots
  - From: Lukas Slebodnik
- Re: [Condor-users] Getting closer with Parallel Universe on Dynamic slots
  - From: Steffen Grunewald
- Re: [Condor-users] Getting closer with Parallel Universe on Dynamic slots
  - From: Ian Chesal

Prev by Date: Re: [Condor-users] sharing cpu performance
Next by Date: Re: [Condor-users] Getting closer with Parallel Universe on Dynamic slots
Previous by thread: Re: [Condor-users] Getting closer with Parallel Universe on Dynamic slots
Next by thread: Re: [Condor-users] Getting closer with Parallel Universe on Dynamic slots
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Getting closer with Parallel Universe on Dynamic slots