[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Getting closer with Parallel Universe on Dynamic slots

On 11/25/11 8:12 AM, Ian Chesal wrote:
On Friday, 25 November, 2011 at 8:55 AM, Steffen Grunewald wrote:
On Fri, Nov 25, 2011 at 01:12:01PM +0100, Lukas Slebodnik wrote:
On Fri, Nov 25, 2011 at 12:14:19PM +0100, Steffen Grunewald wrote:
... but still no cigar.

The setup consists of 5 4-core machines and some more 2-cores machines.
All of them have been configured as single, partitionable slots.
Preemption is forbidden completely.
The rank definitions are as follows:
RANK = 0
NEGOTIATOR_PRE_JOB_RANK = 1000000000 + 1000000000 * (TARGET.JobUniverse =?= 11) * (TotalCpus+TotalSlots) - 1000 * Memory

I'd expect this to favour big machines over small ones (for Parallel jobs),
and partially occupied ones over empty ones.

What I see with the following submit file, is quite different:

universe = parallel
initialdir = /home/steffeng/tests/mpi/
executable = /home/steffeng/tests/mpi/mpitest
arguments = $(Process) $(NODE)
output = out.$(NODE)
error = err.$(NODE)
log = log
notification = Never
== False) || ((ExitBySignal == True) && (ExitSignal != 11))
should_transfer_files = yes
when_to_transfer_output = on_exit
Requirements = ( TotalCpus == 4 )
request_memory = 500
machine_count = 10

(mpitest is the ubiquitous "MPI hello world" program trying to get rank and
size from MPI_COMM_WORLD)

- if I leave the Requirements out, the 10 MPI nodes will end up on the big
5 machines (one per machine) plus 5 small ones
If you did not specify request_cpus, then default value (1) will be used.

I cannot specify "request_cpus=4" as this would let my jobs idle if the big nodes
were taken by someone else.
And AFAICT, there's no "request_cpus=all" or "request_cpus=TARGET.TotalCpus".
See the section labeled 'Macros' in the condor_submit manual:



In addition to the normal macro, there is also a special kind of macro called a substitution macro that allows the substitution of a ClassAd attribute value defined on the resource machine itself (gotten after a match to the machine has been made) into specific commands within the submit description file. The substitution macro is of the form:

A common use of this macro is for the heterogeneous submission of an executable:

executable = povray.$$(opsys).$$(arch)

Values for the opsys and arch attributes are substituted at match time for any given resource. This allows Condor to automatically choose the correct executable for the matched machine.


So in your case:

request_cpus = $$(totalcpus)

Careful here!  The $$() substitution only happens inside string values.  In the condor submit file, string values are typically not quoted unless they are inside of a larger _expression_, so there is no lexical clue to tell you which values are string values and which are not.

The value given for executable is a string value, so $$() substitution works there.  The value given for request_cpus is an _expression_ that evaluates to an integer, so $$() substitution does not make sense there, unless the $$() appears inside of a quoted string value within the _expression_.  However, there is no need to rely on $$() substitution in request_cpus.  You can just directly refer to TARGET.Cpus if you need to, because this _expression_ is evaluated with the machine ad being TARGET.

Whether that could solve the underlying problem or not is another matter.  As Steffen points out, partitionable slots and parallel universe have not been tightly integrated, so this may be one of the resulting rough edges.