[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Getting closer with Parallel Universe on Dynamic slots



On Fri, Nov 25, 2011 at 12:14:19PM +0100, Steffen Grunewald wrote:
> ... but still no cigar.
>
> The setup consists of 5 4-core machines and some more 2-cores machines.
> All of them have been configured as single, partitionable slots.
> Preemption is forbidden completely.
> The rank definitions are as follows:
> RANK = 0
> NEGOTIATOR_PRE_JOB_RANK = 1000000000 + 1000000000 * (TARGET.JobUniverse =?= 11) * (TotalCpus+TotalSlots) - 1000 * Memory
>
> I'd expect this to favour big machines over small ones (for Parallel jobs),
> and partially occupied ones over empty ones.
>
> What I see with the following submit file, is quite different:
>
> universe   = parallel
> initialdir = /home/steffeng/tests/mpi/
> executable = /home/steffeng/tests/mpi/mpitest
> arguments  =  $(Process) $(NODE)
> output     = out.$(NODE)
> error      = err.$(NODE)
> log        = log
> notification = Never
> on_exit_remove = (ExitBySignal == False) || ((ExitBySignal == True) && (ExitSignal != 11))
> should_transfer_files = yes
> when_to_transfer_output = on_exit
> Requirements = ( TotalCpus == 4 )
> request_memory = 500
> machine_count = 10
>
> (mpitest is the ubiquitous "MPI hello world" program trying to get rank and
> size from MPI_COMM_WORLD)
>
> - if I leave the Requirements out, the 10 MPI nodes will end up on the big
> 5 machines (one per machine) plus 5 small ones
If you did not specify request_cpus, then default value (1) will be used.

I suppose, that there isn't any others jobs. At the beginning of negotiation 
cycle you have only partitionable slots. According to NEGOTIATOR_PRE_JOB_RANK 
slots with 4-cores will have higher priority. This is what you exactly want.
But you don't specifies request_cpus, therefore only ONE core will be stoled
from Partitionable slot(slot1@xxxxxxxxxxxxxxx) and new dynamic slot
(slot1_1@xxxxxxxxxxxxxxx) will be created. I the same negotiation cycle 
there is also 2-cores partitionable slot available. The same process will occur
with 2-cores slots.

Result: 10 new slots will be created (5 on big machines and 5 on small machines)

> - with the Requirements set as above, each of the big machines will run
> exactly two nodes instead of 4+4+2+0+0
Like previous case, but in first negotiation cycle 2-cores partitionable slots
will not be considered because of own job requirement. In the next negotiation
cycle "4-cores" partitionable slots contains only 3 Cpus, but TotalCpus will
be always equal to 4. Therefore in next negotiation cycle another 5 dynamic
slots from "4-cores" partitionable slot will be created 
(with name slot1_2@xxxxxxxxxxxxxxx).

Result: Each big machines have two slots(nodes).

I think that my explanation will help you.

Regards,
Lukas

> - not all out.* and err.* files get written (the pattern looks semi-random)
> - all of them identify as "rank 0" of "size 1"
>
> Condor version is 7.6.0 (and should include the fixes of ticket 986 which
> went into 7.5.6).
>
> How can I debug this?
>
> Cheers,
>  Steffen
> --
> Steffen Grunewald * MPI Grav.Phys.(AEI) * Am Mühlenberg 1, D-14476 Potsdam
> Cluster Admin * --------------------------------- * http://www.aei.mpg.de/
> * e-mail: steffen.grunewald(*)aei.mpg.de * +49-331-567-{fon:7274,fax:7298}
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>