[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Getting closer with Parallel Universe on Dynamic slots



... but still no cigar.

The setup consists of 5 4-core machines and some more 2-cores machines.
All of them have been configured as single, partitionable slots.
Preemption is forbidden completely.
The rank definitions are as follows:
RANK = 0
NEGOTIATOR_PRE_JOB_RANK = 1000000000 + 1000000000 * (TARGET.JobUniverse =?= 11) * (TotalCpus+TotalSlots) - 1000 * Memory

I'd expect this to favour big machines over small ones (for Parallel jobs),
and partially occupied ones over empty ones.

What I see with the following submit file, is quite different:

universe   = parallel                                                           
initialdir = /home/steffeng/tests/mpi/                                          
executable = /home/steffeng/tests/mpi/mpitest                                   
arguments  =  $(Process) $(NODE)                                                
output     = out.$(NODE)                                                        
error      = err.$(NODE)                                                        
log        = log                                                                
notification = Never                                                            
on_exit_remove = (ExitBySignal == False) || ((ExitBySignal == True) && (ExitSignal != 11))                                                                      
should_transfer_files = yes                                                     
when_to_transfer_output = on_exit                                               
Requirements = ( TotalCpus == 4 )                                               
request_memory = 500                                                            
machine_count = 10

(mpitest is the ubiquitous "MPI hello world" program trying to get rank and
size from MPI_COMM_WORLD)

- if I leave the Requirements out, the 10 MPI nodes will end up on the big
5 machines (one per machine) plus 5 small ones
- with the Requirements set as above, each of the big machines will run
exactly two nodes instead of 4+4+2+0+0
- not all out.* and err.* files get written (the pattern looks semi-random)
- all of them identify as "rank 0" of "size 1"

Condor version is 7.6.0 (and should include the fixes of ticket 986 which 
went into 7.5.6).

How can I debug this?

Cheers,
 Steffen
-- 
Steffen Grunewald * MPI Grav.Phys.(AEI) * Am Mühlenberg 1, D-14476 Potsdam
Cluster Admin * --------------------------------- * http://www.aei.mpg.de/
* e-mail: steffen.grunewald(*)aei.mpg.de * +49-331-567-{fon:7274,fax:7298}