I am looking for suggestions on the best way to schedule MPI based jobs that want to spawn a large number of threads, but be scheduled on the least number of multi-core nodes in hopes of keeping inter-nodal, interprocess communication to a minimum.
A user would like to request 64 MPI threads. Using no selection criteria in Condor, it allocates 64 "slots" which correspond to cores across the architecture specified. We have a 40 node cluster with dual quad core cpus ( 8 cores total per node) and 1 GbE interconnects . Here there are 320 "slots" spanning 40 nodes. The job in theory can get sent to 64 slots spanning all 40 nodes. The nature of the algorithm requires more inter-process communication as his algorithm runs. The user would like to have Condor use the minimum number of nodes necessary to accommodate his 64 threads. (In the case of the cluster node described above, that has 8 slots per node, he would want to span only 8 nodes.)
I have found that in the Condor submit file if we use a construct such as:
requirements = (Subnet == "192.168.45")+RequiresWholeMachine = True
we can have the MPI job take up, reserve, and use one whole machine (read node) (which is good for up to how many cores are on a single node - our largest cluster nodes have 8 cores per node), but we have not come across the way to construct a job submission that will span multiple whole machines. (The behavior of the MPI based job if it needs more threads than the number of cores, using the above Condor submit file construct, Condor spawns all the job thread requests on that one node, and not migrate to other nodes - raising machine load on that one node). Please let me know if I am using the wrong require parameter and there are others I should be looking at, or whether this is not really possible. I believe this should be possible as it is a feature of other batch submission subsystems and I am just not looking at the right Condor submit constructs.
Thank you for any help you can provide.