[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] MPI jobs in the vanilla universe



Hi,

I am interested in running an MPI job on my cluster (which is already running Condor 6.6.6), but within the vanilla
universe (there are some restrictions to the MPI universe setup which we cannot abide by).


In the past, I've used Sun Grid Engine and PBS; I submitted a job asking for "N nodes"; the batch queueing system would basically wait until N nodes were free. When the job ran, the batch system would start my job on the "first" machine of the N, and provide me with a file listing all the nodes ($PBS_NODEFILE is an env var pointing to the file). At that point, I could run mpirun with the machines file being the list of nodes. The batch system would properly manage the nodes, in that they would be marked as being used, rather than schedule more jobs there.

I've checked the manual, and there doesn't seem to be an equivalent in condor that I can find. The 'machine_count' directive nt he condor job file doesn't seem to apply to vanilla jobs, and there are no other ways I can find to schedule a bunch of machines together.

Any suggestions. Suggestions including obscure class ad hackery are quite welcome.

Dave