On Tue, Oct 26, 2004 at 03:30:02PM -0700, David E. Konerding wrote:From the manual:
Hi,
I am interested in running an MPI job on my cluster (which is already running Condor 6.6.6), but within the vanilla
universe (there are some restrictions to the MPI universe setup which we cannot abide by).
The vanilla universe has all of the same restrictions as the MPI universe (they're nearly identical code-bases) - what is giving you trouble?
We've solved the ssh problem by running the MPICH mpd daemon instead of the regular mpirun job startupIn the past, I've used Sun Grid Engine and PBS; I submitted a job asking for "N nodes"; the batch queueing system would basically wait until N nodes were free. When the job ran, the batch system would start my job on the "first" machine of the N, and provide me with a file listing all the nodes ($PBS_NODEFILE is an env var pointing to the file). At that point, I could run mpirun with the machines file being the list of nodes. The batch system would properly manage the nodes, in that they would be marked as being used, rather than schedule more jobs there.
The reason we don't do it this way is that there's no way for the batch system to clean up - mpirun just fires off ssh or rsh. No cleanup of the execute environment, no cleanup of errant processes, you have to setup ssh keys for all of the users beforehand... we went with a more managed solution.
You can, however, write a perl program that submits vanilla jobs to Condor andThat's an interesting approach. I'll give it some consideration.
watches the userlog to see when they start running, and where they start
running, and then runs mpirun on those machines. It can keep watching and
see if any of the nodes gets evicted, and it can tear down the rest of the
MPI job. We call a program like this a "coordinator". If you sbmit your
coordinator program as a Condor job under the "scheduler" universe, it
will start running right away on your submit node - DAGMan does exactly this.
(In fact, this is why we call it the "scheduler" universe - it's meant to be
for jobs that schedule other jobs)
Thanks, Dave