[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] MPI jobs in the vanilla universe

Erik Paulson wrote:

On Tue, Oct 26, 2004 at 03:30:02PM -0700, David E. Konerding wrote:


I am interested in running an MPI job on my cluster (which is already running Condor 6.6.6), but within the vanilla
universe (there are some restrictions to the MPI universe setup which we cannot abide by).

The vanilla universe has all of the same restrictions as the MPI universe (they're nearly identical code-bases) - what is giving you trouble?

From the manual:

> Administratively, Condor must be congured such that resources (machines) running MPI jobs are
> dedicated.

Not sure what that means, but it sounds to me like we would have to statically configure nodes to run
MPI jobs, which would be fully exclusive of vanilla jobs (following the docs from the user MPI section, 2.10 to the admion MPI section, 3.10.10, shows that you have to set up a dedicated scheduler that manages dedicated
resources). We've always used the pool as a combination of MPI and single process jobs, so this is undesireable.

> This leads to a further restriction that jobs submitted to execute under the MPI
> universe (with dedicated machines) must be submitted from the machine running as the dedicated
> scheduler.

We would normally be starting these jobs from a laptop, far away from pool. That laptop is running Windows, the pool is running Linux. So this is a constraint we cannot satisfy; we don't want to have to ssh into the pool to start the job.

In the past, I've used Sun Grid Engine and PBS; I submitted a job asking for "N nodes"; the batch queueing system would basically wait until N nodes were free. When the job ran, the batch system would start my job on the "first" machine of the N, and provide me with a file listing all the nodes ($PBS_NODEFILE is an env var pointing to the file). At that point, I could run mpirun with the machines file being the list of nodes. The batch system would properly manage the nodes, in that they would be marked as being used, rather than schedule more jobs there.

The reason we don't do it this way is that there's no way for the batch system to clean up - mpirun just fires off ssh or rsh. No cleanup of the execute environment, no cleanup of errant processes, you have to setup ssh keys for all of the users beforehand... we went with a more managed solution.

We've solved the ssh problem by running the MPICH mpd daemon instead of the regular mpirun job startup
mechanism. This deals with cleanup of errant processes, ssh keys are not required. Another approach is to use condor itself as the job launch mechanism; this is analogous to the PBS mpiexec feature, which uses the PBS multi-node job launch mechanism to start all the MPI processes on their nodes.

You can, however, write a perl program that submits vanilla jobs to Condor and
watches the userlog to see when they start running, and where they start
running, and then runs mpirun on those machines. It can keep watching and
see if any of the nodes gets evicted, and it can tear down the rest of the
MPI job. We call a program like this a "coordinator". If you sbmit your
coordinator program as a Condor job under the "scheduler" universe, it
will start running right away on your submit node - DAGMan does exactly this.
(In fact, this is why we call it the "scheduler" universe - it's meant to be
for jobs that schedule other jobs)

That's an interesting approach. I'll give it some consideration.