Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] MPI jobs in the vanilla universe

Date: Wed, 27 Oct 2004 08:24:14 -0700
From: "David E. Konerding" <dekonerding@xxxxxxx>
Subject: Re: [Condor-users] MPI jobs in the vanilla universe

Erik Paulson wrote:

On Tue, Oct 26, 2004 at 03:30:02PM -0700, David E. Konerding wrote:

Hi,

I am interested in running an MPI job on my cluster (which is already running Condor 6.6.6), but within the vanilla universe (there are some restrictions to the MPI universe setup which we cannot abide by).

The vanilla universe has all of the same restrictions as the MPI universe (they're nearly identical code-bases) - what is giving you trouble?

From the manual:

> Administratively, Condor must be congured such that resources (machines) running MPI jobs are > dedicated.

Not sure what that means, but it sounds to me like we would have to statically configure nodes to run MPI jobs, which would be fully exclusive of vanilla jobs (following the docs from the user MPI section, 2.10 to the admion MPI section, 3.10.10, shows that you have to set up a dedicated scheduler that manages dedicated resources). We've always used the pool as a combination of MPI and single process jobs, so this is undesireable.

Also: > This leads to a further restriction that jobs submitted to execute under the MPI > universe (with dedicated machines) must be submitted from the machine running as the dedicated > scheduler.

We would normally be starting these jobs from a laptop, far away from pool. That laptop is running Windows, the pool is running Linux. So this is a constraint we cannot satisfy; we don't want to have to ssh into the pool to start the job.

In the past, I've used Sun Grid Engine and PBS; I submitted a job asking for "N nodes"; the batch queueing system would basically wait until N nodes were free. When the job ran, the batch system would start my job on the "first" machine of the N, and provide me with a file listing all the nodes ($PBS_NODEFILE is an env var pointing to the file). At that point, I could run mpirun with the machines file being the list of nodes. The batch system would properly manage the nodes, in that they would be marked as being used, rather than schedule more jobs there.
The reason we don't do it this way is that there's no way for the batch
system to clean up - mpirun just fires off ssh or rsh. No cleanup of
the execute environment, no cleanup of errant processes, you have to
setup ssh keys for all of the users beforehand... we went with a more
managed solution.

We've solved the ssh problem by running the MPICH mpd daemon instead of the regular mpirun job startup mechanism. This deals with cleanup of errant processes, ssh keys are not required. Another approach is to use condor itself as the job launch mechanism; this is analogous to the PBS mpiexec feature, which uses the PBS multi-node job launch mechanism to start all the MPI processes on their nodes.

You can, however, write a perl program that submits vanilla jobs to Condor and watches the userlog to see when they start running, and where they start running, and then runs mpirun on those machines. It can keep watching and see if any of the nodes gets evicted, and it can tear down the rest of the MPI job. We call a program like this a "coordinator". If you sbmit your coordinator program as a Condor job under the "scheduler" universe, it will start running right away on your submit node - DAGMan does exactly this. (In fact, this is why we call it the "scheduler" universe - it's meant to be for jobs that schedule other jobs)

That's an interesting approach. I'll give it some consideration.

Thanks,
Dave

Follow-Ups:
- Re: [Condor-users] MPI jobs in the vanilla universe
  - From: Erik Paulson
- [Condor-users] vanilla universe
  - From: Chakravarthi

References:
- [Condor-users] MPI jobs in the vanilla universe
  - From: David E. Konerding
- Re: [Condor-users] MPI jobs in the vanilla universe
  - From: Erik Paulson

Prev by Date: Re: [Condor-users] DAGMan post scripts and ${RETURN}
Next by Date: Re: [Condor-users] DAGMan post scripts and ${RETURN}
Previous by thread: Re: [Condor-users] MPI jobs in the vanilla universe
Next by thread: [Condor-users] vanilla universe
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] MPI jobs in the vanilla universe