[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] What happes when a MPI job hangs?



On Thu, Feb 16, 2006 at 04:52:44PM -0600, Matt Baker wrote:
> We are looking into using the latest Condor to manage MPI jobs in a  
> Concurrent Computing class. We have a problem killing MPI jobs using  
> just "mpirun", since killing one process does not kill the other  
> processes that were spawned when calling mpirun.
> 
> We've read that both PBS and SGE have the ability to "sense" that the  
> head node (process 0) has died and can clean up (kill and clear  
> sockets) the other processes that block waiting for communication  
> with process 0.
> 
> Is there a similar functionality in Condor? If I submit an unsafe MPI  
> job and it hangs, will condor_rm take care of the process cleanup?
> 

Yes. That's probably the best way to think about what Condor provides for
parallel jobs:

A) Allocation of resources to run the jobs on
B) A way to start processes on those machines, and a guarantee that when
Condor de-allocates the machine any processes that were spawned will be 
killed. 

That's it. Condor does nothing else for a parallel job - it knows nothing
about MPD, or lamboot, or Myrinet, or LINDA, or anything else. Condor
doesn't know what sockets your MPI job has open, and no MPI messages
flow through any of the Condor daemons*. The only primitives it provides are: 
Allocate machines, and allow processes to be started on those machines. 

The way the parallel universe manages starting new processes is the 
condor_ssh. If you use condor_ssh in your parallel job, you don't have to
worry about what user id you're running on at the other execution side, or
setting up shared ssh keys or NIS map files or passwordless ssh or whatever,
condor_ssh sets all of that up for you, and when the machine is deallocated
it tears all of that setup down. Because condor_ssh creates jobs under 
condor_sshd, an execution node knows all of the processes started under
it and can clean them up when the job exits. 

If your MPI job needs to do something like start up MPD or lamd or whatever,
you can create a script that sets all of that up. The Condor daemons don't
know and don't care about it. We've included example scripts for MPICH 1.2
and LAM in the Condor distribution to handle that - we really probably
should include one for MPICH2, so if anyone has one that they'd like to
include please do share it with the list. 

The parallel universe has one extra feature: If the first process Condor
started on the first machine Condor started processes on exits, Condor
will kill all remaining processes on every other machine. This maps 
to the common case of the rank0 process of an MPI job exiting, and most
people want all other processes to be killed if that happens. We've talked
about making that more configurable, so if you want the other processes to
continue they could, but we've never found anyone who actually wants this
behaviour so we haven't bothered.  


-Erik

*This isn't true for the PVM universe. In the PVM universe, Condor acts
as the pvmd, and routes PVM messages itself. It also introduces 
additional PVM messages to manage adding and removing hosts from a 
running PVM.