[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] What happes when a MPI job hangs?



We are looking into using the latest Condor to manage MPI jobs in a Concurrent Computing class. We have a problem killing MPI jobs using just "mpirun", since killing one process does not kill the other processes that were spawned when calling mpirun.

We've read that both PBS and SGE have the ability to "sense" that the head node (process 0) has died and can clean up (kill and clear sockets) the other processes that block waiting for communication with process 0.

Is there a similar functionality in Condor? If I submit an unsafe MPI job and it hangs, will condor_rm take care of the process cleanup?

Thanks,

Matt
University of Arkansas