[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Keeping Parallel Universe job alive even node0 is done



Natarajan, Senthil wrote:


Hi,

I am trying to test simple MPICH2 example code (using condor 7.0.5, MPICH2 1.0.8), calculating pi value MPI code.

I am testing this with 3 nodes, as soon as node 0 is done, condor shuts down node1 and node2 even though jobs on them did not finish.

I know it is the way condor suppose to work, but is there any work around to keep node0 alive until all the nodes are done.


Yes.

In your job submit file that you give to condor_submit, add the following line:

+ParallelShutdownPolicy = "WAIT_FOR_ALL"

(yes, it needs to start with a plus sign)

If the job attribute ParallelShutdownPolicy is set to the string "WAIT_FOR_ALL", then Condor will wait until every node in the parallel job has completed to consider the job finished. If this attribute is not set, or is set to any other string, the default policy is in effect, which is when the first node exits, the whole job is considered done, and condor kills all other running nodes in that parallel job.

Hope this helps,
Todd