[Condor-users] Keeping Parallel Universe job alive even node0 is done

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

Date: Fri, 30 Jan 2009 15:41:19 -0500

From: "Natarajan, Senthil" <senthil@xxxxxxxx>

Subject: [Condor-users] Keeping Parallel Universe job alive even node0 is done

Hi,

I am trying to test simple MPICH2 example code (using condor 7.0.5, MPICH2 1.0.8), calculating pi value MPI code.

I am testing this with 3 nodes, as soon as node 0 is done, condor shuts down node1 and node2 even though jobs on them did not finish.

I know it is the way condor suppose to work, but is there any work around to keep node0 alive until all the nodes are done.

Because the individual nodes are geographically distributed and also due to network latency, node0 finishes first and causes other node die and hence the Parallel universe job.

Thanks,

Senthil

Mailing List Archives

Public Access

[Condor-users] Keeping Parallel Universe job alive even node0 is done