[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] MPI exe stalls



Hi,

I have an C MPI program I am using to run a brute force solution to the
Travellening Salesman Problem (TSP).   I am using this as part of Condor
with Condor 6.8 on Fedora 5 with mpich 1.2.4.

My code keeps stalling in mid execution.  I was wondering if a MPI guru
could assist me.  I really hate to ask but I've tried everything and I'm a
little new to MPI.

My code is at http://gis.sis.pitt.edu/temp/chris/mpi/mpi2/tspRunOneBranch.c
The output for one machine is at
http://gis.sis.pitt.edu/temp/chris/mpi/mpi2/tspRunOneBranch0.out

I understand the code may be a little big but I can easily narrow the
problem, but can't find the solution.

The execution is stalling at two points.

1) As you can see, the coordinator function runs on rank 0.  It seems to run
up to directly before the while loop at while(globalLeaveAloneIndex <=
dimensionsOfGraph - 1);  Hence, it never reaches the MPI_recv in the loop.
However, when the MPI_recv is outside the loop it is executed.  Why won't
the code enter this loop?

2) The worker seems to run directly up to the message acknowledging the
message was sent at fprintf (stdout, "worker:  sent get request to
coordinator.\n");

Why are these stalling at these points?  Its almost like one process is
waiting for the other.  The execution just seems to sit there when I see the
temporary execution bins condor creates.

Also, it seems as if in my code, all ranks are being output on each machine?
Shouldn't MPI be running on separarate machines for each rank in my C code?
For example in the output file, shouldn't rank 0 be the only output?

Sincerely,

Christopher Jon Jursa
Geoinformatics Laboratory
School of Information Sciences
University of Pittsburgh
web: http://gis.sis.pitt.edu
email: cjursa@xxxxxxxxxxxx
phone: 412-624-8858