[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] problems with MPI-example in manual

Being a complete newby to Condor, I may overlook something very simple
here, but alas, that's why there are these lists I guess.

I literally copied the little MPI example on pages 61/62 of the manual
(for condor 7.0.5 in section 2.9), except the lines that do something
with stdin and build/linked the program
(did it with mpich2 and with openmpi; both give same result)
Then I copied the submit file on page 61 and submitted the job. It runs
nicely, but:

Contents of output.0:
Printing to out... 0
Contents of output.1:
Printing to out... 0
Contents of output.2 and output.3:

This is clearly not the intended result. There are two important issues:
- The nodes are incorrect; apparently the executable can not obtain the
correct node number
- nodes 2 and 3 are apparently killed before even getting a chance to
produce output.

The latter could have to do with this very little, hidden remark that I
found in the manual: 'When the first process exits, Condor shuts down
all the others, even if they have not completed their execution.'
This would make sense, because the first two nodes are typically
allocated one 1 machine and the others on another. On the other hand,
for this particular example that is extremely dumb behavior, even more
because not a single mpi-implementation functions in this way. Even
worse, this is a very confusing example of what I call
'non-documentation', or documentation that is wrong, since the program
works different that the documentation states (non-documentation usually
gets me extremely irritated, I apologize if this is reflected in this mail)

So I have two questions:
- How do I retrieve the (IMHO expected for Condor) behavior of mpirun,
namely that all nodes finish their jobs, before the job finishes?
- Why does the call to MPI_Comm_rank(MPI_COMM_WORLD, &myid) in the
example not return the right node number?

(and actually: I could repeat the whole story for Fortran 90, for which
I get precisely the same behaviour/results)

Yours sincerely,
Jakob van Bethlehem

Kapteyn Astronomical Institute
Groningen, the Netherlands