[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] MPICH2 question



Hello

Thank you for the answer. I have found the problem.

The problem is not the MPICH2, is the mp2script that comes with Condor.
It have three problems:

1.This lines fails:
--------------------------
while [ $num_hosts -ne $_CONDOR_NPROCS ]
do
	num_hosts=`mpdtrace | wc -l`
.
.
.
done
-------------------------

Because $_CONDOR_NPROCS have a value of 6. But the mpdtrace return a
value of 2(because the are only 2 machines on the ring). So the while
always ends with error.

I changed this part of the script and now works well.

2.Another problem of the mp2script is that it tries to start the mpd on
each slot, so it produces this error:
---------
An mpd is already running with console at /tmp/mpd2.console_condor on
vm-ubuntu64.xxxxx. 
Start mpd with the -n option for a second mpd on same host.
--------
But this is not a critical problem.

3.And the other problem, that you have mentioned, is that the script not
prepares a machine file, and  calls the mpiexec without it:
mpiexec -n $_CONDOR_NPROCS $EXECUTABLE $@

Regards

Antoni Artigues

The problem is the 
El vie, 14-05-2010 a las 11:06 +0200, Henning Fehrmann escribió:
> 
> Hello Antoni,
> 
> On Thu, May 13, 2010 at 03:20:43PM +0200, antoni artigues wrote:
> > Hello
> > 
> > Sorry, but I have another question again.
> > 
> > Here is my problem:
> > 
> > I have two machines A and B. Machine A have 4 cpu's and machine B have 2
> > cpu's.
> > 
> > I want to launch a MPI(MPICH2) job that needs 6 processes. But I can't
> > do it with Condor.
> 
> If you launch it in a parallel universe all free slots should be
> assigned to this MPI job. The crucial thing is to prepare a machines list for 
> the MPI-universe. If one node provides two slots it should appear twice in this list.
> 
> We are using OpenMPI but I can't see a reason why it shouldn't work with MPICH.
> 
> 
> > Finally a single slot is responsible for starting the MPI job.
> > ------------CONFIGURATION 1----------------
> > NUM_SLOTS = 1 and NUM_CPUS= 4 for A
> > NUM_SLOTS = 1 and NUM_CPUS= 2 for B
> > 
> > in the job definition I put:
> > machine_count = 2
> > Because there are two machines on the cluster. But, how can I specify
> > that I want 6 processes for the mpi? Is there any configuration
> > parameter on the job definition?
> > 
> > -----------CONFIGURATION 2-----------------
> > NUM_SLOTS = 4 and NUM_CPUS= 4 for A
> > NUM_SLOTS = 2 and NUM_CPUS= 2 for B
> > 
> > in the job definition I put:
> > machine_count = 6
> > 
> > But the mpi execution fails, because Condor tries to start more than one
> > mpd on the same machine. Because the mp2script starts a mpd process for
> > each node.
> 
> machine_count = 6 is correct. If only one mpd process can run per node, than MPICH2 is not the
> right candidate. Can you run 6 process on two nodes without using condor?
> 
> Cheers,
> Henning