[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] MPI problem



Hello

I'm trying to execute an MPI, with MPICH2, on my Condor cluster.

My job desc file is:
-----------------------------------------------------
universe = parallel
executable = mp2script
arguments = sim problem.input
Requirements  = OpSys == "LINUX" && Arch =="X86_64"
Rank = machine == "xxxxxxxx"
log = logfile
output = outfile.$(NODE)
error = errfile.$(NODE)
machine_count = 2
queue
-----------------------------------------------------

So, I request 2 slots of the same machine. But the job is not executed,
here are the logs:

The output of the node 0 is:
Too many retries, could not start all 2 nodes, only started 1, giving
up.  Here are the hosts I could start 

The output of the node 1 is empty, and the mpd.out of node 1 is:
An mpd is already running with console at /tmp/mpd2.console_condor on
vm-ubuntu64.intranet.iac3.eu. 
Start mpd with the -n option for a second mpd on same host.

In the logFile I see:
015 (083.000.001) 05/13 12:19:53 Node 1 terminated.
	(1) Normal termination (return value 255)
015 (083.000.000) 05/13 12:23:19 Node 0 terminated.
	(1) Normal termination (return value 1)

Where is the problem? Why Condor tries to start a second mpd on the same
machine?

Thanks in advance

Regards

Antoni Artigues