[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor problem with MPI Jobs

Setting up the MPI caused me a bit of trouble when I first tried it (but that was some time ago now).
I think I misunderstood the term "dedicated" for one thing.
I think more log files are neccessary to see what is going on.
How many parallel jobs is your MPI job setup to run? 2?
Have you tried the following:
* Running the job on one of the execution nodes outside of Condor? (ensures all MPI libs are OK)
* Trying the setup for 1 machine, and submitting under vanilla (ensures submit file is copying all the
  files it needs and MPI libs are accessible under condor)
* Running a vanilla job that just runs "env" (on unix) to ensure that the MPI setup is there for the
  user the job runs as
-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx]On Behalf Of Natarajan, Senthil
Sent: Friday, May 19, 2006 4:24 PM
To: Condor-Users Mail List
Subject: [Condor-users] Condor problem with MPI Jobs


I have posted this couple of times but no response, hopefully this time I will get some.


I was trying to run MPI job using condor 6.6.10 on windows. I am using condor supported MPI (MPICH 1.2.4)

MPICH 1.2.4 libraries are installed properly on the windows machines and the path to the libraries are properly set in System Environmental variables. And of course I configured condor_config files in the execution node as dedicated resources and suitable for running MPI jobs by following the condor document.


If I submit the job, it is in ideal condition and it is not reporting any error and even it is not trying to contact the execution nodes. I have no clue what is going on.


Could you please some one point out what might be the problem. I was wondering is the condor MPI universe is fully developed feature, is it possible to use this for real production environment.


universe = MPI

executable = simplempi.exe

#executable = cpi.exe

requirements   = Arch == "INTEL" && OpSys == "WINNT51"

log = logfile

input = infile.$(NODE)

output = outfile.$(NODE)

error = errfile.$(NODE)

machine_count = 2

should_transfer_files = yes

when_to_transfer_output = on_exit