[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] mpi and dedicated scheduler configuration



Hello,

I used to have a very similar setup as you Mike.  Then I found a bug that
arises when you use a Linux submitter to Windows execute nodes for MPI jobs.

I had a Linux server acting as the central manager and submitter.  All my
execute nodes were Windows.  I compiled my code with cygwin.  My code ran fine
when I ran it with the GUI NT mpirun program included from the package at
http://www-unix.mcs.anl.gov/mpi/mpich/.  I could run my MPI code on Condor so
long as machine_count=1.  If it was anything greater, Condor would crash and
burn.

I sent a bug report to condor-admin around November 2003.  It turned out to be
a larger bug in Condor then they had expected and it is not fixed to this day. 
It is a documented bug.  I saw it documented some time in 2004:
http://www.cs.wisc.edu/condor/manual/v6.6/8_2Stable_Release.html.

-- snippet --
Condor 6.6.1 Release notes:
Known Bugs:
    * Submission of MPI jobs from a Unix machine to run on Windows machines (or
vice versa) fails for machine_count > 1. This is not a new bug. Cross-platform
submission of MPI jobs between Unix and Windows has always had this problem.
-- snippet --

Now I use a Windows machine as the central manager and submitter.  I've
installed as many UNIX tools as I needed to make the server more friendly
(cygwin with all its support tools like sshd, etc).

I run MPI jobs successfully now with a Windows submitter.  I should also point
out that I use MPICH NT 1.2.5.  I have always used this version and I know
Condor documentation specifically notes that 1.2.5 is not supported.  I have
not suffered any MPI related problems in my all-Windows pool.

However, I have uncovered what I think is a bug in the file transfer mechanism
when running MPI jobs on a Windows pool.  As the number of files needed to
transfer (tansfer_input_files) and the machine_count values rise, the chances
of the file transfer failing gets very high - to the point that you can assume
failure.  I haven't heard many others talk about this, though I don't know how
many people are using a Windows pool to run MPI jobs like myself.  I submitted
the bug to condor-admin a few months ago but I have not received many replies
back.  The few replies I did receive simply stated that they are too busy to
read through the logs that I sent in.  About a month ago I posted the problem
to this mail list (dig through the archives and it should pop up).

With all this said, if you successfully run win32 MPI code from a Linux server
to 2 or more Windows execute nodes, let me know!  I'll be very interested to
know your exact setup.  Cheers and good luck.

Regards,
Vahid



--- Mike Busch <zenlc2000@xxxxxxxxx> wrote:
> Erik,
> 
> You say, 
> 
> > With the vanilla universe, you won't be able to allocate multiple
> > machines
> > in any sort of a group - you run the risk of a single node
> > disappearing. 
> > With the MPI universe, the loss of a single node tells Condor to shut
> > down
> > all of the other machines, since Condor assumes your MPI
> > implementation 
> > has no fault tolerance.
> 
> Let's say I'm willing to accept the lack of fault tolerance just for
> the sake of proving the concept.  Is it possible to submit an NT-MPICH
> job to a Linux Manager and have it run on a Win2k pool in the vanilla
> universe?


> 
> Thanks!
> Mike
> 
> 
> 	
> 		


=====
< NPACI Education Center on Computational Science and Engineering >
< http://www.edcenter.sdsu.edu/>

"A friend is someone who knows the song in your heart and can sing it back to you when you have forgotten the words."  -Unknown Author 
=====


		
__________________________________
Do you Yahoo!?
New and Improved Yahoo! Mail - Send 10MB messages!
http://promotions.yahoo.com/new_mail