[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] mpi and dedicated scheduler configuration
- Date: Mon, 28 Jun 2004 02:01:32 -0700 (PDT)
- From: Vahid Pazirandeh <vpaziran@xxxxxxxxx>
- Subject: Re: [Condor-users] mpi and dedicated scheduler configuration
I used to have a very similar setup as you Mike. Then I found a bug that
arises when you use a Linux submitter to Windows execute nodes for MPI jobs.
I had a Linux server acting as the central manager and submitter. All my
execute nodes were Windows. I compiled my code with cygwin. My code ran fine
when I ran it with the GUI NT mpirun program included from the package at
http://www-unix.mcs.anl.gov/mpi/mpich/. I could run my MPI code on Condor so
long as machine_count=1. If it was anything greater, Condor would crash and
I sent a bug report to condor-admin around November 2003. It turned out to be
a larger bug in Condor then they had expected and it is not fixed to this day.
It is a documented bug. I saw it documented some time in 2004:
-- snippet --
Condor 6.6.1 Release notes:
* Submission of MPI jobs from a Unix machine to run on Windows machines (or
vice versa) fails for machine_count > 1. This is not a new bug. Cross-platform
submission of MPI jobs between Unix and Windows has always had this problem.
-- snippet --
Now I use a Windows machine as the central manager and submitter. I've
installed as many UNIX tools as I needed to make the server more friendly
(cygwin with all its support tools like sshd, etc).
I run MPI jobs successfully now with a Windows submitter. I should also point
out that I use MPICH NT 1.2.5. I have always used this version and I know
Condor documentation specifically notes that 1.2.5 is not supported. I have
not suffered any MPI related problems in my all-Windows pool.
However, I have uncovered what I think is a bug in the file transfer mechanism
when running MPI jobs on a Windows pool. As the number of files needed to
transfer (tansfer_input_files) and the machine_count values rise, the chances
of the file transfer failing gets very high - to the point that you can assume
failure. I haven't heard many others talk about this, though I don't know how
many people are using a Windows pool to run MPI jobs like myself. I submitted
the bug to condor-admin a few months ago but I have not received many replies
back. The few replies I did receive simply stated that they are too busy to
read through the logs that I sent in. About a month ago I posted the problem
to this mail list (dig through the archives and it should pop up).
With all this said, if you successfully run win32 MPI code from a Linux server
to 2 or more Windows execute nodes, let me know! I'll be very interested to
know your exact setup. Cheers and good luck.
--- Mike Busch <zenlc2000@xxxxxxxxx> wrote:
> You say,
> > With the vanilla universe, you won't be able to allocate multiple
> > machines
> > in any sort of a group - you run the risk of a single node
> > disappearing.
> > With the MPI universe, the loss of a single node tells Condor to shut
> > down
> > all of the other machines, since Condor assumes your MPI
> > implementation
> > has no fault tolerance.
> Let's say I'm willing to accept the lack of fault tolerance just for
> the sake of proving the concept. Is it possible to submit an NT-MPICH
> job to a Linux Manager and have it run on a Win2k pool in the vanilla
< NPACI Education Center on Computational Science and Engineering >
"A friend is someone who knows the song in your heart and can sing it back to you when you have forgotten the words." -Unknown Author
Do you Yahoo!?
New and Improved Yahoo! Mail - Send 10MB messages!