[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] mpi and dedicated scheduler configuration



Hi Vahid,

Thank you for that report.  At this point I'm more than a little
frustrated.  Originally the project was to have Globus submitting jobs
to a Condor pool of MPI machines but I find I can't run Globus due to
security restrictions and now I find out that Condor won't submit to
Windows.  Pretty effectively kills my project.

I'll work with it a bit and let you know if I get any different
results.

Mike



--- Vahid Pazirandeh <vpaziran@xxxxxxxxx> wrote:
> Hello,
> 
> I used to have a very similar setup as you Mike.  Then I found a bug
> that
> arises when you use a Linux submitter to Windows execute nodes for
> MPI jobs.
> 
> I had a Linux server acting as the central manager and submitter. 
> All my
> execute nodes were Windows.  I compiled my code with cygwin.  My code
> ran fine
> when I ran it with the GUI NT mpirun program included from the
> package at
> http://www-unix.mcs.anl.gov/mpi/mpich/.  I could run my MPI code on
> Condor so
> long as machine_count=1.  If it was anything greater, Condor would
> crash and
> burn.
> 
> I sent a bug report to condor-admin around November 2003.  It turned
> out to be
> a larger bug in Condor then they had expected and it is not fixed to
> this day. 
> It is a documented bug.  I saw it documented some time in 2004:
> http://www.cs.wisc.edu/condor/manual/v6.6/8_2Stable_Release.html.
> 
> -- snippet --
> Condor 6.6.1 Release notes:
> Known Bugs:
>     * Submission of MPI jobs from a Unix machine to run on Windows
> machines (or
> vice versa) fails for machine_count > 1. This is not a new bug.
> Cross-platform
> submission of MPI jobs between Unix and Windows has always had this
> problem.
> -- snippet --
> 
> Now I use a Windows machine as the central manager and submitter. 
> I've
> installed as many UNIX tools as I needed to make the server more
> friendly
> (cygwin with all its support tools like sshd, etc).
> 
> I run MPI jobs successfully now with a Windows submitter.  I should
> also point
> out that I use MPICH NT 1.2.5.  I have always used this version and I
> know
> Condor documentation specifically notes that 1.2.5 is not supported. 
> I have
> not suffered any MPI related problems in my all-Windows pool.
> 
> However, I have uncovered what I think is a bug in the file transfer
> mechanism
> when running MPI jobs on a Windows pool.  As the number of files
> needed to
> transfer (tansfer_input_files) and the machine_count values rise, the
> chances
> of the file transfer failing gets very high - to the point that you
> can assume
> failure.  I haven't heard many others talk about this, though I don't
> know how
> many people are using a Windows pool to run MPI jobs like myself.  I
> submitted
> the bug to condor-admin a few months ago but I have not received many
> replies
> back.  The few replies I did receive simply stated that they are too
> busy to
> read through the logs that I sent in.  About a month ago I posted the
> problem
> to this mail list (dig through the archives and it should pop up).
> 
> With all this said, if you successfully run win32 MPI code from a
> Linux server
> to 2 or more Windows execute nodes, let me know!  I'll be very
> interested to
> know your exact setup.  Cheers and good luck.
> 
> Regards,
> Vahid
> 
> 
> 
> --- Mike Busch <zenlc2000@xxxxxxxxx> wrote:
> > Erik,
> > 
> > You say, 
> > 
> > > With the vanilla universe, you won't be able to allocate multiple
> > > machines
> > > in any sort of a group - you run the risk of a single node
> > > disappearing. 
> > > With the MPI universe, the loss of a single node tells Condor to
> shut
> > > down
> > > all of the other machines, since Condor assumes your MPI
> > > implementation 
> > > has no fault tolerance.
> > 
> > Let's say I'm willing to accept the lack of fault tolerance just
> for
> > the sake of proving the concept.  Is it possible to submit an
> NT-MPICH
> > job to a Linux Manager and have it run on a Win2k pool in the
> vanilla
> > universe?
> 
> 
> > 
> > Thanks!
> > Mike
> > 
> > 
> > 	
> > 		
> 
> 
> =====
> < NPACI Education Center on Computational Science and Engineering >
> < http://www.edcenter.sdsu.edu/>
> 
> "A friend is someone who knows the song in your heart and can sing it
> back to you when you have forgotten the words."  -Unknown Author 
> =====
> 
> 
> 		
> __________________________________
> Do you Yahoo!?
> New and Improved Yahoo! Mail - Send 10MB messages!
> http://promotions.yahoo.com/new_mail 
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> http://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 



	
		
__________________________________
Do you Yahoo!?
New and Improved Yahoo! Mail - 100MB free storage!
http://promotions.yahoo.com/new_mail