[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] mpi and dedicated scheduler configuration



Hi Mike,
I actually ALMOST solved this problem ( in fact not so reliably, but it
worked more or less ) I did the following: 
Linux machine which submits a vanilla job, submits a perl script to
Windows machine, which is a dedicated scheduler. This perl script
invokes condor_submit once again, now with the real MPI executable, and
submits it into its own queue. Perl script uses Condor Perl module to
monitor the job execution and does NOT exit before the real job
finishes. When it does, Condor brings all its output files with it. The
script wraps up as well. Condor takes all the files of MPI and perl
script and brings them back - job complete. 
It works fine, BUT!! There is a problem if a user wants to issue
condor_rm and to remove a job. This sometimes works ( condor removes the
perl job, which catches the signal and condor_rm the MPI job), and
sometimes ( I have no idea why - this actually what caused all this nice
pyramid to fail) the signal is not caught, and MPI job keeps running as
if nothing happened. The possible solution is to issue condor_rm
remotely for the windows machine, but it doesn't work, since Condor
requires some kind of authentication there. Another solution is to write
C program instead of perl - maybe it handles signals better. 
In general Condor sometimes fails to clean up MPI jobs on Windows. One
note - the last is true for 6.4.7, and maybe is already solved in 6.6
If you want, I can send you the scripts

On Mon, 2004-06-28 at 16:43, Mike Busch wrote:
> Hi Vahid,
> 
> Thank you for that report.  At this point I'm more than a little
> frustrated.  Originally the project was to have Globus submitting jobs
> to a Condor pool of MPI machines but I find I can't run Globus due to
> security restrictions and now I find out that Condor won't submit to
> Windows.  Pretty effectively kills my project.
> 
> I'll work with it a bit and let you know if I get any different
> results.
> 
> Mike
> 
> 
> 
> --- Vahid Pazirandeh <vpaziran@xxxxxxxxx> wrote:
> > Hello,
> > 
> > I used to have a very similar setup as you Mike.  Then I found a bug
> > that
> > arises when you use a Linux submitter to Windows execute nodes for
> > MPI jobs.
> > 
> > I had a Linux server acting as the central manager and submitter. 
> > All my
> > execute nodes were Windows.  I compiled my code with cygwin.  My code
> > ran fine
> > when I ran it with the GUI NT mpirun program included from the
> > package at
> > http://www-unix.mcs.anl.gov/mpi/mpich/.  I could run my MPI code on
> > Condor so
> > long as machine_count=1.  If it was anything greater, Condor would
> > crash and
> > burn.
> > 
> > I sent a bug report to condor-admin around November 2003.  It turned
> > out to be
> > a larger bug in Condor then they had expected and it is not fixed to
> > this day. 
> > It is a documented bug.  I saw it documented some time in 2004:
> > http://www.cs.wisc.edu/condor/manual/v6.6/8_2Stable_Release.html.
> > 
> > -- snippet --
> > Condor 6.6.1 Release notes:
> > Known Bugs:
> >     * Submission of MPI jobs from a Unix machine to run on Windows
> > machines (or
> > vice versa) fails for machine_count > 1. This is not a new bug.
> > Cross-platform
> > submission of MPI jobs between Unix and Windows has always had this
> > problem.
> > -- snippet --
> > 
> > Now I use a Windows machine as the central manager and submitter. 
> > I've
> > installed as many UNIX tools as I needed to make the server more
> > friendly
> > (cygwin with all its support tools like sshd, etc).
> > 
> > I run MPI jobs successfully now with a Windows submitter.  I should
> > also point
> > out that I use MPICH NT 1.2.5.  I have always used this version and I
> > know
> > Condor documentation specifically notes that 1.2.5 is not supported. 
> > I have
> > not suffered any MPI related problems in my all-Windows pool.
> > 
> > However, I have uncovered what I think is a bug in the file transfer
> > mechanism
> > when running MPI jobs on a Windows pool.  As the number of files
> > needed to
> > transfer (tansfer_input_files) and the machine_count values rise, the
> > chances
> > of the file transfer failing gets very high - to the point that you
> > can assume
> > failure.  I haven't heard many others talk about this, though I don't
> > know how
> > many people are using a Windows pool to run MPI jobs like myself.  I
> > submitted
> > the bug to condor-admin a few months ago but I have not received many
> > replies
> > back.  The few replies I did receive simply stated that they are too
> > busy to
> > read through the logs that I sent in.  About a month ago I posted the
> > problem
> > to this mail list (dig through the archives and it should pop up).
> > 
> > With all this said, if you successfully run win32 MPI code from a
> > Linux server
> > to 2 or more Windows execute nodes, let me know!  I'll be very
> > interested to
> > know your exact setup.  Cheers and good luck.
> > 
> > Regards,
> > Vahid
> > 
> > 
> > 
> > --- Mike Busch <zenlc2000@xxxxxxxxx> wrote:
> > > Erik,
> > > 
> > > You say, 
> > > 
> > > > With the vanilla universe, you won't be able to allocate multiple
> > > > machines
> > > > in any sort of a group - you run the risk of a single node
> > > > disappearing. 
> > > > With the MPI universe, the loss of a single node tells Condor to
> > shut
> > > > down
> > > > all of the other machines, since Condor assumes your MPI
> > > > implementation 
> > > > has no fault tolerance.
> > > 
> > > Let's say I'm willing to accept the lack of fault tolerance just
> > for
> > > the sake of proving the concept.  Is it possible to submit an
> > NT-MPICH
> > > job to a Linux Manager and have it run on a Win2k pool in the
> > vanilla
> > > universe?
> > 
> > 
> > > 
> > > Thanks!
> > > Mike
> > > 
> > > 
> > > 	
> > > 		
> > 
> > 
> > =====
> > < NPACI Education Center on Computational Science and Engineering >
> > < http://www.edcenter.sdsu.edu/>
> > 
> > "A friend is someone who knows the song in your heart and can sing it
> > back to you when you have forgotten the words."  -Unknown Author 
> > =====
> > 
> > 
> > 		
> > __________________________________
> > Do you Yahoo!?
> > New and Improved Yahoo! Mail - Send 10MB messages!
> > http://promotions.yahoo.com/new_mail
> > _______________________________________________
> > Condor-users mailing list
> > Condor-users@xxxxxxxxxxx
> > http://lists.cs.wisc.edu/mailman/listinfo/condor-users
> > 
> 
> 
> 
> 	
> 		
> __________________________________
> Do you Yahoo!?
> New and Improved Yahoo! Mail - 100MB free storage!
> http://promotions.yahoo.com/new_mail
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> http://lists.cs.wisc.edu/mailman/listinfo/condor-users