[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Parallel submission issue



Ok, here I am : 

I touched the local conf file to modify EXECUTE to be on an NFS-shared directory : now I can run MPICH2 progs on 2 different boxes.
BUT, this is not an efficient solution, as everything will be written through the network, it will slow down my simulations...

On the other side I tryed some LAM/MPI jobs : they run perfectly well on several boxes (also with the initial setup, EXECUTE is on each local HD).

So, I can point more precisely the problem now : 
When I run programs with MPICH _or_ MPICH2 on several computers, with "EXECUTE" located locally,it fails because one box can't find the "main" dir_XXXXX where to write the output. 
If I change EXECUTE to an nfs-shared directory, runs fine (but not efficient in network resource/occupation, IMO).

Program that run with LAM/MPI run fine on several boxes, with local "EXECUTE"

What should I investigate now ? Do you have a solution to this ?

Thanks in advance
Nicolas


----------------
On Sat, 1 Sep 2007 08:40:07 +0100
Si Hammond wrote:

> 
> On 31 Aug 2007, at 16:23, Nicolas GUIOT wrote:
> 
> > Please people, I REALLY need help : I'm leaving this lab very soon,  
> > and if I can't get this to work for MPI, it's quite sure people  
> > will give up using condor, even for mono-cpu jobs, which would be  
> > very sad...
> >
> > News :
> >
> > I tested MPI with an other program, and I have exactly the same  
> > symptoms : 1 computer stores the output files, and each process  
> > that runs on this computer finds the file, but the 2nd computer  
> > can't find them.
> >
> > I would like to make a test and put the EXECUTE directory on the  
> > same nfs folder. So, I tried to do this :
> > condor_config_val -rset EXECUTE=/nfs/scratch-condor/execute
> >
> > but it failed, whether I run it as root or as condor user :
> > Attempt to set configuration "EXECUTE=/nfs/scratch-condor/execute"  
> > on master calisto.my.domain.fr <XXX.XXX.XXX.XXX:55829> failed.
> >
> > So :
> > 1- what's the correct solution to have files see-able by all the  
> > computers
> > 2- for my tests, how can I change the EXECUTE directory to be nfs- 
> > shared
> 
> Nicolas, have you tried specifying the execute in the machine's  
> configuration file (i.e. making every machine use the NFS-shared space)?
> 
> 
> >
> >
> > ++
> > Nicolas
> >
> > ----------------
> > On Thu, 30 Aug 2007 12:36:08 +0200
> > Nicolas GUIOT wrote:
> >
> >> Hi
> >>
> >> I'm trying to sumbit an MPI job to my condor pool.
> >>
> >> The problem is that when I ask it to run on 2 cpus (ie 1  
> >> computer), it's fine, but when I ask for 4 CPU (ie 2 computer),  
> >> one seems not to find the file to write the output.
> >>
> >> Here is the submission script :
> >> $ cat sub-cond.cmd
> >> universe = parallel
> >> executable = mp2script
> >> arguments = /nfs/opt/amber/amber9/exe/sander.MPI -O -i md.in -o  
> >> TGA07.1.out -p TGA07.top  -c TGA07.0.rst -r TGA07.1.rst -x  
> >> TGA07.1.trj -e TGA07.1.ene
> >> machine_count = 4
> >> should_transfer_files = yes
> >> when_to_transfer_output = on_exit_OR_EVICT
> >> transfer_input_files = /nfs/opt/amber/amber9/exe/ 
> >> sander.MPI,md.in,TGA07.top,TGA07.0.rst
> >> Output  = sanderMPI.out
> >> Error   = sanderMPI.err
> >> Log     = sanderMPI.log
> >> queue
> >>
> >> I'm starting the script from a directory that is nfs-shared :
> >>
> >> (/nfs/test-space/amber)$ ls
> >> blu.sh  clean.sh  md.in  mdinfo  mp2script  mpd.hosts  run_MD.sh   
> >> sub-cond.cmd  TGA07.0.rst  TGA07.top
> >>
> >> The error is a typical amber error when it can't find the result  
> >> file (TGA07.1.out is an output file, doesn't exist before runnning  
> >> the progam.:
> >>
> >> $ more sanderMPI.err
> >> 0:
> >> 0:   Unit    6 Error on OPEN: TGA07.1.out
> >>
> >> 0: [cli_0]: aborting job:
> >> 0: application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
> >> $
> >>
> >> So, where is my problem ? NFS ? file transfer ?
> >>
> >> Any help would be greatly appreciated :)
> >>
> >> Nicolas
> >
> >
> > ----------------------------------------------------
> > CNRS - UPR 9080 : Laboratoire de Biochimie Theorique
> >
> > Institut de Biologie Physico-Chimique
> > 13 rue Pierre et Marie Curie
> > 75005 PARIS - FRANCE
> >
> > Tel : +33 158 41 51 70
> > Fax : +33 158 41 50 26
> > ----------------------------------------------------
> > _______________________________________________
> > Condor-users mailing list
> > To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx  
> > with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/condor-users/
> 
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at: 
> https://lists.cs.wisc.edu/archive/condor-users/
> 

----------


----------------------------------------------------
CNRS - UPR 9080 : Laboratoire de Biochimie Theorique

Institut de Biologie Physico-Chimique
13 rue Pierre et Marie Curie
75005 PARIS - FRANCE

Tel : +33 158 41 51 70
Fax : +33 158 41 50 26
----------------------------------------------------