[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Parallel submission issue



Nicolas,

Are you running Linux/Unix, if yes, why don't you symbolic link the
files you are outputting to the NFS share? This way, most stuff will
be local but some stuff will be shared?


Si Hammond,
University of Warwick


On 05/09/07, Nicolas GUIOT <nicolas.guiot@xxxxxxx> wrote:
> Ok, here I am :
>
> I touched the local conf file to modify EXECUTE to be on an NFS-shared directory : now I can run MPICH2 progs on 2 different boxes.
> BUT, this is not an efficient solution, as everything will be written through the network, it will slow down my simulations...
>
> On the other side I tryed some LAM/MPI jobs : they run perfectly well on several boxes (also with the initial setup, EXECUTE is on each local HD).
>
> So, I can point more precisely the problem now :
> When I run programs with MPICH _or_ MPICH2 on several computers, with "EXECUTE" located locally,it fails because one box can't find the "main" dir_XXXXX where to write the output.
> If I change EXECUTE to an nfs-shared directory, runs fine (but not efficient in network resource/occupation, IMO).
>
> Program that run with LAM/MPI run fine on several boxes, with local "EXECUTE"
>
> What should I investigate now ? Do you have a solution to this ?
>
> Thanks in advance
> Nicolas
>
>
> ----------------
> On Sat, 1 Sep 2007 08:40:07 +0100
> Si Hammond wrote:
>
> >
> > On 31 Aug 2007, at 16:23, Nicolas GUIOT wrote:
> >
> > > Please people, I REALLY need help : I'm leaving this lab very soon,
> > > and if I can't get this to work for MPI, it's quite sure people
> > > will give up using condor, even for mono-cpu jobs, which would be
> > > very sad...
> > >
> > > News :
> > >
> > > I tested MPI with an other program, and I have exactly the same
> > > symptoms : 1 computer stores the output files, and each process
> > > that runs on this computer finds the file, but the 2nd computer
> > > can't find them.
> > >
> > > I would like to make a test and put the EXECUTE directory on the
> > > same nfs folder. So, I tried to do this :
> > > condor_config_val -rset EXECUTE=/nfs/scratch-condor/execute
> > >
> > > but it failed, whether I run it as root or as condor user :
> > > Attempt to set configuration "EXECUTE=/nfs/scratch-condor/execute"
> > > on master calisto.my.domain.fr <XXX.XXX.XXX.XXX:55829> failed.
> > >
> > > So :
> > > 1- what's the correct solution to have files see-able by all the
> > > computers
> > > 2- for my tests, how can I change the EXECUTE directory to be nfs-
> > > shared
> >
> > Nicolas, have you tried specifying the execute in the machine's
> > configuration file (i.e. making every machine use the NFS-shared space)?
> >
> >
> > >
> > >
> > > ++
> > > Nicolas
> > >
> > > ----------------
> > > On Thu, 30 Aug 2007 12:36:08 +0200
> > > Nicolas GUIOT wrote:
> > >
> > >> Hi
> > >>
> > >> I'm trying to sumbit an MPI job to my condor pool.
> > >>
> > >> The problem is that when I ask it to run on 2 cpus (ie 1
> > >> computer), it's fine, but when I ask for 4 CPU (ie 2 computer),
> > >> one seems not to find the file to write the output.
> > >>
> > >> Here is the submission script :
> > >> $ cat sub-cond.cmd
> > >> universe = parallel
> > >> executable = mp2script
> > >> arguments = /nfs/opt/amber/amber9/exe/sander.MPI -O -i md.in -o
> > >> TGA07.1.out -p TGA07.top  -c TGA07.0.rst -r TGA07.1.rst -x
> > >> TGA07.1.trj -e TGA07.1.ene
> > >> machine_count = 4
> > >> should_transfer_files = yes
> > >> when_to_transfer_output = on_exit_OR_EVICT
> > >> transfer_input_files = /nfs/opt/amber/amber9/exe/
> > >> sander.MPI,md.in,TGA07.top,TGA07.0.rst
> > >> Output  = sanderMPI.out
> > >> Error   = sanderMPI.err
> > >> Log     = sanderMPI.log
> > >> queue
> > >>
> > >> I'm starting the script from a directory that is nfs-shared :
> > >>
> > >> (/nfs/test-space/amber)$ ls
> > >> blu.sh  clean.sh  md.in  mdinfo  mp2script  mpd.hosts  run_MD.sh
> > >> sub-cond.cmd  TGA07.0.rst  TGA07.top
> > >>
> > >> The error is a typical amber error when it can't find the result
> > >> file (TGA07.1.out is an output file, doesn't exist before runnning
> > >> the progam.:
> > >>
> > >> $ more sanderMPI.err
> > >> 0:
> > >> 0:   Unit    6 Error on OPEN: TGA07.1.out
> > >>
> > >> 0: [cli_0]: aborting job:
> > >> 0: application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
> > >> $
> > >>
> > >> So, where is my problem ? NFS ? file transfer ?
> > >>
> > >> Any help would be greatly appreciated :)
> > >>
> > >> Nicolas
> > >
> > >
> > > ----------------------------------------------------
> > > CNRS - UPR 9080 : Laboratoire de Biochimie Theorique
> > >
> > > Institut de Biologie Physico-Chimique
> > > 13 rue Pierre et Marie Curie
> > > 75005 PARIS - FRANCE
> > >
> > > Tel : +33 158 41 51 70
> > > Fax : +33 158 41 50 26
> > > ----------------------------------------------------
> > > _______________________________________________
> > > Condor-users mailing list
> > > To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
> > > with a
> > > subject: Unsubscribe
> > > You can also unsubscribe by visiting
> > > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> > >
> > > The archives can be found at:
> > > https://lists.cs.wisc.edu/archive/condor-users/
> >
> > _______________________________________________
> > Condor-users mailing list
> > To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/condor-users/
> >
>
> ----------
>
>
> ----------------------------------------------------
> CNRS - UPR 9080 : Laboratoire de Biochimie Theorique
>
> Institut de Biologie Physico-Chimique
> 13 rue Pierre et Marie Curie
> 75005 PARIS - FRANCE
>
> Tel : +33 158 41 51 70
> Fax : +33 158 41 50 26
> ----------------------------------------------------
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>