[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] NFS and the parallel universe



Title: NFS and the parallel universe

Hello,

I need some help in understanding the parallel universe and a shared file system. I currently have a pool of machines that NFS mount a 5.3TByte file system for users to run their jobs out of. I am now able to run MPI/Parallel jobs across the pool, but I noticed something odd relating to the file system behavior. I previously reported a chirp error in my parallel environment and to fix it I was told to put the following entries in my submission script:

From the NFS mounted scratch directory I issued the condor_submit command in the parallel universe. The job failed and reported an error to the effect that the executable was no where to be found, even though it existed in the directory where I had submitted from. I read the docs and added the following line to the submit script and the job began working:

I also had to had the full path to the executable on the arguments line because the mp1script I am using reported that it also couldn't find the xhpl binary.

This is the contents of my mp1script:

I poked around a little more and I noticed on the nodes where the job was running the following data directories and my binary and the input file. Also located there were the error/output files that were copied back when the job finished.

What settings might I be missing to allow NFS nodes to function in my parallel universe? Am I misunderstanding the way NFS should behave? My experience with clusters and NFS is from the PBS environment where I submit from the directory where all of my input and output are read and written to (cd $PBS_O_WORKDIR).  The MPI universe and VANILLA universe appear to work as expected, but not so the parallel. Any thoughts or ideas?

Richard

--
Richard N. Cleary
Sandia National Laboratories
Dept. 4324 Infrastructure Computing Systems
Email: rnclear@xxxxxxxxxx
Phone: 505.845.7836