[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] MPI jobs without a Shared NFS weirdness.
- Date: Sat, 11 Mar 2006 13:58:12 -0700
- From: rnayar@xxxxxxxx
- Subject: [Condor-users] MPI jobs without a Shared NFS weirdness.
So I was digging around in the scripts and I ran accross something a bit weird,
I wanted to see if anyone else noticed this as well and perhaps we can all work
towards a solution that will benefit everyone.
So when your mpi job actually gets run, a p4pg file will get made that will help
each computer know where the mpi executable is. Normally if you had a shared
file system you would expect it to be in the same place in all your nodes right?
Well this seems not to be the case when you aren't using a shared file system.
Lets take a closer look.. when a mpi job is being run on your machine condor
will create a temp directory, usually in the form of dir_#####, in your
/home/of/condor/execute directory. Within that directory 2 files of interest are
contact and PI#####
Doing a quick "more" command on them we can see what they contain:
[root@panndaa execute]# more contact
1 tango 4444 nobody /home/condor/execute/dir_6729
0 panndaa 4444 condor /home/condor/execute/dir_31433
[root@panndaa execute]# more PI31478
panndaa 0 /home/condor/execute/dir_31433/john
tango 1 /home/condor/execute/dir_31433/john
Okey back to the p4pgfile, well it seems that the p4pgfile is being created off
the PI#### file rather than (what i think) should be the directory locations in
the contact file. See within the PI##### we notice that they are of the same
directory but on the actual nodes the directories are the ones described in the
I edited the mpi scripts with some brief echo statements to know where I am and
what path is being traced through them and my output was something like:
running /home/condor/execute/dir_31960/john on 2 LINUX ch_p4 processors
p0_32088: p4_error: Timeout in making connection to remote process on tango: 0
p0_32088: (302.012775) net_send: could not write to fd=4, errno = 32
So it seems that in the last line something couldn't be written too. If I'm
correct in assuming its trying to write in a directory that doesnt exist than
thats whats probably causing these errors.
What do you guys think? Am I crazy?