[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] MPI jobs without a Shared NFS weirdness.

Lo gang,

So I was digging around in the scripts and I ran accross something a bit weird,
I wanted to see if anyone else noticed this as well and perhaps we can all work
towards a solution that will benefit everyone.

So when your mpi job actually gets run, a p4pg file will get made that will help
each computer know where the mpi executable is. Normally if you had a shared
file system you would expect it to be in the same place in all your nodes right?
Well this seems not to be the case when you aren't using a shared file system.

Lets take a closer look.. when a mpi job is being run on your machine condor
will create a temp directory, usually in the form of dir_#####, in your
/home/of/condor/execute directory. Within that directory 2 files of interest are

contact and PI#####

Doing a quick "more" command on them we can see what they contain:

[root@panndaa execute]# more contact
1 tango 4444 nobody /home/condor/execute/dir_6729
0 panndaa 4444 condor /home/condor/execute/dir_31433

[root@panndaa execute]# more PI31478
panndaa 0 /home/condor/execute/dir_31433/john
tango 1 /home/condor/execute/dir_31433/john

Okey back to the p4pgfile, well it seems that the p4pgfile is being created off
the PI#### file rather than (what i think) should be the directory locations in
the contact file. See within the PI##### we notice that they are of the same
directory but on the actual nodes the directories are the ones described in the
contact file.

I edited the mpi scripts with some brief echo statements to know where I am and
what path is being traced through them and my output was something like:

running /home/condor/execute/dir_31960/john on 2 LINUX ch_p4 processors
Created /home/condor/execute/dir_31960/PI32004
p0_32088:  p4_error: Timeout in making connection to remote process on tango: 0
p0_32088: (302.012775) net_send: could not write to fd=4, errno = 32

So it seems that in the last line something couldn't be written too. If I'm
correct in assuming its trying to write in a directory that doesnt exist than
thats whats probably causing these errors.

What do you guys think? Am I crazy?