[Condor-users] A couple of MPI-universe oddities


I'm experienceing a couple of problems running MPI universe jobs, depending on whether I try to use the underlying NFS file system (which gives rise to one error) or not (which leads to a different error). I'll mention the two separately:

1) Problem 1 - No NFS case

In this case I set the following in the submit script:

should_transfer_files = YES
when_to_transfer_output = ON_EXIT

and all goes well with a simple "hello world" program for jobs that use <= 6 processors. Any more than that and some processors do not return their output. It's not always the same processors, and not always the same number of processors. The ShadowLog always has:

5/3 11:43:41 (25.0) (30724): Job 25.0 terminated: exited with status 0
5/3 11:43:42 (25.0) (30724): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 100

The StarterLogs all exit with "Status 0" and the job logfile verifies that some nodes returned zero bytes back. I stress that the raw mpi job, when run with mpirun, works perfectly well for an arbitrary number of nodes.

2) Problem 2 - NFS case

Now I set the following in the submit script:

should_transfer_files = IF_NEEDED

and for any number of processors in the job I get the following in the job logfile:

007 (026.000.000) 05/03 11:51:08 Shadow exception!
Error from starter on node2--srl.grid.private.cam.ac.uk: Failed to open standard output file '/home/mcal00/mpi/outfile.0': Permission denied (errno 13)

I have no problem when running the jobs via mpirun as the dedicated condor execute user. I have however come across an article that a possible source of the ``Permission denied.'' message is when one uses the su command to change effective user id on some systems that use the ch_p4 device. This is pretty much the Condor-MPI setup, right? /home is nfs exported with no_root_squash across the nodes, and both root and the dedicated condor user have passwordless access set up.

Help to alleviate either of the above problems would be much appreciated!