[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] A couple of MPI-universe oddities


This is an interesting failure mode. We routinely run 16 node MPI jobs here, so there shouldn't be anything inherent in condor preventing this case. Can you turn on D_FULLDEBUG for the starter and shadow, and see if the logs say anything more informative? Also, is there anything unusual in the user log?



Mark Calleja wrote:

I'm experienceing a couple of problems running MPI universe jobs, depending on whether I try to use the underlying NFS file system (which gives rise to one error) or not (which leads to a different error). I'll mention the two separately:

1) Problem 1 - No NFS case

In this case I set the following in the submit script:

should_transfer_files = YES
when_to_transfer_output = ON_EXIT

and all goes well with a simple "hello world" program for jobs that use <= 6 processors. Any more than that and some processors do not return their output. It's not always the same processors, and not always the same number of processors. The ShadowLog always has:

5/3 11:43:41 (25.0) (30724): Job 25.0 terminated: exited with status 0
5/3 11:43:42 (25.0) (30724): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 100

The StarterLogs all exit with "Status 0" and the job logfile verifies that some nodes returned zero bytes back. I stress that the raw mpi job, when run with mpirun, works perfectly well for an arbitrary number of nodes.

2) Problem 2 - NFS case

Now I set the following in the submit script:

should_transfer_files = IF_NEEDED

and for any number of processors in the job I get the following in the job logfile:

007 (026.000.000) 05/03 11:51:08 Shadow exception!
Error from starter on node2--srl.grid.private.cam.ac.uk: Failed to open standard output file '/home/mcal00/mpi/outfile.0': Permission denied (errno 13)

I have no problem when running the jobs via mpirun as the dedicated condor execute user. I have however come across an article that a possible source of the ``Permission denied.'' message is when one uses the su command to change effective user id on some systems that use the ch_p4 device. This is pretty much the Condor-MPI setup, right? /home is nfs exported with no_root_squash across the nodes, and both root and the dedicated condor user have passwordless access set up.

Help to alleviate either of the above problems would be much appreciated!


Condor-users mailing list