[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Running MPI job using Open MPI



I'm trying to run a mpi job on condor using the parallel universe. I use an openmpi wrapper script that I found in the mail archives, https://www-auth.cs.wisc.edu/lists/condor-users/2009-February/msg00024.shtml. My submission file looks like the following:

universe = parallel
executable = openmpi-wrapper
arguments = process_images_parallel
getenv = True
log = condor.log
output = stdout.$(NODE)
error = stderr.$(NODE)
machine_count = 2
should_transfer_files = yes
when_to_transfer_output = on_exit
transfer_input_files = process_images_parallel
queue

When the job runs until it sources the sshd.sh script and then fails when generating a hostkey in the local/execute/tmp_directory folder. There is no way in my environment to add a condor user so all my applications run as nobody using CONDOR_IDS=99.99. I can run the sshd.sh script as nobody outside of condor, setting a few environment variables so it will work, and I can create the hostkey as expected. The permissions all look correct and from what I can tell everything works as expected up to that point. I'm using RedHat Enterprise 5, bash, condor version 7.4.0, and Open MPI 1.4.1. I currently don't have a firewall on any of the machines, ALLOW_WRITE/READ = *, LOCAL_DIR is on the local disk, the RELEASE_DIR is on a network share along with the LOCAL_CONFIG_FILE, and my FILESYSTEM_DOMAIN=$(FULL_HOSTNAME). Any help would be much appreciated.

Kirt Lillywhite