[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] condor_shadow exits with STATUS 100 on MPI jobs



Hi,

I'm trying to get MPI universe jobs to work without using shared disc space but have a bit of a hitch. The set up uses MPICH v1.2.4 compiled with Intel's ifc 7.1, and raw MPI jobs work well. When I submit a simple "hello world" program via Condor's MPI universe, the jobs also run to completion and return the data, but the nodes don't seem to exit cleanly but remain in a Claimed/Idle state and the ShadowLog on the submit host ends up with:

4/28 07:58:27 (22.0) (1040): Job 22.0 terminated: exited with status 0
4/28 07:58:27 (22.0) (1040): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 100


The StartLog and StarterLog on the execute nodes seem happy enough, and jobs on those nodes are executed as dedicated user condor_user which has had passwordless rsh set up between all execute nodes.

The submit script is:

=================
universe = MPI
executable = hello
machine_count = 6

should_transfer_files = yes
when_to_transfer_output = ON_EXIT

log = logfile
input = /dev/null
output = outfile.$(NODE)
error = errfile.$(NODE)

queue
=================

Now, I realise I may get round all this by NFS mounting all home space, but I'd like to avoid this if possible for performance reasons. Any suggestions?

Cheers,
Mark