[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor_shadow exits with STATUS 100 on MPI jobs



Hi Mark,
I guess that this is the correct behavior, since MPI uses Dedicated
scheduler, which claims the resources and doesn't release them  within
some timeout (hopefully configurable - I can't find  the exact name
although I'm sure I once succeeded) 
I suppose that for very small jobs this is the case. 
Mark

On Thu, 2005-04-28 at 08:16 +0100, Mark Calleja wrote:
> Hi,
> 
> I'm trying to get MPI universe jobs to work without using shared disc 
> space but have a bit of a hitch. The set up uses MPICH v1.2.4 compiled 
> with Intel's ifc 7.1, and raw MPI jobs work well. When I submit a simple 
> "hello world" program via Condor's MPI universe, the jobs also run to 
> completion and return the data, but the nodes don't seem to exit cleanly 
> but remain in a Claimed/Idle state and the ShadowLog on the submit host 
> ends up with:
> 
> 4/28 07:58:27 (22.0) (1040): Job 22.0 terminated: exited with status 0
> 4/28 07:58:27 (22.0) (1040): **** condor_shadow (condor_SHADOW) EXITING 
> WITH STATUS 100
> 
> The StartLog and StarterLog on the execute nodes seem happy enough, and 
> jobs on those nodes are executed as dedicated user condor_user which has 
> had passwordless rsh set up between all execute nodes.
> 
> The submit script is:
> 
> =================
> universe = MPI
> executable = hello
> machine_count = 6
> 
> should_transfer_files = yes
> when_to_transfer_output = ON_EXIT
> 
> log = logfile
> input = /dev/null
> output = outfile.$(NODE)
> error = errfile.$(NODE)
> 
> queue
> =================
> 
> Now, I realise I may get round all this by NFS mounting all home space, 
> but I'd like to avoid this if possible for performance reasons. Any 
> suggestions?
> 
> Cheers,
> Mark
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users