[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor and MPI jobs



It has been a while since I tried running MPI jobs under Condor, but you
definitely need to work in the 'parallel' Universe. Also your executable
shell script is a bit too simple. Try these:
https://lists.cs.wisc.edu/archive/condor-users/2009-February/msg00024.shtml

It is all kind of RTFM I guess, but then again: the manual is very, very
complex and not updated for MPI yet (at least: the last time I checked,
which is some time ago, around Condor 6.8)

Jakob


Ary Junior wrote:
> Hi, Im trying to run a job with MPI and Condor... I have my .submit file
> like this:
> 
> universe        = vanilla
> requirements    = Activity == "Idle"
> executable      = LIME-443-001.sh
> output          = LIME-443-001.sh.out
> error           = LIME-443-001.sh.err
> log             = LIME-443-001.sh.log
> should_transfer_files = IF_NEEDED
> when_to_transfer_output = ON_EXIT
> queue
> 
> In this example, the LIME-443-001.sh have the content:
> 
> #!/bin/sh
> export OMP_NUM_THREADS=1
> export LD_LIBRARY_PATH=:/usr/lib64/mpi/gcc/openmpi/lib64
> /usr/lib64/mpi/gcc/openmpi/bin/mpirun -np 2 /opt/espresso-mpi/bin/pw.x <
> /home/aryjr/SUPERFICIES/LIME/LIME-443-001.pw.inp >
> /home/aryjr/SUPERFICIES/LIME/LIME-443-001.pw.out
> 
> If I don't use Condor and execute the .sh file like "sh
> LIME-443-001.sh", all works fine... However, if I try to run
> "condor_submit LIME-443-001.submit" I get the error on
> LIME-443-001.sh.err file:
> 
> [xeonquad01:22365] [0,0,0] ORTE_ERROR_LOG: Error in file
> runtime/orte_init_stage1.c at line 312
> --------------------------------------------------------------------------
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
> 
>   orte_pls_base_select failed
>   --> Returned value -1 instead of ORTE_SUCCESS
> 
> --------------------------------------------------------------------------
> [xeonquad01:22365] [0,0,0] ORTE_ERROR_LOG: Error in file
> runtime/orte_system_init.c at line 42
> [xeonquad01:22365] [0,0,0] ORTE_ERROR_LOG: Error in file
> runtime/orte_init.c at line 52
> --------------------------------------------------------------------------
> Open RTE was unable to initialize properly.  The error occured while
> attempting to orte_init().  Returned value -1 instead of ORTE_SUCCESS.
> --------------------------------------------------------------------------
> 
> Anybody can help me?
> 
> Thanks very much!!!
> 
> Ary Juniort
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at: 
> https://lists.cs.wisc.edu/archive/condor-users/