[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] How to get the machine file in parallel jobs



Hi Chunbao,

I had the same problem before.
I have not found proper scripts for submitting in openmpi environment.
I have found one in the share/doc/condor-7.8.1/etc/examples/ directory which
installs ssh daemons on the remote machines, but I cannot use it in SMP
environment.

Finally I found a solution, may be it helps for you.
- I created a shell script which collects the host info from job status and creates a host file containing job IDs and slot numbers for starting mpirun.
 - I force the mpirun to use  condor_ssh_to_job. The only problem is the
mpirun checks the format of the host file and if it starts with numbers it
   assumes these are IP addresses. So I added a constant string to the job
IDs and a wrapper starts the condor_ssh_to_job, which removes the constant string.

I enclosed my scripts, I hope you can find it useful as well.

If your are using openmpi-1.4 change the last command of condor_openmpi.sh script to

exec $MPIRUN --prefix $MPI_HOME --mca plm_rsh_agent $_CONDOR_SSH_TO_JOB_WRAPPER \
              --hostfile $_CONDOR_PARALLEL_HOSTS_FILE $@


Best,

Imre


2012.08.26. 15:00 keltezéssel, miaocb@xxxxxxx írta:
Hi All,
     I successfully configured condor to run parallel jobs, but I can't figure out how to get a machine file that can be used by mpiexec or mpirun to start MPI jobs. Is there an environment variable that refers to the machine file?

thanks

Chunbao Miao
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/


#**************************************************************
# mpimimd.job:
#
# submitting MPI programs in SIMD/MIMD modell
#**************************************************************

universe = parallel
 # name of the job 
JOBNAME = mpimimd

 # helper script for starting openmpi programs
executable = condor_openmpi.sh
 # the script passes all its argument to the mpirun command
 # eg. interface parameters
IF= -mca btl_openib_if_include mlx4_0:1
 # and number of requested master processes
MNUM = 1
 # and the name of the executable 
MPRG = /bin/date
 # and number of requested worker processes
WNUM = 3
 # and the name of the executable 
WPRG = /bin/hostname
arguments = $(IF) -np $(MNUM) $(MPRG) : -np $(WNUM) $(WPRG)

 # the name of the hostfile generated for mpi run command
 #   (the default is 'parallel_hosts')
environment = _CONDOR_PARALLEL_HOSTS_FILE=$(JOBNAME).hosts

 # standard output, error and log
output = $(JOBNAME).out
error  = $(JOBNAME).err
log    = $(JOBNAME).log
 # requirement 1:  the master should run on the HEAD_NODE
machine_count = 1
requirements = ( machine == HEAD_NODE )
queue

 # worker's stdout and stderror redirected by mpirun
 # (no need for additional redirect by condor)
output = /dev/null
error  = /dev/null
 # requirement 2:  the workers should not run on the HEAD_NODE
machine_count = 3
requirements = ( machine =!= HEAD_NODE )
queue

#!/bin/bash
##**************************************************************
## condor_ssh_to_job_wrapper.sh:
##       Created by I.Sz. <szebi@xxxxxxxxxx> BME-IIT 2012.07.17
## This is a ssh wrapper for mpirun command.
## It deletes the .*-CONDOR- prefix form the hostname (first) 
## argment and invokes the condor_ssh_to_job command.Ă­
##**************************************************************
#

arg1=$1; shift
arg1=`sed 's/^.*-CONDOR-//' <<< $arg1`
exec condor_ssh_to_job $arg1 $@


#!/bin/bash

##**************************************************************
## condor_parallel_hosts.sh 
##       Created by I.Sz. <szebi@xxxxxxxxxx> BME-IIT 2012.07.17
## Functions for collecting host and job information about the running parallel job.
## Function CONDOR_PARALLEL_HOSTS creates a hostfile including contact info for remote hosts
## Usage: Source the script and use the CONDOR_GET_PARALLEL_HOSTS_INFO function
##************************************************************** 

# Defaults for error testing
: ${_CONDOR_PROCNO:=0}
: ${_CONDOR_NPROCS:=1}
: ${_CONDOR_MACHINE_AD:="None"}
: ${_CONDOR_JOB_AD:="None"}

##************************************************************** 
## Usage: CONDOR_GET_PARALLEL_HOSTS_INFO [hostfile]
## If hostfile omitted 'parallel_hosts' is used.
## Return:
##   The function returns with error status on main process (_CONDOR_PROCNO==0).
##   The function never returns on on the other nodes (sleeping).
## The created file structure: 
##   HostName1'-CONDOR-'CLusterID.ProcId.SubProcId 'slots='Allocated_CPUs 'max_slots='Allocated_CPUs
##   HostName2'-CONDOR-'CLusterID.ProcId.SubProcId 'slots='Allocated_CPUs 'max_slots='Allocated_CPUs
##   HostName3'-CONDOR-'CLusterID.ProcId.SubProcId 'slots='Allocated_CPUs 'max_slots='Allocated_CPUs
##   ...
##************************************************************** 
function CONDOR_GET_PARALLEL_HOSTS_INFO() {
    # getting parameters if _CONDOR_PARALLEL_HOSTS_FILE not set
    : ${_CONDOR_PARALLEL_HOSTS_FILE:=$1}
    # setting defaults
    : ${_CONDOR_PARALLEL_HOSTS_FILE:=parallel_hosts}
    local hostname=`hostname -f`
    if [ $_CONDOR_PROCNO -eq 0 ]; then
    # collecting info on the main proc
        clusterid=`CONDOR_GET_JOB_ATTR ClusterId`
        local ret=$?
        if [ $ret -ne 0 ]; then 
            echo Error: get_job_attr ClusterId
            return 1
        fi
        local line=""
        condor_q -l $clusterid | \
        awk '/^ProcId.=/ { ProcId=$3 } \
             /^ClusterId.=/ { ClusterId=$3 } \
             /^RequestCpus.=/ { RequestCpus=$3 } \
             /^RemoteHosts.=/ { RemoteHosts=$3 } \
             /^$/ { if (ClusterId != 0) print ClusterId" "ProcId" "RequestCpus" "RemoteHosts  }' | \
        while read line; do
            CONDOR_PRINT_HOSTS $line
        done | sort -d > ${_CONDOR_PARALLEL_HOSTS_FILE}
    else 
    # endless loop on the workers
        while true ; do
            sleep 30
        done
    fi
    return 0
}

## Helper fn for getting specific machine attributes from $_CONDOR_MACHINE_AD
function CONDOR_GET_MACHINE_ATTR() {
    local attr="$1"
    awk '/^'"$attr"'[[:space:]]+=[[:space:]]+/ \
        { ret=sub(/^'"$attr"'[[:space:]]+=[[:space:]]+/,""); print; } \
        END { exit 1-ret; }' $_CONDOR_MACHINE_AD
    return $?
} 

## Helper fn for getting specific job attributes from $_CONDOR_JOB_AD
function CONDOR_GET_JOB_ATTR() {
    local attr="$1"
    awk '/^'"$attr"'[[:space:]]+=[[:space:]]+/ \
        { ret=sub(/^'"$attr"'[[:space:]]+=[[:space:]]+/,""); print; } \
        END { exit 1-ret; }' $_CONDOR_JOB_AD
    return $?
} 

## Helper fn for printing the host info
function CONDOR_PRINT_HOSTS() {
    local clusterid=$1
    local procid=$2
    local reqcpu=$3
    local rhosts=$4
    tr ',"' '\n' <<< $rhosts | grep -v $hostname | \
    awk '{ sub(/slot.*@/,""); if ($1 != "") { slots[$1]+='$reqcpu'; subproc[$1]=id++; } } \
        END { for (i in slots) print i"-CONDOR-"'$clusterid'".1."subproc[i]" slots="slots[i]" max_slots="slots[i]; }' 
}

#!/bin/bash

##**************************************************************
## condor_openmpi.sh: 
##    Created by I.Sz. <szebi@xxxxxxxxxx> BME-IIT 2012.07.17
## This is a script to run openmpi jobs under the Condor parallel universe.
## Collects the host and job information into $_CONDOR_PARALLEL_HOSTS_FILE
## and executes 
##   $MPIRUN --perfix $MPIHOME --hostfile $_CONDOR_PARALLEL_HOSTS_FILE $@
## command
## The default value of _CONDOR_PARALLEL_HOSTS_FILE is 'parallel_hosts'
##
## The script assumes:  
##  On the head node (_CONDOR_PROCNO == 0) : 
##    * $MPIRUN points to the mpirun command
##    * condor_ssh_to_job command is working (run as owner is true) 
##    * condor_parallel_hosts.sh and condor_ssh_to_job_wraper.sh scripts 
##      are available and installed in the condor libexec dir. 
##  On all nodes:
##    * openmpi is installed into $MPI_HOME directoy 
##**************************************************************

#----------------------------
MPIRUN=mpirun
MPI_HOME=/usr/lib64/openmpi
#----------------------------

_CONDOR_LIBEXEC=`condor_config_val libexec`
_CONDOR_PARALLEL_HOSTS=$_CONDOR_LIBEXEC/condor_parallel_hosts.sh
_CONDOR_SSH_TO_JOB_WRAPPER=$_CONDOR_LIBEXEC/condor_ssh_to_job_wrapper.sh

# Source the condor_parallel_hosts.sh script
. $_CONDOR_PARALLEL_HOSTS

# Creates parallel_hosts file containing contact info for hosts
# Returns on head node only
CONDOR_GET_PARALLEL_HOSTS_INFO
ret=$?
if [ $ret -ne 0 ]; then
        echo Error: $ret creating $_CONDOR_PARALLEL_HOSTS_FILE
        exit $ret
fi

# Starting mpirun cmd 
exec $MPIRUN --prefix $MPI_HOME --mca orte_rsh_agent $_CONDOR_SSH_TO_JOB_WRAPPER \
              --hostfile $_CONDOR_PARALLEL_HOSTS_FILE $@