[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Fw: Configuring condor for execution of parallel amber jobs




Hi,

I am trying to execute the parallel universe job through condor. The
version of condor which i am using is 7.1.0. In my  pool there are three
machines with Linux (RHEL v5.0) as OS . My submit machine and central
manager is server3 and server2 and server1 are execute nodes.

I generally use student user account to submit the jobs. This  account is
present on all three machines and candor binaries are in the PATH along
with MPICH2 binaries and Intel compliers (ifort,icc and icpc). The
LD_LIBRARY_PATH appropriately set to the respective lib directories.

I am trying to execute amber module called (sander.MPI) simultaneously on
all three machines.The following is my mp2script
--
#!/bin/sh -x

# File: mp2script
#       Adapted from mp1script by Mark Calleja
#
#   Edit MPDIR and LD_LIBRARY_PATH to suit your local configuration.
# Also don't forget to set the secretword in .mpd.conf.
#

export PWD=`pwd`
export MPD_CONF_FILE=~/.mpd.conf

_CONDOR_PROCNO=$_CONDOR_PROCNO
_CONDOR_NPROCS=$_CONDOR_NPROCS

CONDOR_SSH=`condor_config_val libexec`
CONDOR_SSH=$CONDOR_SSH/condor_ssh

SSHD_SH=`condor_config_val libexec`
SSHD_SH=$SSHD_SH/sshd.sh

. $SSHD_SH $_CONDOR_PROCNO $_CONDOR_NPROCS

EXECUTABLE=$1
shift

# The binary is copied but the executable flag is cleared,
# so the script has to take care of this.
chmod +x $EXECUTABLE

# Set this to the directory of MPICH2 installation
MPDIR=/opt/mpich2
PATH=$MPDIR/bin:.:$PATH
export
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/mpich2/lib:/opt/intel/fc/9.1.032
/lib:/opt/intel/cc/9.1.038/lib:/opt/jdk1.5.0_06/lib
export PATH

# Keep track where Condor's scratch dir for this VM is
export SCRATCH_LOC=loclocloc
echo $PWD > ~/$SCRATCH_LOC
echo $SCRATCH_LOC
finalize()
{
 echo "In finalize"
  mpdallexit
  rm ~/$SCRATCH_LOC
  exit
}
trap finalize TERM

if [ $_CONDOR_PROCNO -ne 0 ]
then
    sleep 5
fi

if [ $_CONDOR_PROCNO -eq 0 ]
then
    CONDOR_CONTACT_FILE=$_CONDOR_SCRATCH_DIR/contact
    export CONDOR_CONTACT_FILE

    CONDOR_MPI_SHELL=$(($RANDOM * $RANDOM))
    echo '#!/bin/sh' > $CONDOR_MPI_SHELL
    echo "cd \`cat ~/$SCRATCH_LOC\`" >> $CONDOR_MPI_SHELL
    echo 'exec $1 $@' >> $CONDOR_MPI_SHELL
    chmod a+r+x $CONDOR_MPI_SHELL

    NODEFILE=nodefile
    myTmp=`cat $CONDOR_CONTACT_FILE | cut -f 2 -d ' ' | sort -u`

        rootHost=`hostname`
    touch $NODEFILE

    for i in $myTmp;
    do
        echo $i >> $NODEFILE
              scp $CONDOR_MPI_SHELL $i:/tmp
    done

    nodes=`wc -l $NODEFILE | cut -f 1 -d ' '`

    # The second field in the contact file is the machine name
    # that condor_ssh knows how to use
    sort -n +0 < $CONDOR_CONTACT_FILE | awk '{print $2}' > machines

           $CONDOR_CHIRP put $_CONDOR_SCRATCH_DIR/contact
           $_CONDOR_REMOTE_SPOOL_DIR/contact

    mpdboot -n $nodes -v -f $NODEFILE
    val=$?

    if [ $val -ne 0 ]
    then
               echo "mpdboot error : $val"
               exit 1
    fi

    mpdtrace -l

    ## run the actual mpijob
    mpiexec -machinefile machines -n $_CONDOR_NPROCS /tmp/$CONDOR_MPI_SHELL
    $EXECUTABLE $@
    mpdallexit

    for i in `cat $NODEFILE`;
        do
              ssh $i "rm -f /tmp/$CONDOR_MPI_SHELL"
        done

        rm ~/$SCRATCH_LOC $CONDOR_MPI_SHELL contact nodefile machines
fi


exit $?

---
my submit description file is :
--
universe = parallel
executable = mp2script
arguments = sander.MPI -O -i mm.in -o 1AOT.out -p 1AOT.prmtop -c
1AOT.prmcrd -r 1AOT.xyz
machine_count = 2
should_transfer_files = yes
transfer_executable = true
when_to_transfer_output = on_exit
transfer_input_files = sander.MPI, mm.in, 1AOT.prmtop, 1AOT.prmcrd
+WantParallelSchedulingGroups = True
notification = never
log = log
error = err
output = out
queue
--

The following is the error file :
--
++ pwd
+ export PWD=/home/condor/execute/dir_17451
+ PWD=/home/condor/execute/dir_17451
+ export MPD_CONF_FILE=/home/student/.mpd.conf
+ MPD_CONF_FILE=/home/student/.mpd.conf
+ _CONDOR_PROCNO=0
+ _CONDOR_NPROCS=2
++ condor_config_val libexec
+ CONDOR_SSH=/home1/condor_release/libexec
+ CONDOR_SSH=/home1/condor_release/libexec/condor_ssh
++ condor_config_val libexec
+ SSHD_SH=/home1/condor_release/libexec
+ SSHD_SH=/home1/condor_release/libexec/sshd.sh
+ . /home1/condor_release/libexec/sshd.sh 0 2
++ trap sshd_cleanup 15
+++ condor_config_val CONDOR_SSHD
++ SSHD=/usr/sbin/sshd
+++ condor_config_val CONDOR_SSH_KEYGEN
++ KEYGEN=/usr/bin/ssh-keygen
+++ condor_config_val libexec
++ CONDOR_CHIRP=/home1/condor_release/libexec
++ CONDOR_CHIRP=/home1/condor_release/libexec/condor_chirp
++ PORT=4444
++ _CONDOR_REMOTE_SPOOL_DIR=/home/condor/spool/cluster31.proc0.subproc0
++ _CONDOR_PROCNO=0
++ _CONDOR_NPROCS=2
++ mkdir /home/condor/execute/dir_17451/tmp
++ hostkey=/home/condor/execute/dir_17451/tmp/hostkey
++ /bin/rm -f /home/condor/execute/dir_17451/tmp/hostkey
/home/condor/execute/dir_17451/tmp/hostkey.pub
++ /usr/bin/ssh-keygen -q -f /home/condor/execute/dir_17451/tmp/hostkey -t
rsa -N ''
++ '[' 0 -ne 0 ']'
++ idkey=/home/condor/execute/dir_17451/tmp/0.key
++ /usr/bin/ssh-keygen -q -f /home/condor/execute/dir_17451/tmp/0.key -t
rsa -N ''
++ '[' 0 -ne 0 ']'
++ /home1/condor_release/libexec/condor_chirp put -perm 0700
/home/condor/execute/dir_17451/tmp/0.key
/home/condor/spool/cluster31.proc0.subproc0/0.key
++ '[' 0 -ne 0 ']'
++ done=0
++ '[' 0 -eq 0 ']'
++ /usr/sbin/sshd -p4444
-oAuthorizedKeysFile=/home/condor/execute/dir_17451/tmp/0.key.pub
-h/home/condor/execute/dir_17451/tmp/hostkey -De -f/dev/null
-oStrictModes=no -oPidFile=/dev/null -oAcceptEnv=_CONDOR
++ pid=17483
++ sleep 2
++ grep 'Server listening' sshd.out
++ done=1
++ '[' 1 -eq 0 ']'
++ /bin/rm sshd.out
+++ hostname
++ hostname=server3.jublbiosys.com
+++ pwd
++ currentDir=/home/condor/execute/dir_17451
+++ whoami
++ user=student
++ echo '0 server3.jublbiosys.com 4444 student
/home/condor/execute/dir_17451'
++ /home1/condor_release/libexec/condor_chirp put -mode cwa -
/home/condor/spool/cluster31.proc0.subproc0/contact
++ '[' 0 -ne 0 ']'
++ '[' 0 -eq 0 ']'
++ done=0
++ '[' 0 -eq 0 ']'
++ /bin/rm -f contact
++ /home1/condor_release/libexec/condor_chirp fetch
/home/condor/spool/cluster31.proc0.subproc0/contact
/home/condor/execute/dir_17451/contact
+++ wc -l /home/condor/execute/dir_17451/contact
+++ awk '{print $1}'
++ lines=1
++ '[' 1 -eq 2 ']'
++ sleep 1
++ '[' 0 -eq 0 ']'
++ /bin/rm -f contact
++ /home1/condor_release/libexec/condor_chirp fetch
/home/condor/spool/cluster31.proc0.subproc0/contact
/home/condor/execute/dir_17451/contact
+++ wc -l /home/condor/execute/dir_17451/contact
+++ awk '{print $1}'
++ lines=1
++ '[' 1 -eq 2 ']'
++ sleep 1
++ '[' 0 -eq 0 ']'
++ /bin/rm -f contact
++ /home1/condor_release/libexec/condor_chirp fetch
/home/condor/spool/cluster31.proc0.subproc0/contact
/home/condor/execute/dir_17451/contact
+++ wc -l /home/condor/execute/dir_17451/contact
+++ awk '{print $1}'
++ lines=1
++ '[' 1 -eq 2 ']'
++ sleep 1
++ '[' 0 -eq 0 ']'
++ /bin/rm -f contact
++ /home1/condor_release/libexec/condor_chirp fetch
/home/condor/spool/cluster31.proc0.subproc0/contact
/home/condor/execute/dir_17451/contact
+++ wc -l /home/condor/execute/dir_17451/contact
+++ awk '{print $1}'
++ lines=1
++ '[' 1 -eq 2 ']'
++ sleep 1
++ '[' 0 -eq 0 ']'
++ /bin/rm -f contact
++ /home1/condor_release/libexec/condor_chirp fetch
/home/condor/spool/cluster31.proc0.subproc0/contact
/home/condor/execute/dir_17451/contact
+++ wc -l /home/condor/execute/dir_17451/contact
+++ awk '{print $1}'
++ lines=1
++ '[' 1 -eq 2 ']'
++ sleep 1
++ '[' 0 -eq 0 ']'
++ /bin/rm -f contact
++ /home1/condor_release/libexec/condor_chirp fetch
/home/condor/spool/cluster31.proc0.subproc0/contact
/home/condor/execute/dir_17451/contact
+++ wc -l /home/condor/execute/dir_17451/contact
+++ awk '{print $1}'
++ lines=2
++ '[' 2 -eq 2 ']'
++ done=1
++ node=0
++ '[' 0 -ne 2 ']'
++ /home1/condor_release/libexec/condor_chirp fetch
/home/condor/spool/cluster31.proc0.subproc0/0.key
/home/condor/execute/dir_17451/tmp/0.key
++ /home1/condor_release/libexec/condor_chirp remove
/home/condor/spool/cluster31.proc0.subproc0/0.key
+++ expr 0 + 1
++ node=1
++ '[' 1 -ne 2 ']'
++ /home1/condor_release/libexec/condor_chirp fetch
/home/condor/spool/cluster31.proc0.subproc0/1.key
/home/condor/execute/dir_17451/tmp/1.key
++ /home1/condor_release/libexec/condor_chirp remove
/home/condor/spool/cluster31.proc0.subproc0/1.key
+++ expr 1 + 1
++ node=2
++ '[' 2 -ne 2 ']'
++ chmod 0700 /home/condor/execute/dir_17451/tmp/0.key
/home/condor/execute/dir_17451/tmp/1.key
++ /home1/condor_release/libexec/condor_chirp remove
/home/condor/spool/cluster31.proc0.subproc0/contact
++ '[' 1 -eq 0 ']'
+ EXECUTABLE=sander.MPI
+ shift
+ chmod +x sander.MPI
+ MPDIR=/opt/mpich2
+
PATH=/opt/mpich2/bin:.:/home1/condor_release/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin:/opt/amb/amber9/exe:/opt/mpich2/bin:/opt/intel/cc/9.1.038
/bin:/opt/python24/bin:/opt/intel/fc/9.1.032
/bin::/home1/condor_release/bin:/home1/condor_release/sbin:/usr/bin:/usr/local/bin:/usr/sbin:/usr/local/sbin:/home1/condor_release/bin:/home1/condor_release/sbin::/root/bin:/root/bin:/opt/amb/amber9/exe:/opt/mpich2/bin:/opt/intel/cc/9.1.038
/bin:/opt/pyt
hon24/bin:/opt/intel/fc/9.1.032
/bin::/home1/condor_release/bin:/home1/condor_release/sbin:/usr/bin:/usr/local/bin:/usr/sbin:/usr/local/sbin:/home1/condor_release/bin:/home1/condor_release/sbin:

+ export LD_LIBRARY_PATH=:/opt/mpich2/lib:/opt/intel/fc/9.1.032
/lib:/opt/intel/cc/9.1.038/lib:/opt/jdk1.5.0_06/lib
+ LD_LIBRARY_PATH=:/opt/mpich2/lib:/opt/intel/fc/9.1.032
/lib:/opt/intel/cc/9.1.038/lib:/opt/jdk1.5.0_06/lib
+ export PATH
+ export SCRATCH_LOC=loclocloc
+ SCRATCH_LOC=loclocloc
+ echo /home/condor/execute/dir_17451
+ echo loclocloc
+ trap finalize TERM
+ '[' 0 -ne 0 ']'
+ '[' 0 -eq 0 ']'
+ CONDOR_CONTACT_FILE=/home/condor/execute/dir_17451/contact
+ export CONDOR_CONTACT_FILE
+ CONDOR_MPI_SHELL=309250656
+ echo '#!/bin/sh'
+ echo 'cd `cat ~/loclocloc`'
+ echo 'exec $1 $@'
+ chmod a+r+x 309250656
+ NODEFILE=nodefile
++ cat /home/condor/execute/dir_17451/contact
++ cut -f 2 -d ' '
++ sort -u
+ myTmp=server3.jublbiosys.com
++ hostname
+ rootHost=server3.jublbiosys.com
+ touch nodefile
+ for i in '$myTmp'
+ echo server3.jublbiosys.com
+ scp 309250656 server3.jublbiosys.com:/tmp
Host key verification failed.
lost connection
++ wc -l nodefile
++ cut -f 1 -d ' '
+ nodes=1
+ sort -n +0
sort: open failed: +0: No such file or directory
+ awk '{print $2}'
+ /home1/condor_release/libexec/condor_chirp put
/home/condor/execute/dir_17451/contact
/home/condor/spool/cluster31.proc0.subproc0/contact
+ mpdboot -n 1 -v -f nodefile
+ val=0
+ '[' 0 -ne 0 ']'
+ mpdtrace -l
+ mpiexec -machinefile machines -n 2 /tmp/309250656 sander.MPI -O -i mm.in
-o 1AOT.out -p 1AOT.prmtop -c 1AOT.prmcrd -r 1AOT.xyz
+ mpdallexit
++ cat nodefile
+ for i in '`cat $NODEFILE`'
+ ssh server3.jublbiosys.com 'rm -f /tmp/309250656'
Host key verification failed.
+ rm /home/student/loclocloc 309250656 contact nodefile machines
+ exit 0
--
 The following is the out file :
--
loclocloc
running mpdallexit on server3.jublbiosys.com
LAUNCHED mpd on server3.jublbiosys.com  via
RUNNING: mpd on server3.jublbiosys.com
server3.jublbiosys.com_48021 (180.190.40.23)
problem with execution of /tmp/309250656  on  server3.jublbiosys.com:
[Errno 2] No such file or directory
problem with execution of /tmp/309250656  on  server3.jublbiosys.com:
[Errno 2] No such file or directory
--

Following is the log file :
--
000 (031.000.000) 07/07 18:10:16 Job submitted from host:
<180.190.40.23:42236>
...
014 (031.000.000) 07/07 18:10:38 Node 0 executing on host:
<180.190.40.23:51669>
...
014 (031.000.001) 07/07 18:10:39 Node 1 executing on host:
<180.190.40.23:51669>
...
001 (031.000.000) 07/07 18:10:39 Job executing on host: MPI_job
...
015 (031.000.000) 07/07 18:10:48 Node 0 terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    7667  -  Run Bytes Sent By Node
    5750602  -  Run Bytes Received By Node
    7667  -  Total Bytes Sent By Node
    5750602  -  Total Bytes Received By Node
...
005 (031.000.000) 07/07 18:10:48 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    7667  -  Run Bytes Sent By Job
    11501204  -  Run Bytes Received By Job
    7667  -  Total Bytes Sent By Job
    11501204  -  Total Bytes Received By Job
--

In the mp2script file
the line  SCRATCH_LOC=loclocloc (I could not get where to point). should it
be `$PWD`?

Any ideas on how to implement or suggestion in this regard

Thanks :)

Pravinkumar






























----------------------------------------------------------------------------------------------------------------
This Message and any attachment (the "message") is intended solely for the addressees and is confidential. 
If you receive this message in error, please delete it and immediately notify the sender.
Any use not in accord with its purpose, any dissemination or disclosure, either whole or partial, 
is prohibited except formal approval. The internet can not guarantee the integrity of this message.
Jubilant Organosys Ltd. (and its subsidiaries) shall (will) not therefore be liable for the message if modified.