Mailing List Archives
Public Access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Fw: Configuring condor for execution of parallel amber jobs
- Date: Tue, 8 Jul 2008 14:07:17 +0530
- From: Pravin_Kumar@xxxxxxxxxxxxxxxxxx
- Subject: [Condor-users] Fw: Configuring condor for execution of parallel amber jobs
Hi,
I am trying to execute the parallel universe job through condor. The
version of condor which i am using is 7.1.0. In my pool there are three
machines with Linux (RHEL v5.0) as OS . My submit machine and central
manager is server3 and server2 and server1 are execute nodes.
I generally use student user account to submit the jobs. This account is
present on all three machines and candor binaries are in the PATH along
with MPICH2 binaries and Intel compliers (ifort,icc and icpc). The
LD_LIBRARY_PATH appropriately set to the respective lib directories.
I am trying to execute amber module called (sander.MPI) simultaneously on
all three machines.The following is my mp2script
--
#!/bin/sh -x
# File: mp2script
# Adapted from mp1script by Mark Calleja
#
# Edit MPDIR and LD_LIBRARY_PATH to suit your local configuration.
# Also don't forget to set the secretword in .mpd.conf.
#
export PWD=`pwd`
export MPD_CONF_FILE=~/.mpd.conf
_CONDOR_PROCNO=$_CONDOR_PROCNO
_CONDOR_NPROCS=$_CONDOR_NPROCS
CONDOR_SSH=`condor_config_val libexec`
CONDOR_SSH=$CONDOR_SSH/condor_ssh
SSHD_SH=`condor_config_val libexec`
SSHD_SH=$SSHD_SH/sshd.sh
. $SSHD_SH $_CONDOR_PROCNO $_CONDOR_NPROCS
EXECUTABLE=$1
shift
# The binary is copied but the executable flag is cleared,
# so the script has to take care of this.
chmod +x $EXECUTABLE
# Set this to the directory of MPICH2 installation
MPDIR=/opt/mpich2
PATH=$MPDIR/bin:.:$PATH
export
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/mpich2/lib:/opt/intel/fc/9.1.032
/lib:/opt/intel/cc/9.1.038/lib:/opt/jdk1.5.0_06/lib
export PATH
# Keep track where Condor's scratch dir for this VM is
export SCRATCH_LOC=loclocloc
echo $PWD > ~/$SCRATCH_LOC
echo $SCRATCH_LOC
finalize()
{
echo "In finalize"
mpdallexit
rm ~/$SCRATCH_LOC
exit
}
trap finalize TERM
if [ $_CONDOR_PROCNO -ne 0 ]
then
sleep 5
fi
if [ $_CONDOR_PROCNO -eq 0 ]
then
CONDOR_CONTACT_FILE=$_CONDOR_SCRATCH_DIR/contact
export CONDOR_CONTACT_FILE
CONDOR_MPI_SHELL=$(($RANDOM * $RANDOM))
echo '#!/bin/sh' > $CONDOR_MPI_SHELL
echo "cd \`cat ~/$SCRATCH_LOC\`" >> $CONDOR_MPI_SHELL
echo 'exec $1 $@' >> $CONDOR_MPI_SHELL
chmod a+r+x $CONDOR_MPI_SHELL
NODEFILE=nodefile
myTmp=`cat $CONDOR_CONTACT_FILE | cut -f 2 -d ' ' | sort -u`
rootHost=`hostname`
touch $NODEFILE
for i in $myTmp;
do
echo $i >> $NODEFILE
scp $CONDOR_MPI_SHELL $i:/tmp
done
nodes=`wc -l $NODEFILE | cut -f 1 -d ' '`
# The second field in the contact file is the machine name
# that condor_ssh knows how to use
sort -n +0 < $CONDOR_CONTACT_FILE | awk '{print $2}' > machines
$CONDOR_CHIRP put $_CONDOR_SCRATCH_DIR/contact
$_CONDOR_REMOTE_SPOOL_DIR/contact
mpdboot -n $nodes -v -f $NODEFILE
val=$?
if [ $val -ne 0 ]
then
echo "mpdboot error : $val"
exit 1
fi
mpdtrace -l
## run the actual mpijob
mpiexec -machinefile machines -n $_CONDOR_NPROCS /tmp/$CONDOR_MPI_SHELL
$EXECUTABLE $@
mpdallexit
for i in `cat $NODEFILE`;
do
ssh $i "rm -f /tmp/$CONDOR_MPI_SHELL"
done
rm ~/$SCRATCH_LOC $CONDOR_MPI_SHELL contact nodefile machines
fi
exit $?
---
my submit description file is :
--
universe = parallel
executable = mp2script
arguments = sander.MPI -O -i mm.in -o 1AOT.out -p 1AOT.prmtop -c
1AOT.prmcrd -r 1AOT.xyz
machine_count = 2
should_transfer_files = yes
transfer_executable = true
when_to_transfer_output = on_exit
transfer_input_files = sander.MPI, mm.in, 1AOT.prmtop, 1AOT.prmcrd
+WantParallelSchedulingGroups = True
notification = never
log = log
error = err
output = out
queue
--
The following is the error file :
--
++ pwd
+ export PWD=/home/condor/execute/dir_17451
+ PWD=/home/condor/execute/dir_17451
+ export MPD_CONF_FILE=/home/student/.mpd.conf
+ MPD_CONF_FILE=/home/student/.mpd.conf
+ _CONDOR_PROCNO=0
+ _CONDOR_NPROCS=2
++ condor_config_val libexec
+ CONDOR_SSH=/home1/condor_release/libexec
+ CONDOR_SSH=/home1/condor_release/libexec/condor_ssh
++ condor_config_val libexec
+ SSHD_SH=/home1/condor_release/libexec
+ SSHD_SH=/home1/condor_release/libexec/sshd.sh
+ . /home1/condor_release/libexec/sshd.sh 0 2
++ trap sshd_cleanup 15
+++ condor_config_val CONDOR_SSHD
++ SSHD=/usr/sbin/sshd
+++ condor_config_val CONDOR_SSH_KEYGEN
++ KEYGEN=/usr/bin/ssh-keygen
+++ condor_config_val libexec
++ CONDOR_CHIRP=/home1/condor_release/libexec
++ CONDOR_CHIRP=/home1/condor_release/libexec/condor_chirp
++ PORT=4444
++ _CONDOR_REMOTE_SPOOL_DIR=/home/condor/spool/cluster31.proc0.subproc0
++ _CONDOR_PROCNO=0
++ _CONDOR_NPROCS=2
++ mkdir /home/condor/execute/dir_17451/tmp
++ hostkey=/home/condor/execute/dir_17451/tmp/hostkey
++ /bin/rm -f /home/condor/execute/dir_17451/tmp/hostkey
/home/condor/execute/dir_17451/tmp/hostkey.pub
++ /usr/bin/ssh-keygen -q -f /home/condor/execute/dir_17451/tmp/hostkey -t
rsa -N ''
++ '[' 0 -ne 0 ']'
++ idkey=/home/condor/execute/dir_17451/tmp/0.key
++ /usr/bin/ssh-keygen -q -f /home/condor/execute/dir_17451/tmp/0.key -t
rsa -N ''
++ '[' 0 -ne 0 ']'
++ /home1/condor_release/libexec/condor_chirp put -perm 0700
/home/condor/execute/dir_17451/tmp/0.key
/home/condor/spool/cluster31.proc0.subproc0/0.key
++ '[' 0 -ne 0 ']'
++ done=0
++ '[' 0 -eq 0 ']'
++ /usr/sbin/sshd -p4444
-oAuthorizedKeysFile=/home/condor/execute/dir_17451/tmp/0.key.pub
-h/home/condor/execute/dir_17451/tmp/hostkey -De -f/dev/null
-oStrictModes=no -oPidFile=/dev/null -oAcceptEnv=_CONDOR
++ pid=17483
++ sleep 2
++ grep 'Server listening' sshd.out
++ done=1
++ '[' 1 -eq 0 ']'
++ /bin/rm sshd.out
+++ hostname
++ hostname=server3.jublbiosys.com
+++ pwd
++ currentDir=/home/condor/execute/dir_17451
+++ whoami
++ user=student
++ echo '0 server3.jublbiosys.com 4444 student
/home/condor/execute/dir_17451'
++ /home1/condor_release/libexec/condor_chirp put -mode cwa -
/home/condor/spool/cluster31.proc0.subproc0/contact
++ '[' 0 -ne 0 ']'
++ '[' 0 -eq 0 ']'
++ done=0
++ '[' 0 -eq 0 ']'
++ /bin/rm -f contact
++ /home1/condor_release/libexec/condor_chirp fetch
/home/condor/spool/cluster31.proc0.subproc0/contact
/home/condor/execute/dir_17451/contact
+++ wc -l /home/condor/execute/dir_17451/contact
+++ awk '{print $1}'
++ lines=1
++ '[' 1 -eq 2 ']'
++ sleep 1
++ '[' 0 -eq 0 ']'
++ /bin/rm -f contact
++ /home1/condor_release/libexec/condor_chirp fetch
/home/condor/spool/cluster31.proc0.subproc0/contact
/home/condor/execute/dir_17451/contact
+++ wc -l /home/condor/execute/dir_17451/contact
+++ awk '{print $1}'
++ lines=1
++ '[' 1 -eq 2 ']'
++ sleep 1
++ '[' 0 -eq 0 ']'
++ /bin/rm -f contact
++ /home1/condor_release/libexec/condor_chirp fetch
/home/condor/spool/cluster31.proc0.subproc0/contact
/home/condor/execute/dir_17451/contact
+++ wc -l /home/condor/execute/dir_17451/contact
+++ awk '{print $1}'
++ lines=1
++ '[' 1 -eq 2 ']'
++ sleep 1
++ '[' 0 -eq 0 ']'
++ /bin/rm -f contact
++ /home1/condor_release/libexec/condor_chirp fetch
/home/condor/spool/cluster31.proc0.subproc0/contact
/home/condor/execute/dir_17451/contact
+++ wc -l /home/condor/execute/dir_17451/contact
+++ awk '{print $1}'
++ lines=1
++ '[' 1 -eq 2 ']'
++ sleep 1
++ '[' 0 -eq 0 ']'
++ /bin/rm -f contact
++ /home1/condor_release/libexec/condor_chirp fetch
/home/condor/spool/cluster31.proc0.subproc0/contact
/home/condor/execute/dir_17451/contact
+++ wc -l /home/condor/execute/dir_17451/contact
+++ awk '{print $1}'
++ lines=1
++ '[' 1 -eq 2 ']'
++ sleep 1
++ '[' 0 -eq 0 ']'
++ /bin/rm -f contact
++ /home1/condor_release/libexec/condor_chirp fetch
/home/condor/spool/cluster31.proc0.subproc0/contact
/home/condor/execute/dir_17451/contact
+++ wc -l /home/condor/execute/dir_17451/contact
+++ awk '{print $1}'
++ lines=2
++ '[' 2 -eq 2 ']'
++ done=1
++ node=0
++ '[' 0 -ne 2 ']'
++ /home1/condor_release/libexec/condor_chirp fetch
/home/condor/spool/cluster31.proc0.subproc0/0.key
/home/condor/execute/dir_17451/tmp/0.key
++ /home1/condor_release/libexec/condor_chirp remove
/home/condor/spool/cluster31.proc0.subproc0/0.key
+++ expr 0 + 1
++ node=1
++ '[' 1 -ne 2 ']'
++ /home1/condor_release/libexec/condor_chirp fetch
/home/condor/spool/cluster31.proc0.subproc0/1.key
/home/condor/execute/dir_17451/tmp/1.key
++ /home1/condor_release/libexec/condor_chirp remove
/home/condor/spool/cluster31.proc0.subproc0/1.key
+++ expr 1 + 1
++ node=2
++ '[' 2 -ne 2 ']'
++ chmod 0700 /home/condor/execute/dir_17451/tmp/0.key
/home/condor/execute/dir_17451/tmp/1.key
++ /home1/condor_release/libexec/condor_chirp remove
/home/condor/spool/cluster31.proc0.subproc0/contact
++ '[' 1 -eq 0 ']'
+ EXECUTABLE=sander.MPI
+ shift
+ chmod +x sander.MPI
+ MPDIR=/opt/mpich2
+
PATH=/opt/mpich2/bin:.:/home1/condor_release/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin:/opt/amb/amber9/exe:/opt/mpich2/bin:/opt/intel/cc/9.1.038
/bin:/opt/python24/bin:/opt/intel/fc/9.1.032
/bin::/home1/condor_release/bin:/home1/condor_release/sbin:/usr/bin:/usr/local/bin:/usr/sbin:/usr/local/sbin:/home1/condor_release/bin:/home1/condor_release/sbin::/root/bin:/root/bin:/opt/amb/amber9/exe:/opt/mpich2/bin:/opt/intel/cc/9.1.038
/bin:/opt/pyt
hon24/bin:/opt/intel/fc/9.1.032
/bin::/home1/condor_release/bin:/home1/condor_release/sbin:/usr/bin:/usr/local/bin:/usr/sbin:/usr/local/sbin:/home1/condor_release/bin:/home1/condor_release/sbin:
+ export LD_LIBRARY_PATH=:/opt/mpich2/lib:/opt/intel/fc/9.1.032
/lib:/opt/intel/cc/9.1.038/lib:/opt/jdk1.5.0_06/lib
+ LD_LIBRARY_PATH=:/opt/mpich2/lib:/opt/intel/fc/9.1.032
/lib:/opt/intel/cc/9.1.038/lib:/opt/jdk1.5.0_06/lib
+ export PATH
+ export SCRATCH_LOC=loclocloc
+ SCRATCH_LOC=loclocloc
+ echo /home/condor/execute/dir_17451
+ echo loclocloc
+ trap finalize TERM
+ '[' 0 -ne 0 ']'
+ '[' 0 -eq 0 ']'
+ CONDOR_CONTACT_FILE=/home/condor/execute/dir_17451/contact
+ export CONDOR_CONTACT_FILE
+ CONDOR_MPI_SHELL=309250656
+ echo '#!/bin/sh'
+ echo 'cd `cat ~/loclocloc`'
+ echo 'exec $1 $@'
+ chmod a+r+x 309250656
+ NODEFILE=nodefile
++ cat /home/condor/execute/dir_17451/contact
++ cut -f 2 -d ' '
++ sort -u
+ myTmp=server3.jublbiosys.com
++ hostname
+ rootHost=server3.jublbiosys.com
+ touch nodefile
+ for i in '$myTmp'
+ echo server3.jublbiosys.com
+ scp 309250656 server3.jublbiosys.com:/tmp
Host key verification failed.
lost connection
++ wc -l nodefile
++ cut -f 1 -d ' '
+ nodes=1
+ sort -n +0
sort: open failed: +0: No such file or directory
+ awk '{print $2}'
+ /home1/condor_release/libexec/condor_chirp put
/home/condor/execute/dir_17451/contact
/home/condor/spool/cluster31.proc0.subproc0/contact
+ mpdboot -n 1 -v -f nodefile
+ val=0
+ '[' 0 -ne 0 ']'
+ mpdtrace -l
+ mpiexec -machinefile machines -n 2 /tmp/309250656 sander.MPI -O -i mm.in
-o 1AOT.out -p 1AOT.prmtop -c 1AOT.prmcrd -r 1AOT.xyz
+ mpdallexit
++ cat nodefile
+ for i in '`cat $NODEFILE`'
+ ssh server3.jublbiosys.com 'rm -f /tmp/309250656'
Host key verification failed.
+ rm /home/student/loclocloc 309250656 contact nodefile machines
+ exit 0
--
The following is the out file :
--
loclocloc
running mpdallexit on server3.jublbiosys.com
LAUNCHED mpd on server3.jublbiosys.com via
RUNNING: mpd on server3.jublbiosys.com
server3.jublbiosys.com_48021 (180.190.40.23)
problem with execution of /tmp/309250656 on server3.jublbiosys.com:
[Errno 2] No such file or directory
problem with execution of /tmp/309250656 on server3.jublbiosys.com:
[Errno 2] No such file or directory
--
Following is the log file :
--
000 (031.000.000) 07/07 18:10:16 Job submitted from host:
<180.190.40.23:42236>
...
014 (031.000.000) 07/07 18:10:38 Node 0 executing on host:
<180.190.40.23:51669>
...
014 (031.000.001) 07/07 18:10:39 Node 1 executing on host:
<180.190.40.23:51669>
...
001 (031.000.000) 07/07 18:10:39 Job executing on host: MPI_job
...
015 (031.000.000) 07/07 18:10:48 Node 0 terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
7667 - Run Bytes Sent By Node
5750602 - Run Bytes Received By Node
7667 - Total Bytes Sent By Node
5750602 - Total Bytes Received By Node
...
005 (031.000.000) 07/07 18:10:48 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
7667 - Run Bytes Sent By Job
11501204 - Run Bytes Received By Job
7667 - Total Bytes Sent By Job
11501204 - Total Bytes Received By Job
--
In the mp2script file
the line SCRATCH_LOC=loclocloc (I could not get where to point). should it
be `$PWD`?
Any ideas on how to implement or suggestion in this regard
Thanks :)
Pravinkumar
----------------------------------------------------------------------------------------------------------------
This Message and any attachment (the "message") is intended solely for the addressees and is confidential.
If you receive this message in error, please delete it and immediately notify the sender.
Any use not in accord with its purpose, any dissemination or disclosure, either whole or partial,
is prohibited except formal approval. The internet can not guarantee the integrity of this message.
Jubilant Organosys Ltd. (and its subsidiaries) shall (will) not therefore be liable for the message if modified.