[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] How to get unoccupied nodes?



Hi,

I am new of condor system.
I want to use condor to submit the simulation of GROMACS in parallel.
I read the user manual and write a submit script myself.
Unfortunately, nodes assigned to me were always occupied buy other jobs
before.
And then the utility rate I got in a CPU is always less than 10%.
In our cluster we use bash shell, lam/mpi v7.1.4 and condor v6.8.8.
I got the lamscript in the message of mailing list before which could work
on bash shell.
It apparently there is still something wrong in my submit script and
lamscript.
And I post my script above.

condor_mpi:
#!/bin/bash
Universe = parallel
Executable = ./lamscript
machine_count = 2
output = md_$(NODE).out
error = md_$(NODE).err
log = md.log
arguments = /stathome/jiangsl/simulation/gromacs/2OMP/2OMP_1_1/md.sh
+WantIOProxy = True
should_transfer_files = yes
when_to_transfer_output = on_exit
Queue

lamscript:
#!/bin/sh

_CONDOR_PROCNO=$_CONDOR_PROCNO
_CONDOR_NPROCS=$_CONDOR_NPROCS
_CONDOR_REMOTE_SPOOL_DIR=$_CONDOR_REMOTE_SPOOL_DIR

SSHD_SH=`condor_config_val libexec`
SSHD_SH=$SSHD_SH/sshd.sh

CONDOR_SSH=`condor_config_val libexec`
CONDOR_SSH=$CONDOR_SSH/condor_ssh

# Set this to the bin directory of your lam installation
# This also must be in your .cshrc file, so the remote side
# can find it!
export LAMDIR=/stathome/jiangsl/soft/lam-7.1.4
export PATH=${LAMDIR}/bin:${PATH}
export LD_LIBRARY_PATH=$LAMDIR/lib/:lib:/usr/lib:.:/opt/intel/compilers/lib

. $SSHD_SH $_CONDOR_PROCNO $_CONDOR_NPROCS

# If not the head node, just sleep forever, to let the
# sshds run
if [ $_CONDOR_PROCNO -ne 0 ]
then
                wait
                sshd_cleanup
                exit 0
fi

EXECUTABLE=$1
shift

####
echo $EXECUTABLE
echo $SSHD_SH
echo $_CONDOR_PROCNO
echo $_CONDOR_NPROCS
echo $_CONDOR_REMOTE_SPOOL_DIR
####
# the binary is copied but the executable flag is cleared.
# so the script have to take care of this
chmod +x $EXECUTABLE

# to allow multiple lam jobs running on a single machine,
# we have to give somewhat unique value
export LAM_MPI_SESSION_SUFFIX=$$
echo $$
export LAMRSH=$CONDOR_SSH
echo $CONDOR_SSH
# when a job is killed by the user, this script will get sigterm
# This script have to catch it and do the cleaning for the
# lam environment
finalize()
{
sshd_cleanup
lamhalt
exit
}
trap finalize TERM

CONDOR_CONTACT_FILE=$_CONDOR_SCRATCH_DIR/contact
export $CONDOR_CONTACT_FILE
echo $_CONDOR_SCRATCH_DIR
echo $CONDOR_CONTACT_FILE
# The second field in the contact file is the machine name
# that condor_ssh knows how to use. Note that this used to
# say "sort -n +0 ...", but -n option is now deprecated.
sort < $CONDOR_CONTACT_FILE | awk '{print $2}' > machines

# start the lam environment
# For older versions of lam you may need to remove the -ssi boot rsh line
lamboot -ssi boot rsh -ssi rsh_agent "$LAMRSH -x" machines

if [ $? -ne 0 ]
then
        echo "lamscript error booting lam"
        exit 1
fi

## run the actual mpijob
mpirun C -ssi rpi usysv -ssi coll_smp 1 $EXECUTABLE $@ &
###############################################
CHILD=$!
TMP=130
while [ $TMP -gt 128 ] ; do
        wait $CHILD
        TMP=$?;
done

# clean up files
sshd_cleanup
/bin/rm -f machines

# clean up lam
lamhalt

exit $TMP
----

Does anyone kindly tell me how to correct my script?

Hsin-Lin