[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] parallel jobs getting stuck



Diana,

I am not an expert in the parallel universe. The developer who is happens to be very busy this week at a workshop.

A couple things to try:

Look in the ShadowLog and SchedLog on your submit machine. Are there any error messages when the job completes?

If you want to get rid of a job that is stuck in X state after being removed, you can use the -forcex option to condor_rm.

--Dan

Diana Lousa wrote:

Hello,

We have been experiencing some problems upon running parallel jobs
to condor's parallel universe. The main problem is that  our jobs get
often
stuck. The jobs terminate (all the expected output files are properly
written in the correct directory and using the top command on the
machine where the job was running shows that it has finished). However,
using the command condor_q reports that the job is still running long
after (2 days after) its effective termination. Moreover, using
condor_rm to kill those jobs is ineffective, since they do not dye (they
appear as X, when running condor_q) and our central manager (which is
also our dedicated scheduler) gets stuck and doesn't answer condor_q and
condor_submit commands. To solve the problem we have tried to reboot our central
manager/dedicated scheduler and even using this brute force approach
didn't solve the problem. The system came up, all the processors
appeared has unclaimed, but the condor_q and condor_rm commands still
didn't work.
Has anyone faced similar problems or have any
clue about the source of the problems we are experiencing and how to
solve them? Below I send a copy of our scripts:

Submition file:

Universe = parallel
Executable = run.sh
initialdir = xxx
machine_count = 2
output = $(NODE).out
error = $(NODE).error
log = parallel.log
+WantParallelSchedulingGroups = True
Requirements = (machine!="xxx")
Scheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxx"
Queue

run.sh:

#!/bin/sh

# This is for LAM operation on a *single* SMP machine.

# Where's my LAM installation?
LAMDIR=/etc/lam
#export PATH=$LAMDIR/bin:$PATH
export PATH=$PATH:$LAMDIR/bin
export
LD_LIBRARY_PATH=/lib:/usr/lib:$LAMDIR/lib:/opt/intel/compilers/lib

_CONDOR_PROCNO=$_CONDOR_PROCNO
_CONDOR_NPROCS=$_CONDOR_NPROCS
_CONDOR_REMOTE_SPOOL_DIR=$_CONDOR_REMOTE_SPOOL_DIR

# If not the head node, just sleep forever, to let the
# sshds run
if [ $_CONDOR_PROCNO -ne 0 ]
then
               wait
               exit 0
fi


# the binary is copied but the executable flag is cleared.
# so the script have to take care of this


# to allow multiple lam jobs running on a single machine,
# we have to give somewhat unique value
export LAM_MPI_SESSION_SUFFIX=$$

# when a job is killed by the user, this script will get sigterm
# This script have to catch it and do the cleaning for the
# lam environment
finalize()
{
lamhalt
exit
}
trap finalize TERM

# Each of my machines has 4 cores, so I can set this here. For
# a heterogenous mix one would need to use a machine file on the # execute host, e.g. put in $LAMDIR/etc

/bin/echo "$HOSTNAME cpu=2" > machines

# start the lam environment

lamboot machines

if [ $? -ne 0 ]
then
       echo "lamscript error booting lam"
       exit 1
fi

## run the actual mpijob; choose your rpi module



for i in 1,2,3,4,5 do mpirun C program_mpi -np 2 input_${i} output_${i}

done


CHILD=$!
TMP=130
while [ $TMP -gt 128 ] ; do
       wait $CHILD
       TMP=$?;
done

/bin/rm -f machines

# clean up lam
lamhalt

exit 0




Thanks in advance

Diana Lousa
PhD Student
Instituto de Tecnologia Química e Biológica (ITQB)
Universidade Nova de Lisboa
Avenida da República, EAN
Apartado 127
2780-901 Oeiras
Portugal

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/