[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] parallel jobs getting stuck


I am not an expert in the parallel universe. The developer who is happens to be very busy this week at a workshop.

A couple things to try:

Look in the ShadowLog and SchedLog on your submit machine. Are there any error messages when the job completes?

If you want to get rid of a job that is stuck in X state after being removed, you can use the -forcex option to condor_rm.


Diana Lousa wrote:


We have been experiencing some problems upon running parallel jobs
to condor's parallel universe. The main problem is that  our jobs get
stuck. The jobs terminate (all the expected output files are properly
written in the correct directory and using the top command on the
machine where the job was running shows that it has finished). However,
using the command condor_q reports that the job is still running long
after (2 days after) its effective termination. Moreover, using
condor_rm to kill those jobs is ineffective, since they do not dye (they
appear as X, when running condor_q) and our central manager (which is
also our dedicated scheduler) gets stuck and doesn't answer condor_q and
condor_submit commands. To solve the problem we have tried to reboot our central
manager/dedicated scheduler and even using this brute force approach
didn't solve the problem. The system came up, all the processors
appeared has unclaimed, but the condor_q and condor_rm commands still
didn't work.
Has anyone faced similar problems or have any
clue about the source of the problems we are experiencing and how to
solve them? Below I send a copy of our scripts:

Submition file:

Universe = parallel
Executable = run.sh
initialdir = xxx
machine_count = 2
output = $(NODE).out
error = $(NODE).error
log = parallel.log
+WantParallelSchedulingGroups = True
Requirements = (machine!="xxx")
Scheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxx"



# This is for LAM operation on a *single* SMP machine.

# Where's my LAM installation?
#export PATH=$LAMDIR/bin:$PATH
export PATH=$PATH:$LAMDIR/bin


# If not the head node, just sleep forever, to let the
# sshds run
if [ $_CONDOR_PROCNO -ne 0 ]
               exit 0

# the binary is copied but the executable flag is cleared.
# so the script have to take care of this

# to allow multiple lam jobs running on a single machine,
# we have to give somewhat unique value

# when a job is killed by the user, this script will get sigterm
# This script have to catch it and do the cleaning for the
# lam environment
trap finalize TERM

# Each of my machines has 4 cores, so I can set this here. For
# a heterogenous mix one would need to use a machine file on the # execute host, e.g. put in $LAMDIR/etc

/bin/echo "$HOSTNAME cpu=2" > machines

# start the lam environment

lamboot machines

if [ $? -ne 0 ]
       echo "lamscript error booting lam"
       exit 1

## run the actual mpijob; choose your rpi module

for i in 1,2,3,4,5 do mpirun C program_mpi -np 2 input_${i} output_${i}


while [ $TMP -gt 128 ] ; do
       wait $CHILD

/bin/rm -f machines

# clean up lam

exit 0

Thanks in advance

Diana Lousa
PhD Student
Instituto de Tecnologia Química e Biológica (ITQB)
Universidade Nova de Lisboa
Avenida da República, EAN
Apartado 127
2780-901 Oeiras

Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting

The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/