Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] parallel jobs getting stuck

Date: Wed, 08 Aug 2007 13:24:21 -0500
From: Dan Bradley <dan@xxxxxxxxxxxx>
Subject: Re: [Condor-users] parallel jobs getting stuck

Is your job's "user log" on an NFS filesystem? We are making somechanges to improve this case, because it is problematic in current versions.


You could try adding this to your configuration:

IGNORE_NFS_LOCK_ERRORS = True

or if you still have problems this:

ENABLE_USERLOG_LOCKING = False

--Dan

Diana Lousa wrote:

Hello again,

First, I want to thank you for your reply. Unfortunately, our problem
still remains unsolved, although we've followed your advices:

On Wed, 2007-08-08 at 15:21, Dan Bradley wrote:

Diana,
I am not an expert in the parallel universe. The developer who ishappens to be very busy this week at a workshop.
A couple things to try:
Look in the ShadowLog and SchedLog on your submit machine. Are thereany error messages when the job completes?


We had this kind of error message in the ShadowLog:


8/7 20:43:36 (362.0) (3553): FileLock::obtain(1) failed - errno 37 (No
locks available)
8/7 20:43:36 (362.0) (3553): Job 362.0 terminated: exited with status 0

It seems to us, that there was some problem with the locks. Can anyone

elucidate us about the meaning of this error message?

If you want to get rid of a job that is stuck in X state after beingremoved, you can use the -forcex option to condor_rm.


We had already tried the -forcex command but the condor_q command
stopped working as well as the other condor commands in the dedicated
scheduler. We could only solve this by rebooting the dedicated
scheduler. Any idea why this happened? We also tried kill -9 with the
condor-shadows corresponding to the idle jobs: the result was that the
processes got zombie.

Thanks again

Diana Lousa
PhD Student
Instituto de Tecnologia Química e Biológica (ITQB)
Universidade Nova de Lisboa
Avenida da República, EAN
Apartado 127
2780-901 Oeiras
Portugal

--Dan

Diana Lousa wrote:

Hello,

We have been experiencing some problems upon running parallel jobs
to condor's parallel universe. The main problem is that  our jobs get
often
stuck. The jobs terminate (all the expected output files are properly
written in the correct directory and using the top command on the
machine where the job was running shows that it has finished). However,
using the command condor_q reports that the job is still running long
after (2 days after) its effective termination. Moreover, using
condor_rm to kill those jobs is ineffective, since they do not dye (they
appear as X, when running condor_q) and our central manager (which is
also our dedicated scheduler) gets stuck and doesn't answer condor_q and

condor_submit commands.To solve the problem we have tried to reboot our central

manager/dedicated scheduler and even using this brute force approach
didn't solve the problem. The system came up, all the processors
appeared has unclaimed, but the condor_q and condor_rm commands still
didn't work.
Has anyone faced similar problems or have any
clue about the source of the problems we are experiencing and how to

solve them?Below I send a copy of our scripts:


Submition file:

Universe = parallel
Executable = run.sh
initialdir = xxx
machine_count = 2
output = $(NODE).out
error = $(NODE).error
log = parallel.log
+WantParallelSchedulingGroups = True
Requirements = (machine!="xxx")
Scheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxx"
Queue

run.sh:

#!/bin/sh

# This is for LAM operation on a *single* SMP machine.

# Where's my LAM installation?
LAMDIR=/etc/lam
#export PATH=$LAMDIR/bin:$PATH
export PATH=$PATH:$LAMDIR/bin
export
LD_LIBRARY_PATH=/lib:/usr/lib:$LAMDIR/lib:/opt/intel/compilers/lib

_CONDOR_PROCNO=$_CONDOR_PROCNO
_CONDOR_NPROCS=$_CONDOR_NPROCS
_CONDOR_REMOTE_SPOOL_DIR=$_CONDOR_REMOTE_SPOOL_DIR

# If not the head node, just sleep forever, to let the
# sshds run
if [ $_CONDOR_PROCNO -ne 0 ]
then
               wait
               exit 0
fi


# the binary is copied but the executable flag is cleared.
# so the script have to take care of this


# to allow multiple lam jobs running on a single machine,
# we have to give somewhat unique value
export LAM_MPI_SESSION_SUFFIX=$$

# when a job is killed by the user, this script will get sigterm
# This script have to catch it and do the cleaning for the
# lam environment
finalize()
{
lamhalt
exit
}
trap finalize TERM

# Each of my machines has 4 cores, so I can set this here. For

# a heterogenous mix one would need to use a machine file on the# execute host, e.g. put in $LAMDIR/etc


/bin/echo "$HOSTNAME cpu=2" > machines

# start the lam environment

lamboot machines

if [ $? -ne 0 ]
then
       echo "lamscript error booting lam"
       exit 1
fi

## run the actual mpijob; choose your rpi module

for i in 1,2,3,4,5dompirun C program_mpi -np 2 input_${i} output_${i}


done


CHILD=$!
TMP=130
while [ $TMP -gt 128 ] ; do
       wait $CHILD
       TMP=$?;
done

/bin/rm -f machines

# clean up lam
lamhalt

exit 0




Thanks in advance

Diana Lousa
PhD Student
Instituto de Tecnologia Química e Biológica (ITQB)
Universidade Nova de Lisboa
Avenida da República, EAN
Apartado 127
2780-901 Oeiras
Portugal

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:https://lists.cs.wisc.edu/archive/condor-users/

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:https://lists.cs.wisc.edu/archive/condor-users/


_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:https://lists.cs.wisc.edu/archive/condor-users/

References:
- [Condor-users] parallel jobs getting stuck
  - From: Diana Lousa
- Re: [Condor-users] parallel jobs getting stuck
  - From: Dan Bradley
- Re: [Condor-users] parallel jobs getting stuck
  - From: Diana Lousa

Prev by Date: Re: [Condor-users] quill issue
Next by Date: Re: [Condor-users] quill issue
Previous by thread: Re: [Condor-users] parallel jobs getting stuck
Next by thread: [Condor-users] Vista support in Condor
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] parallel jobs getting stuck