[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] parallel jobs getting stuck



Hello again,

First, I want to thank you for your reply. Unfortunately, our problem
still remains unsolved, although we've followed your advices:

On Wed, 2007-08-08 at 15:21, Dan Bradley wrote:
> Diana,
> 
> I am not an expert in the parallel universe.  The developer who is 
> happens to be very busy this week at a workshop.
> 
> A couple things to try:
> 
> Look in the ShadowLog and SchedLog on your submit machine.  Are there 
> any error messages when the job completes?

We had this kind of error message in the ShadowLog:


8/7 20:43:36 (362.0) (3553): FileLock::obtain(1) failed - errno 37 (No
locks available)
8/7 20:43:36 (362.0) (3553): Job 362.0 terminated: exited with status 0

It seems to us, that there was some problem with the locks. Can anyone
elucidate us about the meaning of this error message?   

> 
> If you want to get rid of a job that is stuck in X state after being 
> removed, you can use the -forcex option to condor_rm.

We had already tried the -forcex command but the condor_q command
stopped working as well as the other condor commands in the dedicated
scheduler. We could only solve this by rebooting the dedicated
scheduler. Any idea why this happened? We also tried kill -9 with the
condor-shadows corresponding to the idle jobs: the result was that the
processes got zombie.

Thanks again

Diana Lousa
PhD Student
Instituto de Tecnologia Química e Biológica (ITQB)
Universidade Nova de Lisboa
Avenida da República, EAN
Apartado 127
2780-901 Oeiras
Portugal

> 
> --Dan
> 
> Diana Lousa wrote:
> 
> >Hello,
> >
> >We have been experiencing some problems upon running parallel jobs
> >to condor's parallel universe. The main problem is that  our jobs get
> >often
> >stuck. The jobs terminate (all the expected output files are properly
> >written in the correct directory and using the top command on the
> >machine where the job was running shows that it has finished). However,
> >using the command condor_q reports that the job is still running long
> >after (2 days after) its effective termination. Moreover, using
> >condor_rm to kill those jobs is ineffective, since they do not dye (they
> >appear as X, when running condor_q) and our central manager (which is
> >also our dedicated scheduler) gets stuck and doesn't answer condor_q and
> >condor_submit commands. 
> >To solve the problem we have tried to reboot our central
> >manager/dedicated scheduler and even using this brute force approach
> >didn't solve the problem. The system came up, all the processors
> >appeared has unclaimed, but the condor_q and condor_rm commands still
> >didn't work.
> >Has anyone faced similar problems or have any
> >clue about the source of the problems we are experiencing and how to
> >solve them?    
> >Below I send a copy of our scripts:
> >
> >Submition file:
> >
> >Universe = parallel
> >Executable = run.sh
> >initialdir = xxx
> >machine_count = 2
> >output = $(NODE).out
> >error = $(NODE).error
> >log = parallel.log
> >+WantParallelSchedulingGroups = True
> >Requirements = (machine!="xxx")
> >Scheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxx"
> >Queue
> >
> >run.sh:
> >
> >#!/bin/sh
> >
> ># This is for LAM operation on a *single* SMP machine.
> >
> ># Where's my LAM installation?
> >LAMDIR=/etc/lam
> >#export PATH=$LAMDIR/bin:$PATH
> >export PATH=$PATH:$LAMDIR/bin
> >export
> >LD_LIBRARY_PATH=/lib:/usr/lib:$LAMDIR/lib:/opt/intel/compilers/lib
> >
> >_CONDOR_PROCNO=$_CONDOR_PROCNO
> >_CONDOR_NPROCS=$_CONDOR_NPROCS
> >_CONDOR_REMOTE_SPOOL_DIR=$_CONDOR_REMOTE_SPOOL_DIR
> >
> ># If not the head node, just sleep forever, to let the
> ># sshds run
> >if [ $_CONDOR_PROCNO -ne 0 ]
> >then
> >                wait
> >                exit 0
> >fi
> >
> >
> ># the binary is copied but the executable flag is cleared.
> ># so the script have to take care of this
> >
> >
> ># to allow multiple lam jobs running on a single machine,
> ># we have to give somewhat unique value
> >export LAM_MPI_SESSION_SUFFIX=$$
> >
> ># when a job is killed by the user, this script will get sigterm
> ># This script have to catch it and do the cleaning for the
> ># lam environment
> >finalize()
> >{
> >lamhalt
> >exit
> >}
> >trap finalize TERM
> >
> ># Each of my machines has 4 cores, so I can set this here. For
> ># a heterogenous mix one would need to use a machine file on the 
> ># execute host, e.g. put in $LAMDIR/etc
> >
> >/bin/echo "$HOSTNAME cpu=2" > machines
> >
> ># start the lam environment
> >
> >lamboot machines
> >
> >if [ $? -ne 0 ]
> >then
> >        echo "lamscript error booting lam"
> >        exit 1
> >fi
> >
> >## run the actual mpijob; choose your rpi module
> >
> >
> >
> >    for i in  1,2,3,4,5    
> >    do
> >   
> >
> >mpirun C  program_mpi  -np 2 input_${i} output_${i} 
> >
> >
> >done
> >
> >
> >CHILD=$!
> >TMP=130
> >while [ $TMP -gt 128 ] ; do
> >        wait $CHILD
> >        TMP=$?;
> >done
> >
> >/bin/rm -f machines
> >
> ># clean up lam
> >lamhalt
> >
> >exit 0
> >
> >
> >
> >
> >Thanks in advance
> >
> >Diana Lousa
> >PhD Student
> >Instituto de Tecnologia Química e Biológica (ITQB)
> >Universidade Nova de Lisboa
> >Avenida da República, EAN
> >Apartado 127
> >2780-901 Oeiras
> >Portugal
> >
> >_______________________________________________
> >Condor-users mailing list
> >To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> >subject: Unsubscribe
> >You can also unsubscribe by visiting
> >https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >
> >The archives can be found at: 
> >https://lists.cs.wisc.edu/archive/condor-users/
> >  
> >
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at: 
> https://lists.cs.wisc.edu/archive/condor-users/