[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] job wrappers and parallel jobs problem



Hi Folks,

I'm having an issue with using job wrappers and parallel jobs, with condor
v7.0.1 on RHEL 4 x86_64.

To begin, I've seen a number of user jobs over commit memory on compute nodes, which has had significant negative impact on stability. In particular, condor's parallel universe does not recover well from a reboot of such a compute node that has a
parallel job running on it.

To avoid this, I've try to add a job wrapper (at the end of the email) which will limit the memory use to a 1/4 of ram (these are 4 proc compute nodes). This
works fine, except it  seem for parallel jobs.

Parallel job, with a condor-mpirun startup script, have issues starting up;
the error I see is:

014 (120586.000.000) 04/14 07:42:32 Node 0 executing on host: <10.101.12.197:32833>
   ...
022 (120586.000.000) 04/14 07:42:32 007 (120586.000.000) 04/14 07:42:32 Shadow exception! JobDisconnectedEvent::writeEvent() called without startd_addr
           0  -  Run Bytes Sent By Job
           14264  -  Run Bytes Received By Job

Any insight into this issue would be very greatly appreciated.

I've also tried putting something very much like the wrapper script (i.e. a script to set shell limits) into the /etc/profile.d on each compute node, but the limits don't seem to be in place on vanilla universe jobs, despite the script. I'm not sure why ... what shells are used to start jobs, and for which universes?


Thanks so much for some insight into this issue.

rob



------------------------------------------------------------------------

#!/bin/bash

#
# determine memory size to use
# This computes the total memory divided by the number of processors,
#  plus a fudge factor of 100 MB

PHYS_MEM=`/usr/bin/free | /bin/grep Mem |  /bin/awk '{ print $2 }'`
NUM_PROCS=`cat /proc/cpuinfo  | grep "^processor" | wc -l`
VMEM_PER_SLICE=`echo $PHYS_MEM $NUM_PROCS | /bin/awk '{ print 100000+ $1/'$NUM_PROCS' }'` DMEM_PER_SLICE=`echo $PHYS_MEM $NUM_PROCS | /bin/awk '{ print $1/'$NUM_PROCS' }'`

MEM_PER_SLICE_MB=`echo $VMEM_PER_SLICE | /bin/awk '{ print $1/1024 }'`

#
# Set process limits
#

# 1 week CPU time per process
ulimit -t  604800

# limit virtual memory used
ulimit -v $VMEM_PER_SLICE

# limit data segment to 1/4 memory
ulimit -d $DMEM_PER_SLICE
ulimit -m $DMEM_PER_SLICE

# limit stack size to 50 MB
ulimit -s  50000

# run the command
exec "$@"