[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] job wrappers and parallel jobs problem



Can anyone comment on this question?

thanks
rob

On Apr 14, 2008, at 10:57 AM, Robert E. Parrott wrote:

Hi Folks,

I'm having an issue with using job wrappers and parallel jobs, with
condor
v7.0.1 on RHEL 4 x86_64.

To begin, I've seen a number of user jobs over commit memory on
compute nodes,
which has had significant negative impact on stability. In particular,
condor's parallel
universe does not recover well from a reboot of such a compute node
that has a
parallel job running on it.

To avoid this, I've try to add a job wrapper  (at the end of the
email) which will
limit the memory use to a 1/4 of ram (these are 4 proc compute nodes).
This
works fine, except it  seem for parallel jobs.

Parallel job, with a condor-mpirun startup script, have issues
starting up;
the error I see is:

   014 (120586.000.000) 04/14 07:42:32 Node 0 executing on host:
<10.101.12.197:32833>
   ...
   022 (120586.000.000) 04/14 07:42:32 007 (120586.000.000) 04/14
07:42:32 Shadow exception!
           JobDisconnectedEvent::writeEvent() called without
startd_addr
           0  -  Run Bytes Sent By Job
           14264  -  Run Bytes Received By Job

Any insight into this issue would be very greatly appreciated.

I've also tried putting something very much like the wrapper script
(i.e. a script to set shell limits) into
the /etc/profile.d on each compute node, but the limits don't seem  to
be in place on vanilla universe
jobs, despite the script. I'm not sure why ... what shells are used to
start jobs, and for which universes?


Thanks so much for some insight into this issue.

rob



------------------------------------------------------------------------

#!/bin/bash

#
# determine memory size to use
# This computes the total memory divided by the number of processors,
#  plus a fudge factor of 100 MB

PHYS_MEM=`/usr/bin/free | /bin/grep Mem |  /bin/awk '{ print $2 }'`
NUM_PROCS=`cat /proc/cpuinfo  | grep "^processor" | wc -l`
VMEM_PER_SLICE=`echo $PHYS_MEM $NUM_PROCS | /bin/awk '{ print 100000+
$1/'$NUM_PROCS' }'`
DMEM_PER_SLICE=`echo $PHYS_MEM $NUM_PROCS | /bin/awk '{ print
$1/'$NUM_PROCS' }'`

MEM_PER_SLICE_MB=`echo $VMEM_PER_SLICE | /bin/awk '{ print $1/1024 }'`

#
# Set process limits
#

# 1 week CPU time per process
ulimit -t  604800

# limit virtual memory used
ulimit -v $VMEM_PER_SLICE

# limit data segment to 1/4 memory
ulimit -d $DMEM_PER_SLICE
ulimit -m $DMEM_PER_SLICE

# limit stack size to 50 MB
ulimit -s  50000

# run the command
exec "$@"
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/