[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] enforcing job memory limits?



One limitation of this is the relatively long time scale on which this policy
is enforced. This leaves open the possibility of crossing the Linux Oom-killer
threshold (or other critical resource boundaries) before the Condor startd
can take action.

A more robust solution, in my opinion, would be for Condor to make
calls to setrlimit(), or equivalent, to let the execute machine kernel
enforce resource limits much more rigorously.

Thanks.


On Fri, Nov 16, 2007 at 05:02:34PM -0600, Todd Tannenbaum wrote:
> Todd Tannenbaum wrote:
> > Jonathan D. Proulx wrote:
> >> Hi,
> >>
> >> it appears as though a condor job in my flock ehausted the memory on a
> >> user workstation over night ($CondorVersion: 6.8.2 Oct 12 2006 $
> >> $CondorPlatform: X86_64-LINUX_RHEL3 $).  This triggered Linux's OOM
> >> killer which killed several desktop apps and sshd on the system.
> >>
> >> this seems a prety serious violation of the do no harm principle and
> >> I'm a bit surprized.
> >>
> >> is the a config setting I need to tweak on the workstations?
> >>
> > 
> > You could add a clause to your preempt expression, i.e. something like
> > 
> >    PREEMPT = ( whatever was there before ) || (ImageSize > (Memory-20))
> > 
> > This should work at least in the v6.9 series (and maybe in v6.8 as well? 
> > cannot recall offhand), since the condor_startd's value for "ImageSize" 
> > will be updated several times a minute to the total memory usage of the 
> > job.
> 
> Ouch, in my example above, I didn't deal with the fact that Memory is in 
> megs and ImageSize is in Kbytes.  So something like the following is 
> what I meant:
> 
> PREEMPT = (whatever was there before) || \
>            (ImageSize > ((Memory*0.8)*1024) )
> 
> regards,
> Todd
> 
> -- 
> Todd Tannenbaum                       University of Wisconsin-Madison
> Condor Project Research               Department of Computer Sciences
> tannenba@xxxxxxxxxxx                  1210 W. Dayton St. Rm #4257
> Phone: (608) 263-7132                 Madison, WI 53706-1685
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at: 
> https://lists.cs.wisc.edu/archive/condor-users/

-- 
Stuart Anderson  anderson@xxxxxxxxxxxxxxxx
http://www.ligo.caltech.edu/~anderson