Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] out-of-memory issues in parallel universe

Date: Wed, 19 Mar 2008 14:37:00 -0400
From: "Robert E. Parrott" <parrott@xxxxxxxxxxxxxxxx>
Subject: Re: [Condor-users] out-of-memory issues in parallel universe

Thanks for the pointers.

The point you make about virtual memory size is relevant. On quadcore, 4 GB machines (so 1 GB per "slot"), I'm seeing jobs with virtualmemory reported to be 1.5 GB, but resident size of 850 MB.

As such, using the virtual memory value would make a significantimpact on usability.


Too bad that the kernel OOM killer isn't smarter.

rob



On Mar 19, 2008, at 1:44 PM, Dan Bradley wrote:




Robert E. Parrott wrote:

I'm also/instead looking for a solution to enforce memory limits at
runtime.

It looks as if a USER_JOB_WRAPPER with a ulimit line is the solution
here. Does that jibe with what others have done?


That is one option.  Here are two others:

1. Have Condor preempt jobs from the machine when their virtual image
size exceeds some amount.  Example:

MEMORY_EXCEEDED = ( ImageSize > 1.5*Memory*1024 )
MEMORY_NOT_EXCEEDED = ($(MEMORY_EXCEEDED) =!= TRUE)

WANT_SUSPEND = ($(WANT_SUSPEND)) && $(MEMORY_NOT_EXCEEDED)
PREEMPT = ($(PREEMPT)) && $(MEMORY_EXCEEDED)

2. Have Condor (on the submit side) put jobs on hold when theirvirtualimage size exceeds some amount. It is a little more awkward to settheamount based on the size of the machine's memory in this case, butit is

possible.  Example:

# When a job matches, insert the machine memory into the
# job ClassAd so periodic_remove can refer to it.
MachineMemory = "$$(Memory)"
SUBMIT_EXPRS = $(SUBMIT_EXPRS)  MachineMemory

SYSTEM_PERIODIC_HOLD = (MATCH_EXP_MachineMemory =!= UNDEFINED &&
ImageSize > 1.5*int(MATCH_EXP_MachineMemory))

Both of these techniques suffer from the shortcoming that they arebasedoff of the virtual memory size of the job, which may not be anaccurate

measure of the job's actual demand on physical memory.

--Dan

_______________________________________________
Condor-users mailing list

To unsubscribe, send a message to condor-users-request@xxxxxxxxxxxwith a

subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/



==========================
Robert E. Parrott, Ph.D. (Phys. '06)
Associate Director, Grid and
       Supercomputing Platforms
Project Manager, CrimsonGrid Initiative
Harvard University Sch. of Eng. and App. Sci.
Maxwell-Dworkin  211,
33 Oxford St.
Cambridge, MA 02138
(617)-495-5045

References:
- [Condor-users] Submit Parallel Job from the client Matlab to Condor scheduler!
  - From: Vinicius da Cunha M. Borges
- [Condor-users] out-of-memory issues in parallel universe
  - From: Robert E. Parrott
- Re: [Condor-users] out-of-memory issues in parallel universe
  - From: Greg Thain
- Re: [Condor-users] out-of-memory issues in parallel universe
  - From: Robert E. Parrott
- Re: [Condor-users] out-of-memory issues in parallel universe
  - From: Dan Bradley

Prev by Date: Re: [Condor-users] out-of-memory issues in parallel universe
Next by Date: [Condor-users] How to submit a job via SOAP API
Previous by thread: Re: [Condor-users] out-of-memory issues in parallel universe
Next by thread: [Condor-users] [Fwd: Re: VMGAHP_ERR_CRITICAL]
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] out-of-memory issues in parallel universe