[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] STARTD-based memory limit
- Date: Tue, 07 Jun 2011 15:46:06 +0100
- From: Dan Bradley <dan@xxxxxxxxxxxx>
- Subject: Re: [Condor-users] STARTD-based memory limit
On 6/6/11 6:40 PM, Matthew Farrellee wrote:
On 06/02/2011 10:15 AM, Steven Timm wrote:
In my cluster I have been using a schedd-based method of
killing jobs that are using too much memory.
[root@fcdf1x1 local]# condor_config_val SYSTEM_PERIODIC_REMOVE
(NumJobStarts > 10) || (ImageSize>=2500000) || (JobRunCount>=1 &&
JobStatus==1 && ImageSize>=1000000)
But this has two weaknesses
One is that sometimes it can take
the shadow a long time to send the high memory value back to
the schedd so the schedd can act, and in the meantime the job grows
too fast and sucks up all ram on the node and starts killing other
The second one is that I have a diverse pool of nodes and
would like jobs running on the nodes with bigger memory to use it if
it is there.
So is there a way to evict jobs that use, (ImageSize*2>Memory)?
would you use the KILL or the PREEMPT function?
Often policy evaluation is delegated to the Shadow. Maybe it's a bug
that SYSTEM_PERIODIC_REMOVE is not.
SYSTEM_PERIODIC_REMOVE is evaluated by the shadow while the job is
running. Therefore, I would expect the delay in this policy seeing
updated memory usage to be related to the delay in sending updates from
the starter to the shadow, which is the same delay as sending updates
from the starter to the startd. This delay is controlled by
STARTER_UPDATE_INTERVAL (default 300 seconds).