Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] job memory requirements and free memory vs cached

Date: Sun, 25 Oct 2015 15:48:12 +0000
From: Brian Candler <b.candler@xxxxxxxxx>
Subject: Re: [HTCondor-users] job memory requirements and free memory vs cached

On 23/10/2015 22:42, Michael Paterson wrote:

I'm trying to run 4 single core jobs on a 4cpu box with partitionableslots and ~7500m memory, the jobs have 1500m mem set for the memoryrequirement.


Sometimes all 4 slots will start up on a machine, but others only get 3.

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+ COMMAND
 7804 slot03    30  10 1381m 787m  26m R 100.0 10.5  63:25.24 basf2
10279 slot02    30  10 1347m 793m  64m R 100.0 10.6  41:47.69 basf2
 6322 slot01    30  10 1546m 891m  21m R 98.4 11.9 386:24.91 basf2

# free -m

total used free shared bufferscached

Mem:          7514       6824        689          0         69 3320
-/+ buffers/cache:       3434       4080
Swap:        16383         10      16373


Is the ~3G in cache preventing the 4th job from setting a slot?

No, the VFS cache is nothing to do with this. The OS will always ejectstuff from the cache when it needs more RAM.

To work out what's happening, look at condor_status output: this willshow you all the allocated slots and also the top partitionable slotwhich has all the remaining resources assigned to it. And when threejobs are running but the fourth isn't, look at condor_q -better-analyze<pid> where <pid> is the ID of the job which isn't running.

I can't tell without seeing that output what's happening, but there arelots of reasons why a job might not start.

The one which has bitten me in the past is that condor thinks themachine is in "owner" state (i.e. a human is sitting in front of themachine doing real work) and therefore won't start any new jobs. This isbecause condor has what appears to be a very rough way of tracking howmuch load average is due to condor jobs and how much due to non-condorjobs; in my experience it can easily think that the non-condor jobsaccount for a load average of more than 0.3.

Because these machines were dedicated to htcondor jobs I fixed thisproblem by setting


START = True

in /etc/condor/condor_config.local.

Recently after reading some more I think a better way may be to set

IS_OWNER = False

but I've not tested that. (The benefit is you can still use the STARTexpression to decide whether to run particular jobs, based on theattributes of the job, but you will never end up going into the 'owner'state)


Regards,

Brian.

References:
- [HTCondor-users] job memory requirements and free memory vs cached
  - From: Michael Paterson

Prev by Date: Re: [HTCondor-users] jobs with disjoint requirements
Next by Date: [HTCondor-users] wall clock time in condor_q
Previous by thread: [HTCondor-users] job memory requirements and free memory vs cached
Next by thread: [HTCondor-users] jobs with disjoint requirements
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] job memory requirements and free memory vs cached