[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] job memory requirements and free memory vs cached
- Date: Sun, 25 Oct 2015 15:48:12 +0000
- From: Brian Candler <b.candler@xxxxxxxxx>
- Subject: Re: [HTCondor-users] job memory requirements and free memory vs cached
On 23/10/2015 22:42, Michael Paterson wrote:
No, the VFS cache is nothing to do with this. The OS will always eject
stuff from the cache when it needs more RAM.
I'm trying to run 4 single core jobs on a 4cpu box with partitionable
slots and ~7500m memory, the jobs have 1500m mem set for the memory
Sometimes all 4 slots will start up on a machine, but others only get 3.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7804 slot03 30 10 1381m 787m 26m R 100.0 10.5 63:25.24 basf2
10279 slot02 30 10 1347m 793m 64m R 100.0 10.6 41:47.69 basf2
6322 slot01 30 10 1546m 891m 21m R 98.4 11.9 386:24.91 basf2
# free -m
total used free shared buffers
Mem: 7514 6824 689 0 69 3320
-/+ buffers/cache: 3434 4080
Swap: 16383 10 16373
Is the ~3G in cache preventing the 4th job from setting a slot?
To work out what's happening, look at condor_status output: this will
show you all the allocated slots and also the top partitionable slot
which has all the remaining resources assigned to it. And when three
jobs are running but the fourth isn't, look at condor_q -better-analyze
<pid> where <pid> is the ID of the job which isn't running.
I can't tell without seeing that output what's happening, but there are
lots of reasons why a job might not start.
The one which has bitten me in the past is that condor thinks the
machine is in "owner" state (i.e. a human is sitting in front of the
machine doing real work) and therefore won't start any new jobs. This is
because condor has what appears to be a very rough way of tracking how
much load average is due to condor jobs and how much due to non-condor
jobs; in my experience it can easily think that the non-condor jobs
account for a load average of more than 0.3.
Because these machines were dedicated to htcondor jobs I fixed this
problem by setting
START = True
Recently after reading some more I think a better way may be to set
IS_OWNER = False
but I've not tested that. (The benefit is you can still use the START
expression to decide whether to run particular jobs, based on the
attributes of the job, but you will never end up going into the 'owner'