[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] job memory requirements and free memory vs cached



On 23/10/2015 22:42, Michael Paterson wrote:

I'm trying to run 4 single core jobs on a 4cpu box with partitionable slots and ~7500m memory, the jobs have 1500m mem set for the memory requirement.

Sometimes all 4 slots will start up on a machine, but others only get 3.

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+ COMMAND
 7804 slot03    30  10 1381m 787m  26m R 100.0 10.5  63:25.24 basf2
10279 slot02    30  10 1347m 793m  64m R 100.0 10.6  41:47.69 basf2
 6322 slot01    30  10 1546m 891m  21m R 98.4 11.9 386:24.91 basf2

# free -m
total used free shared buffers cached
Mem:          7514       6824        689          0         69 3320
-/+ buffers/cache:       3434       4080
Swap:        16383         10      16373


Is the ~3G in cache preventing the 4th job from setting a slot?
No, the VFS cache is nothing to do with this. The OS will always eject stuff from the cache when it needs more RAM.

To work out what's happening, look at condor_status output: this will show you all the allocated slots and also the top partitionable slot which has all the remaining resources assigned to it. And when three jobs are running but the fourth isn't, look at condor_q -better-analyze <pid> where <pid> is the ID of the job which isn't running.

I can't tell without seeing that output what's happening, but there are lots of reasons why a job might not start.

The one which has bitten me in the past is that condor thinks the machine is in "owner" state (i.e. a human is sitting in front of the machine doing real work) and therefore won't start any new jobs. This is because condor has what appears to be a very rough way of tracking how much load average is due to condor jobs and how much due to non-condor jobs; in my experience it can easily think that the non-condor jobs account for a load average of more than 0.3.

Because these machines were dedicated to htcondor jobs I fixed this problem by setting

START = True

in /etc/condor/condor_config.local.

Recently after reading some more I think a better way may be to set

IS_OWNER = False

but I've not tested that. (The benefit is you can still use the START expression to decide whether to run particular jobs, based on the attributes of the job, but you will never end up going into the 'owner' state)

Regards,

Brian.