[HTCondor-users] cgroups + soft memory limit issues

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

Hi,

Condor 8.4.11 here.

I’m facing jobs which are put to the hold state much too quickly.

In the startd logs, I see this kind of thing :

03/06/17 04:28:26 (pid:1158843) Limiting (soft) memory usage to 16777216000 bytes

03/06/17 04:28:26 (pid:1158843) Limiting (hard) memory usage to 9135382528 bytes

03/06/17 04:28:26 (pid:1158843) Unable to commit memory soft limit for htcondor/condor_home_condor_slot1_4@xxxxxxxxxxxxxxxxxxxxx : 50016 Invalid argument

03/06/17 04:28:26 (pid:1158843) Limiting memsw usage to 9135386624 bytes

03/06/17 04:28:26 (pid:1158843) Unable to commit memsw limit for htcondor/condor_home_condor_slot1_4@xxxxxxxxxxxxxxxxxxxxx : 50016 Invalid argument

03/06/17 04:28:26 (pid:1158843) Unable to commit CPU shares for htcondor/condor_home_condor_slot1_4@xxxxxxxxxxxxxxxxxxxxx: 50016 Invalid argument

03/06/17 05:14:44 (pid:1158843) Hold all jobs

03/06/17 05:14:44 (pid:1158843) Job was held due to OOM event: Job has gone over memory limit of 16000 megabytes.

In the condor history, I see :

MemoryUsage = ( ( ResidentSetSize + 1023 ) / 1024 )

LastHoldReasonSubCode = 0

RequestMemory = 16000

LastHoldReasonCode = 34

ResidentSetSize = 3000000

RemoveReason = "Job removed by SYSTEM_PERIODIC_REMOVE due to being in hold state for 6 hours."

ResidentSetSize_RAW = 2966680

LastHoldReason = "Error from slot1@xxxxxxxxxxxxxxxxxxxxx: Job has gone over memory limit of 16000 megabytes."

Requirements = ( ( RequestCpus == 8 || RequestCpus == 1 ) ) && ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) && ( TARGET.Cpus >= RequestCpus ) && ( TARGET.HasFileTransfer )

What I’m failing to understand is the following :

- 3 000 000 kb of RSS is lower than 16 000 MB of memory, so why was the job set to hold ?

- Why is the job assigned a memory soft quota that’s higher than the hard quota ? In that case, only the hard quota can be used… ?

- What’s the meaning of the “invalid argument” cgroup errors ? I presume I’ll have to dig into this to fix the real issue behind these errors…

I have checked I have the kernel tunning options on.

One more question : I understood from the various condor docs and wikis that cgroups should allow for swap being allocated when the jobs exceed the memory usage, but I don’t see any swap being allocated so I’m wondering how this behavior can be controlled, if only it can ?

any hints ?

Frederic

Mailing List Archives

Public Access

[HTCondor-users] cgroups + soft memory limit issues