[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] cgroups + soft memory limit issues



Hi,

 

Condor 8.4.11 here.

 

I’m facing jobs which are put to the hold state much too quickly.

 

In the startd logs, I see this kind of thing :

 

03/06/17 04:28:26 (pid:1158843) Limiting (soft) memory usage to 16777216000 bytes

03/06/17 04:28:26 (pid:1158843) Limiting (hard) memory usage to 9135382528 bytes

03/06/17 04:28:26 (pid:1158843) Unable to commit memory soft limit for htcondor/condor_home_condor_slot1_4@xxxxxxxxxxxxxxxxxxxxx : 50016 Invalid argument

03/06/17 04:28:26 (pid:1158843) Limiting memsw usage to 9135386624 bytes

03/06/17 04:28:26 (pid:1158843) Unable to commit memsw limit for htcondor/condor_home_condor_slot1_4@xxxxxxxxxxxxxxxxxxxxx : 50016 Invalid argument

03/06/17 04:28:26 (pid:1158843) Unable to commit CPU shares for htcondor/condor_home_condor_slot1_4@xxxxxxxxxxxxxxxxxxxxx: 50016 Invalid argument

03/06/17 05:14:44 (pid:1158843) Hold all jobs

03/06/17 05:14:44 (pid:1158843) Job was held due to OOM event: Job has gone over memory limit of 16000 megabytes.

 

In the condor history, I see :

 

MemoryUsage = ( ( ResidentSetSize + 1023 ) / 1024 )

LastHoldReasonSubCode = 0

RequestMemory = 16000

LastHoldReasonCode = 34

ResidentSetSize = 3000000

RemoveReason = "Job removed by SYSTEM_PERIODIC_REMOVE due to being in hold state for 6 hours."

ResidentSetSize_RAW = 2966680

LastHoldReason = "Error from slot1@xxxxxxxxxxxxxxxxxxxxx: Job has gone over memory limit of 16000 megabytes."

Requirements = ( ( RequestCpus == 8 || RequestCpus == 1 ) ) && ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) && ( TARGET.Cpus >= RequestCpus ) && ( TARGET.HasFileTransfer )

 

What I’m failing to understand is the following :

-          3 000 000 kb of RSS is lower than 16 000 MB of memory, so why was the job set to hold ?

-          Why is the job assigned a memory soft quota that’s higher than the hard quota ?  In that case, only the hard quota can be used… ?

-          What’s the meaning of the “invalid argument” cgroup errors ? I presume I’ll have to dig into this to fix the real issue behind these errors…

 

I have checked I have the kernel tunning options on.

 

One more question : I understood from the various condor docs and wikis that cgroups should allow for swap being allocated when the jobs exceed the memory usage, but I don’t see any swap being allocated so I’m wondering how this behavior can be controlled, if only it can ?

 

any hints ?

 

Frederic