Condor 8.4.11 here.
I’m facing jobs which are put to the hold state much too quickly.
In the startd logs, I see this kind of thing :
03/06/17 04:28:26 (pid:1158843) Limiting (soft) memory usage to 16777216000 bytes
03/06/17 04:28:26 (pid:1158843) Limiting (hard) memory usage to 9135382528 bytes
03/06/17 04:28:26 (pid:1158843) Unable to commit memory soft limit for htcondor/condor_home_condor_slot1_4@xxxxxxxxxxxxxxxxxxxxx : 50016 Invalid argument
03/06/17 04:28:26 (pid:1158843) Limiting memsw usage to 9135386624 bytes
03/06/17 04:28:26 (pid:1158843) Unable to commit memsw limit for htcondor/condor_home_condor_slot1_4@xxxxxxxxxxxxxxxxxxxxx : 50016 Invalid argument
03/06/17 04:28:26 (pid:1158843) Unable to commit CPU shares for htcondor/condor_home_condor_slot1_4@xxxxxxxxxxxxxxxxxxxxx: 50016 Invalid argument
03/06/17 05:14:44 (pid:1158843) Hold all jobs
03/06/17 05:14:44 (pid:1158843) Job was held due to OOM event: Job has gone over memory limit of 16000 megabytes.
In the condor history, I see :
MemoryUsage = ( ( ResidentSetSize + 1023 ) / 1024 )
LastHoldReasonSubCode = 0
RequestMemory = 16000
LastHoldReasonCode = 34
ResidentSetSize = 3000000
RemoveReason = "Job removed by SYSTEM_PERIODIC_REMOVE due to being in hold state for 6 hours."
ResidentSetSize_RAW = 2966680
LastHoldReason = "Error from slot1@xxxxxxxxxxxxxxxxxxxxx: Job has gone over memory limit of 16000 megabytes."
Requirements = ( ( RequestCpus == 8 || RequestCpus == 1 ) ) && ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) && ( TARGET.Cpus >= RequestCpus ) && ( TARGET.HasFileTransfer )
What I’m failing to understand is the following :
- 3 000 000 kb of RSS is lower than 16 000 MB of memory, so why was the job set to hold ?
- Why is the job assigned a memory soft quota that’s higher than the hard quota ? In that case, only the hard quota can be used… ?
- What’s the meaning of the “invalid argument” cgroup errors ? I presume I’ll have to dig into this to fix the real issue behind these errors…
I have checked I have the kernel tunning options on.
One more question : I understood from the various condor docs and wikis that cgroups should allow for swap being allocated when the jobs exceed the memory usage, but I don’t see any swap being allocated so I’m wondering how this behavior can be controlled, if only it can ?
any hints ?