[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] [HTCondor-Users] Cgroup memory hard limit



I can easily reproduce this behavior of RHEL 6.10Â

If I copy file of larger than memory per core limit in /dev/shm job is going into held status.Â

If I try to read file larger than request memory job is going into 4 status (complete) instead of held status. I can see the errror in stderr log file it should have gone into removed status not sure why it's marked as completed.Â

Thanks & Regards,
Vikrant Aggarwal


On Fri, May 22, 2020 at 1:18 PM Vikrant Aggarwal <ervikrant06@xxxxxxxxx> wrote:
Hello Experts,

Any thoughts here?

Thanks & Regards,
Vikrant Aggarwal


On Wed, May 20, 2020 at 1:25 PM Vikrant Aggarwal <ervikrant06@xxxxxxxxx> wrote:
Hello Experts,

I have a query about working on memory based cgroups with hard configuration setting.Â

With the following message in starter logs job went into held status. First of all peak usage was confusing 3.5GB when limit is approx 10GB (per core), probably it has used virtual memory for considering the OOM condition. System was having sufficient RAM no issues with it.Â

05/19/20 13:02:29 (pid:1461781) Job was held due to OOM event: Job has gone over memory limit of 10069 megabytes. Peak usage: 3565 megabytes.
From job history: 
condor_history 7524291.93 -json -attr ResidentSetSize_RAW
[
{
"ResidentSetSize_RAW": 3650128
}
Sometimes I have seen job not going into held status but marked as completed.Â

Example with the following c++ code. I am expecting it to allocate approx 10GB of virtual memory then second loop should make the translation work virtual memory to physical memory.Â

#include<iostream>
#include<stdlib.h> //Required for Malloc function:
using namespace std;

main()
{

unsigned int * ptr[10];
for (int i=0; i< 10; i++){
 ptr[i] = (unsigned int*) malloc(1024*1024*1000);
 for(int j =0; j< 1024*256*1000; j++){
   ptr[i][j] = j;
 Â}
}
return 0;
}

I was expecting above test will also make the job goes into held status but to my surprise in starter log it exited with 11 status but it was marked as completed. It should have been removed or hold, right?

05/20/20 03:24:16 (pid:3752231) Limiting (hard) memory usage to 6012534784 bytes
05/20/20 03:24:16 (pid:3752231) Limiting memsw usage to 135110701056 bytes
05/20/20 03:24:21 (pid:3752231) Process exited, pid=3752397, status=11
05/20/20 03:24:21 (pid:3752231) Got SIGQUIT. Performing fast shutdown.
05/20/20 03:24:21 (pid:3752231) ShutdownFast all jobs.
05/20/20 03:24:21 (pid:3752231) **** condor_starter (condor_STARTER) pid 3752231 EXITING WITH STATUS 0

Can anyone please help me to understand in which scenario job will go into held and completed status while trying to access more memory than limited by core?Â

condor version : 8.5.8
# condor_config_val CGROUP_MEMORY_LIMIT_POLICY
hard
OS : RHEL 6.10

Thanks & Regards,
Vikrant Aggarwal