[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] cgroups issue (cgroup invalid operation)



And I think this will help more (I stripped the domainname)

 

This is an strace excerpt from the procd daemon :

 

write(3, "03/18/16 21:41:32 : gathering usage data for family with root pid 3278427\n", 74) = 74

lseek(3, 0, SEEK_CUR)                   = 570505

open("/cgroup/memory/htcondor/condor_home_condor_slot1_3@wn296 //memory.stat", O_RDONLY|O_CLOEXEC) = 9

fstat(9, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0

 

(seems to work)

 

munmap(0x7fc321dbb000, 4096)            = 0

open("/cgroup/cpuacct/htcondor/condor_home_condor_slot1_2@wn296 //tasks", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)

stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2945, ...}) = 0

 

(clearly fails)

 

Note the // in the pathsâ but thatâs not the issue.

The thing is that thereâs nothing usefull in /cgroup/cpuacct/htcondor/ :

[root@wn296 htcondor]# ll /cgroup/cpuacct/htcondor/

total 0

--w--w---- 1 root root 0 Mar 17 12:26 cgroup.event_control

-rw-rw-r-- 1 root root 0 Mar 17 12:26 cgroup.procs

-r--r--r-- 1 root root 0 Mar 17 12:26 cpuacct.stat

-rw-rw-r-- 1 root root 0 Mar 17 12:26 cpuacct.usage

-r--r--r-- 1 root root 0 Mar 17 12:26 cpuacct.usage_percpu

-rw-rw-r-- 1 root root 0 Mar 17 12:26 notify_on_release

-rw-rw-r-- 1 root root 0 Mar 17 12:26 tasks

 

But itâs mounted correctly soâ ?

[root@wn296 htcondor]# grep cpuacct /proc/mounts

cgroup /cgroup/cpuacct cgroup rw,relatime,cpuacct 0 0

 

?

 

Fred

 

 

 

De : HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] De la part de SCHAER Frederic
Envoyà: vendredi 18 mars 2016 21:32
à: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Objet : [PROVENANCE INTERNET] Re: [HTCondor-users] cgroups issue (cgroup invalid operation)

 

Salut Brian :]

 

Ah, yes, forgot to give that part of the config : itâs like what I could read in various guides :

 

[root@wn296 htcondor]# condor_config_val -startd -dump|grep -i cgroup

BASE_CGROUP = htcondor

CGROUP_MEMORY_LIMIT_POLICY = soft

 

Fred

 

 

De : HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] De la part de Brian Bockelman
Envoyà: vendredi 18 mars 2016 21:23
à: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Objet : Re: [HTCondor-users] cgroups issue (cgroup invalid operation)

 

Hi Frederic,

 

Whatâs the value of âBASE_CGROUPâ configuration variable?  I ask because I would have thought the ProcLog would show something like âUnable to read cgroup /htcondor/ââ instead of âUnable to read cgroup htcondor/ââ.  Maybe thereâs a base â/â that is getting automatically added in some places but not others?

 

Brian

 

On Mar 18, 2016, at 3:19 PM, SCHAER Frederic <frederic.schaer@xxxxxx> wrote:

 

Hi,

 

So, I deployed cgroups for htcondorâ. And Iâm having issues.

First : looks like I have to restart condor, not just condor_reconfig it. (right ?)

Without a service restart, I get logs about âcgroups not initializedâ.

 

But problem is that even after restarting condor, the ProcsLog shows these errors :

 

03/18/16 20:59:53 : Unable to read cgroup htcondor/condor_home_condor_slot1_2@wn296 cpuacct stats (ProcFamily 3278233): Cgroup invalid operation.

03/18/16 20:59:53 : Internal cgroup error when retrieving CPU statistics: Cgroup invalid operation

03/18/16 20:59:53 : Unable to read cgroup htcondor/condor_home_condor_slot1_2@wn296 memory stats (ProcFamily 3278233): 50016 No such file or directory.

 

I am deploying cgroups so that memory accounting (RSS) doesnât double-tripple count stuff, which in the end is causing the jobs to be killed after they are reported as consuming more memory than requested.

Iâm wondering how to fix this ?

To answer coming questions :

 

-          Yes, cgconfig is running, and cgroups are mounted (but not visible with the mount command â sl6x/sl6.7 here) :

[root@wn296 htcondor]# service cgconfig status

Running

[root@wn296 htcondor]# grep cgroup /proc/mounts

cgroup /cgroup/cpu cgroup rw,relatime,cpu 0 0

cgroup /cgroup/cpuset cgroup rw,relatime,cpuset 0 0

cgroup /cgroup/cpuacct cgroup rw,relatime,cpuacct 0 0

cgroup /cgroup/devices cgroup rw,relatime,devices 0 0

cgroup /cgroup/memory cgroup rw,relatime,memory 0 0

cgroup /cgroup/freezer cgroup rw,relatime,freezer 0 0

cgroup /cgroup/net_cls cgroup rw,relatime,net_cls 0 0

cgroup /cgroup/blkio cgroup rw,relatime,blkio 0 0

 

-          Yes, I see htcondor subdirectories, and I even see PIDs in the subdirectories tasks files :

 

[root@wn296 htcondor]# wc -l /cgroup/memory/htcondor/condor_home_condor_slot1_*/tasks|sed -r -e 's%(wn...).*/%\1/%'

0 /cgroup/memory/htcondor/condor_home_condor_slot1_10@wn296/tasks

0 /cgroup/memory/htcondor/condor_home_condor_slot1_11@wn296/tasks

6 /cgroup/memory/htcondor/condor_home_condor_slot1_1@wn296/tasks

22 /cgroup/memory/htcondor/condor_home_condor_slot1_2@wn296/tasks

22 /cgroup/memory/htcondor/condor_home_condor_slot1_3@wn296/tasks

22 /cgroup/memory/htcondor/condor_home_condor_slot1_4@wn296/tasks

0 /cgroup/memory/htcondor/condor_home_condor_slot1_5@wn296/tasks

0 /cgroup/memory/htcondor/condor_home_condor_slot1_6@wn296/tasks

0 /cgroup/memory/htcondor/condor_home_condor_slot1_7@wn296/tasks

0 /cgroup/memory/htcondor/condor_home_condor_slot1_8@wn296/tasks

0 /cgroup/memory/htcondor/condor_home_condor_slot1_9@wn296/tasks

72 total

 

-          Cgroups config :

 

[root@wn296 htcondor]# cat /etc/cgconfig.{conf,d/htcondor.conf}

# This file is being maintained by Puppet.

# DO NOT EDIT

 

mount {

      cpu     = /cgroup/cpu;

      cpuset  = /cgroup/cpuset;

      cpuacct = /cgroup/cpuacct;

      devices = /cgroup/devices;

      memory  = /cgroup/memory;

      freezer = /cgroup/freezer;

      net_cls = /cgroup/net_cls;

      blkio   = /cgroup/blkio;

}

group htcondor {

      cpu {}

      cpuacct {}

      memory {}

      freezer {}

      blkio {}

}

 

-          And condor version : 8.4.3-2

 

The problem I have is that I have jobs submitted (not by me) with Memory requirements and that these jobs are still killed because of the RSS approximative accounting without cgroups â and for now, the killing is still going on :â(

 

Any ideas about whatâs wrong, and better, how to fix ? I admin Iâm very new to cgroups, so it might be that Iâm mistaken somewhereâ

Thanks

 

Frederic

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to 
htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/