[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] cgroup error



Hups,

forgot to mention I used the config from the manual mainly : 

[root@wn2-test]/cgroup/cpu/htcondor# cat /etc/cgconfig.conf
[ snip ]
group htcondor {
      cpu {}
      cpuacct {}
      memory {}
      freezer {}
      blkio {}
}

In htcondor.cfg:
# Enable CGROUP control
BASE_CGROUP = htcondor
# hard: job can't access more physical memory than allocated
# soft: job can access more physical memory than allocated when there are free memory
CGROUP_MEMORY_LIMIT_POLICY = hard

Are there any further knobs for finetuning or things I should be aware of ? Would cgroups laso the tool to prevent forkbombs ? 

cheers
        ~christoph


-- 
/*   Christoph Beyer     |   Office: Building 2b / 23     *\
 *   DESY                |    Phone: 040-8998-2317        *
 *   - IT -              |      Fax: 040-8994-2317        *
\*   22603 Hamburg       |     http://www.desy.de         */

----- Original Message -----
From: "Beyer, Christoph" <christoph.beyer@xxxxxxx>
To: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Sent: Wednesday, 26 August, 2015 15:42:31
Subject: Re: [HTCondor-users] cgroup error

Hi Iain,

thanks for the offer, I tried again with longer job runtime and everything seems to work as expected now :) 

[chbeyer@bm-test]/mnt/bshare/chbeyer% condor_q -hold
 250.1   chbeyer         8/26 09:53 Error from slot1@xxxxxxxxxxxxxxxx: Job has gone over memory limit of 1024 megabytes.                                                               
 250.4   chbeyer         8/26 09:54 Error from slot1@xxxxxxxxxxxxxxxx: Job has gone over memory limit of 1024 megabytes.                                                               
 250.8   chbeyer         8/26 09:54 Error from slot1@xxxxxxxxxxxxxxxx: Job has gone over memory limit of 1024 megabytes.                                                               
[ snip ]

Will be looking at the hierarchical accounting groups next, I remember a sophisticated entry from Andrew with wildcards and some kind of guessing logic, maybe you can help me out with that if you use it the same way ? ;) 
 
cheers
        chris


-- 
/*   Christoph Beyer     |   Office: Building 2b / 23     *\
 *   DESY                |    Phone: 040-8998-2317        *
 *   - IT -              |      Fax: 040-8994-2317        *
\*   22603 Hamburg       |     http://www.desy.de         */

----- Original Message -----
From: "Iain Steers" <iain.steers@xxxxxxx>
To: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Sent: Tuesday, 25 August, 2015 16:17:17
Subject: Re: [HTCondor-users] cgroup error

Hi Christoph,

We've been running SLC6 cgroups in our HTCondor pool for a couple of months now without issue.

Give me a shout and mail me your config if you'd like.

Cheers, Iain

On Tue, Aug 25, 2015 at 08:53:08AM -0500, Lincoln Bryant wrote:
> Hi,
> 
> Shot in the dark, but.. do you have the cgroups service running? /etc/init.d/cgconfig status?
> 
> Cheers,
> Lincoln
> 
> > On Aug 25, 2015, at 8:48 AM, Beyer, Christoph <christoph.beyer@xxxxxxx> wrote:
> > 
> > 
> > Hi, 
> > 
> > I am using SL6 (2.6.32-504.8.1.el6.x86_64) and HTC 8.3.7 Jul 23 2015 BuildID: 331383
> > 
> > I enabled cgroups as described in the memory and send 'stress' jobs using 10 gb memory while announcing 1gb of memory usage via submit file. 
> > 
> > The result is somehow not as I would expect: 
> > 
> >  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                         
> > 20378 chbeyer   20   0  9.8g 1.0g  144 D  2.0  6.4   0:01.03 stress                                                                          
> > 20400 chbeyer   20   0  9.8g 1.0g  144 D  2.0  6.4   0:00.77 stress                                                                          
> > 20381 chbeyer   20   0  9.8g 1.0g  144 D  1.3  6.4   0:00.97 stress                                                                          
> > 20384 chbeyer   20   0  9.8g 1.0g  144 D  1.3  6.3   0:00.97 stress                                                                          
> > 20386 chbeyer   20   0  9.8g 1.0g  144 D  1.3  6.4   0:01.03 stress                                                                          
> > 20388 chbeyer   20   0  9.8g 1.0g  144 D  1.3  6.3   0:00.86 stress                                                                          
> > 20392 chbeyer   20   0  9.8g 1.0g  144 D  1.3  6.3   0:00.80 stress                                                                          
> > 20398 chbeyer   20   0  9.8g 1.0g  144 D  1.3  6.3   0:00.71 stress           
> > 
> > 
> > The procd log file shows some errors: 
> > 
> > 
> > 08/25/15 15:40:26 : PROC_FAMILY_GET_USAGE
> > 08/25/15 15:40:26 : gathering usage data for family with root pid 20361
> > 08/25/15 15:40:26 : Unable to read cgroup htcondor/condor_var_lib_condor_execute_slot1_2@xxxxxxxxxxxxxxxx cpuacct stats (ProcFamily 20373): Cgroup invalid operation.
> > 08/25/15 15:40:26 : Internal cgroup error when retrieving CPU statistics: Cgroup invalid operation
> > 08/25/15 15:40:26 : Unable to read cgroup htcondor/condor_var_lib_condor_execute_slot1_2@xxxxxxxxxxxxxxxx memory stats (ProcFamily 20373): 50016 No such file or directory.
> > 08/25/15 15:40:26 : PROC_FAMILY_GET_USAGE
> > 08/25/15 15:40:26 : gathering usage data for family with root pid 20360
> > 08/25/15 15:40:26 : Unable to read cgroup htcondor/condor_var_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxxx cpuacct stats (ProcFamily 20372): Cgroup invalid operation.
> > [ snip ]
> > 08/25/15 13:46:41 : PROC_FAMILY_TRACK_FAMILY_VIA_CGROUP
> > 08/25/15 13:46:41 : Setting cgroup to htcondor/condor_var_lib_condor_execute_slot1_3@xxxxxxxxxxxxxxxx for ProcFamily 15896.
> > 08/25/15 13:46:41 : Warning - cgroup controller cpuacct not mounted (but not required).
> > 08/25/15 13:46:41 : Warning - cgroup controller memory not mounted (but not required).
> > 08/25/15 13:46:41 : Warning - cgroup controller freezer not mounted (but not required).
> > 08/25/15 13:46:41 : Warning - cgroup controller blkio not mounted (but not required).
> > 08/25/15 13:46:41 : Warning - cgroup controller cpu not mounted (but not required).
> > 08/25/15 13:46:41 : Cannot attach pid 15896 to cgroup htcondor/condor_var_lib_condor_execute_slot1_3@xxxxxxxxxxxxxxxx for ProcFamily 15896: 50014 Cgroup not initialized
> > 
> > I thought the jobs that by far exceed the memory limit would be killed and go on hold but that seems only to happen from time to time (?) 
> > 
> > best regards
> >        ~christoph
> > 
> > 
> > -- 
> > /*   Christoph Beyer     |   Office: Building 2b / 23     *\
> > *   DESY                |    Phone: 040-8998-2317        *
> > *   - IT -              |      Fax: 040-8994-2317        *
> > \*   22603 Hamburg       |     http://www.desy.de         */
> > _______________________________________________
> > HTCondor-users mailing list
> > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> > 
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/htcondor-users/
> > 
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/