[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] cgroup error



Jep, is running but I figured out in the meantime that maybe the total running time of the job is too short, it's 200 sec, not very realistic anyway.
Will try longer runtime tomorrow ...

Thanjs for input !

best regards
        ~christoph


-- 
/*   Christoph Beyer     |   Office: Building 2b / 23     *\
 *   DESY                |    Phone: 040-8998-2317        *
 *   - IT -              |      Fax: 040-8994-2317        *
\*   22603 Hamburg       |     http://www.desy.de         */


----- Original Message -----
From: Lincoln Bryant <lincolnb@xxxxxxxxxxxxxxxx>
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Sent: Tue, 25 Aug 2015 15:53:08 +0200 (CEST)
Subject: Re: [HTCondor-users] cgroup error

Hi,

Shot in the dark, but.. do you have the cgroups service running? /etc/init.d/cgconfig status?

Cheers,
Lincoln

> On Aug 25, 2015, at 8:48 AM, Beyer, Christoph <christoph.beyer@xxxxxxx> wrote:
> 
> 
> Hi, 
> 
> I am using SL6 (2.6.32-504.8.1.el6.x86_64) and HTC 8.3.7 Jul 23 2015 BuildID: 331383
> 
> I enabled cgroups as described in the memory and send 'stress' jobs using 10 gb memory while announcing 1gb of memory usage via submit file. 
> 
> The result is somehow not as I would expect: 
> 
>  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                         
> 20378 chbeyer   20   0  9.8g 1.0g  144 D  2.0  6.4   0:01.03 stress                                                                          
> 20400 chbeyer   20   0  9.8g 1.0g  144 D  2.0  6.4   0:00.77 stress                                                                          
> 20381 chbeyer   20   0  9.8g 1.0g  144 D  1.3  6.4   0:00.97 stress                                                                          
> 20384 chbeyer   20   0  9.8g 1.0g  144 D  1.3  6.3   0:00.97 stress                                                                          
> 20386 chbeyer   20   0  9.8g 1.0g  144 D  1.3  6.4   0:01.03 stress                                                                          
> 20388 chbeyer   20   0  9.8g 1.0g  144 D  1.3  6.3   0:00.86 stress                                                                          
> 20392 chbeyer   20   0  9.8g 1.0g  144 D  1.3  6.3   0:00.80 stress                                                                          
> 20398 chbeyer   20   0  9.8g 1.0g  144 D  1.3  6.3   0:00.71 stress           
> 
> 
> The procd log file shows some errors: 
> 
> 
> 08/25/15 15:40:26 : PROC_FAMILY_GET_USAGE
> 08/25/15 15:40:26 : gathering usage data for family with root pid 20361
> 08/25/15 15:40:26 : Unable to read cgroup htcondor/condor_var_lib_condor_execute_slot1_2@xxxxxxxxxxxxxxxx cpuacct stats (ProcFamily 20373): Cgroup invalid operation.
> 08/25/15 15:40:26 : Internal cgroup error when retrieving CPU statistics: Cgroup invalid operation
> 08/25/15 15:40:26 : Unable to read cgroup htcondor/condor_var_lib_condor_execute_slot1_2@xxxxxxxxxxxxxxxx memory stats (ProcFamily 20373): 50016 No such file or directory.
> 08/25/15 15:40:26 : PROC_FAMILY_GET_USAGE
> 08/25/15 15:40:26 : gathering usage data for family with root pid 20360
> 08/25/15 15:40:26 : Unable to read cgroup htcondor/condor_var_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxxx cpuacct stats (ProcFamily 20372): Cgroup invalid operation.
> [ snip ]
> 08/25/15 13:46:41 : PROC_FAMILY_TRACK_FAMILY_VIA_CGROUP
> 08/25/15 13:46:41 : Setting cgroup to htcondor/condor_var_lib_condor_execute_slot1_3@xxxxxxxxxxxxxxxx for ProcFamily 15896.
> 08/25/15 13:46:41 : Warning - cgroup controller cpuacct not mounted (but not required).
> 08/25/15 13:46:41 : Warning - cgroup controller memory not mounted (but not required).
> 08/25/15 13:46:41 : Warning - cgroup controller freezer not mounted (but not required).
> 08/25/15 13:46:41 : Warning - cgroup controller blkio not mounted (but not required).
> 08/25/15 13:46:41 : Warning - cgroup controller cpu not mounted (but not required).
> 08/25/15 13:46:41 : Cannot attach pid 15896 to cgroup htcondor/condor_var_lib_condor_execute_slot1_3@xxxxxxxxxxxxxxxx for ProcFamily 15896: 50014 Cgroup not initialized
> 
> I thought the jobs that by far exceed the memory limit would be killed and go on hold but that seems only to happen from time to time (?) 
> 
> best regards
>        ~christoph
> 
> 
> -- 
> /*   Christoph Beyer     |   Office: Building 2b / 23     *\
> *   DESY                |    Phone: 040-8998-2317        *
> *   - IT -              |      Fax: 040-8994-2317        *
> \*   22603 Hamburg       |     http://www.desy.de         */
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
> 

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/