[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] cgroups question/problem



Hi Bob,

Was the job accessing CVMFS at the same time?  Weâve seen the kernel deadlock (in non-HTCondor contexts) when mixing FUSE filesystems and cgroup-based memory limits.

What happens when SIGKILL these by hand?

Brian

> On Mar 15, 2016, at 9:37 AM, Bob Ball <ball@xxxxxxxxx> wrote:
> 
> We have recently upgraded to HTCondor 8.4.4 on our worker nodes, and are encountering problems with cgroups.  The issue is this.  From the end of StarterLog.slot10 on this particular WN, I find
> 
> 03/14/16 22:07:10 (pid:1722453) Running job as user usatlas1
> 03/14/16 22:07:10 (pid:1722453) Create_Process succeeded, pid=1722478
> 03/14/16 22:07:10 (pid:1722453) Limiting (soft) memory usage to 4294967296 bytes
> 03/14/16 22:07:10 (pid:1722453) Limiting (hard) memory usage to 4174556160 bytes
> 03/14/16 22:07:10 (pid:1722453) Limiting memsw usage to 4174560256 bytes
> 03/15/16 02:18:26 (pid:1722453) Hold all jobs
> 03/15/16 02:18:26 (pid:1722453) Job was held due to OOM event: Job has gone over memory limit of 4096 megabytes.
> 03/15/16 02:18:26 (pid:1722453) ShutdownFast all jobs.
> 
> The thing is, the top of that process tree is still around, along with most (if not all) of the subprocesses.
> 
> condor   1722453       1  0 Mar14 ?        00:00:00 condor_starter -f -a slot10 gate04.local
> 
> [root@c-117-13 ~]# pstree -h -p -l 1722453
> condor_starter(1722453)---bash(1722478)-+-bash(2453448)
> `-python(1722744)-+-python(1724660)-+-sh(1725396)---python(1725397)---sh(1738769)---python(1748382)---top-xaod(1748422)
> | `-sh(1725882)---MemoryMonitor(1764264)---sh(2453440)---sh(2453441)
> `-{python}(1724539)
> 
> We have set the following:
> BASE_CGROUP = htcondor
> CGROUP_MEMORY_LIMIT_POLICY = soft
> 
> So, why are jobs like this still hanging around?  Literally, there is no progress on this or similar jobs, and they end up wedging the WN.  Condor_status no longer shows this WN.
> 
> The condor_startd is no longer running, with nothing in the StartLog to indicate why.  The MasterLog contains this
> 03/15/16 03:05:01 ERROR: Child pid 178946 appears hung! Killing it hard.
> 03/15/16 03:05:01 DefaultReaper unexpectedly called on pid 178946, status 9.
> 03/15/16 03:05:01 The STARTD (pid 178946) was killed because it was no longer responding
> 
> The last entry in the ProcLog is simultaneous with the hold on the slot10 job above, ie,
> 03/15/16 02:18:19 : PROC_FAMILY_GET_USAGE
> 03/15/16 02:18:19 : gathering usage data for family with root pid 1651340
> 03/15/16 02:18:26 : PROC_FAMILY_GET_USAGE
> 03/15/16 02:18:26 : gathering usage data for family with root pid 1722478
> 03/15/16 02:18:26 : taking a snapshot...
> 03/15/16 02:18:26 : ProcAPI: new boottime = 1456240746; old_boottime = 1456240745; /proc/stat boottime = 1456240746; /proc/uptime boottime = 1456240746
> 
> Is this a bug in 8.4.4?  We had upgraded from 8.2.10 and never saw anything like this.
> 
> Thanks,
> bob
>