[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] cgroups question/problem



We have recently upgraded to HTCondor 8.4.4 on our worker nodes, and are encountering problems with cgroups. The issue is this. From the end of StarterLog.slot10 on this particular WN, I find

03/14/16 22:07:10 (pid:1722453) Running job as user usatlas1
03/14/16 22:07:10 (pid:1722453) Create_Process succeeded, pid=1722478
03/14/16 22:07:10 (pid:1722453) Limiting (soft) memory usage to 4294967296 bytes 03/14/16 22:07:10 (pid:1722453) Limiting (hard) memory usage to 4174556160 bytes
03/14/16 22:07:10 (pid:1722453) Limiting memsw usage to 4174560256 bytes
03/15/16 02:18:26 (pid:1722453) Hold all jobs
03/15/16 02:18:26 (pid:1722453) Job was held due to OOM event: Job has gone over memory limit of 4096 megabytes.
03/15/16 02:18:26 (pid:1722453) ShutdownFast all jobs.

The thing is, the top of that process tree is still around, along with most (if not all) of the subprocesses.

condor 1722453 1 0 Mar14 ? 00:00:00 condor_starter -f -a slot10 gate04.local

[root@c-117-13 ~]# pstree -h -p -l 1722453
condor_starter(1722453)---bash(1722478)-+-bash(2453448)
`-python(1722744)-+-python(1724660)-+-sh(1725396)---python(1725397)---sh(1738769)---python(1748382)---top-xaod(1748422)
| `-sh(1725882)---MemoryMonitor(1764264)---sh(2453440)---sh(2453441)
`-{python}(1724539)

We have set the following:
BASE_CGROUP = htcondor
CGROUP_MEMORY_LIMIT_POLICY = soft

So, why are jobs like this still hanging around? Literally, there is no progress on this or similar jobs, and they end up wedging the WN. Condor_status no longer shows this WN.

The condor_startd is no longer running, with nothing in the StartLog to indicate why. The MasterLog contains this
03/15/16 03:05:01 ERROR: Child pid 178946 appears hung! Killing it hard.
03/15/16 03:05:01 DefaultReaper unexpectedly called on pid 178946, status 9.
03/15/16 03:05:01 The STARTD (pid 178946) was killed because it was no longer responding

The last entry in the ProcLog is simultaneous with the hold on the slot10 job above, ie,
03/15/16 02:18:19 : PROC_FAMILY_GET_USAGE
03/15/16 02:18:19 : gathering usage data for family with root pid 1651340
03/15/16 02:18:26 : PROC_FAMILY_GET_USAGE
03/15/16 02:18:26 : gathering usage data for family with root pid 1722478
03/15/16 02:18:26 : taking a snapshot...
03/15/16 02:18:26 : ProcAPI: new boottime = 1456240746; old_boottime = 1456240745; /proc/stat boottime = 1456240746; /proc/uptime boottime = 1456240746

Is this a bug in 8.4.4? We had upgraded from 8.2.10 and never saw anything like this.

Thanks,
bob