[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] increasing schedd memory usage [v8.6.0?]



Hi Todd and Brian,

yes, it is probably not version related.
We downgraded fro 8.6.0 to 8.4.11 and the node got into the strange
behaviour again.

Thing is, that after some time, I see multiple condor_schedd processes
running, each using 4GB-5GB of memory [1].
The master is spawning just one schedd after being restarted [2] and I
have no idea, where the other schedds are coming from. Judging from the
PIDs, they are spawned pretty close to each other (unfortunately, I just
restarted condor and forgot to dive into their /proc/PIDs).

Suspiciously, dmesg tells several times that the original schedd 2060869
started by the master had run out of memory [3]

Cheers,
  Thomas

[1]
1950398 condor    20   0 6797m 4.3g  644 D 38.3 27.6   0:31.61
condor_schedd

1950399 condor    20   0 6797m 5.2g  624 D 38.3 33.4   0:39.68
condor_schedd

...
1951012 condor    20   0 6797m 5.2g  624 R 27.7 33.2   0:38.87
condor_schedd

1950418 condor    20   0 6797m 4.7g  576 D 25.5 30.4   0:36.00
condor_schedd


[2]
> grep "/usr/sbin/condor_schedd" /var/log/condor/MasterLog | tail -n 6
12/19/16 16:41:41 Started DaemonCore process "/usr/sbin/condor_schedd",
pid and pgroup = 3157
02/01/17 16:42:44 Started DaemonCore process "/usr/sbin/condor_schedd",
pid and pgroup = 3734842
02/09/17 13:29:19 Started DaemonCore process "/usr/sbin/condor_schedd",
pid and pgroup = 3163
02/09/17 18:00:00 Started DaemonCore process "/usr/sbin/condor_schedd",
pid and pgroup = 196569
02/10/17 15:58:47 Started DaemonCore process "/usr/sbin/condor_schedd",
pid and pgroup = 1852855
02/10/17 16:30:08 Started DaemonCore process "/usr/sbin/condor_schedd",
pid and pgroup = 2060869


[3]
> dmesg
...
[1948198]   497 1948198    10372       17   5       0             0 nrpe
Out of memory: Kill process 2060869 (condor_schedd) score 339 or
sacrifice child
Killed process 1947478, UID 0, (condor_schedd) total-vm:6960724kB,
anon-rss:5505256kB, file-rss:148kB
condor_shadow invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0,
oom_score_adj=0
condor_shadow cpuset=/ mems_allowed=0
Pid: 1953543, comm: condor_shadow Not tainted 2.6.32-642.13.1.el6.x86_64 #1
Call Trace:
 [<ffffffff81131420>] ? dump_header+0x90/0x1b0
...
[1954251]     0 1954251     3576       15   5       0             0
arc-lcmaps
Out of memory: Kill process 2060869 (condor_schedd) score 339 or
sacrifice child
Killed process 1950394, UID 0, (condor_schedd) total-vm:6960724kB,
anon-rss:5156116kB, file-rss:636kB
condor_q invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0,
oom_score_adj=0
condor_q cpuset=/ mems_allowed=0
Pid: 1950796, comm: condor_q Not tainted 2.6.32-642.13.1.el6.x86_64 #1
Call Trace:
 [<ffffffff81131420>] ? dump_header+0x90/0x1b0
...
Out of memory: Kill process 2060869 (condor_schedd) score 339 or
sacrifice child
Killed process 1950398, UID 0, (condor_schedd) total-vm:6960724kB,
anon-rss:4717704kB, file-rss:8kB



On 2017-02-10 21:02, Todd Tannenbaum wrote:
> Just as a data point, fwiw, I just looked at the ganglia chart for a
> fairly busy (~7000 jobs running at any moment) schedd here at UW-Madison
> which has been running v8.6.0 for three weeks.  No sign of memory leaks
> or bursts.
> 
> regards,
> Todd
> 
> 
> On 2/10/2017 9:13 AM, Thomas Hartmann wrote:
>> Hi Brian,
>>
>> thanks for the suggestion
>>
>> On 2017-02-10 03:19, Brian Bockelman wrote:
>>> Is it possible the extra memory usage is coming from when the
>>> condor_schedd process forks to respond to condor_q queries?  Are you
>>> seeing an abnormally large amount of queries?
>>
>> not that I am aware of - any queries would should come only from the ARC
>> CE, but afais both our ARCCEs have been ~equally busy.
>> As cross-check, I restarted the CE daemon, but it had no effect on the
>> memory consumption so far and only reduced the number of connections to
>> the outside [1] compared to its sibling (should be the expected
>> behaviour).
>> On the affected node quite(?) a number of shadows were kept open [2],
>> but that should be OK, or?
>>
>> We have now downgraded the version to
>>   8.4.11
>> and will keep an eye on it over the weekend.
>> If the behaviour gets back to normal, we can at least exclude Condor.
>>
>> Cheers,
>>   Thomas
>>
>>
>>
>>
>> [1]
>>> grid-arcce1 > wc -l /proc/net/tcp*
>> 335 /proc/net/tcp
>> 10 /proc/net/tcp6
>> 345 total
>>
>>> grid-arcce0 > wc -l /proc/net/tcp*
>> 2733 /proc/net/tcp
>> 16 /proc/net/tcp6
>> 2749 total
>>
>>
>> [2]
>>> lsof -i TCP | grep condor | cut -d " " -f 1 | sort | uniq -c
>>       1 condor_de
>>       1 condor_ma
>>       4 condor_sc
>>> lsof | grep condor | cut -d " " -f 1 | sort | uniq -c
>>      27 condor_de
>>      30 condor_ma
>>      19 condor_pr
>>      45 condor_sc
>>   44776 condor_sh
>>       1 scan-cond
>>
>>
>>
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
>> with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>
> 
> 

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature