[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] collector memory leak / history truncation



All -

Thanks for all the folllow ups, I am working with the Condor team to
debug this, and we will post what we learn...

Doug


On 7/1/08, ucarlino@xxxxxxxxxx <ucarlino@xxxxxxxxxx> wrote:
> Master host is RadHat Linux RHEL 3.
>
> Total machines:
>
> 101:condor@lnxgen7/home/condor> condor_status -t
>
>                      Total Owner Claimed Unclaimed Matched Preempting
> Backfill
>
>          INTEL/LINUX   204    24      88        92       0          0
> 0
>        INTEL/WINNT51  2347   204     477      1658       8          0
> 0
>        INTEL/WINNT52    12     5       0         7       0          0
> 0
>      SUN4u/SOLARIS28    15    11       0         4       0          0
> 0
>      SUN4u/SOLARIS29     6     2       0         4       0          0
> 0
>
>                Total  2584   246     565      1765       8          0
> 0
>
> 105:condor@lnxgen7/home/condor> ps auxw | grep condor_
> condor    3606  0.1  0.4 21084 17332 ?       S    Mar12 166:59
> /home/condor/6.8.6/Linux-2.4-i386/sbin/condor_master
> condor    3674  0.0  0.1  7992 3912 ?        S    Mar12  19:16 condor_startd
> -f
> condor    3675  0.0  0.0  8168 3572 ?        S    Mar12   1:36 condor_schedd
> -f
> condor    3676 43.8  3.6 143232 139488 ?     S    Mar12 69754:49
> condor_negotiator -f
> condor   25070 16.0  2.5 102648 98528 ?      S    Jun30 181:47
> condor_collector -f
> condor    2537  0.0  0.0  1616  472 pts/1    S    01:52   0:00 grep condor_
>
> 109:condor@lnxgen7/home/condor> top -bn1 | grep condor_
>  3676 condor    25   0  136M 136M  2512 R    22.5  3.6 69755m   2
> condor_negotiat
> 25070 condor    15   0 98560  96M  2460 S     3.1  2.5 182:11   0
> condor_collecto
>  3606 condor    15   0 17332  16M  2596 S     0.0  0.4 166:59   0
> condor_master
>  3674 condor    15   0  3912 3912  2840 S     0.0  0.1  19:16   0
> condor_startd
>  3675 condor    15   0  3572 3572  2792 S     0.0  0.0   1:36   3
> condor_schedd
>
> Collector is restarted by cron twice a week:
> 110:condor@lnxgen7/home/condor> crontab -l
> # Cron entries for Micron 'is' Condor pool
> # Activate on the pool controller using: crontab
> /home/condor/cron/collector_crontab
> #
> # Restart the Collector at 7:00 AM on Mondays and Thursdays
>
> 0 7 * * mon,thu /home/condor/6.8.6/Linux-2.4-i386/sbin/condor_restart
> -subsystem Collector >/tmp/collector_restart.log 2>&1
>
> Regards,
> 	Umberto
>
> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx
> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Steffen Grunewald
> Sent: 01 July 2008 09:47
> To: condor-users@xxxxxxxxxxx
> Subject: Re: [Condor-users] collector memory leak / history truncation
>
> On Tue, Jul 01, 2008 at 09:37:11AM +0200, ucarlino@xxxxxxxxxx wrote:
>> We've been experiencing the same problem  for long time now.
>> With 6.8.6, 7.0.1, 7.0.2, 7.1.0 and 7.0.3. They all have the same
>> problem.
>> And I agree with the fact that it seems proportional with the number
>> of machine in the pool.
>
> How many are there? We're running a 600+ node cluster, and after more than 1
> million hours of accumulated usage:
>
> # ps auxw | grep condor
> condor     491  0.0  0.1  17020  3260 ?        Ss   Jun02  28:43
> /usr/sbin/condor_master
> condor     492 14.2  2.8  71480 57960 ?        Ss   Jun02 5930:59
> condor_collector -f
> condor     495  1.7  3.0  74520 61300 ?        Ss   Jun02 730:49
> condor_negotiator -f
> condor     496  0.0  0.1  18152  3916 ?        Ss   Jun02   0:24
> condor_schedd -f
> root       497  0.0  0.1  11812  2492 ?        S    Jun02   9:03
> condor_procd -A /usr/share/condor/local/log/procd_pipe.SCHEDD -C 666
> root      9903  0.0  0.0   2748   604 pts/1    R+   09:44   0:00 grep condor
> # top -bn1 | grep condor
>   492 condor    20   0 71480  56m 2564 R   34  2.8   5931:00 condor_collecto
>   491 condor    20   0 17020 3260 2168 S    0  0.2  28:43.55 condor_master
>   495 condor    20   0 74520  59m 2764 S    0  3.0 730:49.44 condor_negotiat
>   496 condor    20   0 18152 3916 3144 S    0  0.2   0:24.44 condor_schedd
>   497 root      20   0 11812 2492 1064 S    0  0.1   9:03.47 condor_procd
>
> Extra classAd attributes? (we don't have any...)
>
> Steffen
>
> --
> Steffen Grunewald * MPI Grav.Phys.(AEI) * Am Mühlenberg 1, D-14476 Potsdam
> Cluster Admin * http://pandora.aei.mpg.de/merlin/ * http://www.aei.mpg.de/
> * e-mail: steffen.grunewald(*)aei.mpg.de * +49-331-567-{fon:7233,fax:7298}
> No Word/PPT mails - http://www.gnu.org/philosophy/no-word-attachments.html
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>