[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] memory leak in Condor 7.4.2 schedd ???



We're using SSL authentication. There are few machines entering
and leaving the pool but a reasonable number of jobs (~ 6000).
Most of the machine classads represent offline machines
which are updated every 15 min. I'm not sure what protocol
is used - I haven't altered the defaults so I think it's UDP.

regards,

-ian.

> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-
> bounces@xxxxxxxxxxx] On Behalf Of Steven Timm
> Sent: 24 June 2010 15:47
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] memory leak in Condor 7.4.2 schedd ???
> 
> What kind of authentication is your pool using?
> SEC_DEFAULT_AUTHENTICATION
> 
> What is the value of
> SEC_DEFAULT_SESSION_DURATION
> 
> Do you have a lot of nodes
> enterying and leaving the pool?  what about a lot of jobs?
> 
> Security session bloat is one big reason for collectors
> but sometimes schedd's too to get a big memory size,
> especially if TCP is involved.
> 
> Steve
> 
> 
> On Thu, 24 Jun 2010, Smith, Ian wrote:
> 
> > Hi Dan,
> >
> > I've copied this here:
> >
> > http://pcwww.liv.ac.uk/~smithic/core.17281.Z
> >
> > Its about 500 MB so I'm not sure how much luck you will have
> > with downloading it. As I write the scheduler is using a stonking
> > 1700 MB and we have only one job in the queue !
> >
> > regards,
> >
> > -ian.
> >
> >> -----Original Message-----
> >> From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-
> >> bounces@xxxxxxxxxxx] On Behalf Of Dan Bradley
> >> Sent: 23 June 2010 15:12
> >> To: Condor-Users Mail List
> >> Subject: Re: [Condor-users] memory leak in Condor 7.4.2 schedd ???
> >>
> >> Ian,
> >>
> >> We might be able to tell where the problem is by looking at a core file
> >> from the bloated schedd process.  One way to generate one is this:
> >>
> >> gdb -p <PID of schedd>
> >> (gdb) gcore
> >> (gdb) quit
> >>
> >> It will write the core file into your current working directory, so make
> >> sure there is enough space.  Also, it will take some time (minute or
> >> two, I imagine), during which the schedd will be unresponsive.
> >>
> >> --Dan
> >>
> >> Smith, Ian wrote:
> >>> Apologies for the rather long running thread but I've just now seen a
> >>> repeat of the excessive schedd memory usage described earlier.
> >>>
> >>> Running top
> >>>
> >>>    PID USERNAME LWP PRI NICE  SIZE   RES STATE    TIME    CPU
> >> COMMAND
> >>>
> >>>  17281 root       1  59    0 1043M 1038M sleep   55:11  0.50% condor_schedd
> >>>
> >>> and pmap:
> >>>
> >>> 17281:  condor_schedd -f
> >>> 00010000    6752K r-x--  /opt1/condor_7.4.3/sbin/condor_schedd
> >>> 006B6000     536K rwx--  /opt1/condor_7.4.3/sbin/condor_schedd
> >>> 0073C000     784K rwx--    [ heap ]
> >>> 00800000 1056768K rwx--    [ heap ]
> >>> FEF00000     608K r-x--  /lib/libm.so.2
> >>> FEFA6000      24K rwx--  /lib/libm.so.2
> >>> FF000000    1216K r-x--  /lib/libc.so.1
> >>> FF130000      40K rwx--  /lib/libc.so.1
> >>> FF13A000       8K rwx--  /lib/libc.so.1
> >>> FF160000      64K rwx--    [ anon ]
> >>> FF180000     584K r-x--  /lib/libnsl.so.1
> >>> FF222000      40K rwx--  /lib/libnsl.so.1
> >>> FF22C000      24K rwx--  /lib/libnsl.so.1
> >>> FF240000      64K rwx--    [ anon ]
> >>> FF260000      64K rwx--    [ anon ]
> >>> FF280000      16K r-x--  /lib/libm.so.1
> >>> FF292000       8K rwx--  /lib/libm.so.1
> >>> FF2A0000     240K r-x--  /lib/libresolv.so.2
> >>> FF2E0000      24K rwx--    [ anon ]
> >>> FF2EC000      16K rwx--  /lib/libresolv.so.2
> >>> FF300000      48K r-x--  /lib/libsocket.so.1
> >>> FF310000       8K rwx--    [ anon ]
> >>> FF31C000       8K rwx--  /lib/libsocket.so.1
> >>> FF320000     128K r-x--  /lib/libelf.so.1
> >>> FF340000       8K rwx--  /lib/libelf.so.1
> >>> FF350000       8K rwx--    [ anon ]
> >>> FF360000       8K r-x--  /lib/libkstat.so.1
> >>> FF372000       8K rwx--  /lib/libkstat.so.1
> >>> FF380000       8K r-x--  /lib/libdl.so.1
> >>> FF38E000       8K rwxs-    [ anon ]
> >>> FF392000       8K rwx--  /lib/libdl.so.1
> >>> FF3A0000       8K r-x--  /platform/sun4u-us3/lib/libc_psr.so.1
> >>> FF3B0000     208K r-x--  /lib/ld.so.1
> >>> FF3F0000       8K r--s-  dev:32,12 ino:70306
> >>> FF3F4000       8K rwx--  /lib/ld.so.1
> >>> FF3F6000       8K rwx--  /lib/ld.so.1
> >>> FFBEC000      80K rwx--    [ stack ]
> >>>  total   1068448K
> >>>
> >>> So it does look to me that around 1 GB of heap is allocated to the schedd.
> >>> Currently I have 889 jobs in total, 450 idle and 439 running which seems
> >>> pretty modest.
> >>>
> >>> regards,
> >>>
> >>> -ian.
> >>>
> >>>
> >>>> Is that virtual, resident or private memory usage?
> >>>>
> >>>> Output of,
> >>>>
> >>>>   top -n1 -b -p $(pidof condor_schedd)
> >>>>   pmap -d $(pidof condor_schedd)
> >>>>
> >>>> ?
> >>>>
> >>>> FYI, Condor uses string interning to minimize the memory footprint of
> >>>> jobs (all classads actually), but, iirc, does not always garbage collect
> >>>> the string pool. If you have a lot of jobs passing through your Schedd,
> >>>> say with large unique Environments, you could certainly see memory usage
> >>>> increase. Then of course there could just be a memory leak.
> >>>>
> >>>> Best,
> >>>>
> >>>>
> >>>> matt
> >>>> _______________________________________________
> >>>> Condor-users mailing list
> >>>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> >>>> subject: Unsubscribe
> >>>> You can also unsubscribe by visiting
> >>>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >>>>
> >>>> The archives can be found at:
> >>>> https://lists.cs.wisc.edu/archive/condor-users/
> >>>>
> >>> _______________________________________________
> >>> Condor-users mailing list
> >>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> >>> subject: Unsubscribe
> >>> You can also unsubscribe by visiting
> >>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >>>
> >>> The archives can be found at:
> >>> https://lists.cs.wisc.edu/archive/condor-users/
> >>>
> >> _______________________________________________
> >> Condor-users mailing list
> >> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> >> subject: Unsubscribe
> >> You can also unsubscribe by visiting
> >> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >>
> >> The archives can be found at:
> >> https://lists.cs.wisc.edu/archive/condor-users/
> > _______________________________________________
> > Condor-users mailing list
> > To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/condor-users/
> >
> 
> --
> ------------------------------------------------------------------
> Steven C. Timm, Ph.D  (630) 840-8525
> timm@xxxxxxxx  http://home.fnal.gov/~timm/
> Fermilab Computing Division, Scientific Computing Facilities,
> Grid Facilities Department, FermiGrid Services Group, Assistant Group Leader.
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/