[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] loadavg thread died, restarting. (exit code=2)




> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx
> [mailto:condor-users-bounces@xxxxxxxxxxx]On Behalf Of Matt Hope
> Sent: Friday, December 09, 2005 3:57 AM
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] loadavg thread died, restarting. (exit
> code=2)
> 
> 
> On 12/8/05, Orchard, Bob <Robert.Orchard@xxxxxxxxxxxxxx> wrote:
> >
> > Running on Windows 2000. Condor client version 6.6.10.
> 
> On the 64 bit windows platform this is a known bug (it is todo with
> adding performance counters) though in that instance you get an exit
> code of 1 not 2 so it appears to be getting past initializing the WMI
> perf counters query but is not able to add a counter.
> 
> Definitely keep on 6.6.10 since on 6.6.8 and below this would cause a
> massive memory leak in your startd.
> 
> > When a job is NOT running I get the following messages 
> every 5 minutes or so.
> >
> > 11/30 10:23:22 loadavg thread died, restarting. (exit code=2)
> > 11/30 10:23:27 no loadavg samples this minute, maybe thread died???
> 
> This suggests it is failing to add a counter to "\\System\\Processor
> Queue Length" but this is likely to not be specific to that counter
> and more likely to be a general issue polling the perf counters.
> 
> Is your "Windows Management Instrumentation" service running (is it
> set to disabled?)

This service is running

> 
> > 12/6 11:24:03 ProcFamily::currentfamily: ERROR: family_size is 0
> > 12/6 11:24:03 WARNING: No processes found in starter's family
> 
> This is not necessarilyl a critical error, if your job creates a lot
> of short lived processes it could just have stale data
> 

The jobs submitted are quite long and each runs for 40 minutes to 2 hours


> 
> > Has anyone had this problem or does anyone know what the 
> source of the
> > problem could be? It seems specific to my machine and not 
> others in our pool.
> >
> > Some supplemental information. My machine sometimes also allows
> > more than 1 job to be scheduled at the same time. So I end 
> up with many
> > sub-directories under condor/execute. I've had up to 65 directories
> > created and many of these were the same job running at the 
> same time.
> > Output from StarterLog file below shows the same job being 
> started within
> > 30 seconds and both running at the same time. This is not
> > supposed to happen.
> 
> That looks bad, are you running multiple startd's on this 
> machine by accident.

No there are not multiple startd's but there are many condor_exec.exe processes

> 
> What does your process list show for executables starting 
> with "condor_"

Not running right now and I didn't capture that but I'm quite certain that there
was just the master, schedd, and startd plus the condor_exec processes

> 
> > A second bit of information that may be relevant. It is 
> possible that some
> > time ago when I was cleaning up user accounts, that I 
> deleted the condor_reuse_vm1
> ...
> > I've installed and uninstalled condor several times to try 
> to get rid of this
> > unusual problem
> 
> Sounds very unhappy. is the machine suffering from other
> issues/symptoms? Is is SMP/Hyperthreaded?

No the machine behaves quite well.

> 
> Have you considered the 'nuclear' option of reinstalling from 
> the OS up...
> 

I've thought that this might be the only option but it is a significant
effort to get back to my current state and I'll only do this if I
think this workstation is critical to the condor pool. I was hoping
for a simple fix ... 

> Matt
> 
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>