[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] loadavg thread died, restarting. (exit code=2)



On 12/8/05, Orchard, Bob <Robert.Orchard@xxxxxxxxxxxxxx> wrote:
>
> Running on Windows 2000. Condor client version 6.6.10.

On the 64 bit windows platform this is a known bug (it is todo with
adding performance counters) though in that instance you get an exit
code of 1 not 2 so it appears to be getting past initializing the WMI
perf counters query but is not able to add a counter.

Definitely keep on 6.6.10 since on 6.6.8 and below this would cause a
massive memory leak in your startd.

> When a job is NOT running I get the following messages every 5 minutes or so.
>
> 11/30 10:23:22 loadavg thread died, restarting. (exit code=2)
> 11/30 10:23:27 no loadavg samples this minute, maybe thread died???

This suggests it is failing to add a counter to "\\System\\Processor
Queue Length" but this is likely to not be specific to that counter
and more likely to be a general issue polling the perf counters.

Is your "Windows Management Instrumentation" service running (is it
set to disabled?)

> 12/6 11:24:03 ProcFamily::currentfamily: ERROR: family_size is 0
> 12/6 11:24:03 WARNING: No processes found in starter's family

This is not necessarilyl a critical error, if your job creates a lot
of short lived processes it could just have stale data


> Has anyone had this problem or does anyone know what the source of the
> problem could be? It seems specific to my machine and not others in our pool.
>
> Some supplemental information. My machine sometimes also allows
> more than 1 job to be scheduled at the same time. So I end up with many
> sub-directories under condor/execute. I've had up to 65 directories
> created and many of these were the same job running at the same time.
> Output from StarterLog file below shows the same job being started within
> 30 seconds and both running at the same time. This is not
> supposed to happen.

That looks bad, are you running multiple startd's on this machine by accident.

What does your process list show for executables starting with "condor_"

> A second bit of information that may be relevant. It is possible that some
> time ago when I was cleaning up user accounts, that I deleted the condor_reuse_vm1
...
> I've installed and uninstalled condor several times to try to get rid of this
> unusual problem

Sounds very unhappy. is the machine suffering from other
issues/symptoms? Is is SMP/Hyperthreaded?

Have you considered the 'nuclear' option of reinstalling from the OS up...

Matt