[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] loadavg thread died, restarting. (exit code=2)



Running on Windows 2000. Condor client version 6.6.10.

When a job is NOT running I get the following messages every 5 minutes or so.

11/30 10:23:22 loadavg thread died, restarting. (exit code=2)
11/30 10:23:27 no loadavg samples this minute, maybe thread died???
11/30 10:28:27 loadavg thread died, restarting. (exit code=2)
11/30 10:28:32 no loadavg samples this minute, maybe thread died???
11/30 10:33:32 loadavg thread died, restarting. (exit code=2)
11/30 10:33:37 no loadavg samples this minute, maybe thread died???
11/30 10:38:37 loadavg thread died, restarting. (exit code=2)
11/30 10:38:42 no loadavg samples this minute, maybe thread died???

When a job is running the messages come about every 5 seconds

12/6 11:23:38 loadavg thread died, restarting. (exit code=2)
12/6 11:23:43 no loadavg samples this minute, maybe thread died???
12/6 11:23:48 loadavg thread died, restarting. (exit code=2)
12/6 11:23:53 no loadavg samples this minute, maybe thread died???
12/6 11:23:58 loadavg thread died, restarting. (exit code=2)
12/6 11:24:03 no loadavg samples this minute, maybe thread died???
12/6 11:24:03 ProcFamily::currentfamily: ERROR: family_size is 0
12/6 11:24:03 WARNING: No processes found in starter's family
12/6 11:24:08 loadavg thread died, restarting. (exit code=2)
12/6 11:24:13 no loadavg samples this minute, maybe thread died???
12/6 11:24:18 loadavg thread died, restarting. (exit code=2)
12/6 11:24:23 no loadavg samples this minute, maybe thread died???
12/6 11:24:28 loadavg thread died, restarting. (exit code=2)
12/6 11:24:33 no loadavg samples this minute, maybe thread died???

Has anyone had this problem or does anyone know what the source of the 
problem could be? It seems specific to my machine and not others in our pool. 

Some supplemental information. My machine sometimes also allows 
more than 1 job to be scheduled at the same time. So I end up with many
sub-directories under condor/execute. I've had up to 65 directories 
created and many of these were the same job running at the same time.
Output from StarterLog file below shows the same job being started within
30 seconds and both running at the same time. This is not
supposed to happen.

11/29 00:21:45 ******************************************************
11/29 00:21:45 Using config file: C:\Condor\condor_config
11/29 00:21:45 Using local config files: C:\Condor/condor_config.local
11/29 00:21:45 DaemonCore: Command Socket at <10.10.23.114:4619>
11/29 00:21:45 Setting resource limits not implemented!
11/29 00:21:45 Starter communicating with condor_shadow <10.10.23.142:3419>
11/29 00:21:45 Submitting machine is "iitm50ws0380.iit-iti.priv"
11/29 00:21:46 File transfer completed successfully.
11/29 00:21:47 Starting a VANILLA universe job with ID: 39.2136
11/29 00:21:47 IWD: C:\Condor/execute\dir_3472
11/29 00:21:47 Output file: C:\Condor/execute\dir_3472\izanya.out_2137
11/29 00:21:47 Renice expr "10" evaluated to 10
11/29 00:21:47 About to exec C:\Condor\execute\dir_3472\condor_exec.exe izanya.cfg_2137
11/29 00:21:47 Create_Process succeeded, pid=3532
11/29 00:22:06 ******************************************************
11/29 00:22:06 ** condor_starter (CONDOR_STARTER) STARTING UP
11/29 00:22:06 ** C:\Condor\bin\condor_starter.exe
11/29 00:22:06 ** $CondorVersion: 6.6.10 Jun 22 2005 $
11/29 00:22:06 ** $CondorPlatform: INTEL-WINNT50 $
11/29 00:22:06 ** PID = 844
11/29 00:22:06 ******************************************************
11/29 00:22:06 Using config file: C:\Condor\condor_config
11/29 00:22:06 Using local config files: C:\Condor/condor_config.local
11/29 00:22:07 DaemonCore: Command Socket at <10.10.23.114:4627>
11/29 00:22:07 Setting resource limits not implemented!
11/29 00:22:07 Starter communicating with condor_shadow <10.10.23.142:3437>
11/29 00:22:07 Submitting machine is "iitm50ws0380.iit-iti.priv"
11/29 00:22:07 File transfer completed successfully.
11/29 00:22:08 Starting a VANILLA universe job with ID: 39.2136
11/29 00:22:08 IWD: C:\Condor/execute\dir_844
11/29 00:22:08 Output file: C:\Condor/execute\dir_844\izanya.out_2137
11/29 00:22:08 Renice expr "10" evaluated to 10
11/29 00:22:08 About to exec C:\Condor\execute\dir_844\condor_exec.exe izanya.cfg_2137
11/29 00:22:08 Create_Process succeeded, pid=3516

A second bit of information that may be relevant. It is possible that some 
time ago when I was cleaning up user accounts, that I deleted the condor_reuse_vm1
account. It gets recreated but Windows warns about doing this and says even 
if you recreate the account later with the same name it is 'not the same account'.
I don't know all the ramifications of this but it may be relevant. I did
try something a bit bizarre ... I looked for the string 'condor_reuse' in all
of the condor executable files and used a hex editor to change them to
'condor_xxuse'. Then I started condor and sure enough it created the user account
condor_xxuse_vm1 and used it when it ran jobs but I had the same problems. I
have no way of knowing if this was enough to 'fix' the user account problem but it did
appear to create the newly named user account etc.

I've installed and uninstalled condor several times to try to get rid of this
unusual problem

Bob Orchard
National Research Council Canada      Conseil national de recherches Canada
Institute for Information Technology  Institut de technologie de l'information
1200 Montreal Road, Building M-50     M50, 1200 chemin Montréal
Ottawa, ON, Canada K1A 0R6            Ottawa (Ontario) Canada K1A 0R6
(613) 993-8557 
(613) 952-0215 Fax / télécopieur
bob.orchard@xxxxxxxxxxxxxx 
Government of Canada | Gouvernement du Canada