[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] loadavg thread died, restarting. (exit code=2)



A bit more info showing some messages in StartLog for a situation where
the same job (39.3011) is started twice. The ACCESS_VIOLATION exception
is probably key here ... perhaps this is related to the removal of the
user account condor_reuse_vm1?

12/8 12:09:27 State change: IS_OWNER is false
12/8 12:09:27 Changing state: Owner -> Unclaimed
12/8 12:10:07 DaemonCore: Command received via UDP from host <10.10.6.33:36266>
12/8 12:10:07 DaemonCore: received command 440 (MATCH_INFO), calling handler (command_match_info)
12/8 12:10:07 match_info called
12/8 12:10:07 Received match <10.10.23.114:3893>#1972313343
12/8 12:10:07 State change: match notification protocol successful
12/8 12:10:07 Changing state: Unclaimed -> Matched
12/8 12:10:08 DaemonCore: Command received via TCP from host <10.10.23.142:2609>
12/8 12:10:08 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler (command_request_claim)
12/8 12:10:08 Request accepted.
12/8 12:10:08 Remote owner is jjvaldes_admin@ir41165valdes
12/8 12:10:08 State change: claiming protocol successful
12/8 12:10:08 Changing state: Matched -> Claimed
12/8 12:10:13 DaemonCore: Command received via TCP from host <10.10.23.142:2615>
12/8 12:10:13 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler (command_activate_claim)
12/8 12:10:13 Got activate_claim request from shadow (<10.10.23.142:2615>)
12/8 12:10:13 loadavg thread died, restarting. (exit code=2)
12/8 12:10:18 no loadavg samples this minute, maybe thread died???
12/8 12:10:18 Remote job ID is 39.3011
12/8 12:10:18 ProcFamily::currentfamily: ERROR: family_size is 0
12/8 12:10:18 WARNING: No processes found in starter's family
12/8 12:10:18 Got universe "VANILLA" (5) from request classad
12/8 12:10:18 State change: claim-activation protocol successful
12/8 12:10:18 Changing activity: Idle -> Busy
12/8 12:10:23 loadavg thread died, restarting. (exit code=2)
12/8 12:10:28 no loadavg samples this minute, maybe thread died???
12/8 12:10:28 DaemonCore: Command received via UDP from host <10.10.23.114:3346>
12/8 12:10:28 DaemonCore: received command 60001 (DC_PROCESSEXIT), calling handler (HandleProcessExitCommand())
12/8 12:10:28 Starter pid 2856 died on signal -1073741819 (exception ACCESS_VIOLATION)
12/8 12:10:28 ERROR: C:\Condor\execute\dir_2856 still exists after trying to add Full control to ACLs for PRIV_UNKNOWN
12/8 12:10:28 State change: starter exited
12/8 12:10:28 Changing activity: Busy -> Idle
12/8 12:10:33 loadavg thread died, restarting. (exit code=2)
12/8 12:10:38 no loadavg samples this minute, maybe thread died???
12/8 12:10:38 DaemonCore: Command received via TCP from host <10.10.23.142:2624>
12/8 12:10:39 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler (command_activate_claim)
12/8 12:10:39 Got activate_claim request from shadow (<10.10.23.142:2624>)
12/8 12:10:39 loadavg thread died, restarting. (exit code=2)
12/8 12:10:44 no loadavg samples this minute, maybe thread died???
12/8 12:10:44 Remote job ID is 39.3011
12/8 12:10:44 ProcFamily::currentfamily: ERROR: family_size is 0
12/8 12:10:44 WARNING: No processes found in starter's family
12/8 12:10:44 Got universe "VANILLA" (5) from request classad
12/8 12:10:44 State change: claim-activation protocol successful
12/8 12:10:44 Changing activity: Idle -> Busy
12/8 12:10:44 loadavg thread died, restarting. (exit code=2)
12/8 12:10:49 no loadavg samples this minute, maybe thread died???

Bob Orchard
National Research Council Canada      Conseil national de recherches Canada
Institute for Information Technology  Institut de technologie de l'information
1200 Montreal Road, Building M-50     M50, 1200 chemin Montréal
Ottawa, ON, Canada K1A 0R6            Ottawa (Ontario) Canada K1A 0R6
(613) 993-8557 
(613) 952-0215 Fax / télécopieur
bob.orchard@xxxxxxxxxxxxxx 
Government of Canada | Gouvernement du Canada



> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx
> [mailto:condor-users-bounces@xxxxxxxxxxx]On Behalf Of Orchard, Bob
> Sent: Thursday, December 08, 2005 1:51 PM
> To: condor-users@xxxxxxxxxxx
> Subject: [Condor-users] loadavg thread died, restarting. (exit code=2)
> 
> 
> 
> Running on Windows 2000. Condor client version 6.6.10.
> 
> When a job is NOT running I get the following messages every 
> 5 minutes or so.
> 
> 11/30 10:23:22 loadavg thread died, restarting. (exit code=2)
> 11/30 10:23:27 no loadavg samples this minute, maybe thread died???
> 11/30 10:28:27 loadavg thread died, restarting. (exit code=2)
> 11/30 10:28:32 no loadavg samples this minute, maybe thread died???
> 11/30 10:33:32 loadavg thread died, restarting. (exit code=2)
> 11/30 10:33:37 no loadavg samples this minute, maybe thread died???
> 11/30 10:38:37 loadavg thread died, restarting. (exit code=2)
> 11/30 10:38:42 no loadavg samples this minute, maybe thread died???
> 
> When a job is running the messages come about every 5 seconds
> 
> 12/6 11:23:38 loadavg thread died, restarting. (exit code=2)
> 12/6 11:23:43 no loadavg samples this minute, maybe thread died???
> 12/6 11:23:48 loadavg thread died, restarting. (exit code=2)
> 12/6 11:23:53 no loadavg samples this minute, maybe thread died???
> 12/6 11:23:58 loadavg thread died, restarting. (exit code=2)
> 12/6 11:24:03 no loadavg samples this minute, maybe thread died???
> 12/6 11:24:03 ProcFamily::currentfamily: ERROR: family_size is 0
> 12/6 11:24:03 WARNING: No processes found in starter's family
> 12/6 11:24:08 loadavg thread died, restarting. (exit code=2)
> 12/6 11:24:13 no loadavg samples this minute, maybe thread died???
> 12/6 11:24:18 loadavg thread died, restarting. (exit code=2)
> 12/6 11:24:23 no loadavg samples this minute, maybe thread died???
> 12/6 11:24:28 loadavg thread died, restarting. (exit code=2)
> 12/6 11:24:33 no loadavg samples this minute, maybe thread died???
> 
> Has anyone had this problem or does anyone know what the 
> source of the 
> problem could be? It seems specific to my machine and not 
> others in our pool. 
> 
> Some supplemental information. My machine sometimes also allows 
> more than 1 job to be scheduled at the same time. So I end up 
> with many
> sub-directories under condor/execute. I've had up to 65 directories 
> created and many of these were the same job running at the same time.
> Output from StarterLog file below shows the same job being 
> started within
> 30 seconds and both running at the same time. This is not
> supposed to happen.
> 
> 11/29 00:21:45 ******************************************************
> 11/29 00:21:45 Using config file: C:\Condor\condor_config
> 11/29 00:21:45 Using local config files: C:\Condor/condor_config.local
> 11/29 00:21:45 DaemonCore: Command Socket at <10.10.23.114:4619>
> 11/29 00:21:45 Setting resource limits not implemented!
> 11/29 00:21:45 Starter communicating with condor_shadow 
> <10.10.23.142:3419>
> 11/29 00:21:45 Submitting machine is "iitm50ws0380.iit-iti.priv"
> 11/29 00:21:46 File transfer completed successfully.
> 11/29 00:21:47 Starting a VANILLA universe job with ID: 39.2136
> 11/29 00:21:47 IWD: C:\Condor/execute\dir_3472
> 11/29 00:21:47 Output file: C:\Condor/execute\dir_3472\izanya.out_2137
> 11/29 00:21:47 Renice expr "10" evaluated to 10
> 11/29 00:21:47 About to exec 
> C:\Condor\execute\dir_3472\condor_exec.exe izanya.cfg_2137
> 11/29 00:21:47 Create_Process succeeded, pid=3532
> 11/29 00:22:06 ******************************************************
> 11/29 00:22:06 ** condor_starter (CONDOR_STARTER) STARTING UP
> 11/29 00:22:06 ** C:\Condor\bin\condor_starter.exe
> 11/29 00:22:06 ** $CondorVersion: 6.6.10 Jun 22 2005 $
> 11/29 00:22:06 ** $CondorPlatform: INTEL-WINNT50 $
> 11/29 00:22:06 ** PID = 844
> 11/29 00:22:06 ******************************************************
> 11/29 00:22:06 Using config file: C:\Condor\condor_config
> 11/29 00:22:06 Using local config files: C:\Condor/condor_config.local
> 11/29 00:22:07 DaemonCore: Command Socket at <10.10.23.114:4627>
> 11/29 00:22:07 Setting resource limits not implemented!
> 11/29 00:22:07 Starter communicating with condor_shadow 
> <10.10.23.142:3437>
> 11/29 00:22:07 Submitting machine is "iitm50ws0380.iit-iti.priv"
> 11/29 00:22:07 File transfer completed successfully.
> 11/29 00:22:08 Starting a VANILLA universe job with ID: 39.2136
> 11/29 00:22:08 IWD: C:\Condor/execute\dir_844
> 11/29 00:22:08 Output file: C:\Condor/execute\dir_844\izanya.out_2137
> 11/29 00:22:08 Renice expr "10" evaluated to 10
> 11/29 00:22:08 About to exec 
> C:\Condor\execute\dir_844\condor_exec.exe izanya.cfg_2137
> 11/29 00:22:08 Create_Process succeeded, pid=3516
> 
> A second bit of information that may be relevant. It is 
> possible that some 
> time ago when I was cleaning up user accounts, that I deleted 
> the condor_reuse_vm1
> account. It gets recreated but Windows warns about doing this 
> and says even 
> if you recreate the account later with the same name it is 
> 'not the same account'.
> I don't know all the ramifications of this but it may be 
> relevant. I did
> try something a bit bizarre ... I looked for the string 
> 'condor_reuse' in all
> of the condor executable files and used a hex editor to change them to
> 'condor_xxuse'. Then I started condor and sure enough it 
> created the user account
> condor_xxuse_vm1 and used it when it ran jobs but I had the 
> same problems. I
> have no way of knowing if this was enough to 'fix' the user 
> account problem but it did
> appear to create the newly named user account etc.
> 
> I've installed and uninstalled condor several times to try to 
> get rid of this
> unusual problem
> 
> Bob Orchard
> National Research Council Canada      Conseil national de 
> recherches Canada
> Institute for Information Technology  Institut de technologie 
> de l'information
> 1200 Montreal Road, Building M-50     M50, 1200 chemin Montréal
> Ottawa, ON, Canada K1A 0R6            Ottawa (Ontario) Canada K1A 0R6
> (613) 993-8557 
> (613) 952-0215 Fax / télécopieur
> bob.orchard@xxxxxxxxxxxxxx 
> Government of Canada | Gouvernement du Canada
> 
> 
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>