[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Schedd and Startd crashes



Hello
 
Our configuration includes an email to report problems. About every month or so, a couple of our machines (different ones) would send the consecutive emails about schedd and startd crashes. The content of the email seems similar between schedd crashes; same apply to startd crashes. Any pointers on how to debug this? All PC's are running Windows XP.
 
Thanks
 
 
Schedd crashes - Email chain #1
This is an automated email from the Condor system on machine "nbs60.bbnet.ad".  Do not reply.
 
"D:\condor/bin/condor_schedd.exe" on "nbs60.bbnet.ad" exited with status 44.
Condor will automatically restart this process in 10 seconds.
 
*** Last 20 line(s) of file SchedLog:
5/7 00:10:42 (pid:956) -------- Begin starting jobs --------
5/7 00:10:42 (pid:956) -------- Done starting jobs --------
5/7 00:14:15 (pid:956) Getting monitoring info for pid 956
5/7 00:15:42 (pid:956) JobsRunning = 0
5/7 00:15:43 (pid:956) JobsIdle = 0
5/7 00:15:43 (pid:956) JobsHeld = 0
5/7 00:15:43 (pid:956) JobsRemoved = 0
5/7 00:15:43 (pid:956) LocalUniverseJobsRunning = 0
5/7 00:15:43 (pid:956) LocalUniverseJobsIdle = 0
5/7 00:15:43 (pid:956) SchedUniverseJobsRunning = 0
5/7 00:15:43 (pid:956) SchedUniverseJobsIdle = 0
5/7 00:15:43 (pid:956) N_Owners = 0
5/7 00:15:43 (pid:956) MaxJobsRunning = 200
5/7 00:15:43 (pid:956) Trying to update collector <172.26.21.99:9618>
5/7 00:15:43 (pid:956) Attempting to send update via UDP to collector nbs40.bbnet.ad <172.26.21.99:9618>
 
5/7 00:15:43 (pid:956) Sent HEART BEAT ad to 1 collectors. Number of submittors=0
5/7 00:15:43 (pid:956) ============ Begin clean_shadow_recs =============
5/7 00:15:43 (pid:956) ============ End clean_shadow_recs =============
5/7 00:15:43 (pid:956) -------- Begin starting jobs --------
5/7 00:15:43 (pid:956) -------- Done starting jobs --------
*** End of file SchedLog
 
 
 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Questions about this message or Condor in general?
Email address of the local Condor administrator: system_canvas_grid@xxxxxxxxxxxx The Official Condor Homepage is http://www.cs.wisc.edu/condor
 
 
 
 
Schedd crashes - email chain #2
This is an automated email from the Condor system on machine "nbs60.bbnet.ad".  Do not reply.
 
"D:\condor/bin/condor_schedd.exe" on "nbs60.bbnet.ad" exited with status 44.
Condor will automatically restart this process in 11 seconds.
 
*** Last 20 line(s) of file SchedLog:
5/7 00:18:45 (pid:4012) JobsHeld = 0
5/7 00:18:45 (pid:4012) JobsRemoved = 0
5/7 00:18:45 (pid:4012) LocalUniverseJobsRunning = 0
5/7 00:18:45 (pid:4012) LocalUniverseJobsIdle = 0
5/7 00:18:45 (pid:4012) SchedUniverseJobsRunning = 0
5/7 00:18:45 (pid:4012) SchedUniverseJobsIdle = 0
5/7 00:18:45 (pid:4012) N_Owners = 0
5/7 00:18:45 (pid:4012) MaxJobsRunning = 200
5/7 00:18:45 (pid:4012) Trying to update collector <172.26.21.99:9618>
5/7 00:18:45 (pid:4012) Attempting to send update via UDP to collector nbs40.bbnet.ad <172.26.21.99:9618>
 
5/7 00:18:45 (pid:4012) File descriptor limits: max 2000, safe 1600
5/7 00:18:45 (pid:4012) Ignoring file descriptor safety limit (1600), because only 4 sockets are registered (fd is 1776)
 
5/7 00:18:45 (pid:4012) Sent HEART BEAT ad to 1 collectors. Number of submittors=0
5/7 00:18:45 (pid:4012) ============ Begin clean_shadow_recs =============
5/7 00:18:45 (pid:4012) ============ End clean_shadow_recs =============
5/7 00:18:45 (pid:4012) Getting monitoring info for pid 4012
5/7 00:18:45 (pid:4012) DaemonCore: in SendAliveToParent()
5/7 00:18:45 (pid:4012) DaemonCore: attempting to connect to '<172.26.21.23:1916>'
5/7 00:18:55 (pid:4012) -------- Begin starting jobs --------
5/7 00:18:56 (pid:4012) -------- Done starting jobs --------
*** End of file SchedLog
 
 
 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Questions about this message or Condor in general?
Email address of the local Condor administrator: system_canvas_grid@xxxxxxxxxxxx The Official Condor Homepage is http://www.cs.wisc.edu/condor
 
 
 
 
Schedd crashes -email chain #3
This is an automated email from the Condor system on machine "nbs60.bbnet.ad".  Do not reply.
 
"D:\condor/bin/condor_schedd.exe" on "nbs60.bbnet.ad" exited with status 44.
Condor will automatically restart this process in 13 seconds.
 
*** Last 20 line(s) of file SchedLog:
5/7 00:18:45 (pid:4012) JobsHeld = 0
5/7 00:18:45 (pid:4012) JobsRemoved = 0
5/7 00:18:45 (pid:4012) LocalUniverseJobsRunning = 0
5/7 00:18:45 (pid:4012) LocalUniverseJobsIdle = 0
5/7 00:18:45 (pid:4012) SchedUniverseJobsRunning = 0
5/7 00:18:45 (pid:4012) SchedUniverseJobsIdle = 0
5/7 00:18:45 (pid:4012) N_Owners = 0
5/7 00:18:45 (pid:4012) MaxJobsRunning = 200
5/7 00:18:45 (pid:4012) Trying to update collector <172.26.21.99:9618>
5/7 00:18:45 (pid:4012) Attempting to send update via UDP to collector nbs40.bbnet.ad <172.26.21.99:9618>
 
5/7 00:18:45 (pid:4012) File descriptor limits: max 2000, safe 1600
5/7 00:18:45 (pid:4012) Ignoring file descriptor safety limit (1600), because only 4 sockets are registered (fd is 1776)
 
5/7 00:18:45 (pid:4012) Sent HEART BEAT ad to 1 collectors. Number of submittors=0
5/7 00:18:45 (pid:4012) ============ Begin clean_shadow_recs =============
5/7 00:18:45 (pid:4012) ============ End clean_shadow_recs =============
5/7 00:18:45 (pid:4012) Getting monitoring info for pid 4012
5/7 00:18:45 (pid:4012) DaemonCore: in SendAliveToParent()
5/7 00:18:45 (pid:4012) DaemonCore: attempting to connect to '<172.26.21.23:1916>'
5/7 00:18:55 (pid:4012) -------- Begin starting jobs --------
5/7 00:18:56 (pid:4012) -------- Done starting jobs --------
*** End of file SchedLog
 
 
 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Questions about this message or Condor in general?
Email address of the local Condor administrator:
system_canvas_grid@xxxxxxxxxxxx The Official Condor Homepage is http://www.cs.wisc.edu/condor
 
 
 
 

Schedd crashes - email chain #4
This is an automated email from the Condor system on machine "nbs60.bbnet.ad".  Do not reply.
 
"D:\condor/bin/condor_startd.exe" on "nbs60.bbnet.ad" exited with status 44.
Condor will automatically restart this process in 10 seconds.
 
*** Last 20 line(s) of file StartLog:
5/7 01:41:59 no loadavg samples this minute, maybe thread died???
5/7 01:42:05 loadavg thread died, restarting. (exit code=2)
5/7 01:42:10 no loadavg samples this minute, maybe thread died???
5/7 01:42:16 loadavg thread died, restarting. (exit code=2)
5/7 01:42:21 no loadavg samples this minute, maybe thread died???
5/7 01:42:27 loadavg thread died, restarting. (exit code=2)
5/7 01:42:32 no loadavg samples this minute, maybe thread died???
5/7 01:42:37 loadavg thread died, restarting. (exit code=2)
5/7 01:42:43 no loadavg samples this minute, maybe thread died???
5/7 01:42:48 loadavg thread died, restarting. (exit code=2)
5/7 01:42:54 no loadavg samples this minute, maybe thread died???
5/7 01:42:59 loadavg thread died, restarting. (exit code=2)
5/7 01:43:05 no loadavg samples this minute, maybe thread died???
5/7 01:43:10 loadavg thread died, restarting. (exit code=2)
5/7 01:43:16 no loadavg samples this minute, maybe thread died???
5/7 01:43:21 loadavg thread died, restarting. (exit code=2)
5/7 01:43:27 no loadavg samples this minute, maybe thread died???
5/7 01:43:32 loadavg thread died, restarting. (exit code=2)
5/7 01:43:37 no loadavg samples this minute, maybe thread died???
5/7 01:43:43 loadavg thread died, restarting. (exit code=2)
*** End of file StartLog
 
 
 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Questions about this message or Condor in general?
Email address of the local Condor administrator:
system_canvas_grid@xxxxxxxxxxxx The Official Condor Homepage is http://www.cs.wisc.edu/condor
 
 
 
 
Startd crashes - email chain #1
This is an automated email from the Condor system on machine "nbs50.bbnet.ad".  Do not reply.
 
"D:\condor/bin/condor_startd.exe" on "nbs50.bbnet.ad" exited with status 44.
Condor will automatically restart this process in 10 seconds.
 
*** Last 20 line(s) of file StartLog:
5/19 19:32:36 Trying to update collector <172.26.21.99:9618>
5/19 19:32:36 Attempting to send update via UDP to collector nbs40.bbnet.ad <172.26.21.99:9618>
5/19 19:32:36 vm2: Sent update to 1 collector(s)
5/19 19:32:42 DaemonCore: Command received via UDP from host <172.26.21.75:1619>
5/19 19:32:42 DaemonCore: received command 441 (ALIVE), calling handler (command_handler)
5/19 19:32:42 DaemonCore: Command received via UDP from host <172.26.21.75:1638>
5/19 19:32:42 DaemonCore: received command 441 (ALIVE), calling handler (command_handler)
5/19 19:34:28 DaemonCore: Command received via UDP from host <172.26.21.13:2342>
5/19 19:34:28 DaemonCore: received command 60008 (DC_CHILDALIVE), calling handler (HandleChildAliveCommand)
 
5/19 19:36:16 Getting monitoring info for pid 1536
5/19 19:37:31 Swap space: 2724296
5/19 19:37:31 Looking up RESERVED_DISK parameter
5/19 19:37:31 Reserving 5120 kbytes for file system
5/19 19:37:31 Disk space: 132614452
5/19 19:37:35 Trying to update collector <172.26.21.99:9618>
5/19 19:37:35 Attempting to send update via UDP to collector nbs40.bbnet.ad <172.26.21.99:9618>
5/19 19:37:35 vm1: Sent update to 1 collector(s)
5/19 19:37:36 Trying to update collector <172.26.21.99:9618>
5/19 19:37:36 Attempting to send update via UDP to collector nbs40.bbnet.ad <172.26.21.99:9618>
5/19 19:37:36 vm2: Sent update to 1 collector(s)
*** End of file StartLog
 
 
 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Questions about this message or Condor in general?
Email address of the local Condor administrator: system_canvas_grid@xxxxxxxxxxxx The Official Condor Homepage is http://www.cs.wisc.edu/condor
 
 
 
 
Startd crashes - email chain #2
This is an automated email from the Condor system on machine "nbs50.bbnet.ad".  Do not reply.
 
"D:\condor/bin/condor_startd.exe" on "nbs50.bbnet.ad" died due to exception ACCESS_VIOLATION.
Condor will automatically restart this process in 11 seconds.
 
*** Last 20 line(s) of file StartLog:
5/19 20:03:01 ** condor_startd.exe (CONDOR_STARTD) STARTING UP
5/19 20:03:01 ** D:\condor\bin\condor_startd.exe
5/19 20:03:01 ** $CondorVersion: 6.8.4 Feb  1 2007 $
5/19 20:03:01 ** $CondorPlatform: INTEL-WINNT50 $
5/19 20:03:01 ** PID = 4084
5/19 20:03:01 ** Log last touched 5/19 20:02:47
5/19 20:03:01 ******************************************************
5/19 20:03:01 Using config source: D:\Condor\condor_config
5/19 20:03:01 Using local config sources:
5/19 20:03:01    D:\condor/condor_config.local
5/19 20:03:01 DaemonCore: Command Socket at <172.26.21.13:2574>
5/19 20:03:01 Memory: Detected 2038 megs RAM
5/19 20:03:01 my_popen: CreateProcess failed
5/19 20:03:01 Failed to execute D:\condor/bin/condor_starter.pvm.exe, ignoring
5/19 20:03:01 my_popen: CreateProcess failed
5/19 20:03:01 Failed to execute D:\condor/bin/condor_starter.std.exe, ignoring
5/19 20:03:01 Will use UDP to update collector nbs40.bbnet.ad <172.26.21.99:9618>
5/19 20:03:01 command_x_event() called.
5/19 20:03:01 Attempting to remove D:\condor\execute\dir_3172 as SuperUser (system)
5/19 20:03:01 my_popen: CreateProcess failed
*** End of file StartLog
 
*** Last entry in core file core.STARTD.WIN32
 
========================
Exception code: C0000005 ACCESS_VIOLATION Fault address:  0048D517 01:0008C517 D:\condor\bin\condor_startd.exe
 
Registers:
EAX:FFFFFFFF
EBX:00000001
ECX:004E4898
EDX:00C700C0
ESI:00000000
EDI:FFFFFFFF
CS:EIP:001B:0048D517
SS:ESP:0023:0012F224  EBP:0012F240
DS:0023  ES:0023  FS:003B  GS:0000
Flags:00010286
 
Call stack:
Address   Frame     Logical addr  Module
0048D517  0012F240  0001:0008C517 D:\condor\bin\condor_startd.exe
0042C3D4  0012F2B0  0001:0002B3D4 D:\condor\bin\condor_startd.exe 0042D93B  0012F348  0001:0002C93B D:\condor\bin\condor_startd.exe 0042D91B  0012F390  0001:0002C91B D:\condor\bin\condor_startd.exe
0042D8A6  0012F3B4  0001:0002C8A6 D:\condor\bin\condor_startd.exe
00417138  0012FDEC  0001:00016138 D:\condor\bin\condor_startd.exe
*** End of file core.STARTD.WIN32
 
 
 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Questions about this message or Condor in general?
Email address of the local Condor administrator: system_canvas_grid@xxxxxxxxxxxx The Official Condor Homepage is http://www.cs.wisc.edu/condor
 

 
 
 
Startd crashes - email chain #3
This is an automated email from the Condor system on machine "nbs50.bbnet.ad".  Do not reply.
 
"D:\condor/bin/condor_startd.exe" on "nbs50.bbnet.ad" died due to exception ACCESS_VIOLATION.
Condor will automatically restart this process in 13 seconds.
 
*** Last 20 line(s) of file StartLog:
5/19 20:03:16 ** condor_startd.exe (CONDOR_STARTD) STARTING UP
5/19 20:03:16 ** D:\condor\bin\condor_startd.exe
5/19 20:03:16 ** $CondorVersion: 6.8.4 Feb  1 2007 $
5/19 20:03:16 ** $CondorPlatform: INTEL-WINNT50 $
5/19 20:03:16 ** PID = 3264
5/19 20:03:16 ** Log last touched 5/19 20:03:01
5/19 20:03:16 ******************************************************
5/19 20:03:16 Using config source: D:\Condor\condor_config
5/19 20:03:16 Using local config sources:
5/19 20:03:16    D:\condor/condor_config.local
5/19 20:03:16 DaemonCore: Command Socket at <172.26.21.13:2581>
5/19 20:03:16 Memory: Detected 2038 megs RAM
5/19 20:03:16 my_popen: CreateProcess failed
5/19 20:03:16 Failed to execute D:\condor/bin/condor_starter.pvm.exe, ignoring
5/19 20:03:16 my_popen: CreateProcess failed
5/19 20:03:16 Failed to execute D:\condor/bin/condor_starter.std.exe, ignoring
5/19 20:03:16 Will use UDP to update collector nbs40.bbnet.ad <172.26.21.99:9618>
5/19 20:03:16 command_x_event() called.
5/19 20:03:16 Attempting to remove D:\condor\execute\dir_3172 as SuperUser (system)
5/19 20:03:16 my_popen: CreateProcess failed
*** End of file StartLog
 
*** Last entry in core file core.STARTD.WIN32
 
========================
Exception code: C0000005 ACCESS_VIOLATION Fault address:  0048D517 01:0008C517 D:\condor\bin\condor_startd.exe
 
Registers:
EAX:FFFFFFFF
EBX:00000001
ECX:004E4898
EDX:00C700C0
ESI:00000000
EDI:FFFFFFFF
CS:EIP:001B:0048D517
SS:ESP:0023:0012F224  EBP:0012F240
DS:0023  ES:0023  FS:003B  GS:0000
Flags:00010286
 
Call stack:
Address   Frame     Logical addr  Module
0048D517  0012F240  0001:0008C517 D:\condor\bin\condor_startd.exe
0042C3D4  0012F2B0  0001:0002B3D4 D:\condor\bin\condor_startd.exe 0042D93B  0012F348  0001:0002C93B D:\condor\bin\condor_startd.exe 0042D91B  0012F390  0001:0002C91B D:\condor\bin\condor_startd.exe
0042D8A6  0012F3B4  0001:0002C8A6 D:\condor\bin\condor_startd.exe
00417138  0012FDEC  0001:00016138 D:\condor\bin\condor_startd.exe
*** End of file core.STARTD.WIN32
 
 
 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Questions about this message or Condor in general?
Email address of the local Condor administrator: system_canvas_grid@xxxxxxxxxxxx The Official Condor Homepage is http://www.cs.wisc.edu/condor
 

 

 
 
Best Regards,
Rick
 

Conexant E-mail Firewall (Conexant.Com) made the following annotations
---------------------------------------------------------------------
********************** Legal Disclaimer **************************** "This email may contain confidential and privileged material for the sole use of the intended recipient. Any unauthorized review, use or distribution by others is strictly prohibited. If you have received the message in error, please advise the sender by reply email and delete the message. Thank you." ********************************************************************** ---------------------------------------------------------------------