[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Intermittent Condor startd crashes



I forgot one other bit of information:

The startd will crash periodically on the machine, even with no jobs
assigned to run on it.

Here's the StartLog output when that happens:

8/11 19:07:16 ERROR "Close_Pipe error" at line 1804 in file
..\src\condor_daemon_core.V6\daemon_core.cpp
8/11 19:07:16 Deleting Cronmgr
8/11 19:07:16 Cron: Deleting job 'update'
(d:/arc/scripts/hooks/update_hooks_and_modules.bat)
8/11 19:07:16 All resources are free, exiting.
*** End of file StartLog

*** Last entry in core file core.STARTD.WIN32

====================
Exception code: C0000005 ACCESS_VIOLATION
Fault address:  7D4D8820 01:00008820 C:\WINDOWS\syswow64\kernel32.dll

Registers:
EAX:01658F5C
EBX:00000000
ECX:01658F5C
EDX:00000000
ESI:016533C0
EDI:016533C0
CS:EIP:0023:7D4D8820
SS:ESP:002B:018311F0  EBP:018313A0
DS:002B  ES:002B  FS:0053  GS:002B
Flags:00010206

Call stack:
Address   Frame
7D4D8820  018313A0  InterlockedExchange+8

*** End of file core.STARTD.WIN32

It just ran a Startd cron job for me. It runs that Startd cron job every
hour. But it doesn't crash every hour. The cron job runs robocopy to
sync some stuff on the box that our jobs use. In a 24 hour period I saw
4 crashes. In all 24 cases when the cron job was run nothing was sync'ed
by robocopy -- all 24 jobs did the exact same thing, output the same
messages, etc.

- Ian

> -----Original Message-----
> From: Ian Chesal
> Sent: Tuesday, August 11, 2009 3:46 PM
> To: 'Condor-Users Mail List'
> Subject: Intermittent Condor startd crashes
>
> This is on Windows XP 64-bit. I have a feeling the problem is
> my machine image, and not Condor. But I was wondering if any
> of the Condor developers have seen this problem. Insight on
> what specifically might be wrong on the machine?
>
> The setup is the same every time: I submit a slew of sleep(1)
> jobs that target this host. The host is pulling jobs using
> hook scripts. It's okay for about 1200 jobs or so and then
> the startd crashes. The crash is always the same: the startd
> exits because of a fatal exception. The email snippet of the
> StartLog always has an error line like this in it:
>
> 8/11 20:39:09 Close_Pipe on invalid pipe end: -1
> 8/11 20:39:09 ERROR "Close_Pipe error" at line 1804 in file
> ..\src\condor_daemon_core.V6\daemon_core.cpp
>
> Now the really crummy part: on top of running with hooks, I'm running:
>
> $CondorVersion: 7.2.2 Mar 20 2009 BuildID: none PRE-RELEASE-UWCS $
> $CondorPlatform: INTEL-WINNT50 $
>
> So this might be a problem with this pre-release version.
>
> - Ian
>
>
> "d:/arc/condor/bin/condor_startd.exe" on
> "uk-swcf5173.altera.com" exited with status 4.
> Condor will automatically restart this process in 10 seconds.
>
> *** Last 100 line(s) of file d:/arc/condor/log/StartLog:
> 8/11 20:39:07 condor_read(): recv() returned -1, errno =
> 10054, assuming failure reading 5 bytes from <127.0.0.1:4678>.
> 8/11 20:39:07 IO: Failed to read packet header
> 8/11 20:39:07 Return from Handler <receiveJobClassAdUpdate>
> 8/11 20:39:07 Received UDP command 60011 (DC_NOP) from
> <137.57.183.22:4818>, access level READ
> 8/11 20:39:07 Calling HandleReq <handle_nop()> (0)
> 8/11 20:39:07 Return from HandleReq <handle_nop()> (handler:
> 0.000s, sec: 0.000s)
> 8/11 20:39:07 DaemonCore: pid 3064 exited with status 0,
> invoking reaper 4 <reaper>
> 8/11 20:39:07 Starter pid 3064 exited with status 0
> 8/11 20:39:07 slot3: State change: starter exited
> 8/11 20:39:07 slot3: Changing activity: Busy -> Idle
> 8/11 20:39:07 Executable is a batch script, so executing
> C:\WINDOWS\system32\cmd.exe  /Q /C
> "d:/arc/scripts/hooks/arc_job_fetch.bat"
> 8/11 20:39:07 DaemonCore: return from reaper for pid 3064
> 8/11 20:39:07 Calling pipe Handler <DC stderr pipe handler>
> for Pipe end=65540 <DC stderr pipe>
> 8/11 20:39:07 Return from pipe Handler
> 8/11 20:39:07 Received UDP command 60011 (DC_NOP) from
> <137.57.183.22:4821>, access level READ
> 8/11 20:39:07 Calling HandleReq <handle_nop()> (0)
> 8/11 20:39:07 Return from HandleReq <handle_nop()> (handler:
> 0.000s, sec: 0.000s)
> 8/11 20:39:07 Calling pipe Handler <DC stdout pipe handler>
> for Pipe end=65538 <DC stdout pipe>
> 8/11 20:39:07 Return from pipe Handler
> 8/11 20:39:07 Calling pipe Handler <DC stderr pipe handler>
> for Pipe end=65540 <DC stderr pipe>
> 8/11 20:39:07 Return from pipe Handler
> 8/11 20:39:07 Received UDP command 60011 (DC_NOP) from
> <137.57.183.22:4824>, access level READ
> 8/11 20:39:07 Calling HandleReq <handle_nop()> (0)
> 8/11 20:39:07 Return from HandleReq <handle_nop()> (handler:
> 0.000s, sec: 0.000s)
> 8/11 20:39:07 Received UDP command 60011 (DC_NOP) from
> <137.57.183.22:4826>, access level READ
> 8/11 20:39:07 Calling HandleReq <handle_nop()> (0)
> 8/11 20:39:07 Return from HandleReq <handle_nop()> (handler:
> 0.000s, sec: 0.000s)
> 8/11 20:39:07 DaemonCore: pid 2544 exited with status 0,
> invoking reaper 1 <HookClientMgr Output Reaper>
> 8/11 20:39:07 Warning, hook
> d:/arc/scripts/hooks/arc_job_fetch.bat (pid 2544) printed to
> stderr: DEBUG: Slot = 1
> DEBUG: State = "Claimed"
> DEBUG: Start = ((TRUE =?= TRUE) && (((FALSE =?= TRUE) &&
> (((TRUE =!= TRUE) || (((FALSE =?= TRUE) && ((((((ClockDay > 0
> && ClockDay < 6) && ((480 > 1225 && (ClockMin >= 1225 &&
> ClockMin <= 480)) || (480 < 1225 && (ClockMin >= 1225 ||
> ClockMin <= 480)))) || ((ClockDay == 0 || ClockDay == 6) &&
> ((1439 > 0 && (ClockMin >= 0 && ClockMin <= 1439)) || ((1439
> < 0 && (ClockMin >= 0 || ClockMin <= 1439)))))) && (((LoadAvg
> - CondorLoadAvg) <= 0.300000) || (State != "Unclaimed" &&
> State != "Owner")) && (KeyboardIdle >= 180)) || ((SlotID <=
> 0))) =!= TRUE)) || ((FALSE =!= TRUE) && FALSE))) =?= FALSE))
> || ((FALSE =!= TRUE) && (((TRUE =!= TRUE) || (((FALSE =?=
> TRUE) && ((((((ClockDay > 0 && ClockDay < 6) && ((480 > 1225
> && (ClockMin >= 1225 && ClockMin <= 480)) || (480 < 1225 &&
> (ClockMin >= 1225 || ClockMin <= 480)))) || ((ClockDay == 0
> || ClockDay == 6) && ((1439 > 0 && (ClockMin >= 0 && ClockMin
> <= 1439)) || ((1439 < 0 && (ClockMin >= 0 || ClockMin <=
> 1439)))))) && (((LoadAvg - CondorLoa!
>  dAvg) <= 0.300000) || (State != "Unclaimed" && State !=
> "Owner")) && (KeyboardIdle >= 180)) || ((SlotID <= 0))) =!=
> TRUE)) || ((FALSE =!= TRUE) && FALSE))) =?= FALSE))))
> DEBUG: Requirements = (START) && (IsValidCheckpointPlatform)
> Error: Slot state is unknown or not unclaimed
> 8/11 20:39:07 slot1: Changing state and activity:
> Claimed/Idle -> Preempting/Vacating
> 8/11 20:39:07 slot1: State change: No preempting claim,
> returning to owner
> 8/11 20:39:07 slot1: Changing state and activity:
> Preempting/Vacating -> Owner/Idle
> 8/11 20:39:07 slot1: State change: IS_OWNER is false
> 8/11 20:39:07 slot1: Changing state: Owner -> Unclaimed
> 8/11 20:39:07 DaemonCore: return from reaper for pid 2544
> 8/11 20:39:07 Calling pipe Handler <DC stderr pipe handler>
> for Pipe end=65542 <DC stderr pipe>
> 8/11 20:39:07 Return from pipe Handler
> 8/11 20:39:07 Received UDP command 60011 (DC_NOP) from
> <137.57.183.22:4827>, access level READ
> 8/11 20:39:07 Calling HandleReq <handle_nop()> (0)
> 8/11 20:39:07 Return from HandleReq <handle_nop()> (handler:
> 0.000s, sec: 0.016s)
> 8/11 20:39:07 Calling pipe Handler <DC stderr pipe handler>
> for Pipe end=65542 <DC stderr pipe>
> 8/11 20:39:07 Return from pipe Handler
> 8/11 20:39:07 Calling pipe Handler <DC stdout pipe handler>
> for Pipe end=65539 <DC stdout pipe>
> 8/11 20:39:07 Return from pipe Handler
> 8/11 20:39:07 Received UDP command 60011 (DC_NOP) from
> <137.57.183.22:4828>, access level READ
> 8/11 20:39:07 Calling HandleReq <handle_nop()> (0)
> 8/11 20:39:07 Return from HandleReq <handle_nop()> (handler:
> 0.000s, sec: 0.000s)
> 8/11 20:39:07 Received UDP command 60011 (DC_NOP) from
> <137.57.183.22:4829>, access level READ
> 8/11 20:39:07 Calling HandleReq <handle_nop()> (0)
> 8/11 20:39:07 Return from HandleReq <handle_nop()> (handler:
> 0.000s, sec: 0.000s)
> 8/11 20:39:07 DaemonCore: pid 2116 exited with status 0,
> invoking reaper 1 <HookClientMgr Output Reaper>
> 8/11 20:39:07 Warning, hook
> d:/arc/scripts/hooks/arc_job_fetch.bat (pid 2116) printed to
> stderr: DEBUG: Slot = 3
> DEBUG: State = "Claimed"
> DEBUG: Start = ((TRUE =?= TRUE) && (((FALSE =?= TRUE) &&
> (((TRUE =!= TRUE) || (((FALSE =?= TRUE) && ((((((ClockDay > 0
> && ClockDay < 6) && ((480 > 1225 && (ClockMin >= 1225 &&
> ClockMin <= 480)) || (480 < 1225 && (ClockMin >= 1225 ||
> ClockMin <= 480)))) || ((ClockDay == 0 || ClockDay == 6) &&
> ((1439 > 0 && (ClockMin >= 0 && ClockMin <= 1439)) || ((1439
> < 0 && (ClockMin >= 0 || ClockMin <= 1439)))))) && (((LoadAvg
> - CondorLoadAvg) <= 0.300000) || (State != "Unclaimed" &&
> State != "Owner")) && (KeyboardIdle >= 180)) || ((SlotID <=
> 0))) =!= TRUE)) || ((FALSE =!= TRUE) && FALSE))) =?= FALSE))
> || ((FALSE =!= TRUE) && (((TRUE =!= TRUE) || (((FALSE =?=
> TRUE) && ((((((ClockDay > 0 && ClockDay < 6) && ((480 > 1225
> && (ClockMin >= 1225 && ClockMin <= 480)) || (480 < 1225 &&
> (ClockMin >= 1225 || ClockMin <= 480)))) || ((ClockDay == 0
> || ClockDay == 6) && ((1439 > 0 && (ClockMin >= 0 && ClockMin
> <= 1439)) || ((1439 < 0 && (ClockMin >= 0 || ClockMin <=
> 1439)))))) && (((LoadAvg - CondorLoa!
>  dAvg) <= 0.300000) || (State != "Unclaimed" && State !=
> "Owner")) && (KeyboardIdle >= 180)) || ((SlotID <= 0))) =!=
> TRUE)) || ((FALSE =!= TRUE) && FALSE))) =?= FALSE))))
> DEBUG: Requirements = (START) && (IsValidCheckpointPlatform)
> Error: Slot state is unknown or not unclaimed
> 8/11 20:39:07 slot3: Changing state and activity:
> Claimed/Idle -> Preempting/Vacating
> 8/11 20:39:07 slot3: State change: No preempting claim,
> returning to owner
> 8/11 20:39:07 slot3: Changing state and activity:
> Preempting/Vacating -> Owner/Idle
> 8/11 20:39:07 slot3: State change: IS_OWNER is false
> 8/11 20:39:07 slot3: Changing state: Owner -> Unclaimed
> 8/11 20:39:07 DaemonCore: return from reaper for pid 2116
> 8/11 20:39:08 Calling Handler <receiveJobClassAdUpdate>
> 8/11 20:39:08 condor_read(): recv() returned -1, errno =
> 10054, assuming failure reading 5 bytes from <127.0.0.1:4719>.
> 8/11 20:39:08 IO: Failed to read packet header
> 8/11 20:39:08 Return from Handler <receiveJobClassAdUpdate>
> 8/11 20:39:08 Received UDP command 60011 (DC_NOP) from
> <137.57.183.22:4839>, access level READ
> 8/11 20:39:08 Calling HandleReq <handle_nop()> (0)
> 8/11 20:39:08 Return from HandleReq <handle_nop()> (handler:
> 0.000s, sec: 0.000s)
> 8/11 20:39:08 DaemonCore: pid 2320 exited with status 0,
> invoking reaper 4 <reaper>
> 8/11 20:39:08 Starter pid 2320 exited with status 0
> 8/11 20:39:08 slot2: State change: starter exited
> 8/11 20:39:08 slot2: Changing activity: Busy -> Idle
> 8/11 20:39:08 Executable is a batch script, so executing
> C:\WINDOWS\system32\cmd.exe  /Q /C
> "d:/arc/scripts/hooks/arc_job_fetch.bat"
> 8/11 20:39:08 DaemonCore: return from reaper for pid 2320
> 8/11 20:39:08 Calling pipe Handler <DC stderr pipe handler>
> for Pipe end=65540 <DC stderr pipe>
> 8/11 20:39:08 Return from pipe Handler
> 8/11 20:39:08 Received UDP command 60011 (DC_NOP) from
> <137.57.183.22:4842>, access level READ
> 8/11 20:39:08 Calling HandleReq <handle_nop()> (0)
> 8/11 20:39:08 Return from HandleReq <handle_nop()> (handler:
> 0.000s, sec: 0.000s)
> 8/11 20:39:09 Calling pipe Handler <DC stderr pipe handler>
> for Pipe end=65540 <DC stderr pipe>
> 8/11 20:39:09 Return from pipe Handler
> 8/11 20:39:09 Received UDP command 60011 (DC_NOP) from
> <137.57.183.22:4843>, access level READ
> 8/11 20:39:09 Calling HandleReq <handle_nop()> (0)
> 8/11 20:39:09 Return from HandleReq <handle_nop()> (handler:
> 0.000s, sec: 0.000s)
> 8/11 20:39:09 Close_Pipe on invalid pipe end: -1
> 8/11 20:39:09 ERROR "Close_Pipe error" at line 1804 in file
> ..\src\condor_daemon_core.V6\daemon_core.cpp
> 8/11 20:39:09 slot8: Changing state and activity:
> Claimed/Busy -> Preempting/Killing
> 8/11 20:39:09 slot5: Changing state and activity:
> Claimed/Busy -> Preempting/Killing
> 8/11 20:39:09 slot2: Changing state and activity:
> Claimed/Idle -> Preempting/Killing
> 8/11 20:39:09 slot2: State change: No preempting claim,
> returning to owner
> 8/11 20:39:09 slot2: Changing state and activity:
> Preempting/Killing -> Owner/Idle
> 8/11 20:39:09 slot2: State change: IS_OWNER is false
> 8/11 20:39:09 slot2: Changing state: Owner -> Unclaimed
> 8/11 20:39:09 slot6: Changing state and activity:
> Claimed/Busy -> Preempting/Killing
> 8/11 20:39:09 startd exiting because of fatal exception.
> *** End of file StartLog

Confidentiality Notice.
This message may contain information that is confidential or otherwise protected from disclosure. If you are not the intended recipient, you are hereby notified that any use, disclosure, dissemination, distribution,  or copying  of this message, or any attachments, is strictly prohibited.  If you have received this message in error, please advise the sender by reply e-mail, and delete the message and any attachments.  Thank you.