[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Intermittent Condor startd crashes



I have an update. After looking at the source code it appears the crash
is indeed happening as part of the pipe cleanup process for the spawned
pipes use to run the hooks.

I have a theory: because I'm only seeing this on my 8-way boxes I'm
wondering if there's just a big synchornization issue with this many
pipes all writing to stderr on Windows. The Close_Pipe routine is trying
to close a pipe or two that might still be writing or flushing its
output.

I'm testing the theory now. I was using STDERR from my hook scripts to
print debug information for my hook routine into the StartLog file. I'm
no longer doing that. Instead I've got my hook script creating its own
log file, one per slot, in the log directory and writing the debug
information there.

If I can keep a machine up for a few days without a random startd crash
I think I'll close the book on this one.

- Ian

> -----Original Message-----
> From: Ian Chesal
> Sent: Tuesday, August 11, 2009 3:49 PM
> To: Ian Chesal; 'Condor-Users Mail List'
> Subject: RE: Intermittent Condor startd crashes
>
> I forgot one other bit of information:
>
> The startd will crash periodically on the machine, even with
> no jobs assigned to run on it.
>
> Here's the StartLog output when that happens:
>
> 8/11 19:07:16 ERROR "Close_Pipe error" at line 1804 in file
> ..\src\condor_daemon_core.V6\daemon_core.cpp
> 8/11 19:07:16 Deleting Cronmgr
> 8/11 19:07:16 Cron: Deleting job 'update'
> (d:/arc/scripts/hooks/update_hooks_and_modules.bat)
> 8/11 19:07:16 All resources are free, exiting.
> *** End of file StartLog
>
> *** Last entry in core file core.STARTD.WIN32
>
> ====================
> Exception code: C0000005 ACCESS_VIOLATION
> Fault address:  7D4D8820 01:00008820 C:\WINDOWS\syswow64\kernel32.dll
>
> Registers:
> EAX:01658F5C
> EBX:00000000
> ECX:01658F5C
> EDX:00000000
> ESI:016533C0
> EDI:016533C0
> CS:EIP:0023:7D4D8820
> SS:ESP:002B:018311F0  EBP:018313A0
> DS:002B  ES:002B  FS:0053  GS:002B
> Flags:00010206
>
> Call stack:
> Address   Frame
> 7D4D8820  018313A0  InterlockedExchange+8
>
> *** End of file core.STARTD.WIN32
>
> It just ran a Startd cron job for me. It runs that Startd
> cron job every hour. But it doesn't crash every hour. The
> cron job runs robocopy to sync some stuff on the box that our
> jobs use. In a 24 hour period I saw 4 crashes. In all 24
> cases when the cron job was run nothing was sync'ed by
> robocopy -- all 24 jobs did the exact same thing, output the
> same messages, etc.
>
> - Ian
>
> > -----Original Message-----
> > From: Ian Chesal
> > Sent: Tuesday, August 11, 2009 3:46 PM
> > To: 'Condor-Users Mail List'
> > Subject: Intermittent Condor startd crashes
> >
> > This is on Windows XP 64-bit. I have a feeling the problem is
> > my machine image, and not Condor. But I was wondering if any
> > of the Condor developers have seen this problem. Insight on
> > what specifically might be wrong on the machine?
> >
> > The setup is the same every time: I submit a slew of sleep(1)
> > jobs that target this host. The host is pulling jobs using
> > hook scripts. It's okay for about 1200 jobs or so and then
> > the startd crashes. The crash is always the same: the startd
> > exits because of a fatal exception. The email snippet of the
> > StartLog always has an error line like this in it:
> >
> > 8/11 20:39:09 Close_Pipe on invalid pipe end: -1
> > 8/11 20:39:09 ERROR "Close_Pipe error" at line 1804 in file
> > ..\src\condor_daemon_core.V6\daemon_core.cpp
> >
> > Now the really crummy part: on top of running with hooks,
> I'm running:
> >
> > $CondorVersion: 7.2.2 Mar 20 2009 BuildID: none PRE-RELEASE-UWCS $
> > $CondorPlatform: INTEL-WINNT50 $
> >
> > So this might be a problem with this pre-release version.
> >
> > - Ian
> >
> >
> > "d:/arc/condor/bin/condor_startd.exe" on
> > "uk-swcf5173.altera.com" exited with status 4.
> > Condor will automatically restart this process in 10 seconds.
> >
> > *** Last 100 line(s) of file d:/arc/condor/log/StartLog:
> > 8/11 20:39:07 condor_read(): recv() returned -1, errno =
> > 10054, assuming failure reading 5 bytes from <127.0.0.1:4678>.
> > 8/11 20:39:07 IO: Failed to read packet header
> > 8/11 20:39:07 Return from Handler <receiveJobClassAdUpdate>
> > 8/11 20:39:07 Received UDP command 60011 (DC_NOP) from
> > <137.57.183.22:4818>, access level READ
> > 8/11 20:39:07 Calling HandleReq <handle_nop()> (0)
> > 8/11 20:39:07 Return from HandleReq <handle_nop()> (handler:
> > 0.000s, sec: 0.000s)
> > 8/11 20:39:07 DaemonCore: pid 3064 exited with status 0,
> > invoking reaper 4 <reaper>
> > 8/11 20:39:07 Starter pid 3064 exited with status 0
> > 8/11 20:39:07 slot3: State change: starter exited
> > 8/11 20:39:07 slot3: Changing activity: Busy -> Idle
> > 8/11 20:39:07 Executable is a batch script, so executing
> > C:\WINDOWS\system32\cmd.exe  /Q /C
> > "d:/arc/scripts/hooks/arc_job_fetch.bat"
> > 8/11 20:39:07 DaemonCore: return from reaper for pid 3064
> > 8/11 20:39:07 Calling pipe Handler <DC stderr pipe handler>
> > for Pipe end=65540 <DC stderr pipe>
> > 8/11 20:39:07 Return from pipe Handler
> > 8/11 20:39:07 Received UDP command 60011 (DC_NOP) from
> > <137.57.183.22:4821>, access level READ
> > 8/11 20:39:07 Calling HandleReq <handle_nop()> (0)
> > 8/11 20:39:07 Return from HandleReq <handle_nop()> (handler:
> > 0.000s, sec: 0.000s)
> > 8/11 20:39:07 Calling pipe Handler <DC stdout pipe handler>
> > for Pipe end=65538 <DC stdout pipe>
> > 8/11 20:39:07 Return from pipe Handler
> > 8/11 20:39:07 Calling pipe Handler <DC stderr pipe handler>
> > for Pipe end=65540 <DC stderr pipe>
> > 8/11 20:39:07 Return from pipe Handler
> > 8/11 20:39:07 Received UDP command 60011 (DC_NOP) from
> > <137.57.183.22:4824>, access level READ
> > 8/11 20:39:07 Calling HandleReq <handle_nop()> (0)
> > 8/11 20:39:07 Return from HandleReq <handle_nop()> (handler:
> > 0.000s, sec: 0.000s)
> > 8/11 20:39:07 Received UDP command 60011 (DC_NOP) from
> > <137.57.183.22:4826>, access level READ
> > 8/11 20:39:07 Calling HandleReq <handle_nop()> (0)
> > 8/11 20:39:07 Return from HandleReq <handle_nop()> (handler:
> > 0.000s, sec: 0.000s)
> > 8/11 20:39:07 DaemonCore: pid 2544 exited with status 0,
> > invoking reaper 1 <HookClientMgr Output Reaper>
> > 8/11 20:39:07 Warning, hook
> > d:/arc/scripts/hooks/arc_job_fetch.bat (pid 2544) printed to
> > stderr: DEBUG: Slot = 1
> > DEBUG: State = "Claimed"
> > DEBUG: Start = ((TRUE =?= TRUE) && (((FALSE =?= TRUE) &&
> > (((TRUE =!= TRUE) || (((FALSE =?= TRUE) && ((((((ClockDay > 0
> > && ClockDay < 6) && ((480 > 1225 && (ClockMin >= 1225 &&
> > ClockMin <= 480)) || (480 < 1225 && (ClockMin >= 1225 ||
> > ClockMin <= 480)))) || ((ClockDay == 0 || ClockDay == 6) &&
> > ((1439 > 0 && (ClockMin >= 0 && ClockMin <= 1439)) || ((1439
> > < 0 && (ClockMin >= 0 || ClockMin <= 1439)))))) && (((LoadAvg
> > - CondorLoadAvg) <= 0.300000) || (State != "Unclaimed" &&
> > State != "Owner")) && (KeyboardIdle >= 180)) || ((SlotID <=
> > 0))) =!= TRUE)) || ((FALSE =!= TRUE) && FALSE))) =?= FALSE))
> > || ((FALSE =!= TRUE) && (((TRUE =!= TRUE) || (((FALSE =?=
> > TRUE) && ((((((ClockDay > 0 && ClockDay < 6) && ((480 > 1225
> > && (ClockMin >= 1225 && ClockMin <= 480)) || (480 < 1225 &&
> > (ClockMin >= 1225 || ClockMin <= 480)))) || ((ClockDay == 0
> > || ClockDay == 6) && ((1439 > 0 && (ClockMin >= 0 && ClockMin
> > <= 1439)) || ((1439 < 0 && (ClockMin >= 0 || ClockMin <=
> > 1439)))))) && (((LoadAvg - CondorLoa!
> >  dAvg) <= 0.300000) || (State != "Unclaimed" && State !=
> > "Owner")) && (KeyboardIdle >= 180)) || ((SlotID <= 0))) =!=
> > TRUE)) || ((FALSE =!= TRUE) && FALSE))) =?= FALSE))))
> > DEBUG: Requirements = (START) && (IsValidCheckpointPlatform)
> > Error: Slot state is unknown or not unclaimed
> > 8/11 20:39:07 slot1: Changing state and activity:
> > Claimed/Idle -> Preempting/Vacating
> > 8/11 20:39:07 slot1: State change: No preempting claim,
> > returning to owner
> > 8/11 20:39:07 slot1: Changing state and activity:
> > Preempting/Vacating -> Owner/Idle
> > 8/11 20:39:07 slot1: State change: IS_OWNER is false
> > 8/11 20:39:07 slot1: Changing state: Owner -> Unclaimed
> > 8/11 20:39:07 DaemonCore: return from reaper for pid 2544
> > 8/11 20:39:07 Calling pipe Handler <DC stderr pipe handler>
> > for Pipe end=65542 <DC stderr pipe>
> > 8/11 20:39:07 Return from pipe Handler
> > 8/11 20:39:07 Received UDP command 60011 (DC_NOP) from
> > <137.57.183.22:4827>, access level READ
> > 8/11 20:39:07 Calling HandleReq <handle_nop()> (0)
> > 8/11 20:39:07 Return from HandleReq <handle_nop()> (handler:
> > 0.000s, sec: 0.016s)
> > 8/11 20:39:07 Calling pipe Handler <DC stderr pipe handler>
> > for Pipe end=65542 <DC stderr pipe>
> > 8/11 20:39:07 Return from pipe Handler
> > 8/11 20:39:07 Calling pipe Handler <DC stdout pipe handler>
> > for Pipe end=65539 <DC stdout pipe>
> > 8/11 20:39:07 Return from pipe Handler
> > 8/11 20:39:07 Received UDP command 60011 (DC_NOP) from
> > <137.57.183.22:4828>, access level READ
> > 8/11 20:39:07 Calling HandleReq <handle_nop()> (0)
> > 8/11 20:39:07 Return from HandleReq <handle_nop()> (handler:
> > 0.000s, sec: 0.000s)
> > 8/11 20:39:07 Received UDP command 60011 (DC_NOP) from
> > <137.57.183.22:4829>, access level READ
> > 8/11 20:39:07 Calling HandleReq <handle_nop()> (0)
> > 8/11 20:39:07 Return from HandleReq <handle_nop()> (handler:
> > 0.000s, sec: 0.000s)
> > 8/11 20:39:07 DaemonCore: pid 2116 exited with status 0,
> > invoking reaper 1 <HookClientMgr Output Reaper>
> > 8/11 20:39:07 Warning, hook
> > d:/arc/scripts/hooks/arc_job_fetch.bat (pid 2116) printed to
> > stderr: DEBUG: Slot = 3
> > DEBUG: State = "Claimed"
> > DEBUG: Start = ((TRUE =?= TRUE) && (((FALSE =?= TRUE) &&
> > (((TRUE =!= TRUE) || (((FALSE =?= TRUE) && ((((((ClockDay > 0
> > && ClockDay < 6) && ((480 > 1225 && (ClockMin >= 1225 &&
> > ClockMin <= 480)) || (480 < 1225 && (ClockMin >= 1225 ||
> > ClockMin <= 480)))) || ((ClockDay == 0 || ClockDay == 6) &&
> > ((1439 > 0 && (ClockMin >= 0 && ClockMin <= 1439)) || ((1439
> > < 0 && (ClockMin >= 0 || ClockMin <= 1439)))))) && (((LoadAvg
> > - CondorLoadAvg) <= 0.300000) || (State != "Unclaimed" &&
> > State != "Owner")) && (KeyboardIdle >= 180)) || ((SlotID <=
> > 0))) =!= TRUE)) || ((FALSE =!= TRUE) && FALSE))) =?= FALSE))
> > || ((FALSE =!= TRUE) && (((TRUE =!= TRUE) || (((FALSE =?=
> > TRUE) && ((((((ClockDay > 0 && ClockDay < 6) && ((480 > 1225
> > && (ClockMin >= 1225 && ClockMin <= 480)) || (480 < 1225 &&
> > (ClockMin >= 1225 || ClockMin <= 480)))) || ((ClockDay == 0
> > || ClockDay == 6) && ((1439 > 0 && (ClockMin >= 0 && ClockMin
> > <= 1439)) || ((1439 < 0 && (ClockMin >= 0 || ClockMin <=
> > 1439)))))) && (((LoadAvg - CondorLoa!
> >  dAvg) <= 0.300000) || (State != "Unclaimed" && State !=
> > "Owner")) && (KeyboardIdle >= 180)) || ((SlotID <= 0))) =!=
> > TRUE)) || ((FALSE =!= TRUE) && FALSE))) =?= FALSE))))
> > DEBUG: Requirements = (START) && (IsValidCheckpointPlatform)
> > Error: Slot state is unknown or not unclaimed
> > 8/11 20:39:07 slot3: Changing state and activity:
> > Claimed/Idle -> Preempting/Vacating
> > 8/11 20:39:07 slot3: State change: No preempting claim,
> > returning to owner
> > 8/11 20:39:07 slot3: Changing state and activity:
> > Preempting/Vacating -> Owner/Idle
> > 8/11 20:39:07 slot3: State change: IS_OWNER is false
> > 8/11 20:39:07 slot3: Changing state: Owner -> Unclaimed
> > 8/11 20:39:07 DaemonCore: return from reaper for pid 2116
> > 8/11 20:39:08 Calling Handler <receiveJobClassAdUpdate>
> > 8/11 20:39:08 condor_read(): recv() returned -1, errno =
> > 10054, assuming failure reading 5 bytes from <127.0.0.1:4719>.
> > 8/11 20:39:08 IO: Failed to read packet header
> > 8/11 20:39:08 Return from Handler <receiveJobClassAdUpdate>
> > 8/11 20:39:08 Received UDP command 60011 (DC_NOP) from
> > <137.57.183.22:4839>, access level READ
> > 8/11 20:39:08 Calling HandleReq <handle_nop()> (0)
> > 8/11 20:39:08 Return from HandleReq <handle_nop()> (handler:
> > 0.000s, sec: 0.000s)
> > 8/11 20:39:08 DaemonCore: pid 2320 exited with status 0,
> > invoking reaper 4 <reaper>
> > 8/11 20:39:08 Starter pid 2320 exited with status 0
> > 8/11 20:39:08 slot2: State change: starter exited
> > 8/11 20:39:08 slot2: Changing activity: Busy -> Idle
> > 8/11 20:39:08 Executable is a batch script, so executing
> > C:\WINDOWS\system32\cmd.exe  /Q /C
> > "d:/arc/scripts/hooks/arc_job_fetch.bat"
> > 8/11 20:39:08 DaemonCore: return from reaper for pid 2320
> > 8/11 20:39:08 Calling pipe Handler <DC stderr pipe handler>
> > for Pipe end=65540 <DC stderr pipe>
> > 8/11 20:39:08 Return from pipe Handler
> > 8/11 20:39:08 Received UDP command 60011 (DC_NOP) from
> > <137.57.183.22:4842>, access level READ
> > 8/11 20:39:08 Calling HandleReq <handle_nop()> (0)
> > 8/11 20:39:08 Return from HandleReq <handle_nop()> (handler:
> > 0.000s, sec: 0.000s)
> > 8/11 20:39:09 Calling pipe Handler <DC stderr pipe handler>
> > for Pipe end=65540 <DC stderr pipe>
> > 8/11 20:39:09 Return from pipe Handler
> > 8/11 20:39:09 Received UDP command 60011 (DC_NOP) from
> > <137.57.183.22:4843>, access level READ
> > 8/11 20:39:09 Calling HandleReq <handle_nop()> (0)
> > 8/11 20:39:09 Return from HandleReq <handle_nop()> (handler:
> > 0.000s, sec: 0.000s)
> > 8/11 20:39:09 Close_Pipe on invalid pipe end: -1
> > 8/11 20:39:09 ERROR "Close_Pipe error" at line 1804 in file
> > ..\src\condor_daemon_core.V6\daemon_core.cpp
> > 8/11 20:39:09 slot8: Changing state and activity:
> > Claimed/Busy -> Preempting/Killing
> > 8/11 20:39:09 slot5: Changing state and activity:
> > Claimed/Busy -> Preempting/Killing
> > 8/11 20:39:09 slot2: Changing state and activity:
> > Claimed/Idle -> Preempting/Killing
> > 8/11 20:39:09 slot2: State change: No preempting claim,
> > returning to owner
> > 8/11 20:39:09 slot2: Changing state and activity:
> > Preempting/Killing -> Owner/Idle
> > 8/11 20:39:09 slot2: State change: IS_OWNER is false
> > 8/11 20:39:09 slot2: Changing state: Owner -> Unclaimed
> > 8/11 20:39:09 slot6: Changing state and activity:
> > Claimed/Busy -> Preempting/Killing
> > 8/11 20:39:09 startd exiting because of fatal exception.
> > *** End of file StartLog
>

Confidentiality Notice.
This message may contain information that is confidential or otherwise protected from disclosure. If you are not the intended recipient, you are hereby notified that any use, disclosure, dissemination, distribution,  or copying  of this message, or any attachments, is strictly prohibited.  If you have received this message in error, please advise the sender by reply e-mail, and delete the message and any attachments.  Thank you.