[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Intermittent Condor startd crashes



Didn't help. Not long after implementing this I got another random
startd crash. Same crash point in Close_Pipe.

So it appears this has nothing to do with writing to the StartLog file
from your hook. More likely it's just the number of concurrent hooks
that are being fired. I don't see this problem on my 2- and 4-way
Windows machines. Only on my 8-way Windows machines.

I am running my hook at a fairly aggressive rate:

##  Hooks Setup
##  The following hooks setup is global to all operating systems in an
ARC
##  system.

# The FetchWorkDelay expression must evaluate to an integer, which
defines
# the number of seconds since the last fetch attempt completed before
the
# condor_startd will attempt to fetch more work. However, as a ClassAd
# expression (evaluated in the context of the ClassAd of the slot
# considering if it should fetch more work, and the ClassAd of the
# currently running job, if any), the length of the delay can be based
on
# the current state the slot and even the currently running job.
#
# For example, a very common configuration would be to always wait 5
# minutes (300 seconds) between attempts to fetch work, unless the slot
is
# Claimed/Idle, in which case the condor_startd should fetch
immediately:
#FetchWorkDelay = ifThenElse(State == "Claimed" && Activity == "Idle",
0, 5)
FetchWorkDelay = ifThenElse(State == "Claimed" && Activity == "Idle", 0,
ifThenElse(State == "Unclaimed" && Activity == "Idle", 5, 1200))

## Hooks Configuration
## This section describes the default hooks setup and behaviour on this
## platform.
STARTD_JOB_HOOK_KEYWORD = ARC
ARC_HOOK_DIR = d:/arc/scripts/hooks
ARC_HOOK_FETCH_WORK = $(ARC_HOOK_DIR)/arc_job_fetch.bat
ARC_HOOK_REPLY_FETCH =
ARC_HOOK_EVICT_CLAIM =
ARC_HOOK_PREPARE_JOB =
ARC_HOOK_UPDATE_JOB_INFO =
ARC_HOOK_JOB_EXIT =

And my queue is empty. So there are 8 hook script instances firing
nearly continuously. It takes each instance about 0.5 seconds to figure
out there's nothing to run.

I'm going to try slowing down the hook execution rate on my Windows
machine. Specifically I'm going to introduce a random sleep between 1-10
seconds to the start of the hook script that will apply when its run on
Windows machines. That should help prevent them from all running and
completing at the same time.

But if someone from the Condor team can try to reproduce this, run a
hook on a Windows machine with that kind of aggressive schedule when
you've got 8 or more slots, I'd appreciate it.

- Ian

> I have an update. After looking at the source code it appears
> the crash is indeed happening as part of the pipe cleanup
> process for the spawned pipes use to run the hooks.
>
> I have a theory: because I'm only seeing this on my 8-way
> boxes I'm wondering if there's just a big synchornization
> issue with this many pipes all writing to stderr on Windows.
> The Close_Pipe routine is trying to close a pipe or two that
> might still be writing or flushing its output.
>
> I'm testing the theory now. I was using STDERR from my hook
> scripts to print debug information for my hook routine into
> the StartLog file. I'm no longer doing that. Instead I've got
> my hook script creating its own log file, one per slot, in
> the log directory and writing the debug information there.
>
> If I can keep a machine up for a few days without a random
> startd crash I think I'll close the book on this one.
>
> - Ian
>
> > -----Original Message-----
> > From: Ian Chesal
> > Sent: Tuesday, August 11, 2009 3:49 PM
> > To: Ian Chesal; 'Condor-Users Mail List'
> > Subject: RE: Intermittent Condor startd crashes
> >
> > I forgot one other bit of information:
> >
> > The startd will crash periodically on the machine, even with
> > no jobs assigned to run on it.
> >
> > Here's the StartLog output when that happens:
> >
> > 8/11 19:07:16 ERROR "Close_Pipe error" at line 1804 in file
> > ..\src\condor_daemon_core.V6\daemon_core.cpp
> > 8/11 19:07:16 Deleting Cronmgr
> > 8/11 19:07:16 Cron: Deleting job 'update'
> > (d:/arc/scripts/hooks/update_hooks_and_modules.bat)
> > 8/11 19:07:16 All resources are free, exiting.
> > *** End of file StartLog
> >
> > *** Last entry in core file core.STARTD.WIN32
> >
> > ====================
> > Exception code: C0000005 ACCESS_VIOLATION
> > Fault address:  7D4D8820 01:00008820
> C:\WINDOWS\syswow64\kernel32.dll
> >
> > Registers:
> > EAX:01658F5C
> > EBX:00000000
> > ECX:01658F5C
> > EDX:00000000
> > ESI:016533C0
> > EDI:016533C0
> > CS:EIP:0023:7D4D8820
> > SS:ESP:002B:018311F0  EBP:018313A0
> > DS:002B  ES:002B  FS:0053  GS:002B
> > Flags:00010206
> >
> > Call stack:
> > Address   Frame
> > 7D4D8820  018313A0  InterlockedExchange+8
> >
> > *** End of file core.STARTD.WIN32
> >
> > It just ran a Startd cron job for me. It runs that Startd
> > cron job every hour. But it doesn't crash every hour. The
> > cron job runs robocopy to sync some stuff on the box that our
> > jobs use. In a 24 hour period I saw 4 crashes. In all 24
> > cases when the cron job was run nothing was sync'ed by
> > robocopy -- all 24 jobs did the exact same thing, output the
> > same messages, etc.
> >
> > - Ian
> >
> > > -----Original Message-----
> > > From: Ian Chesal
> > > Sent: Tuesday, August 11, 2009 3:46 PM
> > > To: 'Condor-Users Mail List'
> > > Subject: Intermittent Condor startd crashes
> > >
> > > This is on Windows XP 64-bit. I have a feeling the problem is
> > > my machine image, and not Condor. But I was wondering if any
> > > of the Condor developers have seen this problem. Insight on
> > > what specifically might be wrong on the machine?
> > >
> > > The setup is the same every time: I submit a slew of sleep(1)
> > > jobs that target this host. The host is pulling jobs using
> > > hook scripts. It's okay for about 1200 jobs or so and then
> > > the startd crashes. The crash is always the same: the startd
> > > exits because of a fatal exception. The email snippet of the
> > > StartLog always has an error line like this in it:
> > >
> > > 8/11 20:39:09 Close_Pipe on invalid pipe end: -1
> > > 8/11 20:39:09 ERROR "Close_Pipe error" at line 1804 in file
> > > ..\src\condor_daemon_core.V6\daemon_core.cpp
> > >
> > > Now the really crummy part: on top of running with hooks,
> > I'm running:
> > >
> > > $CondorVersion: 7.2.2 Mar 20 2009 BuildID: none PRE-RELEASE-UWCS $
> > > $CondorPlatform: INTEL-WINNT50 $
> > >
> > > So this might be a problem with this pre-release version.
> > >
> > > - Ian
> > >
> > >
> > > "d:/arc/condor/bin/condor_startd.exe" on
> > > "uk-swcf5173.altera.com" exited with status 4.
> > > Condor will automatically restart this process in 10 seconds.
> > >
> > > *** Last 100 line(s) of file d:/arc/condor/log/StartLog:
> > > 8/11 20:39:07 condor_read(): recv() returned -1, errno =
> > > 10054, assuming failure reading 5 bytes from <127.0.0.1:4678>.
> > > 8/11 20:39:07 IO: Failed to read packet header
> > > 8/11 20:39:07 Return from Handler <receiveJobClassAdUpdate>
> > > 8/11 20:39:07 Received UDP command 60011 (DC_NOP) from
> > > <137.57.183.22:4818>, access level READ
> > > 8/11 20:39:07 Calling HandleReq <handle_nop()> (0)
> > > 8/11 20:39:07 Return from HandleReq <handle_nop()> (handler:
> > > 0.000s, sec: 0.000s)
> > > 8/11 20:39:07 DaemonCore: pid 3064 exited with status 0,
> > > invoking reaper 4 <reaper>
> > > 8/11 20:39:07 Starter pid 3064 exited with status 0
> > > 8/11 20:39:07 slot3: State change: starter exited
> > > 8/11 20:39:07 slot3: Changing activity: Busy -> Idle
> > > 8/11 20:39:07 Executable is a batch script, so executing
> > > C:\WINDOWS\system32\cmd.exe  /Q /C
> > > "d:/arc/scripts/hooks/arc_job_fetch.bat"
> > > 8/11 20:39:07 DaemonCore: return from reaper for pid 3064
> > > 8/11 20:39:07 Calling pipe Handler <DC stderr pipe handler>
> > > for Pipe end=65540 <DC stderr pipe>
> > > 8/11 20:39:07 Return from pipe Handler
> > > 8/11 20:39:07 Received UDP command 60011 (DC_NOP) from
> > > <137.57.183.22:4821>, access level READ
> > > 8/11 20:39:07 Calling HandleReq <handle_nop()> (0)
> > > 8/11 20:39:07 Return from HandleReq <handle_nop()> (handler:
> > > 0.000s, sec: 0.000s)
> > > 8/11 20:39:07 Calling pipe Handler <DC stdout pipe handler>
> > > for Pipe end=65538 <DC stdout pipe>
> > > 8/11 20:39:07 Return from pipe Handler
> > > 8/11 20:39:07 Calling pipe Handler <DC stderr pipe handler>
> > > for Pipe end=65540 <DC stderr pipe>
> > > 8/11 20:39:07 Return from pipe Handler
> > > 8/11 20:39:07 Received UDP command 60011 (DC_NOP) from
> > > <137.57.183.22:4824>, access level READ
> > > 8/11 20:39:07 Calling HandleReq <handle_nop()> (0)
> > > 8/11 20:39:07 Return from HandleReq <handle_nop()> (handler:
> > > 0.000s, sec: 0.000s)
> > > 8/11 20:39:07 Received UDP command 60011 (DC_NOP) from
> > > <137.57.183.22:4826>, access level READ
> > > 8/11 20:39:07 Calling HandleReq <handle_nop()> (0)
> > > 8/11 20:39:07 Return from HandleReq <handle_nop()> (handler:
> > > 0.000s, sec: 0.000s)
> > > 8/11 20:39:07 DaemonCore: pid 2544 exited with status 0,
> > > invoking reaper 1 <HookClientMgr Output Reaper>
> > > 8/11 20:39:07 Warning, hook
> > > d:/arc/scripts/hooks/arc_job_fetch.bat (pid 2544) printed to
> > > stderr: DEBUG: Slot = 1
> > > DEBUG: State = "Claimed"
> > > DEBUG: Start = ((TRUE =?= TRUE) && (((FALSE =?= TRUE) &&
> > > (((TRUE =!= TRUE) || (((FALSE =?= TRUE) && ((((((ClockDay > 0
> > > && ClockDay < 6) && ((480 > 1225 && (ClockMin >= 1225 &&
> > > ClockMin <= 480)) || (480 < 1225 && (ClockMin >= 1225 ||
> > > ClockMin <= 480)))) || ((ClockDay == 0 || ClockDay == 6) &&
> > > ((1439 > 0 && (ClockMin >= 0 && ClockMin <= 1439)) || ((1439
> > > < 0 && (ClockMin >= 0 || ClockMin <= 1439)))))) && (((LoadAvg
> > > - CondorLoadAvg) <= 0.300000) || (State != "Unclaimed" &&
> > > State != "Owner")) && (KeyboardIdle >= 180)) || ((SlotID <=
> > > 0))) =!= TRUE)) || ((FALSE =!= TRUE) && FALSE))) =?= FALSE))
> > > || ((FALSE =!= TRUE) && (((TRUE =!= TRUE) || (((FALSE =?=
> > > TRUE) && ((((((ClockDay > 0 && ClockDay < 6) && ((480 > 1225
> > > && (ClockMin >= 1225 && ClockMin <= 480)) || (480 < 1225 &&
> > > (ClockMin >= 1225 || ClockMin <= 480)))) || ((ClockDay == 0
> > > || ClockDay == 6) && ((1439 > 0 && (ClockMin >= 0 && ClockMin
> > > <= 1439)) || ((1439 < 0 && (ClockMin >= 0 || ClockMin <=
> > > 1439)))))) && (((LoadAvg - CondorLoa!
> > >  dAvg) <= 0.300000) || (State != "Unclaimed" && State !=
> > > "Owner")) && (KeyboardIdle >= 180)) || ((SlotID <= 0))) =!=
> > > TRUE)) || ((FALSE =!= TRUE) && FALSE))) =?= FALSE))))
> > > DEBUG: Requirements = (START) && (IsValidCheckpointPlatform)
> > > Error: Slot state is unknown or not unclaimed
> > > 8/11 20:39:07 slot1: Changing state and activity:
> > > Claimed/Idle -> Preempting/Vacating
> > > 8/11 20:39:07 slot1: State change: No preempting claim,
> > > returning to owner
> > > 8/11 20:39:07 slot1: Changing state and activity:
> > > Preempting/Vacating -> Owner/Idle
> > > 8/11 20:39:07 slot1: State change: IS_OWNER is false
> > > 8/11 20:39:07 slot1: Changing state: Owner -> Unclaimed
> > > 8/11 20:39:07 DaemonCore: return from reaper for pid 2544
> > > 8/11 20:39:07 Calling pipe Handler <DC stderr pipe handler>
> > > for Pipe end=65542 <DC stderr pipe>
> > > 8/11 20:39:07 Return from pipe Handler
> > > 8/11 20:39:07 Received UDP command 60011 (DC_NOP) from
> > > <137.57.183.22:4827>, access level READ
> > > 8/11 20:39:07 Calling HandleReq <handle_nop()> (0)
> > > 8/11 20:39:07 Return from HandleReq <handle_nop()> (handler:
> > > 0.000s, sec: 0.016s)
> > > 8/11 20:39:07 Calling pipe Handler <DC stderr pipe handler>
> > > for Pipe end=65542 <DC stderr pipe>
> > > 8/11 20:39:07 Return from pipe Handler
> > > 8/11 20:39:07 Calling pipe Handler <DC stdout pipe handler>
> > > for Pipe end=65539 <DC stdout pipe>
> > > 8/11 20:39:07 Return from pipe Handler
> > > 8/11 20:39:07 Received UDP command 60011 (DC_NOP) from
> > > <137.57.183.22:4828>, access level READ
> > > 8/11 20:39:07 Calling HandleReq <handle_nop()> (0)
> > > 8/11 20:39:07 Return from HandleReq <handle_nop()> (handler:
> > > 0.000s, sec: 0.000s)
> > > 8/11 20:39:07 Received UDP command 60011 (DC_NOP) from
> > > <137.57.183.22:4829>, access level READ
> > > 8/11 20:39:07 Calling HandleReq <handle_nop()> (0)
> > > 8/11 20:39:07 Return from HandleReq <handle_nop()> (handler:
> > > 0.000s, sec: 0.000s)
> > > 8/11 20:39:07 DaemonCore: pid 2116 exited with status 0,
> > > invoking reaper 1 <HookClientMgr Output Reaper>
> > > 8/11 20:39:07 Warning, hook
> > > d:/arc/scripts/hooks/arc_job_fetch.bat (pid 2116) printed to
> > > stderr: DEBUG: Slot = 3
> > > DEBUG: State = "Claimed"
> > > DEBUG: Start = ((TRUE =?= TRUE) && (((FALSE =?= TRUE) &&
> > > (((TRUE =!= TRUE) || (((FALSE =?= TRUE) && ((((((ClockDay > 0
> > > && ClockDay < 6) && ((480 > 1225 && (ClockMin >= 1225 &&
> > > ClockMin <= 480)) || (480 < 1225 && (ClockMin >= 1225 ||
> > > ClockMin <= 480)))) || ((ClockDay == 0 || ClockDay == 6) &&
> > > ((1439 > 0 && (ClockMin >= 0 && ClockMin <= 1439)) || ((1439
> > > < 0 && (ClockMin >= 0 || ClockMin <= 1439)))))) && (((LoadAvg
> > > - CondorLoadAvg) <= 0.300000) || (State != "Unclaimed" &&
> > > State != "Owner")) && (KeyboardIdle >= 180)) || ((SlotID <=
> > > 0))) =!= TRUE)) || ((FALSE =!= TRUE) && FALSE))) =?= FALSE))
> > > || ((FALSE =!= TRUE) && (((TRUE =!= TRUE) || (((FALSE =?=
> > > TRUE) && ((((((ClockDay > 0 && ClockDay < 6) && ((480 > 1225
> > > && (ClockMin >= 1225 && ClockMin <= 480)) || (480 < 1225 &&
> > > (ClockMin >= 1225 || ClockMin <= 480)))) || ((ClockDay == 0
> > > || ClockDay == 6) && ((1439 > 0 && (ClockMin >= 0 && ClockMin
> > > <= 1439)) || ((1439 < 0 && (ClockMin >= 0 || ClockMin <=
> > > 1439)))))) && (((LoadAvg - CondorLoa!
> > >  dAvg) <= 0.300000) || (State != "Unclaimed" && State !=
> > > "Owner")) && (KeyboardIdle >= 180)) || ((SlotID <= 0))) =!=
> > > TRUE)) || ((FALSE =!= TRUE) && FALSE))) =?= FALSE))))
> > > DEBUG: Requirements = (START) && (IsValidCheckpointPlatform)
> > > Error: Slot state is unknown or not unclaimed
> > > 8/11 20:39:07 slot3: Changing state and activity:
> > > Claimed/Idle -> Preempting/Vacating
> > > 8/11 20:39:07 slot3: State change: No preempting claim,
> > > returning to owner
> > > 8/11 20:39:07 slot3: Changing state and activity:
> > > Preempting/Vacating -> Owner/Idle
> > > 8/11 20:39:07 slot3: State change: IS_OWNER is false
> > > 8/11 20:39:07 slot3: Changing state: Owner -> Unclaimed
> > > 8/11 20:39:07 DaemonCore: return from reaper for pid 2116
> > > 8/11 20:39:08 Calling Handler <receiveJobClassAdUpdate>
> > > 8/11 20:39:08 condor_read(): recv() returned -1, errno =
> > > 10054, assuming failure reading 5 bytes from <127.0.0.1:4719>.
> > > 8/11 20:39:08 IO: Failed to read packet header
> > > 8/11 20:39:08 Return from Handler <receiveJobClassAdUpdate>
> > > 8/11 20:39:08 Received UDP command 60011 (DC_NOP) from
> > > <137.57.183.22:4839>, access level READ
> > > 8/11 20:39:08 Calling HandleReq <handle_nop()> (0)
> > > 8/11 20:39:08 Return from HandleReq <handle_nop()> (handler:
> > > 0.000s, sec: 0.000s)
> > > 8/11 20:39:08 DaemonCore: pid 2320 exited with status 0,
> > > invoking reaper 4 <reaper>
> > > 8/11 20:39:08 Starter pid 2320 exited with status 0
> > > 8/11 20:39:08 slot2: State change: starter exited
> > > 8/11 20:39:08 slot2: Changing activity: Busy -> Idle
> > > 8/11 20:39:08 Executable is a batch script, so executing
> > > C:\WINDOWS\system32\cmd.exe  /Q /C
> > > "d:/arc/scripts/hooks/arc_job_fetch.bat"
> > > 8/11 20:39:08 DaemonCore: return from reaper for pid 2320
> > > 8/11 20:39:08 Calling pipe Handler <DC stderr pipe handler>
> > > for Pipe end=65540 <DC stderr pipe>
> > > 8/11 20:39:08 Return from pipe Handler
> > > 8/11 20:39:08 Received UDP command 60011 (DC_NOP) from
> > > <137.57.183.22:4842>, access level READ
> > > 8/11 20:39:08 Calling HandleReq <handle_nop()> (0)
> > > 8/11 20:39:08 Return from HandleReq <handle_nop()> (handler:
> > > 0.000s, sec: 0.000s)
> > > 8/11 20:39:09 Calling pipe Handler <DC stderr pipe handler>
> > > for Pipe end=65540 <DC stderr pipe>
> > > 8/11 20:39:09 Return from pipe Handler
> > > 8/11 20:39:09 Received UDP command 60011 (DC_NOP) from
> > > <137.57.183.22:4843>, access level READ
> > > 8/11 20:39:09 Calling HandleReq <handle_nop()> (0)
> > > 8/11 20:39:09 Return from HandleReq <handle_nop()> (handler:
> > > 0.000s, sec: 0.000s)
> > > 8/11 20:39:09 Close_Pipe on invalid pipe end: -1
> > > 8/11 20:39:09 ERROR "Close_Pipe error" at line 1804 in file
> > > ..\src\condor_daemon_core.V6\daemon_core.cpp
> > > 8/11 20:39:09 slot8: Changing state and activity:
> > > Claimed/Busy -> Preempting/Killing
> > > 8/11 20:39:09 slot5: Changing state and activity:
> > > Claimed/Busy -> Preempting/Killing
> > > 8/11 20:39:09 slot2: Changing state and activity:
> > > Claimed/Idle -> Preempting/Killing
> > > 8/11 20:39:09 slot2: State change: No preempting claim,
> > > returning to owner
> > > 8/11 20:39:09 slot2: Changing state and activity:
> > > Preempting/Killing -> Owner/Idle
> > > 8/11 20:39:09 slot2: State change: IS_OWNER is false
> > > 8/11 20:39:09 slot2: Changing state: Owner -> Unclaimed
> > > 8/11 20:39:09 slot6: Changing state and activity:
> > > Claimed/Busy -> Preempting/Killing
> > > 8/11 20:39:09 startd exiting because of fatal exception.
> > > *** End of file StartLog
> >

Confidentiality Notice.
This message may contain information that is confidential or otherwise protected from disclosure. If you are not the intended recipient, you are hereby notified that any use, disclosure, dissemination, distribution,  or copying  of this message, or any attachments, is strictly prohibited.  If you have received this message in error, please advise the sender by reply e-mail, and delete the message and any attachments.  Thank you.