[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Intermittent Condor startd crashes



Thanks for reporting this, Ian. I was beginning to go crazy trying to track down a similar issue.

Interestingly, I've been having problems with Condor 7.3.1 on an 8 slot OS X 10.5 machine. I haven't been able to figure out why this one machine would have slots fail after running a job, while all the others (2 slot) using the same binaries and configuration ran fine.

Condor also fails to exec daemons when using condor_restart or condor_off followed by condor_on when using 8 slots.

Reducing the number of slots to 4 makes everything run fine.

    Craig

On Aug 19, 2009, at 3:39 PM, Ian Chesal wrote:

Just one more update:

If I push any machine (even a 2-core machine) up to >6 slots I can make
it crash. I can make it crash faster if #cores < #slots, but crash it
always will.

- Ian

-----Original Message-----
From: Ian Chesal
Sent: Wednesday, August 19, 2009 3:54 PM
To: Ian Chesal; Condor-Users Mail List
Subject: RE: Intermittent Condor startd crashes

Didn't help. Not long after implementing this I got another
random startd crash. Same crash point in Close_Pipe.

So it appears this has nothing to do with writing to the
StartLog file from your hook. More likely it's just the
number of concurrent hooks that are being fired. I don't see
this problem on my 2- and 4-way Windows machines. Only on my
8-way Windows machines.

I am running my hook at a fairly aggressive rate:

##  Hooks Setup
##  The following hooks setup is global to all operating
systems in an ARC
##  system.

# The FetchWorkDelay expression must evaluate to an integer,
which defines
# the number of seconds since the last fetch attempt
completed before the
# condor_startd will attempt to fetch more work. However, as a ClassAd
# expression (evaluated in the context of the ClassAd of the slot
# considering if it should fetch more work, and the ClassAd of the
# currently running job, if any), the length of the delay can
be based on
# the current state the slot and even the currently running job.
#
# For example, a very common configuration would be to always wait 5
# minutes (300 seconds) between attempts to fetch work,
unless the slot is
# Claimed/Idle, in which case the condor_startd should fetch
immediately:
#FetchWorkDelay = ifThenElse(State == "Claimed" && Activity
== "Idle", 0, 5)
FetchWorkDelay = ifThenElse(State == "Claimed" && Activity ==
"Idle", 0, ifThenElse(State == "Unclaimed" && Activity ==
"Idle", 5, 1200))

## Hooks Configuration
## This section describes the default hooks setup and
behaviour on this
## platform.
STARTD_JOB_HOOK_KEYWORD = ARC
ARC_HOOK_DIR = d:/arc/scripts/hooks
ARC_HOOK_FETCH_WORK = $(ARC_HOOK_DIR)/arc_job_fetch.bat
ARC_HOOK_REPLY_FETCH =
ARC_HOOK_EVICT_CLAIM =
ARC_HOOK_PREPARE_JOB =
ARC_HOOK_UPDATE_JOB_INFO =
ARC_HOOK_JOB_EXIT =

And my queue is empty. So there are 8 hook script instances
firing nearly continuously. It takes each instance about 0.5
seconds to figure out there's nothing to run.

I'm going to try slowing down the hook execution rate on my
Windows machine. Specifically I'm going to introduce a random
sleep between 1-10 seconds to the start of the hook script
that will apply when its run on Windows machines. That should
help prevent them from all running and completing at the same time.

But if someone from the Condor team can try to reproduce
this, run a hook on a Windows machine with that kind of
aggressive schedule when you've got 8 or more slots, I'd
appreciate it.

- Ian

I have an update. After looking at the source code it appears
the crash is indeed happening as part of the pipe cleanup
process for the spawned pipes use to run the hooks.

I have a theory: because I'm only seeing this on my 8-way
boxes I'm wondering if there's just a big synchornization
issue with this many pipes all writing to stderr on Windows.
The Close_Pipe routine is trying to close a pipe or two that
might still be writing or flushing its output.

I'm testing the theory now. I was using STDERR from my hook
scripts to print debug information for my hook routine into
the StartLog file. I'm no longer doing that. Instead I've got
my hook script creating its own log file, one per slot, in
the log directory and writing the debug information there.

If I can keep a machine up for a few days without a random
startd crash I think I'll close the book on this one.

- Ian

-----Original Message-----
From: Ian Chesal
Sent: Tuesday, August 11, 2009 3:49 PM
To: Ian Chesal; 'Condor-Users Mail List'
Subject: RE: Intermittent Condor startd crashes

I forgot one other bit of information:

The startd will crash periodically on the machine, even with
no jobs assigned to run on it.

Here's the StartLog output when that happens:

8/11 19:07:16 ERROR "Close_Pipe error" at line 1804 in file
..\src\condor_daemon_core.V6\daemon_core.cpp
8/11 19:07:16 Deleting Cronmgr
8/11 19:07:16 Cron: Deleting job 'update'
(d:/arc/scripts/hooks/update_hooks_and_modules.bat)
8/11 19:07:16 All resources are free, exiting.
*** End of file StartLog

*** Last entry in core file core.STARTD.WIN32

====================
Exception code: C0000005 ACCESS_VIOLATION
Fault address:  7D4D8820 01:00008820
C:\WINDOWS\syswow64\kernel32.dll

Registers:
EAX:01658F5C
EBX:00000000
ECX:01658F5C
EDX:00000000
ESI:016533C0
EDI:016533C0
CS:EIP:0023:7D4D8820
SS:ESP:002B:018311F0  EBP:018313A0
DS:002B  ES:002B  FS:0053  GS:002B
Flags:00010206

Call stack:
Address   Frame
7D4D8820  018313A0  InterlockedExchange+8

*** End of file core.STARTD.WIN32

It just ran a Startd cron job for me. It runs that Startd
cron job every hour. But it doesn't crash every hour. The
cron job runs robocopy to sync some stuff on the box that our
jobs use. In a 24 hour period I saw 4 crashes. In all 24
cases when the cron job was run nothing was sync'ed by
robocopy -- all 24 jobs did the exact same thing, output the
same messages, etc.

- Ian

-----Original Message-----
From: Ian Chesal
Sent: Tuesday, August 11, 2009 3:46 PM
To: 'Condor-Users Mail List'
Subject: Intermittent Condor startd crashes

This is on Windows XP 64-bit. I have a feeling the problem is
my machine image, and not Condor. But I was wondering if any
of the Condor developers have seen this problem. Insight on
what specifically might be wrong on the machine?

The setup is the same every time: I submit a slew of sleep(1)
jobs that target this host. The host is pulling jobs using
hook scripts. It's okay for about 1200 jobs or so and then
the startd crashes. The crash is always the same: the startd
exits because of a fatal exception. The email snippet of the
StartLog always has an error line like this in it:

8/11 20:39:09 Close_Pipe on invalid pipe end: -1
8/11 20:39:09 ERROR "Close_Pipe error" at line 1804 in file
..\src\condor_daemon_core.V6\daemon_core.cpp

Now the really crummy part: on top of running with hooks,
I'm running:

$CondorVersion: 7.2.2 Mar 20 2009 BuildID: none
PRE-RELEASE-UWCS $
$CondorPlatform: INTEL-WINNT50 $

So this might be a problem with this pre-release version.

- Ian


"d:/arc/condor/bin/condor_startd.exe" on
"uk-swcf5173.altera.com" exited with status 4.
Condor will automatically restart this process in 10 seconds.

*** Last 100 line(s) of file d:/arc/condor/log/StartLog:
8/11 20:39:07 condor_read(): recv() returned -1, errno =
10054, assuming failure reading 5 bytes from <127.0.0.1:4678>.
8/11 20:39:07 IO: Failed to read packet header
8/11 20:39:07 Return from Handler <receiveJobClassAdUpdate>
8/11 20:39:07 Received UDP command 60011 (DC_NOP) from
<137.57.183.22:4818>, access level READ
8/11 20:39:07 Calling HandleReq <handle_nop()> (0)
8/11 20:39:07 Return from HandleReq <handle_nop()> (handler:
0.000s, sec: 0.000s)
8/11 20:39:07 DaemonCore: pid 3064 exited with status 0,
invoking reaper 4 <reaper>
8/11 20:39:07 Starter pid 3064 exited with status 0
8/11 20:39:07 slot3: State change: starter exited
8/11 20:39:07 slot3: Changing activity: Busy -> Idle
8/11 20:39:07 Executable is a batch script, so executing
C:\WINDOWS\system32\cmd.exe  /Q /C
"d:/arc/scripts/hooks/arc_job_fetch.bat"
8/11 20:39:07 DaemonCore: return from reaper for pid 3064
8/11 20:39:07 Calling pipe Handler <DC stderr pipe handler>
for Pipe end=65540 <DC stderr pipe>
8/11 20:39:07 Return from pipe Handler
8/11 20:39:07 Received UDP command 60011 (DC_NOP) from
<137.57.183.22:4821>, access level READ
8/11 20:39:07 Calling HandleReq <handle_nop()> (0)
8/11 20:39:07 Return from HandleReq <handle_nop()> (handler:
0.000s, sec: 0.000s)
8/11 20:39:07 Calling pipe Handler <DC stdout pipe handler>
for Pipe end=65538 <DC stdout pipe>
8/11 20:39:07 Return from pipe Handler
8/11 20:39:07 Calling pipe Handler <DC stderr pipe handler>
for Pipe end=65540 <DC stderr pipe>
8/11 20:39:07 Return from pipe Handler
8/11 20:39:07 Received UDP command 60011 (DC_NOP) from
<137.57.183.22:4824>, access level READ
8/11 20:39:07 Calling HandleReq <handle_nop()> (0)
8/11 20:39:07 Return from HandleReq <handle_nop()> (handler:
0.000s, sec: 0.000s)
8/11 20:39:07 Received UDP command 60011 (DC_NOP) from
<137.57.183.22:4826>, access level READ
8/11 20:39:07 Calling HandleReq <handle_nop()> (0)
8/11 20:39:07 Return from HandleReq <handle_nop()> (handler:
0.000s, sec: 0.000s)
8/11 20:39:07 DaemonCore: pid 2544 exited with status 0,
invoking reaper 1 <HookClientMgr Output Reaper>
8/11 20:39:07 Warning, hook
d:/arc/scripts/hooks/arc_job_fetch.bat (pid 2544) printed to
stderr: DEBUG: Slot = 1
DEBUG: State = "Claimed"
DEBUG: Start = ((TRUE =?= TRUE) && (((FALSE =?= TRUE) &&
(((TRUE =!= TRUE) || (((FALSE =?= TRUE) && ((((((ClockDay > 0
&& ClockDay < 6) && ((480 > 1225 && (ClockMin >= 1225 &&
ClockMin <= 480)) || (480 < 1225 && (ClockMin >= 1225 ||
ClockMin <= 480)))) || ((ClockDay == 0 || ClockDay == 6) &&
((1439 > 0 && (ClockMin >= 0 && ClockMin <= 1439)) || ((1439
< 0 && (ClockMin >= 0 || ClockMin <= 1439)))))) && (((LoadAvg
- CondorLoadAvg) <= 0.300000) || (State != "Unclaimed" &&
State != "Owner")) && (KeyboardIdle >= 180)) || ((SlotID <=
0))) =!= TRUE)) || ((FALSE =!= TRUE) && FALSE))) =?= FALSE))
|| ((FALSE =!= TRUE) && (((TRUE =!= TRUE) || (((FALSE =?=
TRUE) && ((((((ClockDay > 0 && ClockDay < 6) && ((480 > 1225
&& (ClockMin >= 1225 && ClockMin <= 480)) || (480 < 1225 &&
(ClockMin >= 1225 || ClockMin <= 480)))) || ((ClockDay == 0
|| ClockDay == 6) && ((1439 > 0 && (ClockMin >= 0 && ClockMin
<= 1439)) || ((1439 < 0 && (ClockMin >= 0 || ClockMin <=
1439)))))) && (((LoadAvg - CondorLoa!
dAvg) <= 0.300000) || (State != "Unclaimed" && State !=
"Owner")) && (KeyboardIdle >= 180)) || ((SlotID <= 0))) =!=
TRUE)) || ((FALSE =!= TRUE) && FALSE))) =?= FALSE))))
DEBUG: Requirements = (START) && (IsValidCheckpointPlatform)
Error: Slot state is unknown or not unclaimed
8/11 20:39:07 slot1: Changing state and activity:
Claimed/Idle -> Preempting/Vacating
8/11 20:39:07 slot1: State change: No preempting claim,
returning to owner
8/11 20:39:07 slot1: Changing state and activity:
Preempting/Vacating -> Owner/Idle
8/11 20:39:07 slot1: State change: IS_OWNER is false
8/11 20:39:07 slot1: Changing state: Owner -> Unclaimed
8/11 20:39:07 DaemonCore: return from reaper for pid 2544
8/11 20:39:07 Calling pipe Handler <DC stderr pipe handler>
for Pipe end=65542 <DC stderr pipe>
8/11 20:39:07 Return from pipe Handler
8/11 20:39:07 Received UDP command 60011 (DC_NOP) from
<137.57.183.22:4827>, access level READ
8/11 20:39:07 Calling HandleReq <handle_nop()> (0)
8/11 20:39:07 Return from HandleReq <handle_nop()> (handler:
0.000s, sec: 0.016s)
8/11 20:39:07 Calling pipe Handler <DC stderr pipe handler>
for Pipe end=65542 <DC stderr pipe>
8/11 20:39:07 Return from pipe Handler
8/11 20:39:07 Calling pipe Handler <DC stdout pipe handler>
for Pipe end=65539 <DC stdout pipe>
8/11 20:39:07 Return from pipe Handler
8/11 20:39:07 Received UDP command 60011 (DC_NOP) from
<137.57.183.22:4828>, access level READ
8/11 20:39:07 Calling HandleReq <handle_nop()> (0)
8/11 20:39:07 Return from HandleReq <handle_nop()> (handler:
0.000s, sec: 0.000s)
8/11 20:39:07 Received UDP command 60011 (DC_NOP) from
<137.57.183.22:4829>, access level READ
8/11 20:39:07 Calling HandleReq <handle_nop()> (0)
8/11 20:39:07 Return from HandleReq <handle_nop()> (handler:
0.000s, sec: 0.000s)
8/11 20:39:07 DaemonCore: pid 2116 exited with status 0,
invoking reaper 1 <HookClientMgr Output Reaper>
8/11 20:39:07 Warning, hook
d:/arc/scripts/hooks/arc_job_fetch.bat (pid 2116) printed to
stderr: DEBUG: Slot = 3
DEBUG: State = "Claimed"
DEBUG: Start = ((TRUE =?= TRUE) && (((FALSE =?= TRUE) &&
(((TRUE =!= TRUE) || (((FALSE =?= TRUE) && ((((((ClockDay > 0
&& ClockDay < 6) && ((480 > 1225 && (ClockMin >= 1225 &&
ClockMin <= 480)) || (480 < 1225 && (ClockMin >= 1225 ||
ClockMin <= 480)))) || ((ClockDay == 0 || ClockDay == 6) &&
((1439 > 0 && (ClockMin >= 0 && ClockMin <= 1439)) || ((1439
< 0 && (ClockMin >= 0 || ClockMin <= 1439)))))) && (((LoadAvg
- CondorLoadAvg) <= 0.300000) || (State != "Unclaimed" &&
State != "Owner")) && (KeyboardIdle >= 180)) || ((SlotID <=
0))) =!= TRUE)) || ((FALSE =!= TRUE) && FALSE))) =?= FALSE))
|| ((FALSE =!= TRUE) && (((TRUE =!= TRUE) || (((FALSE =?=
TRUE) && ((((((ClockDay > 0 && ClockDay < 6) && ((480 > 1225
&& (ClockMin >= 1225 && ClockMin <= 480)) || (480 < 1225 &&
(ClockMin >= 1225 || ClockMin <= 480)))) || ((ClockDay == 0
|| ClockDay == 6) && ((1439 > 0 && (ClockMin >= 0 && ClockMin
<= 1439)) || ((1439 < 0 && (ClockMin >= 0 || ClockMin <=
1439)))))) && (((LoadAvg - CondorLoa!
dAvg) <= 0.300000) || (State != "Unclaimed" && State !=
"Owner")) && (KeyboardIdle >= 180)) || ((SlotID <= 0))) =!=
TRUE)) || ((FALSE =!= TRUE) && FALSE))) =?= FALSE))))
DEBUG: Requirements = (START) && (IsValidCheckpointPlatform)
Error: Slot state is unknown or not unclaimed
8/11 20:39:07 slot3: Changing state and activity:
Claimed/Idle -> Preempting/Vacating
8/11 20:39:07 slot3: State change: No preempting claim,
returning to owner
8/11 20:39:07 slot3: Changing state and activity:
Preempting/Vacating -> Owner/Idle
8/11 20:39:07 slot3: State change: IS_OWNER is false
8/11 20:39:07 slot3: Changing state: Owner -> Unclaimed
8/11 20:39:07 DaemonCore: return from reaper for pid 2116
8/11 20:39:08 Calling Handler <receiveJobClassAdUpdate>
8/11 20:39:08 condor_read(): recv() returned -1, errno =
10054, assuming failure reading 5 bytes from <127.0.0.1:4719>.
8/11 20:39:08 IO: Failed to read packet header
8/11 20:39:08 Return from Handler <receiveJobClassAdUpdate>
8/11 20:39:08 Received UDP command 60011 (DC_NOP) from
<137.57.183.22:4839>, access level READ
8/11 20:39:08 Calling HandleReq <handle_nop()> (0)
8/11 20:39:08 Return from HandleReq <handle_nop()> (handler:
0.000s, sec: 0.000s)
8/11 20:39:08 DaemonCore: pid 2320 exited with status 0,
invoking reaper 4 <reaper>
8/11 20:39:08 Starter pid 2320 exited with status 0
8/11 20:39:08 slot2: State change: starter exited
8/11 20:39:08 slot2: Changing activity: Busy -> Idle
8/11 20:39:08 Executable is a batch script, so executing
C:\WINDOWS\system32\cmd.exe  /Q /C
"d:/arc/scripts/hooks/arc_job_fetch.bat"
8/11 20:39:08 DaemonCore: return from reaper for pid 2320
8/11 20:39:08 Calling pipe Handler <DC stderr pipe handler>
for Pipe end=65540 <DC stderr pipe>
8/11 20:39:08 Return from pipe Handler
8/11 20:39:08 Received UDP command 60011 (DC_NOP) from
<137.57.183.22:4842>, access level READ
8/11 20:39:08 Calling HandleReq <handle_nop()> (0)
8/11 20:39:08 Return from HandleReq <handle_nop()> (handler:
0.000s, sec: 0.000s)
8/11 20:39:09 Calling pipe Handler <DC stderr pipe handler>
for Pipe end=65540 <DC stderr pipe>
8/11 20:39:09 Return from pipe Handler
8/11 20:39:09 Received UDP command 60011 (DC_NOP) from
<137.57.183.22:4843>, access level READ
8/11 20:39:09 Calling HandleReq <handle_nop()> (0)
8/11 20:39:09 Return from HandleReq <handle_nop()> (handler:
0.000s, sec: 0.000s)
8/11 20:39:09 Close_Pipe on invalid pipe end: -1
8/11 20:39:09 ERROR "Close_Pipe error" at line 1804 in file
..\src\condor_daemon_core.V6\daemon_core.cpp
8/11 20:39:09 slot8: Changing state and activity:
Claimed/Busy -> Preempting/Killing
8/11 20:39:09 slot5: Changing state and activity:
Claimed/Busy -> Preempting/Killing
8/11 20:39:09 slot2: Changing state and activity:
Claimed/Idle -> Preempting/Killing
8/11 20:39:09 slot2: State change: No preempting claim,
returning to owner
8/11 20:39:09 slot2: Changing state and activity:
Preempting/Killing -> Owner/Idle
8/11 20:39:09 slot2: State change: IS_OWNER is false
8/11 20:39:09 slot2: Changing state: Owner -> Unclaimed
8/11 20:39:09 slot6: Changing state and activity:
Claimed/Busy -> Preempting/Killing
8/11 20:39:09 startd exiting because of fatal exception.
*** End of file StartLog


Confidentiality Notice.
This message may contain information that is confidential or otherwise protected from disclosure. If you are not the intended recipient, you are hereby notified that any use, disclosure, dissemination, distribution, or copying of this message, or any attachments, is strictly prohibited. If you have received this message in error, please advise the sender by reply e-mail, and delete the message and any attachments. Thank you.

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/

--
Craig A. Struble, Ph.D. | 369 Cudahy Hall  | Marquette University
Associate Professor of Computer Science    | (414)288-3783
Director, Master of Bioinformatics Program | (414)288-5472 (fax)
http://www.mscs.mu.edu/~cstruble | craig.struble@xxxxxxxxxxxxx