[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Intermittent Condor startd crashes



This is on Windows XP 64-bit. I have a feeling the problem is my machine
image, and not Condor. But I was wondering if any of the Condor
developers have seen this problem. Insight on what specifically might be
wrong on the machine?

The setup is the same every time: I submit a slew of sleep(1) jobs that
target this host. The host is pulling jobs using hook scripts. It's okay
for about 1200 jobs or so and then the startd crashes. The crash is
always the same: the startd exits because of a fatal exception. The
email snippet of the StartLog always has an error line like this in it:

8/11 20:39:09 Close_Pipe on invalid pipe end: -1
8/11 20:39:09 ERROR "Close_Pipe error" at line 1804 in file
..\src\condor_daemon_core.V6\daemon_core.cpp

Now the really crummy part: on top of running with hooks, I'm running:

$CondorVersion: 7.2.2 Mar 20 2009 BuildID: none PRE-RELEASE-UWCS $
$CondorPlatform: INTEL-WINNT50 $

So this might be a problem with this pre-release version.

- Ian


"d:/arc/condor/bin/condor_startd.exe" on "uk-swcf5173.altera.com" exited
with status 4.
Condor will automatically restart this process in 10 seconds.

*** Last 100 line(s) of file d:/arc/condor/log/StartLog:
8/11 20:39:07 condor_read(): recv() returned -1, errno = 10054, assuming
failure reading 5 bytes from <127.0.0.1:4678>.
8/11 20:39:07 IO: Failed to read packet header
8/11 20:39:07 Return from Handler <receiveJobClassAdUpdate>
8/11 20:39:07 Received UDP command 60011 (DC_NOP) from
<137.57.183.22:4818>, access level READ
8/11 20:39:07 Calling HandleReq <handle_nop()> (0)
8/11 20:39:07 Return from HandleReq <handle_nop()> (handler: 0.000s,
sec: 0.000s)
8/11 20:39:07 DaemonCore: pid 3064 exited with status 0, invoking reaper
4 <reaper>
8/11 20:39:07 Starter pid 3064 exited with status 0
8/11 20:39:07 slot3: State change: starter exited
8/11 20:39:07 slot3: Changing activity: Busy -> Idle
8/11 20:39:07 Executable is a batch script, so executing
C:\WINDOWS\system32\cmd.exe  /Q /C
"d:/arc/scripts/hooks/arc_job_fetch.bat"
8/11 20:39:07 DaemonCore: return from reaper for pid 3064
8/11 20:39:07 Calling pipe Handler <DC stderr pipe handler> for Pipe
end=65540 <DC stderr pipe>
8/11 20:39:07 Return from pipe Handler
8/11 20:39:07 Received UDP command 60011 (DC_NOP) from
<137.57.183.22:4821>, access level READ
8/11 20:39:07 Calling HandleReq <handle_nop()> (0)
8/11 20:39:07 Return from HandleReq <handle_nop()> (handler: 0.000s,
sec: 0.000s)
8/11 20:39:07 Calling pipe Handler <DC stdout pipe handler> for Pipe
end=65538 <DC stdout pipe>
8/11 20:39:07 Return from pipe Handler
8/11 20:39:07 Calling pipe Handler <DC stderr pipe handler> for Pipe
end=65540 <DC stderr pipe>
8/11 20:39:07 Return from pipe Handler
8/11 20:39:07 Received UDP command 60011 (DC_NOP) from
<137.57.183.22:4824>, access level READ
8/11 20:39:07 Calling HandleReq <handle_nop()> (0)
8/11 20:39:07 Return from HandleReq <handle_nop()> (handler: 0.000s,
sec: 0.000s)
8/11 20:39:07 Received UDP command 60011 (DC_NOP) from
<137.57.183.22:4826>, access level READ
8/11 20:39:07 Calling HandleReq <handle_nop()> (0)
8/11 20:39:07 Return from HandleReq <handle_nop()> (handler: 0.000s,
sec: 0.000s)
8/11 20:39:07 DaemonCore: pid 2544 exited with status 0, invoking reaper
1 <HookClientMgr Output Reaper>
8/11 20:39:07 Warning, hook d:/arc/scripts/hooks/arc_job_fetch.bat (pid
2544) printed to stderr: DEBUG: Slot = 1
DEBUG: State = "Claimed"
DEBUG: Start = ((TRUE =?= TRUE) && (((FALSE =?= TRUE) && (((TRUE =!=
TRUE) || (((FALSE =?= TRUE) && ((((((ClockDay > 0 && ClockDay < 6) &&
((480 > 1225 && (ClockMin >= 1225 && ClockMin <= 480)) || (480 < 1225 &&
(ClockMin >= 1225 || ClockMin <= 480)))) || ((ClockDay == 0 || ClockDay
== 6) && ((1439 > 0 && (ClockMin >= 0 && ClockMin <= 1439)) || ((1439 <
0 && (ClockMin >= 0 || ClockMin <= 1439)))))) && (((LoadAvg -
CondorLoadAvg) <= 0.300000) || (State != "Unclaimed" && State !=
"Owner")) && (KeyboardIdle >= 180)) || ((SlotID <= 0))) =!= TRUE)) ||
((FALSE =!= TRUE) && FALSE))) =?= FALSE)) || ((FALSE =!= TRUE) &&
(((TRUE =!= TRUE) || (((FALSE =?= TRUE) && ((((((ClockDay > 0 &&
ClockDay < 6) && ((480 > 1225 && (ClockMin >= 1225 && ClockMin <= 480))
|| (480 < 1225 && (ClockMin >= 1225 || ClockMin <= 480)))) || ((ClockDay
== 0 || ClockDay == 6) && ((1439 > 0 && (ClockMin >= 0 && ClockMin <=
1439)) || ((1439 < 0 && (ClockMin >= 0 || ClockMin <= 1439)))))) &&
(((LoadAvg - CondorLoa!
 dAvg) <= 0.300000) || (State != "Unclaimed" && State != "Owner")) &&
(KeyboardIdle >= 180)) || ((SlotID <= 0))) =!= TRUE)) || ((FALSE =!=
TRUE) && FALSE))) =?= FALSE))))
DEBUG: Requirements = (START) && (IsValidCheckpointPlatform)
Error: Slot state is unknown or not unclaimed
8/11 20:39:07 slot1: Changing state and activity: Claimed/Idle ->
Preempting/Vacating
8/11 20:39:07 slot1: State change: No preempting claim, returning to
owner
8/11 20:39:07 slot1: Changing state and activity: Preempting/Vacating ->
Owner/Idle
8/11 20:39:07 slot1: State change: IS_OWNER is false
8/11 20:39:07 slot1: Changing state: Owner -> Unclaimed
8/11 20:39:07 DaemonCore: return from reaper for pid 2544
8/11 20:39:07 Calling pipe Handler <DC stderr pipe handler> for Pipe
end=65542 <DC stderr pipe>
8/11 20:39:07 Return from pipe Handler
8/11 20:39:07 Received UDP command 60011 (DC_NOP) from
<137.57.183.22:4827>, access level READ
8/11 20:39:07 Calling HandleReq <handle_nop()> (0)
8/11 20:39:07 Return from HandleReq <handle_nop()> (handler: 0.000s,
sec: 0.016s)
8/11 20:39:07 Calling pipe Handler <DC stderr pipe handler> for Pipe
end=65542 <DC stderr pipe>
8/11 20:39:07 Return from pipe Handler
8/11 20:39:07 Calling pipe Handler <DC stdout pipe handler> for Pipe
end=65539 <DC stdout pipe>
8/11 20:39:07 Return from pipe Handler
8/11 20:39:07 Received UDP command 60011 (DC_NOP) from
<137.57.183.22:4828>, access level READ
8/11 20:39:07 Calling HandleReq <handle_nop()> (0)
8/11 20:39:07 Return from HandleReq <handle_nop()> (handler: 0.000s,
sec: 0.000s)
8/11 20:39:07 Received UDP command 60011 (DC_NOP) from
<137.57.183.22:4829>, access level READ
8/11 20:39:07 Calling HandleReq <handle_nop()> (0)
8/11 20:39:07 Return from HandleReq <handle_nop()> (handler: 0.000s,
sec: 0.000s)
8/11 20:39:07 DaemonCore: pid 2116 exited with status 0, invoking reaper
1 <HookClientMgr Output Reaper>
8/11 20:39:07 Warning, hook d:/arc/scripts/hooks/arc_job_fetch.bat (pid
2116) printed to stderr: DEBUG: Slot = 3
DEBUG: State = "Claimed"
DEBUG: Start = ((TRUE =?= TRUE) && (((FALSE =?= TRUE) && (((TRUE =!=
TRUE) || (((FALSE =?= TRUE) && ((((((ClockDay > 0 && ClockDay < 6) &&
((480 > 1225 && (ClockMin >= 1225 && ClockMin <= 480)) || (480 < 1225 &&
(ClockMin >= 1225 || ClockMin <= 480)))) || ((ClockDay == 0 || ClockDay
== 6) && ((1439 > 0 && (ClockMin >= 0 && ClockMin <= 1439)) || ((1439 <
0 && (ClockMin >= 0 || ClockMin <= 1439)))))) && (((LoadAvg -
CondorLoadAvg) <= 0.300000) || (State != "Unclaimed" && State !=
"Owner")) && (KeyboardIdle >= 180)) || ((SlotID <= 0))) =!= TRUE)) ||
((FALSE =!= TRUE) && FALSE))) =?= FALSE)) || ((FALSE =!= TRUE) &&
(((TRUE =!= TRUE) || (((FALSE =?= TRUE) && ((((((ClockDay > 0 &&
ClockDay < 6) && ((480 > 1225 && (ClockMin >= 1225 && ClockMin <= 480))
|| (480 < 1225 && (ClockMin >= 1225 || ClockMin <= 480)))) || ((ClockDay
== 0 || ClockDay == 6) && ((1439 > 0 && (ClockMin >= 0 && ClockMin <=
1439)) || ((1439 < 0 && (ClockMin >= 0 || ClockMin <= 1439)))))) &&
(((LoadAvg - CondorLoa!
 dAvg) <= 0.300000) || (State != "Unclaimed" && State != "Owner")) &&
(KeyboardIdle >= 180)) || ((SlotID <= 0))) =!= TRUE)) || ((FALSE =!=
TRUE) && FALSE))) =?= FALSE))))
DEBUG: Requirements = (START) && (IsValidCheckpointPlatform)
Error: Slot state is unknown or not unclaimed
8/11 20:39:07 slot3: Changing state and activity: Claimed/Idle ->
Preempting/Vacating
8/11 20:39:07 slot3: State change: No preempting claim, returning to
owner
8/11 20:39:07 slot3: Changing state and activity: Preempting/Vacating ->
Owner/Idle
8/11 20:39:07 slot3: State change: IS_OWNER is false
8/11 20:39:07 slot3: Changing state: Owner -> Unclaimed
8/11 20:39:07 DaemonCore: return from reaper for pid 2116
8/11 20:39:08 Calling Handler <receiveJobClassAdUpdate>
8/11 20:39:08 condor_read(): recv() returned -1, errno = 10054, assuming
failure reading 5 bytes from <127.0.0.1:4719>.
8/11 20:39:08 IO: Failed to read packet header
8/11 20:39:08 Return from Handler <receiveJobClassAdUpdate>
8/11 20:39:08 Received UDP command 60011 (DC_NOP) from
<137.57.183.22:4839>, access level READ
8/11 20:39:08 Calling HandleReq <handle_nop()> (0)
8/11 20:39:08 Return from HandleReq <handle_nop()> (handler: 0.000s,
sec: 0.000s)
8/11 20:39:08 DaemonCore: pid 2320 exited with status 0, invoking reaper
4 <reaper>
8/11 20:39:08 Starter pid 2320 exited with status 0
8/11 20:39:08 slot2: State change: starter exited
8/11 20:39:08 slot2: Changing activity: Busy -> Idle
8/11 20:39:08 Executable is a batch script, so executing
C:\WINDOWS\system32\cmd.exe  /Q /C
"d:/arc/scripts/hooks/arc_job_fetch.bat"
8/11 20:39:08 DaemonCore: return from reaper for pid 2320
8/11 20:39:08 Calling pipe Handler <DC stderr pipe handler> for Pipe
end=65540 <DC stderr pipe>
8/11 20:39:08 Return from pipe Handler
8/11 20:39:08 Received UDP command 60011 (DC_NOP) from
<137.57.183.22:4842>, access level READ
8/11 20:39:08 Calling HandleReq <handle_nop()> (0)
8/11 20:39:08 Return from HandleReq <handle_nop()> (handler: 0.000s,
sec: 0.000s)
8/11 20:39:09 Calling pipe Handler <DC stderr pipe handler> for Pipe
end=65540 <DC stderr pipe>
8/11 20:39:09 Return from pipe Handler
8/11 20:39:09 Received UDP command 60011 (DC_NOP) from
<137.57.183.22:4843>, access level READ
8/11 20:39:09 Calling HandleReq <handle_nop()> (0)
8/11 20:39:09 Return from HandleReq <handle_nop()> (handler: 0.000s,
sec: 0.000s)
8/11 20:39:09 Close_Pipe on invalid pipe end: -1
8/11 20:39:09 ERROR "Close_Pipe error" at line 1804 in file
..\src\condor_daemon_core.V6\daemon_core.cpp
8/11 20:39:09 slot8: Changing state and activity: Claimed/Busy ->
Preempting/Killing
8/11 20:39:09 slot5: Changing state and activity: Claimed/Busy ->
Preempting/Killing
8/11 20:39:09 slot2: Changing state and activity: Claimed/Idle ->
Preempting/Killing
8/11 20:39:09 slot2: State change: No preempting claim, returning to
owner
8/11 20:39:09 slot2: Changing state and activity: Preempting/Killing ->
Owner/Idle
8/11 20:39:09 slot2: State change: IS_OWNER is false
8/11 20:39:09 slot2: Changing state: Owner -> Unclaimed
8/11 20:39:09 slot6: Changing state and activity: Claimed/Busy ->
Preempting/Killing
8/11 20:39:09 startd exiting because of fatal exception.
*** End of file StartLog

Confidentiality Notice.
This message may contain information that is confidential or otherwise protected from disclosure. If you are not the intended recipient, you are hereby notified that any use, disclosure, dissemination, distribution,  or copying  of this message, or any attachments, is strictly prohibited.  If you have received this message in error, please advise the sender by reply e-mail, and delete the message and any attachments.  Thank you.