[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] Strange schedd crash (exit status 44)



We got the same crash again with schedd on Windows. This is the 6.7.2
branch. Is there something in the output that might tip us off to a
problem? It looks like it's dying trying to fork a condor_shadown for a
new job in both cases.

Thanks!
Ian

----
This is an automated email from the Condor system on machine
"TTC-GQUAN3.altera.priv.altera.com".  Do not reply.

"d:\abc\condor/bin/condor_schedd.exe" on
"TTC-GQUAN3.altera.priv.altera.com" exited with status 44.
Condor will automatically restart this process in 10 seconds.

*** Last 100 line(s) of file SchedLog:
11/24 09:14:42 attempt to add pre-existing match
"<137.57.176.183:4197>#1099203124#1706" ignored
11/24 09:14:42 attempt to add pre-existing match
"<137.57.176.179:2712>#1099202607#1606" ignored
11/24 09:14:42 Sent RELEASE_CLAIM to startd on <137.57.176.180:1047>
11/24 09:14:42 Match record (<137.57.176.180:1047>, 20, 234) deleted
11/24 09:14:49 DaemonCore: Command received via UDP from host
<137.57.142.51:1319>
11/24 09:14:49 DaemonCore: received command 60001 (DC_PROCESSEXIT),
calling handler (HandleProcessExitCommand())

11/24 09:14:52 Started shadow for job 20.232 on "<137.57.176.180:1047>",
(shadow pid = 2152)
11/24 09:14:52 Sent ad to 1 collectors for gquan@xxxxxxxxxx
11/24 09:14:53 DaemonCore: Command received via TCP from host
<137.57.176.180:1877>
11/24 09:14:53 DaemonCore: received command 443 (VACATE_SERVICE),
calling handler (vacate_service)
11/24 09:14:53 Got VACATE_SERVICE from <137.57.176.180:1877>
11/24 09:14:53 Sent RELEASE_CLAIM to startd on <137.57.176.180:1047>
11/24 09:14:53 Match record (<137.57.176.180:1047>, 20, 232) deleted
11/24 09:14:53 DaemonCore: Command received via UDP from host
<137.57.142.51:1331>
11/24 09:14:53 DaemonCore: received command 60001 (DC_PROCESSEXIT),
calling handler (HandleProcessExitCommand())

11/24 09:14:54 DaemonCore: Command received via UDP from host
<137.57.142.51:1332>
11/24 09:14:54 DaemonCore: received command 60001 (DC_PROCESSEXIT),
calling handler (HandleProcessExitCommand())

11/24 09:14:54 Scheduler::Relinquish - mrec is NULL, can't relinquish
11/24 09:14:54 Null parameter --- match not deleted
11/24 09:14:56 Started shadow for job 20.233 on "<137.57.176.180:1047>",
(shadow pid = 2720)
11/24 09:14:58 Started shadow for job 20.234 on "<137.57.176.180:1047>",
(shadow pid = 2100)
11/24 09:14:58 Sent ad to 1 collectors for gquan@xxxxxxxxxx
11/24 09:16:41 Response problem from startd.
11/24 09:16:41 Sent RELEASE_CLAIM to startd on <137.57.176.183:4197>
11/24 09:16:41 Match record (<137.57.176.183:4197>, 20, 235) deleted
11/24 09:16:42 Activity on stashed negotiator socket
11/24 09:16:42 Negotiating for owner: gquan@xxxxxxxxxx
11/24 09:16:42 Checking consistency running and runnable jobs
11/24 09:16:42 Tables are consistent
11/24 09:16:43 Out of servers - 0 jobs matched, 36 jobs idle, 1 jobs
rejected
11/24 09:16:43 Response problem from startd.
11/24 09:16:43 Sent RELEASE_CLAIM to startd on <137.57.176.179:2712>
11/24 09:16:43 Match record (<137.57.176.179:2712>, 20, 236) deleted
11/24 09:17:28 Sent ad to 1 collectors for gquan@xxxxxxxxxx
11/24 09:18:43 Activity on stashed negotiator socket
11/24 09:18:43 Negotiating for owner: gquan@xxxxxxxxxx
11/24 09:18:43 Checking consistency running and runnable jobs
11/24 09:18:43 Tables are consistent
11/24 09:18:43 Out of servers - 0 jobs matched, 36 jobs idle, 1 jobs
rejected
11/24 09:19:43 DaemonCore: Command received via UDP from host
<137.57.142.51:1395>
11/24 09:19:43 DaemonCore: received command 60001 (DC_PROCESSEXIT),
calling handler (HandleProcessExitCommand())

11/24 09:19:43 DaemonCore: Command received via UDP from host
<137.57.142.51:1398>
11/24 09:19:43 DaemonCore: received command 60001 (DC_PROCESSEXIT),
calling handler (HandleProcessExitCommand())

11/24 09:19:45 Started shadow for job 20.232 on "<137.57.176.180:1047>",
(shadow pid = 2672)
11/24 09:19:47 Started shadow for job 20.235 on "<137.57.176.180:1047>",
(shadow pid = 1448)
11/24 09:19:47 Sent ad to 1 collectors for gquan@xxxxxxxxxx
11/24 09:20:43 Activity on stashed negotiator socket
11/24 09:20:43 Negotiating for owner: gquan@xxxxxxxxxx
11/24 09:20:43 Checking consistency running and runnable jobs
11/24 09:20:43 Tables are consistent
11/24 09:20:44 Out of servers - 4 jobs matched, 30 jobs idle, 1 jobs
rejected
11/24 09:20:58 DaemonCore: Command received via UDP from host
<137.57.142.51:1427>
11/24 09:20:58 DaemonCore: received command 60001 (DC_PROCESSEXIT),
calling handler (HandleProcessExitCommand())

11/24 09:21:01 Started shadow for job 20.236 on "<137.57.176.183:4197>",
(shadow pid = 2108)
11/24 09:21:01 Sent ad to 1 collectors for gquan@xxxxxxxxxx
11/24 09:21:07 DaemonCore: Command received via TCP from host
<137.57.176.183:2328>
11/24 09:21:07 DaemonCore: received command 443 (VACATE_SERVICE),
calling handler (vacate_service)
11/24 09:21:07 Got VACATE_SERVICE from <137.57.176.183:2328>
11/24 09:21:07 Sent RELEASE_CLAIM to startd on <137.57.176.183:4197>
11/24 09:21:07 Match record (<137.57.176.183:4197>, 20, 236) deleted
11/24 09:21:07 DaemonCore: Command received via UDP from host
<137.57.142.51:1440>
11/24 09:21:07 DaemonCore: received command 60001 (DC_PROCESSEXIT),
calling handler (HandleProcessExitCommand())

11/24 09:21:07 Scheduler::Relinquish - mrec is NULL, can't relinquish
11/24 09:21:07 Null parameter --- match not deleted
11/24 09:21:10 Started shadow for job 20.238 on "<137.57.176.183:4197>",
(shadow pid = 2772)
11/24 09:21:10 Sent ad to 1 collectors for gquan@xxxxxxxxxx
11/24 09:22:34 DaemonCore: Command received via UDP from host
<137.57.142.51:1462>
11/24 09:22:34 DaemonCore: received command 60001 (DC_PROCESSEXIT),
calling handler (HandleProcessExitCommand())

11/24 09:22:37 Started shadow for job 20.236 on "<137.57.176.179:2712>",
(shadow pid = 2292)
11/24 09:22:37 Sent ad to 1 collectors for gquan@xxxxxxxxxx
11/24 09:22:42 DaemonCore: Command received via TCP from host
<137.57.176.179:2089>
11/24 09:22:42 DaemonCore: received command 443 (VACATE_SERVICE),
calling handler (vacate_service)
11/24 09:22:42 Got VACATE_SERVICE from <137.57.176.179:2089>
11/24 09:22:42 Sent RELEASE_CLAIM to startd on <137.57.176.179:2712>
11/24 09:22:42 Match record (<137.57.176.179:2712>, 20, 236) deleted
11/24 09:22:43 DaemonCore: Command received via UDP from host
<137.57.142.51:1473>
11/24 09:22:43 DaemonCore: received command 60001 (DC_PROCESSEXIT),
calling handler (HandleProcessExitCommand())

11/24 09:22:43 Scheduler::Relinquish - mrec is NULL, can't relinquish
11/24 09:22:43 Null parameter --- match not deleted
11/24 09:22:44 Activity on stashed negotiator socket
11/24 09:22:44 Negotiating for owner: gquan@xxxxxxxxxx
11/24 09:22:44 Checking consistency running and runnable jobs
11/24 09:22:45 Tables are consistent
11/24 09:22:45 Out of servers - 3 jobs matched, 29 jobs idle, 1 jobs
rejected
11/24 09:22:45 attempt to add pre-existing match
"<137.57.176.180:1047>#1100637096#502" ignored
11/24 09:22:45 attempt to add pre-existing match
"<137.57.176.180:1047>#1100637096#501" ignored
11/24 09:22:45 attempt to add pre-existing match
"<137.57.176.179:2712>#1099202607#1607" ignored
11/24 09:22:45 Started shadow for job 20.239 on "<137.57.176.179:2712>",
(shadow pid = 1144)
11/24 09:22:45 Sent ad to 1 collectors for gquan@xxxxxxxxxx
11/24 09:24:36 DaemonCore: Command received via UDP from host
<137.57.142.51:1505>
11/24 09:24:36 DaemonCore: received command 60001 (DC_PROCESSEXIT),
calling handler (HandleProcessExitCommand())

11/24 09:24:37 DaemonCore: Command received via UDP from host
<137.57.142.51:1508>
11/24 09:24:37 DaemonCore: received command 60001 (DC_PROCESSEXIT),
calling handler (HandleProcessExitCommand())

11/24 09:24:39 DaemonCore: Command received via TCP from host
<137.57.176.180:2306>
11/24 09:24:39 DaemonCore: received command 443 (VACATE_SERVICE),
calling handler (vacate_service)
11/24 09:24:39 Got VACATE_SERVICE from <137.57.176.180:2306>
11/24 09:24:39 Sent RELEASE_CLAIM to startd on <137.57.176.180:1047>
11/24 09:24:39 Match record (<137.57.176.180:1047>, 20, 236) deleted
11/24 09:24:39 match or classad for job 20.236 was deleted - not forking
a shadow
11/24 09:24:39 Started shadow for job 20.237 on "<137.57.176.180:1047>",
(shadow pid = 3912)
*** End of file SchedLog



-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Questions about this message or Condor in general?
Email address of the local Condor administrator: swttcabca@xxxxxxxxxx
The Official Condor Homepage is http://www.cs.wisc.edu/condor


 

> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx 
> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Ian Chesal
> Sent: November 23, 2004 2:45 PM
> To: Condor-Users Mail List
> Subject: [Condor-users] Strange schedd crash (exit status 44)
> 
> I get a schedd crash from this users machine every time he 
> queues up 100 or more jobs. What does exit status 44 indicate?
> 
> Thanks!
> Ian
> 
> -----Original Message-----
> From: SYSTEM@xxxxxxxxxx [mailto:SYSTEM@xxxxxxxxxx]
> Sent: November 23, 2004 2:32 PM
> To: SW TOR Batch System Admins
> Subject: [Condor] Problem
> 
> This is an automated email from the Condor system on machine 
> "TTC-GQUAN3.altera.priv.altera.com".  Do not reply.
> 
> "d:\abc\condor/bin/condor_schedd.exe" on 
> "TTC-GQUAN3.altera.priv.altera.com" exited with status 44.
> Condor will automatically restart this process in 10 seconds.
> 
> *** Last 100 line(s) of file SchedLog:
> 11/23 14:28:58 attempt to add pre-existing match 
> "<137.57.176.180:1047>#1100637096#282" ignored
> 11/23 14:28:58 attempt to add pre-existing match 
> "<137.57.176.182:1151>#1099422886#1224" ignored
> 11/23 14:28:58 attempt to add pre-existing match 
> "<137.57.176.182:1151>#1099422886#1223" ignored
> 11/23 14:28:58 attempt to add pre-existing match 
> "<137.57.176.183:4197>#1099203124#1580" ignored
> 11/23 14:28:58 attempt to add pre-existing match 
> "<137.57.176.183:4197>#1099203124#1579" ignored
> 11/23 14:28:58 attempt to add pre-existing match 
> "<137.57.176.185:1407>#1099202749#1981" ignored
> 11/23 14:28:58 attempt to add pre-existing match 
> "<137.57.176.185:1407>#1099202749#1982" ignored
> 11/23 14:28:58 attempt to add pre-existing match 
> "<137.57.176.177:1213>#1100703290#277" ignored
> 11/23 14:28:58 attempt to add pre-existing match 
> "<137.57.176.186:2147>#1099203682#1256" ignored
> 11/23 14:28:58 attempt to add pre-existing match 
> "<137.57.176.177:1213>#1100703290#276" ignored
> 11/23 14:28:58 attempt to add pre-existing match 
> "<137.57.176.186:2147>#1099203682#1257" ignored
> 11/23 14:28:58 attempt to add pre-existing match 
> "<137.57.176.178:3591>#1099202664#1406" ignored
> 11/23 14:28:58 attempt to add pre-existing match 
> "<137.57.176.179:2712>#1099202607#1468" ignored
> 11/23 14:28:59 attempt to add pre-existing match 
> "<137.57.176.179:2712>#1099202607#1467" ignored
> 11/23 14:29:31 DaemonCore: Command received via UDP from host 
> <137.57.142.51:4119>
> 11/23 14:29:31 DaemonCore: received command 60001 
> (DC_PROCESSEXIT), calling handler (HandleProcessExitCommand())
> 
> 11/23 14:29:36 Started shadow for job 19.130 on 
> "<137.57.176.179:2712>", (shadow pid = 472)
> 11/23 14:29:36 Sent ad to 1 collectors for gquan@xxxxxxxxxx
> 11/23 14:29:36 timed out requesting claim from <137.57.176.180:1047>
> 11/23 14:29:36 Sent RELEASE_CLAIM to startd on <137.57.176.180:1047>
> 11/23 14:29:36 timed out requesting claim from <137.57.176.180:1047>
> 11/23 14:29:36 Sent RELEASE_CLAIM to startd on <137.57.176.180:1047>
> 11/23 14:29:40 DaemonCore: Command received via TCP from host 
> <137.57.176.179:4906>
> 11/23 14:29:40 DaemonCore: received command 443 
> (VACATE_SERVICE), calling handler (vacate_service)
> 11/23 14:29:40 Got VACATE_SERVICE from <137.57.176.179:4906>
> 11/23 14:29:40 Sent RELEASE_CLAIM to startd on <137.57.176.179:2712>
> 11/23 14:29:40 Match record (<137.57.176.179:2712>, 19, 130) deleted
> 11/23 14:29:40 DaemonCore: Command received via UDP from host 
> <137.57.142.51:4133>
> 11/23 14:29:40 DaemonCore: received command 60001 
> (DC_PROCESSEXIT), calling handler (HandleProcessExitCommand())
> 
> 11/23 14:29:40 Scheduler::Relinquish - mrec is NULL, can't relinquish
> 11/23 14:29:40 Null parameter --- match not deleted
> 11/23 14:29:44 Started shadow for job 19.159 on 
> "<137.57.176.179:2712>", (shadow pid = 2972)
> 11/23 14:29:44 Sent ad to 1 collectors for gquan@xxxxxxxxxx
> 11/23 14:29:45 timed out requesting claim from <137.57.176.180:1047>
> 11/23 14:29:45 Sent RELEASE_CLAIM to startd on <137.57.176.180:1047>
> 11/23 14:29:45 timed out requesting claim from <137.57.176.180:1047>
> 11/23 14:29:45 Sent RELEASE_CLAIM to startd on <137.57.176.180:1047>
> 11/23 14:30:02 DaemonCore: Command received via UDP from host 
> <137.57.142.51:4146>
> 11/23 14:30:02 DaemonCore: received command 60001 
> (DC_PROCESSEXIT), calling handler (HandleProcessExitCommand())
> 
> 11/23 14:30:05 condor_read(): recv() returned -1, errno = 
> 10054, assuming failure.
> 11/23 14:30:05 Response problem from startd.
> 11/23 14:30:05 Sent RELEASE_CLAIM to startd on <137.57.176.182:1151>
> 11/23 14:30:05 Match record (<137.57.176.182:1151>, 19, 129) deleted
> 11/23 14:30:07 Started shadow for job 19.130 on 
> "<137.57.176.182:1151>", (shadow pid = 1036)
> 11/23 14:30:07 Sent ad to 1 collectors for gquan@xxxxxxxxxx
> 11/23 14:30:07 timed out requesting claim from <137.57.176.180:1047>
> 11/23 14:30:08 Sent RELEASE_CLAIM to startd on <137.57.176.180:1047>
> 11/23 14:30:08 timed out requesting claim from <137.57.176.180:1047>
> 11/23 14:30:08 Sent RELEASE_CLAIM to startd on <137.57.176.180:1047>
> 11/23 14:30:13 DaemonCore: Command received via TCP from host 
> <137.57.176.182:4778>
> 11/23 14:30:13 DaemonCore: received command 443 
> (VACATE_SERVICE), calling handler (vacate_service)
> 11/23 14:30:13 Got VACATE_SERVICE from <137.57.176.182:4778>
> 11/23 14:30:13 Sent RELEASE_CLAIM to startd on <137.57.176.182:1151>
> 11/23 14:30:13 Match record (<137.57.176.182:1151>, 19, 130) deleted
> 11/23 14:30:13 DaemonCore: Command received via UDP from host 
> <137.57.142.51:4176>
> 11/23 14:30:13 DaemonCore: received command 60001 
> (DC_PROCESSEXIT), calling handler (HandleProcessExitCommand())
> 
> 11/23 14:30:13 Scheduler::Relinquish - mrec is NULL, can't relinquish
> 11/23 14:30:13 Null parameter --- match not deleted
> 11/23 14:30:17 Started shadow for job 19.133 on 
> "<137.57.176.182:1151>", (shadow pid = 2300)
> 11/23 14:30:17 Sent ad to 1 collectors for gquan@xxxxxxxxxx
> 11/23 14:30:17 timed out requesting claim from <137.57.176.180:1047>
> 11/23 14:30:17 Sent RELEASE_CLAIM to startd on <137.57.176.180:1047>
> 11/23 14:30:17 timed out requesting claim from <137.57.176.180:1047>
> 11/23 14:30:17 Sent RELEASE_CLAIM to startd on <137.57.176.180:1047>
> 11/23 14:30:42 DaemonCore: Command received via UDP from host 
> <137.57.142.51:4190>
> 11/23 14:30:42 DaemonCore: received command 60001 
> (DC_PROCESSEXIT), calling handler (HandleProcessExitCommand())
> 
> 11/23 14:30:45 Started shadow for job 19.130 on 
> "<137.57.176.180:1047>", (shadow pid = 3624)
> 11/23 14:30:45 Sent ad to 1 collectors for gquan@xxxxxxxxxx
> 11/23 14:30:45 timed out requesting claim from <137.57.176.180:1047>
> 11/23 14:30:46 Sent RELEASE_CLAIM to startd on <137.57.176.180:1047>
> 11/23 14:30:46 timed out requesting claim from <137.57.176.180:1047>
> 11/23 14:30:46 Sent RELEASE_CLAIM to startd on <137.57.176.180:1047>
> 11/23 14:30:52 DaemonCore: Command received via TCP from host 
> <137.57.176.180:3514>
> 11/23 14:30:52 DaemonCore: received command 443 
> (VACATE_SERVICE), calling handler (vacate_service)
> 11/23 14:30:52 Got VACATE_SERVICE from <137.57.176.180:3514>
> 11/23 14:30:52 Sent RELEASE_CLAIM to startd on <137.57.176.180:1047>
> 11/23 14:30:52 Match record (<137.57.176.180:1047>, 19, 130) deleted
> 11/23 14:30:52 DaemonCore: Command received via UDP from host 
> <137.57.142.51:4204>
> 11/23 14:30:52 DaemonCore: received command 60001 
> (DC_PROCESSEXIT), calling handler (HandleProcessExitCommand())
> 
> 11/23 14:30:52 Scheduler::Relinquish - mrec is NULL, can't relinquish
> 11/23 14:30:52 Null parameter --- match not deleted
> 11/23 14:30:55 Response problem from startd.
> 11/23 14:30:55 Sent RELEASE_CLAIM to startd on <137.57.176.180:1047>
> 11/23 14:30:55 Match record (<137.57.176.180:1047>, 19, 131) deleted
> 11/23 14:30:56 Response problem from startd.
> 11/23 14:30:56 Sent RELEASE_CLAIM to startd on <137.57.176.185:1407>
> 11/23 14:30:56 Match record (<137.57.176.185:1407>, 19, 151) deleted
> 11/23 14:30:56 Response problem from startd.
> 11/23 14:30:56 Sent RELEASE_CLAIM to startd on <137.57.176.183:4197>
> 11/23 14:30:56 Match record (<137.57.176.183:4197>, 19, 147) deleted
> 11/23 14:30:56 Response problem from startd.
> 11/23 14:30:56 Sent RELEASE_CLAIM to startd on <137.57.176.183:4197>
> 11/23 14:30:56 Match record (<137.57.176.183:4197>, 19, 149) deleted
> 11/23 14:30:56 Response problem from startd.
> 11/23 14:30:56 Sent RELEASE_CLAIM to startd on <137.57.176.185:1407>
> 11/23 14:30:56 Match record (<137.57.176.185:1407>, 19, 150) deleted
> 11/23 14:30:57 Response problem from startd.
> 11/23 14:30:57 Sent RELEASE_CLAIM to startd on <137.57.176.186:2147>
> 11/23 14:30:57 Match record (<137.57.176.186:2147>, 19, 155) deleted
> 11/23 14:30:57 Started shadow for job 19.130 on 
> "<137.57.176.180:1047>", (shadow pid = 2692)
> *** End of file SchedLog
> 
> 
> 
> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> Questions about this message or Condor in general?
> Email address of the local Condor administrator: 
> swttcabca@xxxxxxxxxx The Official Condor Homepage is 
> http://www.cs.wisc.edu/condor
> 
> 
> 
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> http://lists.cs.wisc.edu/mailman/listinfo/condor-users
>