[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Strange schedd crash (exit status 44)



I get a schedd crash from this users machine every time he queues up 100
or more jobs. What does exit status 44 indicate?

Thanks!
Ian

-----Original Message-----
From: SYSTEM@xxxxxxxxxx [mailto:SYSTEM@xxxxxxxxxx] 
Sent: November 23, 2004 2:32 PM
To: SW TOR Batch System Admins
Subject: [Condor] Problem

This is an automated email from the Condor system on machine
"TTC-GQUAN3.altera.priv.altera.com".  Do not reply.

"d:\abc\condor/bin/condor_schedd.exe" on
"TTC-GQUAN3.altera.priv.altera.com" exited with status 44.
Condor will automatically restart this process in 10 seconds.

*** Last 100 line(s) of file SchedLog:
11/23 14:28:58 attempt to add pre-existing match
"<137.57.176.180:1047>#1100637096#282" ignored
11/23 14:28:58 attempt to add pre-existing match
"<137.57.176.182:1151>#1099422886#1224" ignored
11/23 14:28:58 attempt to add pre-existing match
"<137.57.176.182:1151>#1099422886#1223" ignored
11/23 14:28:58 attempt to add pre-existing match
"<137.57.176.183:4197>#1099203124#1580" ignored
11/23 14:28:58 attempt to add pre-existing match
"<137.57.176.183:4197>#1099203124#1579" ignored
11/23 14:28:58 attempt to add pre-existing match
"<137.57.176.185:1407>#1099202749#1981" ignored
11/23 14:28:58 attempt to add pre-existing match
"<137.57.176.185:1407>#1099202749#1982" ignored
11/23 14:28:58 attempt to add pre-existing match
"<137.57.176.177:1213>#1100703290#277" ignored
11/23 14:28:58 attempt to add pre-existing match
"<137.57.176.186:2147>#1099203682#1256" ignored
11/23 14:28:58 attempt to add pre-existing match
"<137.57.176.177:1213>#1100703290#276" ignored
11/23 14:28:58 attempt to add pre-existing match
"<137.57.176.186:2147>#1099203682#1257" ignored
11/23 14:28:58 attempt to add pre-existing match
"<137.57.176.178:3591>#1099202664#1406" ignored
11/23 14:28:58 attempt to add pre-existing match
"<137.57.176.179:2712>#1099202607#1468" ignored
11/23 14:28:59 attempt to add pre-existing match
"<137.57.176.179:2712>#1099202607#1467" ignored
11/23 14:29:31 DaemonCore: Command received via UDP from host
<137.57.142.51:4119>
11/23 14:29:31 DaemonCore: received command 60001 (DC_PROCESSEXIT),
calling handler (HandleProcessExitCommand())

11/23 14:29:36 Started shadow for job 19.130 on "<137.57.176.179:2712>",
(shadow pid = 472)
11/23 14:29:36 Sent ad to 1 collectors for gquan@xxxxxxxxxx
11/23 14:29:36 timed out requesting claim from <137.57.176.180:1047>
11/23 14:29:36 Sent RELEASE_CLAIM to startd on <137.57.176.180:1047>
11/23 14:29:36 timed out requesting claim from <137.57.176.180:1047>
11/23 14:29:36 Sent RELEASE_CLAIM to startd on <137.57.176.180:1047>
11/23 14:29:40 DaemonCore: Command received via TCP from host
<137.57.176.179:4906>
11/23 14:29:40 DaemonCore: received command 443 (VACATE_SERVICE),
calling handler (vacate_service)
11/23 14:29:40 Got VACATE_SERVICE from <137.57.176.179:4906>
11/23 14:29:40 Sent RELEASE_CLAIM to startd on <137.57.176.179:2712>
11/23 14:29:40 Match record (<137.57.176.179:2712>, 19, 130) deleted
11/23 14:29:40 DaemonCore: Command received via UDP from host
<137.57.142.51:4133>
11/23 14:29:40 DaemonCore: received command 60001 (DC_PROCESSEXIT),
calling handler (HandleProcessExitCommand())

11/23 14:29:40 Scheduler::Relinquish - mrec is NULL, can't relinquish
11/23 14:29:40 Null parameter --- match not deleted
11/23 14:29:44 Started shadow for job 19.159 on "<137.57.176.179:2712>",
(shadow pid = 2972)
11/23 14:29:44 Sent ad to 1 collectors for gquan@xxxxxxxxxx
11/23 14:29:45 timed out requesting claim from <137.57.176.180:1047>
11/23 14:29:45 Sent RELEASE_CLAIM to startd on <137.57.176.180:1047>
11/23 14:29:45 timed out requesting claim from <137.57.176.180:1047>
11/23 14:29:45 Sent RELEASE_CLAIM to startd on <137.57.176.180:1047>
11/23 14:30:02 DaemonCore: Command received via UDP from host
<137.57.142.51:4146>
11/23 14:30:02 DaemonCore: received command 60001 (DC_PROCESSEXIT),
calling handler (HandleProcessExitCommand())

11/23 14:30:05 condor_read(): recv() returned -1, errno = 10054,
assuming failure.
11/23 14:30:05 Response problem from startd.
11/23 14:30:05 Sent RELEASE_CLAIM to startd on <137.57.176.182:1151>
11/23 14:30:05 Match record (<137.57.176.182:1151>, 19, 129) deleted
11/23 14:30:07 Started shadow for job 19.130 on "<137.57.176.182:1151>",
(shadow pid = 1036)
11/23 14:30:07 Sent ad to 1 collectors for gquan@xxxxxxxxxx
11/23 14:30:07 timed out requesting claim from <137.57.176.180:1047>
11/23 14:30:08 Sent RELEASE_CLAIM to startd on <137.57.176.180:1047>
11/23 14:30:08 timed out requesting claim from <137.57.176.180:1047>
11/23 14:30:08 Sent RELEASE_CLAIM to startd on <137.57.176.180:1047>
11/23 14:30:13 DaemonCore: Command received via TCP from host
<137.57.176.182:4778>
11/23 14:30:13 DaemonCore: received command 443 (VACATE_SERVICE),
calling handler (vacate_service)
11/23 14:30:13 Got VACATE_SERVICE from <137.57.176.182:4778>
11/23 14:30:13 Sent RELEASE_CLAIM to startd on <137.57.176.182:1151>
11/23 14:30:13 Match record (<137.57.176.182:1151>, 19, 130) deleted
11/23 14:30:13 DaemonCore: Command received via UDP from host
<137.57.142.51:4176>
11/23 14:30:13 DaemonCore: received command 60001 (DC_PROCESSEXIT),
calling handler (HandleProcessExitCommand())

11/23 14:30:13 Scheduler::Relinquish - mrec is NULL, can't relinquish
11/23 14:30:13 Null parameter --- match not deleted
11/23 14:30:17 Started shadow for job 19.133 on "<137.57.176.182:1151>",
(shadow pid = 2300)
11/23 14:30:17 Sent ad to 1 collectors for gquan@xxxxxxxxxx
11/23 14:30:17 timed out requesting claim from <137.57.176.180:1047>
11/23 14:30:17 Sent RELEASE_CLAIM to startd on <137.57.176.180:1047>
11/23 14:30:17 timed out requesting claim from <137.57.176.180:1047>
11/23 14:30:17 Sent RELEASE_CLAIM to startd on <137.57.176.180:1047>
11/23 14:30:42 DaemonCore: Command received via UDP from host
<137.57.142.51:4190>
11/23 14:30:42 DaemonCore: received command 60001 (DC_PROCESSEXIT),
calling handler (HandleProcessExitCommand())

11/23 14:30:45 Started shadow for job 19.130 on "<137.57.176.180:1047>",
(shadow pid = 3624)
11/23 14:30:45 Sent ad to 1 collectors for gquan@xxxxxxxxxx
11/23 14:30:45 timed out requesting claim from <137.57.176.180:1047>
11/23 14:30:46 Sent RELEASE_CLAIM to startd on <137.57.176.180:1047>
11/23 14:30:46 timed out requesting claim from <137.57.176.180:1047>
11/23 14:30:46 Sent RELEASE_CLAIM to startd on <137.57.176.180:1047>
11/23 14:30:52 DaemonCore: Command received via TCP from host
<137.57.176.180:3514>
11/23 14:30:52 DaemonCore: received command 443 (VACATE_SERVICE),
calling handler (vacate_service)
11/23 14:30:52 Got VACATE_SERVICE from <137.57.176.180:3514>
11/23 14:30:52 Sent RELEASE_CLAIM to startd on <137.57.176.180:1047>
11/23 14:30:52 Match record (<137.57.176.180:1047>, 19, 130) deleted
11/23 14:30:52 DaemonCore: Command received via UDP from host
<137.57.142.51:4204>
11/23 14:30:52 DaemonCore: received command 60001 (DC_PROCESSEXIT),
calling handler (HandleProcessExitCommand())

11/23 14:30:52 Scheduler::Relinquish - mrec is NULL, can't relinquish
11/23 14:30:52 Null parameter --- match not deleted
11/23 14:30:55 Response problem from startd.
11/23 14:30:55 Sent RELEASE_CLAIM to startd on <137.57.176.180:1047>
11/23 14:30:55 Match record (<137.57.176.180:1047>, 19, 131) deleted
11/23 14:30:56 Response problem from startd.
11/23 14:30:56 Sent RELEASE_CLAIM to startd on <137.57.176.185:1407>
11/23 14:30:56 Match record (<137.57.176.185:1407>, 19, 151) deleted
11/23 14:30:56 Response problem from startd.
11/23 14:30:56 Sent RELEASE_CLAIM to startd on <137.57.176.183:4197>
11/23 14:30:56 Match record (<137.57.176.183:4197>, 19, 147) deleted
11/23 14:30:56 Response problem from startd.
11/23 14:30:56 Sent RELEASE_CLAIM to startd on <137.57.176.183:4197>
11/23 14:30:56 Match record (<137.57.176.183:4197>, 19, 149) deleted
11/23 14:30:56 Response problem from startd.
11/23 14:30:56 Sent RELEASE_CLAIM to startd on <137.57.176.185:1407>
11/23 14:30:56 Match record (<137.57.176.185:1407>, 19, 150) deleted
11/23 14:30:57 Response problem from startd.
11/23 14:30:57 Sent RELEASE_CLAIM to startd on <137.57.176.186:2147>
11/23 14:30:57 Match record (<137.57.176.186:2147>, 19, 155) deleted
11/23 14:30:57 Started shadow for job 19.130 on "<137.57.176.180:1047>",
(shadow pid = 2692)
*** End of file SchedLog



-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Questions about this message or Condor in general?
Email address of the local Condor administrator: swttcabca@xxxxxxxxxx
The Official Condor Homepage is http://www.cs.wisc.edu/condor