[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Jobs no longer complete with upgrade to 7.0.0



Hey Dan,

Yes in that particular starter log there is an entry that matches but it occurred when I asked the system to go for a restart (and while another similar job was running):

5/23 10:29:24 In VanillaProc::PublishUpdateAd()
5/23 10:29:24 About to get usage data from ProcD for family with root 968
5/23 10:29:24 Result of "get_usage" operation from ProcD: SUCCESS
5/23 10:29:24 Inside OsProc::PublishUpdateAd()
5/23 10:29:24 Sent job ClassAd update to startd.
5/23 10:29:24 Leaving JICShadow::updateShadow(): success
5/23 10:30:53 DaemonCore: in SendAliveToParent()
5/23 10:30:53 DaemonCore: Leaving SendAliveToParent() - success
5/23 10:31:41 Got SIGTERM. Performing graceful shutdown.
5/23 10:31:41 Started timer to call main_shutdown_fast in 1800 seconds
5/23 10:31:41 ShutdownGraceful all jobs.
5/23 10:31:41 in VanillaProc::ShutdownGraceful()
5/23 10:31:41 called DaemonCore::Shutdown_Graceful(968)
5/23 10:31:41 return value from my_system for softkill: 0
5/23 10:31:41 Process exited, pid=968, status=-1073741510
5/23 10:31:41 in VanillaProc::JobCleanup()
5/23 10:31:41 About to kill family with root process 968 using the ProcD
5/23 10:31:41 Result of "signal_family" operation from ProcD: SUCCESS
5/23 10:31:41 Inside OsProc::JobCleanup()
5/23 10:31:41 TokenCache contents:
condor-reuse-slot1@.
5/23 10:31:41 TokenCache contents:
condor-reuse-slot1@.
5/23 10:31:41 Reaper: all=1 handled=1 ShuttingDown=1
5/23 10:31:41 Inside OsProc::JobExit()
5/23 10:31:41 In VanillaProc::PublishUpdateAd()
5/23 10:31:41 About to get usage data from ProcD for family with root 968
5/23 10:31:41 Result of "get_usage" operation from ProcD: SUCCESS
5/23 10:31:41 Inside OsProc::PublishUpdateAd()
5/23 10:31:41 Sent job ClassAd update to startd.
5/23 10:31:41 condor_write(): Socket closed when trying to write 200 bytes to <SUBMIT:1249>, fd is 1840, errno=10054
5/23 10:31:41 Buf::write(): condor_write() failed
5/23 10:31:41 Failed to send job exit status to shadow
5/23 10:31:41 JobExit() failed, waiting for job lease to expire or for a reconnect attempt
5/23 10:31:41 About to unregister family with root 968 from the ProcD
5/23 10:31:41 Result of "unregister_family" operation from ProcD: SUCCESS
5/23 10:32:19 Accepted request to reconnect from <SUBMIT:1249>
5/23 10:32:19 Ignoring old shadow <SUBMIT:1248>
5/23 10:32:19 Communicating with shadow <SUBMIT:2076>
5/23 10:32:19 Shadow version: $CondorVersion: 7.0.0 Jan 22 2008 BuildID: 72173 $
5/23 10:32:19 Switching to new filetrans server, sock=<SUBMIT:2076> key=1#4836171f5be76a32
5/23 10:32:19 Closing old syscall sock <SUBMIT:1249>
5/23 10:32:19 Using new syscall sock <SUBMIT:2081>
5/23 10:32:19 Inside OsProc::JobExit()
5/23 10:32:19 In VanillaProc::PublishUpdateAd()
5/23 10:32:19 About to get usage data from ProcD for family with root 968
5/23 10:32:19 Result of "get_usage" operation from ProcD: ERROR: No family with the given PID is registered
5/23 10:32:19 error getting family usage in VanillaProc::PublishUpdateAd()
5/23 10:32:19 Job cleanup finished, now Starter is exiting
5/23 10:32:19 Removing C:\condor\execute\dir_3492
5/23 10:32:19 Attempting to remove C:\condor\execute\dir_3492 as SuperUser (system)
5/23 10:32:20 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0


2008/5/24 Dan Bradley <dan@xxxxxxxxxxxx>:
In your StarterLog, do you see any lines like the following?

Accepted request to reconnect from ...
Ignoring old shadow ...

--Dan

Alan Cass wrote:
> On the Shadow the error is:
>
> 5/22 07:19:03 (14967.0) (1660): condor_read(): recv() returned -1,
> errno = 10054, assuming failure reading 5 bytes from <HOST:1079>.
> 5/22 07:19:03 (14967.0) (1660): IO: Failed to read packet header
> 5/22 07:19:03 (14967.0) (1660): Can no longer talk to condor_starter
> <HOST:1079>
> 5/22 07:19:03 (14967.0) (1660): Trying to reconnect to disconnected job
> 5/22 07:19:03 (14967.0) (1660): LastJobLeaseRenewal: 1211399360 Thu
> May 22 05:19:20 2008
> 5/22 07:19:03 (14967.0) (1660): JobLeaseDuration: 1200 seconds
> 5/22 07:19:03 (14967.0) (1660): JobLeaseDuration remaining: EXPIRED!
> 5/22 07:19:03 (14967.0) (1660): Reconnect FAILED: Job disconnected too
> long: JobLeaseDuration (1200 seconds) expired
> 5/22 07:19:03 (14967.0) (1660): **** condor_shadow (condor_SHADOW)
> EXITING WITH STATUS 107
>
> and on the STARTER this looks like:
>
> 5/22 07:16:12 In VanillaProc::PublishUpdateAd()
> 5/22 07:16:12 About to get usage data from ProcD for family with root 1364
> 5/22 07:16:13 Result of "get_usage" operation from ProcD: SUCCESS
> 5/22 07:16:13 Inside OsProc::PublishUpdateAd()
> 5/22 07:16:13 condor_write(): Socket closed when trying to write 204
> bytes to <SUBMIT:1647>, fd is 1840, errno=10054
> 5/22 07:16:13 Buf::write(): condor_write() failed
> 5/22 07:16:13 Sent job ClassAd update to startd.
> 5/22 07:16:13 JICShadow::updateShadow(): failed to send update
> 5/22 07:16:44 Getting monitoring info for pid 2972
> 5/22 07:20:45 Getting monitoring info for pid 2972
> 5/22 07:20:54 Got SIGTERM. Performing graceful shutdown.
> 5/22 07:20:54 Started timer to call main_shutdown_fast in 1800 seconds
> 5/22 07:20:54 ShutdownGraceful all jobs.
> 5/22 07:20:54 in VanillaProc::ShutdownGraceful()
> 5/22 07:20:54 called DaemonCore::Shutdown_Graceful(1364)
> 5/22 07:20:54 return value from my_system for softkill: 0
> 5/22 07:20:55 Process exited, pid=1364, status=-1073741510
> 5/22 07:20:55 in VanillaProc::JobCleanup()
> 5/22 07:20:55 About to kill family with root process 1364 using the ProcD
> 5/22 07:20:55 Result of "signal_family" operation from ProcD: SUCCESS
> 5/22 07:20:55 Inside OsProc::JobCleanup()
> 5/22 07:20:55 TokenCache contents:
> condor-reuse-slot1@.
> 5/22 07:20:55 TokenCache contents:
> condor-reuse-slot1@.
> 5/22 07:20:55 Reaper: all=1 handled=1 ShuttingDown=1
> 5/22 07:20:55 Inside OsProc::JobExit()
>
> This was after 2 hours of updates that had
>
> 5/22 05:25:55 condor_write(): Socket closed when trying to write 204
> bytes to <SUBMIT:1647>, fd is 1840, errno=10054
> 5/22 05:25:55 Buf::write(): condor_write() failed
> 5/22 05:25:55 Sent job ClassAd update to startd.
> 5/22 05:25:55 JICShadow::updateShadow(): failed to send update
>
>
>
> 2008/5/23 Dan Bradley <dan@xxxxxxxxxxxx <mailto:dan@xxxxxxxxxxxx>>:
>
>
>     At the time when the shadow log indicates a read failure on the
>     connection to the starter, what appears in the  corresponding
>     StarterLog?
>
>     --Dan
>
>     Alan Cass wrote:
>
>     > Hi,
>     >
>     > I've upgraded to Condor 7.0.0 on our cluster of Student Lab Windows
>     > PCs but have not been able to have a job complete that takes a
>     'long'
>     > amount of time since. The jobs do the computation (since I can
>     see the
>     > updates being applied to the SIZE in condor_q). As a test I sent a
>     > node a 7MB file and got it to 'touch' it so it would be
>     automatically
>     > sent back. This works without a problem. However, if I tell the node
>     > to 'sleep' for 7 hours before exiting it will never finish,
>     > communication with the starter fails, the job requeues and this
>     > behaviour cycles.
>     >
>     > I'm worried it might be a problem with the University port scanner.
>     > Every so often I get an entry like this in the nodes' Starter
>     log (and
>     > similar in the Master log) file:
>     >
>     > 5/21 07:11:34 condor_read(): recv() returned -1, errno = 10054,
>     > assuming failure reading 4 bytes from <SCANNER_IP:PORT>.
>     > 5/21 07:11:34 condor_read(): recv() returned -1, errno = 10054,
>     > assuming failure reading 5 bytes from <SCANNER_IP:PORT>.
>     > 5/21 07:11:34 IO: Failed to read packet header
>     > 5/21 07:11:34 DaemonCore: Can't receive command request from
>     > SCANNER_IP (perhaps a timeout?)
>     > 5/21 07:11:37 IO: Incoming packet header unrecognized
>     > 5/21 07:11:37 DaemonCore: Can't receive command request from
>     > SCANNER_IP (perhaps a timeout?)
>     > 5/21 07:11:37 condor_read(): Socket closed when trying to read 4
>     bytes
>     > from <SCANNER_IP:PORT>
>     > 5/21 07:11:37 condor_read(): Socket closed when trying to read 5
>     bytes
>     > from <SCANNER_IP:PORT>
>     > 5/21 07:11:37 IO: EOF reading packet header
>     > 5/21 07:11:37 DaemonCore: Can't receive command request from
>     > SCANNER_IP (perhaps a timeout?)
>     > 5/21 07:11:40 Received HTTP GET connection from <SCANNER_IP:PORT> --
>     > DENIED because ENABLE_WEB_SERVER=FALSE
>     > 5/21 07:11:40 IO: Incoming packet header unrecognized
>     > 5/21 07:11:40 DaemonCore: Can't receive command request from
>     > SCANNER_IP (perhaps a timeout?)
>     > 5/21 07:11:40 condor_read(): Socket closed when trying to read 4
>     bytes
>     > from <SCANNER_IP:PORT>
>     > 5/21 07:11:40 condor_read(): Socket closed when trying to read 5
>     bytes
>     > from <SCANNER_IP:PORT>
>     > 5/21 07:11:40 IO: EOF reading packet header
>     > 5/21 07:11:40 DaemonCore: Can't receive command request from
>     > SCANNER_IP (perhaps a timeout?)
>     > 5/21 07:11:45 Entering JICShadow::updateShadow()
>     > 5/21 07:11:45 TokenCache contents:
>     > condor-reuse-slot1@.
>     > 5/21 07:11:45 In VanillaProc::PublishUpdateAd()
>     > 5/21 07:11:45 About to get usage data from ProcD for family with
>     root 4036
>     > 5/21 07:11:45 Result of "get_usage" operation from ProcD: SUCCESS
>     > 5/21 07:11:45 Inside OsProc::PublishUpdateAd()
>     > 5/21 07:11:45 Sent job ClassAd update to startd.
>     > 5/21 07:11:45 Leaving JICShadow::updateShadow(): success
>     > 5/21 07:11:49 condor_read(): Socket closed when trying to read 4
>     bytes
>     > from <SCANNER_IP:PORT>
>     > 5/21 07:11:49 condor_read(): Socket closed when trying to read 5
>     bytes
>     > from <SCANNER_IP:PORT>
>     > 5/21 07:11:49 IO: EOF reading packet header
>     > 5/21 07:11:49 DaemonCore: Can't receive command request from
>     > SCANNER_IP (perhaps a timeout?)
>     >
>     >
>     > and the shadow eventually bombs out with:
>     >
>     > 5/21 23:11:22 (14933.0) (3964): condor_read(): recv() returned -1,
>     > errno = 10054, assuming failure reading 5 bytes from <EXEC_IP:PORT>.
>     > 5/21 23:11:22 (14933.0) (3964): IO: Failed to read packet header
>     > 5/21 23:11:22 (14933.0) (3964): Can no longer talk to condor_starter
>     > <EXEC_IP:PORT>
>     > 5/21 23:11:22 (14933.0) (3964): Trying to reconnect to
>     disconnected job
>     > 5/21 23:11:22 (14933.0) (3964): LastJobLeaseRenewal: 1211370100 Wed
>     > May 21 21:11:40 2008
>     > 5/21 23:11:22 (14933.0) (3964): JobLeaseDuration: 1200 seconds
>     > 5/21 23:11:22 (14933.0) (3964): JobLeaseDuration remaining: EXPIRED!
>     > 5/21 23:11:22 (14933.0) (3964): Reconnect FAILED: Job
>     disconnected too
>     > long: JobLeaseDuration (1200 seconds) expired
>     > 5/21 23:11:22 (14933.0) (3964): **** condor_shadow (condor_SHADOW)
>     > EXITING WITH STATUS 107
>     >
>     >
>     >
>     > Is the scanner somehow stealing the starter port and not
>     allowing the
>     > shadow to get information back? What settings can I give the
>     config to
>     > get it to completely ignore anything coming from the port
>     scanner? Or
>     > could it be something else?
>     >
>     > Thanks,
>     >
>     > Alan
>     >
>     >------------------------------------------------------------------------
>     >
>     >_______________________________________________
>     >Condor-users mailing list
>     >To unsubscribe, send a message to
>     condor-users-request@xxxxxxxxxxx
>     <mailto:condor-users-request@xxxxxxxxxxx> with a
>     >subject: Unsubscribe
>     >You can also unsubscribe by visiting
>     >https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>     >
>     >The archives can be found at:
>     >https://lists.cs.wisc.edu/archive/condor-users/
>     >
>     >
>     _______________________________________________
>     Condor-users mailing list
>     To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
>     <mailto:condor-users-request@xxxxxxxxxxx> with a
>     subject: Unsubscribe
>     You can also unsubscribe by visiting
>     https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
>     The archives can be found at:
>     https://lists.cs.wisc.edu/archive/condor-users/
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/