[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Jobs no longer complete with upgrade to 7.0.0




At the time when the shadow log indicates a read failure on the connection to the starter, what appears in the corresponding StarterLog?

--Dan

Alan Cass wrote:

Hi,

I've upgraded to Condor 7.0.0 on our cluster of Student Lab Windows PCs but have not been able to have a job complete that takes a 'long' amount of time since. The jobs do the computation (since I can see the updates being applied to the SIZE in condor_q). As a test I sent a node a 7MB file and got it to 'touch' it so it would be automatically sent back. This works without a problem. However, if I tell the node to 'sleep' for 7 hours before exiting it will never finish, communication with the starter fails, the job requeues and this behaviour cycles.

I'm worried it might be a problem with the University port scanner. Every so often I get an entry like this in the nodes' Starter log (and similar in the Master log) file:

5/21 07:11:34 condor_read(): recv() returned -1, errno = 10054, assuming failure reading 4 bytes from <SCANNER_IP:PORT>. 5/21 07:11:34 condor_read(): recv() returned -1, errno = 10054, assuming failure reading 5 bytes from <SCANNER_IP:PORT>.
5/21 07:11:34 IO: Failed to read packet header
5/21 07:11:34 DaemonCore: Can't receive command request from SCANNER_IP (perhaps a timeout?)
5/21 07:11:37 IO: Incoming packet header unrecognized
5/21 07:11:37 DaemonCore: Can't receive command request from SCANNER_IP (perhaps a timeout?) 5/21 07:11:37 condor_read(): Socket closed when trying to read 4 bytes from <SCANNER_IP:PORT> 5/21 07:11:37 condor_read(): Socket closed when trying to read 5 bytes from <SCANNER_IP:PORT>
5/21 07:11:37 IO: EOF reading packet header
5/21 07:11:37 DaemonCore: Can't receive command request from SCANNER_IP (perhaps a timeout?) 5/21 07:11:40 Received HTTP GET connection from <SCANNER_IP:PORT> -- DENIED because ENABLE_WEB_SERVER=FALSE
5/21 07:11:40 IO: Incoming packet header unrecognized
5/21 07:11:40 DaemonCore: Can't receive command request from SCANNER_IP (perhaps a timeout?) 5/21 07:11:40 condor_read(): Socket closed when trying to read 4 bytes from <SCANNER_IP:PORT> 5/21 07:11:40 condor_read(): Socket closed when trying to read 5 bytes from <SCANNER_IP:PORT>
5/21 07:11:40 IO: EOF reading packet header
5/21 07:11:40 DaemonCore: Can't receive command request from SCANNER_IP (perhaps a timeout?)
5/21 07:11:45 Entering JICShadow::updateShadow()
5/21 07:11:45 TokenCache contents:
condor-reuse-slot1@.
5/21 07:11:45 In VanillaProc::PublishUpdateAd()
5/21 07:11:45 About to get usage data from ProcD for family with root 4036
5/21 07:11:45 Result of "get_usage" operation from ProcD: SUCCESS
5/21 07:11:45 Inside OsProc::PublishUpdateAd()
5/21 07:11:45 Sent job ClassAd update to startd.
5/21 07:11:45 Leaving JICShadow::updateShadow(): success
5/21 07:11:49 condor_read(): Socket closed when trying to read 4 bytes from <SCANNER_IP:PORT> 5/21 07:11:49 condor_read(): Socket closed when trying to read 5 bytes from <SCANNER_IP:PORT>
5/21 07:11:49 IO: EOF reading packet header
5/21 07:11:49 DaemonCore: Can't receive command request from SCANNER_IP (perhaps a timeout?)


and the shadow eventually bombs out with:

5/21 23:11:22 (14933.0) (3964): condor_read(): recv() returned -1, errno = 10054, assuming failure reading 5 bytes from <EXEC_IP:PORT>.
5/21 23:11:22 (14933.0) (3964): IO: Failed to read packet header
5/21 23:11:22 (14933.0) (3964): Can no longer talk to condor_starter <EXEC_IP:PORT>
5/21 23:11:22 (14933.0) (3964): Trying to reconnect to disconnected job
5/21 23:11:22 (14933.0) (3964): LastJobLeaseRenewal: 1211370100 Wed May 21 21:11:40 2008
5/21 23:11:22 (14933.0) (3964): JobLeaseDuration: 1200 seconds
5/21 23:11:22 (14933.0) (3964): JobLeaseDuration remaining: EXPIRED!
5/21 23:11:22 (14933.0) (3964): Reconnect FAILED: Job disconnected too long: JobLeaseDuration (1200 seconds) expired 5/21 23:11:22 (14933.0) (3964): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 107



Is the scanner somehow stealing the starter port and not allowing the shadow to get information back? What settings can I give the config to get it to completely ignore anything coming from the port scanner? Or could it be something else?

Thanks,

Alan

------------------------------------------------------------------------

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/