[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Jobs no longer complete with upgrade to 7.0.0



In your StarterLog, do you see any lines like the following?

Accepted request to reconnect from ...
Ignoring old shadow ...

--Dan

Alan Cass wrote:
On the Shadow the error is:

5/22 07:19:03 (14967.0) (1660): condor_read(): recv() returned -1, errno = 10054, assuming failure reading 5 bytes from <HOST:1079>.
5/22 07:19:03 (14967.0) (1660): IO: Failed to read packet header
5/22 07:19:03 (14967.0) (1660): Can no longer talk to condor_starter <HOST:1079>
5/22 07:19:03 (14967.0) (1660): Trying to reconnect to disconnected job
5/22 07:19:03 (14967.0) (1660): LastJobLeaseRenewal: 1211399360 Thu May 22 05:19:20 2008
5/22 07:19:03 (14967.0) (1660): JobLeaseDuration: 1200 seconds
5/22 07:19:03 (14967.0) (1660): JobLeaseDuration remaining: EXPIRED!
5/22 07:19:03 (14967.0) (1660): Reconnect FAILED: Job disconnected too long: JobLeaseDuration (1200 seconds) expired 5/22 07:19:03 (14967.0) (1660): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 107

and on the STARTER this looks like:

5/22 07:16:12 In VanillaProc::PublishUpdateAd()
5/22 07:16:12 About to get usage data from ProcD for family with root 1364
5/22 07:16:13 Result of "get_usage" operation from ProcD: SUCCESS
5/22 07:16:13 Inside OsProc::PublishUpdateAd()
5/22 07:16:13 condor_write(): Socket closed when trying to write 204 bytes to <SUBMIT:1647>, fd is 1840, errno=10054
5/22 07:16:13 Buf::write(): condor_write() failed
5/22 07:16:13 Sent job ClassAd update to startd.
5/22 07:16:13 JICShadow::updateShadow(): failed to send update
5/22 07:16:44 Getting monitoring info for pid 2972
5/22 07:20:45 Getting monitoring info for pid 2972
5/22 07:20:54 Got SIGTERM. Performing graceful shutdown.
5/22 07:20:54 Started timer to call main_shutdown_fast in 1800 seconds
5/22 07:20:54 ShutdownGraceful all jobs.
5/22 07:20:54 in VanillaProc::ShutdownGraceful()
5/22 07:20:54 called DaemonCore::Shutdown_Graceful(1364)
5/22 07:20:54 return value from my_system for softkill: 0
5/22 07:20:55 Process exited, pid=1364, status=-1073741510
5/22 07:20:55 in VanillaProc::JobCleanup()
5/22 07:20:55 About to kill family with root process 1364 using the ProcD
5/22 07:20:55 Result of "signal_family" operation from ProcD: SUCCESS
5/22 07:20:55 Inside OsProc::JobCleanup()
5/22 07:20:55 TokenCache contents:
condor-reuse-slot1@.
5/22 07:20:55 TokenCache contents:
condor-reuse-slot1@.
5/22 07:20:55 Reaper: all=1 handled=1 ShuttingDown=1
5/22 07:20:55 Inside OsProc::JobExit()

This was after 2 hours of updates that had

5/22 05:25:55 condor_write(): Socket closed when trying to write 204 bytes to <SUBMIT:1647>, fd is 1840, errno=10054
5/22 05:25:55 Buf::write(): condor_write() failed
5/22 05:25:55 Sent job ClassAd update to startd.
5/22 05:25:55 JICShadow::updateShadow(): failed to send update



2008/5/23 Dan Bradley <dan@xxxxxxxxxxxx <mailto:dan@xxxxxxxxxxxx>>:


    At the time when the shadow log indicates a read failure on the
    connection to the starter, what appears in the  corresponding
    StarterLog?

    --Dan

    Alan Cass wrote:

    > Hi,
    >
    > I've upgraded to Condor 7.0.0 on our cluster of Student Lab Windows
    > PCs but have not been able to have a job complete that takes a
    'long'
    > amount of time since. The jobs do the computation (since I can
    see the
    > updates being applied to the SIZE in condor_q). As a test I sent a
    > node a 7MB file and got it to 'touch' it so it would be
    automatically
    > sent back. This works without a problem. However, if I tell the node
    > to 'sleep' for 7 hours before exiting it will never finish,
    > communication with the starter fails, the job requeues and this
    > behaviour cycles.
    >
    > I'm worried it might be a problem with the University port scanner.
    > Every so often I get an entry like this in the nodes' Starter
    log (and
    > similar in the Master log) file:
    >
    > 5/21 07:11:34 condor_read(): recv() returned -1, errno = 10054,
    > assuming failure reading 4 bytes from <SCANNER_IP:PORT>.
    > 5/21 07:11:34 condor_read(): recv() returned -1, errno = 10054,
    > assuming failure reading 5 bytes from <SCANNER_IP:PORT>.
    > 5/21 07:11:34 IO: Failed to read packet header
    > 5/21 07:11:34 DaemonCore: Can't receive command request from
    > SCANNER_IP (perhaps a timeout?)
    > 5/21 07:11:37 IO: Incoming packet header unrecognized
    > 5/21 07:11:37 DaemonCore: Can't receive command request from
    > SCANNER_IP (perhaps a timeout?)
    > 5/21 07:11:37 condor_read(): Socket closed when trying to read 4
    bytes
    > from <SCANNER_IP:PORT>
    > 5/21 07:11:37 condor_read(): Socket closed when trying to read 5
    bytes
    > from <SCANNER_IP:PORT>
    > 5/21 07:11:37 IO: EOF reading packet header
    > 5/21 07:11:37 DaemonCore: Can't receive command request from
    > SCANNER_IP (perhaps a timeout?)
    > 5/21 07:11:40 Received HTTP GET connection from <SCANNER_IP:PORT> --
    > DENIED because ENABLE_WEB_SERVER=FALSE
    > 5/21 07:11:40 IO: Incoming packet header unrecognized
    > 5/21 07:11:40 DaemonCore: Can't receive command request from
    > SCANNER_IP (perhaps a timeout?)
    > 5/21 07:11:40 condor_read(): Socket closed when trying to read 4
    bytes
    > from <SCANNER_IP:PORT>
    > 5/21 07:11:40 condor_read(): Socket closed when trying to read 5
    bytes
    > from <SCANNER_IP:PORT>
    > 5/21 07:11:40 IO: EOF reading packet header
    > 5/21 07:11:40 DaemonCore: Can't receive command request from
    > SCANNER_IP (perhaps a timeout?)
    > 5/21 07:11:45 Entering JICShadow::updateShadow()
    > 5/21 07:11:45 TokenCache contents:
    > condor-reuse-slot1@.
    > 5/21 07:11:45 In VanillaProc::PublishUpdateAd()
    > 5/21 07:11:45 About to get usage data from ProcD for family with
    root 4036
    > 5/21 07:11:45 Result of "get_usage" operation from ProcD: SUCCESS
    > 5/21 07:11:45 Inside OsProc::PublishUpdateAd()
    > 5/21 07:11:45 Sent job ClassAd update to startd.
    > 5/21 07:11:45 Leaving JICShadow::updateShadow(): success
    > 5/21 07:11:49 condor_read(): Socket closed when trying to read 4
    bytes
    > from <SCANNER_IP:PORT>
    > 5/21 07:11:49 condor_read(): Socket closed when trying to read 5
    bytes
    > from <SCANNER_IP:PORT>
    > 5/21 07:11:49 IO: EOF reading packet header
    > 5/21 07:11:49 DaemonCore: Can't receive command request from
    > SCANNER_IP (perhaps a timeout?)
    >
    >
    > and the shadow eventually bombs out with:
    >
    > 5/21 23:11:22 (14933.0) (3964): condor_read(): recv() returned -1,
    > errno = 10054, assuming failure reading 5 bytes from <EXEC_IP:PORT>.
    > 5/21 23:11:22 (14933.0) (3964): IO: Failed to read packet header
    > 5/21 23:11:22 (14933.0) (3964): Can no longer talk to condor_starter
    > <EXEC_IP:PORT>
    > 5/21 23:11:22 (14933.0) (3964): Trying to reconnect to
    disconnected job
    > 5/21 23:11:22 (14933.0) (3964): LastJobLeaseRenewal: 1211370100 Wed
    > May 21 21:11:40 2008
    > 5/21 23:11:22 (14933.0) (3964): JobLeaseDuration: 1200 seconds
    > 5/21 23:11:22 (14933.0) (3964): JobLeaseDuration remaining: EXPIRED!
    > 5/21 23:11:22 (14933.0) (3964): Reconnect FAILED: Job
    disconnected too
    > long: JobLeaseDuration (1200 seconds) expired
    > 5/21 23:11:22 (14933.0) (3964): **** condor_shadow (condor_SHADOW)
    > EXITING WITH STATUS 107
    >
    >
    >
    > Is the scanner somehow stealing the starter port and not
    allowing the
    > shadow to get information back? What settings can I give the
    config to
    > get it to completely ignore anything coming from the port
    scanner? Or
    > could it be something else?
    >
    > Thanks,
    >
    > Alan
    >
    >------------------------------------------------------------------------
    >
    >_______________________________________________
    >Condor-users mailing list
    >To unsubscribe, send a message to
    condor-users-request@xxxxxxxxxxx
    <mailto:condor-users-request@xxxxxxxxxxx> with a
    >subject: Unsubscribe
    >You can also unsubscribe by visiting
    >https://lists.cs.wisc.edu/mailman/listinfo/condor-users
    >
    >The archives can be found at:
    >https://lists.cs.wisc.edu/archive/condor-users/
    >
    >
    _______________________________________________
    Condor-users mailing list
    To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
    <mailto:condor-users-request@xxxxxxxxxxx> with a
    subject: Unsubscribe
    You can also unsubscribe by visiting
    https://lists.cs.wisc.edu/mailman/listinfo/condor-users

    The archives can be found at:
    https://lists.cs.wisc.edu/archive/condor-users/


------------------------------------------------------------------------

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/