[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp



On Wednesday 22 March 2017 17:54:04 Jason Patton wrote:
> If it works for your condor pool, you can have condor communicate over your
> infiniband network by setting NETWORK_INTERFACE appropriately. The daemons
> should continue to listen on all interfaces as long as BIND_ALL_INTERFACES
> is set to true.

o.k. this is a good hint, we have restricted the scheduler NETWORK_INTERFACE 
to the private ethernet address, but we can also use
NETWORK_INTERFACE=192.168.*
is a condor_reconfig on this host which has
DAEMON_LIST = SCHEDD, COLLECTOR, MASTER, NEGOTIATOR
enough and will all vanilla jobs kept running?

> 
> Assuming that both your vanilla and parallel jobs were disconnected for the
> same amount of time, it does sounds like you've found a bug with the
> reconnection of parallel jobs. We will see if we can reproduce the behavior
> here.

I assume this is the case, because there are allways vanilla jobs and starters 
from parallel jobs on node 3 running (because some of the vanilla jobs running 
there since the cluster was rebooted the last time).

Harald
> 
> Jason Patton
> 
> On Tue, Mar 21, 2017 at 12:11 PM, Harald van Pee <pee@xxxxxxxxxxxxxxxxx>
> 
> wrote:
> > Hello all,
> > 
> > now I found the reason for  the problems:
> > Our cisco sg200-50 switch reboots every couple of days and sometimes even
> > hours. I will try a firmware update or replace it, but in the 100days
> > cluster
> > its not such easy.
> > 
> > From the condor side the difference seems to be for a parallel universe
> > job:
> > 03/20/17 07:54:11 (3939.0) (1617771): This job cannot reconnect to
> > starter, so
> > job exiting
> > instead for a vanilla universe job:
> > 03/20/17 07:54:13 (3854.0) (1465927): Trying to reconnect to disconnected
> > job
> > 
> > Is this just a feature to avoid problems with mpi, or could one have a
> > configuration that condor tries the same as in the vanilla universe,
> > because
> > mpi runs over infiniband?
> > 
> > Or do you suggest to make the whole communiction in condor via
> > infiniband? Or should/can? I add a second network?
> > 
> > Nevertheless if we can solve the problem of the not terminated programs
> > we can
> > go into production with openmpi soon.
> > 
> > Best regards
> > Harald
> > 
> > On Monday 20 March 2017 18:11:47 Harald van Pee wrote:
> > > Hello,
> > > 
> > > here my status update:
> > > 
> > > - We are now sure that this problem never was seen for the vanilla
> > > universe, the only job in question was restarted because of a broken
> > 
> > node.
> > 
> > > - the new options in openmpiscript from htcondor 8.6.1 make no
> > 
> > difference.
> > 
> > >  Especialy excluding ethernet interfaces does not help, as
> > >  Jason Patton assumes from openmpi documentation.
> > > 
> > > - It is clearly caused by minor ethernet problems because:
> > >  a) We see connection problems to some nodes at the same time, but all
> > >  connections but some from parallel universe are reestablished
> > >  b) We have identified 3 nodes which make more problems than others,
> > >  if we exclude these nodes we have managed to run several mpi jobs with
> > 
> > 40
> > 
> > > mpinodes for longer than 16 days without restarts (the jobs were
> > > removed
> > 
> > by
> > 
> > > the user or finished).
> > > But there is no reason to assume that these nodes have any severe
> > 
> > problem,
> > 
> > > because we see no mpi errors even with high verbosity, and on the
> > > problem nodes there are vanilla jobs running for up to 101 days now.
> > > 
> > > - here what happens today:
> > > Summary: Condor assumes that a starter of node 3 has a problem and
> > > sends
> > 
> > a
> > 
> > > kill command. Even condor assumes that the kill command could not be
> > 
> > send,
> > 
> > > it reaches the node 2 as it should be because this was the first node
> > > where mpirun is running.
> > > It does also send SIGQUIT before SIGTERM which I have not assumed and
> > 
> > maybe
> > 
> > > this is the reason why our trap handler does not work beause it expects
> > > only SIGTERM? In this case still some mpiprograms are running after the
> > > job was removed from condor.
> > > Than later the starters of job 3939 on node 3 also gets a kill signal
> > > and handle this. But this was the prove that this starters are still
> > > alive
> > 
> > and
> > 
> > > there was no reason to kill them, right?
> > > 
> > > Even if there are work arounds, I think this behaviour could and should
> > 
> > be
> > 
> > > improved.
> > > Or can we expect something was changed with htcondor 8.6.1?
> > > 
> > > Best
> > > Harald
> > > 
> > > Here the most relevant log messages:
> > > 
> > > ShadowLog:
> > > 03/20/17 07:54:11 (3939.0) (1617771): condor_read() failed: recv(fd=10)
> > > returned -1, errno = 110 Connection timed out, reading 5 bytes from
> > 
> > startd
> > 
> > > at <192.16
> > > 8.123.3:29143>.
> > > 03/20/17 07:54:11 (3939.0) (1617771): condor_read(): UNEXPECTED read
> > > timeout after 0s during non-blocking read from startd at
> > > <192.168.123.3:29143> (desired
> > > timeout=300s)
> > > 03/20/17 07:54:11 (3939.0) (1617771): IO: Failed to read packet header
> > > 03/20/17 07:54:11 (3939.0) (1617771): Can no longer talk to
> > 
> > condor_starter
> > 
> > > <192.168.123.3:29143>
> > > 03/20/17 07:54:11 (3939.0) (1617771): This job cannot reconnect to
> > 
> > starter,
> > 
> > > so job exiting
> > > 03/20/17 07:54:12 (3939.0) (1617771): attempt to connect to
> > > <192.168.123.3:29143> failed: No route to host (connect errno = 113).
> > 
> > > 03/20/17 07:54:12 (3939.0) (1617771): RemoteResource::killStarter():
> > Could
> > 
> > > not send command to startd
> > > 03/20/17 07:54:15 (3939.0) (1617771): attempt to connect to
> > > <192.168.123.3:29143> failed: No route to host (connect errno = 113).
> > 
> > > 03/20/17 07:54:15 (3939.0) (1617771): RemoteResource::killStarter():
> > Could
> > 
> > > not send command to startd
> > > 03/20/17 07:54:21 (3939.0) (1617771): attempt to connect to
> > > <192.168.123.3:29143> failed: No route to host (connect errno = 113).
> > 
> > > 03/20/17 07:54:21 (3939.0) (1617771): RemoteResource::killStarter():
> > Could
> > 
> > > not send command to startd
> > > 03/20/17 07:54:24 (3939.0) (1617771): ERROR "Can no longer talk to
> > > condor_starter <192.168.123.3:29143>" at line 209 in file
> > > /slots/02/dir_53434/userdir/src/condor_shadow.V6.1/NTreceivers.cpp
> > > 
> > > StarterLog.slot1_2 on node 2 (mpirun of job 3939.0)
> > > 03/20/17 07:54:11 (pid:1066066) Got SIGQUIT.  Performing fast shutdown.
> > > 03/20/17 07:54:11 (pid:1066066) ShutdownFast all jobs.
> > > 03/20/17 07:54:11 (pid:1066066) Got SIGTERM. Performing graceful
> > 
> > shutdown.
> > 
> > > 03/20/17 07:54:11 (pid:1066066) ShutdownGraceful all jobs.
> > > 03/20/17 07:54:11 (pid:1066066) Process exited, pid=1066068, status=0
> > > 03/20/17 07:54:24 (pid:1066066) condor_read() failed: recv(fd=8)
> > > returned -1, errno = 104 Connection reset by peer, reading 5 bytes
> > > from <192.168.123.100:18658>.
> > > 03/20/17 07:54:24 (pid:1066066) IO: Failed to read packet header
> > > 03/20/17 07:54:24 (pid:1066066) Lost connection to shadow, waiting 2400
> > > secs for reconnect
> > > 03/20/17 07:54:24 (pid:1066066) Failed to send job exit status to
> > > shadow 03/20/17 07:54:24 (pid:1066066) Last process exited, now
> > > Starter is
> > 
> > exiting
> > 
> > > 03/20/17 07:54:24 (pid:1066066) **** condor_starter (condor_STARTER)
> > > pid 1066066 EXITING WITH STATUS 0
> > > 
> > > 
> > > StarterLog.slot1_2 on node 3  running job 3939.0
> > > 03/20/17 07:54:33 (pid:1056820) condor_read() failed: recv(fd=8)
> > > returned -1, errno = 104 Connection reset b\
> > > y peer, reading 5 bytes from <192.168.123.100:24154>.
> > > 03/20/17 07:54:33 (pid:1056820) IO: Failed to read packet header
> > > 03/20/17 07:54:33 (pid:1056820) Lost connection to shadow, waiting 2400
> > > secs for reconnect
> > > 03/20/17 07:54:33 (pid:1056820) Got SIGTERM. Performing graceful
> > 
> > shutdown.
> > 
> > > 03/20/17 07:54:33 (pid:1056820) ShutdownGraceful all jobs.
> > > 03/20/17 07:54:33 (pid:1056820) Process exited, pid=1056824, status=0
> > > 03/20/17 07:54:33 (pid:1056820) Failed to send job exit status to
> > > shadow 03/20/17 07:54:33 (pid:1056820) Last process exited, now
> > > Starter is
> > 
> > exiting
> > 
> > > 03/20/17 07:54:33 (pid:1056820) **** condor_starter (condor_STARTER)
> > > pid 1056820 EXITING WITH STATUS 0
> > > 
> > > StarterLog.slot1_3 on node 3  running job 3939.0
> > > 03/20/17 07:54:46 (pid:1056821) condor_read() failed: recv(fd=8)
> > > returned -1, errno = 104 Connection reset b\
> > > y peer, reading 5 bytes from <192.168.123.100:3768>.
> > > 03/20/17 07:54:46 (pid:1056821) IO: Failed to read packet header
> > > 03/20/17 07:54:46 (pid:1056821) Lost connection to shadow, waiting 2400
> > > secs for reconnect
> > > 03/20/17 07:54:46 (pid:1056821) Got SIGTERM. Performing graceful
> > 
> > shutdown.
> > 
> > > 03/20/17 07:54:46 (pid:1056821) ShutdownGraceful all jobs.
> > > 03/20/17 07:54:46 (pid:1056821) Process exited, pid=1056823, status=0
> > > 03/20/17 07:54:46 (pid:1056821) Failed to send job exit status to
> > > shadow 03/20/17 07:54:46 (pid:1056821) Last process exited, now
> > > Starter is
> > 
> > exiting
> > 
> > > 03/20/17 07:54:46 (pid:1056821) **** condor_starter (condor_STARTER)
> > > pid 1056821 EXITING WITH STATUS 0
> > > 
> > > On Thursday 23 February 2017 15:12:38 Harald van Pee wrote:
> > > > Hello,
> > > > 
> > > > it happens again. What we have learned until now is:
> > > > - a communication problem occurs between scheduler node and starter
> > 
> > node
> > 
> > > > - condor kills the starter process and afterwards  kills the job
> > > > - several different nodes are affected due to lack of statistics we
> > > > can not claim that all nodes are affected nor exclude that some have
> > > > more problems than others.
> > > > - its very unlikly that the program itself has a problem, because we
> > 
> > have
> > 
> > > > seen that 2 starter processes of 2 independent parallel jobs were
> > 
> > killed
> > 
> > > > on the same node at the same time.
> > > > - at least within the last 2 monthes only parallel jobs are affected,
> > 
> > but
> > 
> > > > there is no hint for a mpi problem, any help how one can proof that
> > > > no mpi problem exists are welcome.
> > > > We have much more vanilla starters running than parallel ones.
> > > > 
> > > > This morning the same node as last week was affacted. On this node
> > > > 9 single vanilla starters are running,  2 of them since more than 47
> > > > days. In addition 5 starters of 2 parallel jobs and only one starter
> > > > of one parallel job was killed.
> > > > From the ShadowLog below, one can see that several starters from
> > 
> > serveral
> > 
> > > > jobs and not only from node 37 have communication problems and the
> > > > time during the problem occurs is less than one minute. Therfore I
> > > > would expect that there will be no problem to reconnect to the
> > > > starters and this is true for all vanilla jobs.
> > > > But why the parallel starters were killed such fast?
> > > > 
> > > > Any idea is welcome
> > > > Harald
> > > > 
> > > > ShadowLog (begin 2 lines before,  end 2 lines after the minute of the
> > > > problem):
> > > > 02/23/17 07:10:59 (1835.0) (2412453): Job 1835.0 terminated: exited
> > 
> > with
> > 
> > > > status 0
> > > > 02/23/17 07:10:59 (1835.0) (2412453): **** condor_shadow
> > 
> > (condor_SHADOW)
> > 
> > > > pid 2412453 EXITING WITH STATUS 115
> > > > 02/23/17 07:16:02 (1209.3) (49060): condor_read() failed: recv(fd=4)
> > > > returned -1, errno = 110 Connection timed out, reading 5 bytes from
> > > > startd slot1@xxxxxxxxxxxxxxxxxxxxxxxx
> > > > 02/23/17 07:16:02 (1209.3) (49060): condor_read(): UNEXPECTED read
> > > > timeout after 0s during non-blocking read from startd
> > > > slot1@xxxxxxxxxxxxxxxxxxxxxxx (desired timeout=300s)
> > > > 02/23/17 07:16:02 (1209.3) (49060): IO: Failed to read packet header
> > > > 02/23/17 07:16:02 (1209.3) (49060): Can no longer talk to
> > 
> > condor_starter
> > 
> > > > <192.168.123.37:30389>
> > > > 02/23/17 07:16:02 (1209.3) (49060): Trying to reconnect to
> > > > disconnected job 02/23/17 07:16:02 (1209.3) (49060):
> > > > LastJobLeaseRenewal: 1487830176 Thu Feb 23 07:09:36 2017
> > > > 02/23/17 07:16:02 (1209.3) (49060): JobLeaseDuration: 2400 seconds
> > > > 02/23/17 07:16:02 (1209.3) (49060): JobLeaseDuration remaining: 2014
> > > > 02/23/17 07:16:02 (1209.3) (49060): Attempting to locate disconnected
> > > > starter 02/23/17 07:16:03 (1209.3) (49060): attempt to connect to
> > > > <192.168.123.37:30389> failed: No route to host (connect errno =
> > > > 113). 02/23/17 07:16:03 (1209.3) (49060): locateStarter(): Failed to
> > > > connect
> > 
> > to
> > 
> > > > startd <192.168.123.37:30389?addrs=192.168.123.37-30389>
> > > > 02/23/17 07:16:03 (1209.3) (49060): JobLeaseDuration remaining: 2399
> > > > 02/23/17 07:16:03 (1209.3) (49060): Scheduling another attempt to
> > > > reconnect in 8 seconds
> > > > 02/23/17 07:16:04 (1208.16) (46751): condor_read() failed: recv(fd=4)
> > > > returned -1, errno = 110 Connection timed out, reading 5 bytes from
> > > > starter at <192.168.123.51:49120>.
> > > > 02/23/17 07:16:04 (1208.16) (46751): condor_read(): UNEXPECTED read
> > > > timeout after 0s during non-blocking read from starter at
> > > > <192.168.123.51:49120> (desired timeout=300s)
> > > > 02/23/17 07:16:04 (1208.16) (46751): IO: Failed to read packet header
> > > > 02/23/17 07:16:04 (1208.16) (46751): Can no longer talk to
> > 
> > condor_starter
> > 
> > > > <192.168.123.51:49120>
> > > > 02/23/17 07:16:04 (1208.16) (46751): JobLeaseDuration remaining: 2014
> > > > 02/23/17 07:16:04 (1208.16) (46751): Attempting to locate
> > > > disconnected starter 02/23/17 07:16:05 (2143.0) (2719507):
> > > > condor_read() failed: recv(fd=25) returned -1, errno = 110
> > > > Connection timed out, reading 5 bytes from startd at
> > > > <192.168.123.37:30389>.
> > > > 02/23/17 07:16:05 (2143.0) (2719507): condor_read(): UNEXPECTED read
> > > > timeout after 0s during non-blocking read from startd at
> > > > <192.168.123.37:30389> (desired timeout=300s)
> > > > 02/23/17 07:16:05 (2143.0) (2719507): IO: Failed to read packet
> > > > header 02/23/17 07:16:05 (2143.0) (2719507): Can no longer talk to
> > > > condor_starter <192.168.123.37:30389>
> > > > 02/23/17 07:16:05 (2143.0) (2719507): This job cannot reconnect to
> > > > starter, so job exiting
> > > > 02/23/17 07:16:06 (1208.16) (46751): attempt to connect to
> > > > <192.168.123.51:29246> failed: No route to host (connect errno =
> > > > 113). 02/23/17 07:16:06 (1208.16) (46751): locateStarter(): Failed
> > > > to connect to startd
> > > > <192.168.123.51:29246?addrs=192.168.123.51-29246>
> > > > 02/23/17 07:16:06 (1208.16) (46751): JobLeaseDuration remaining: 2398
> > > > 02/23/17 07:16:06 (1208.16) (46751): Scheduling another attempt to
> > > > reconnect in 8 seconds
> > > > 02/23/17 07:16:07 (683.9) (2270376): condor_read() failed: recv(fd=4)
> > > > returned -1, errno = 110 Connection timed out, reading 5 bytes from
> > > > starter at <192.168.123.37:30325>.
> > > > 02/23/17 07:16:07 (683.9) (2270376): condor_read(): UNEXPECTED read
> > > > timeout after 0s during non-blocking read from starter at
> > > > <192.168.123.37:30325> (desired timeout=300s)
> > > > 02/23/17 07:16:07 (683.9) (2270376): IO: Failed to read packet header
> > > > 02/23/17 07:16:07 (683.9) (2270376): Can no longer talk to
> > 
> > condor_starter
> > 
> > > > <192.168.123.37:30325>
> > > > 02/23/17 07:16:07 (683.9) (2270376): JobLeaseDuration remaining: 2014
> > > > 02/23/17 07:16:07 (683.9) (2270376): Attempting to locate
> > > > disconnected starter 02/23/17 07:16:08 (2143.0) (2719507): attempt
> > > > to connect to <192.168.123.37:30389> failed: No route to host
> > > > (connect errno = 113). 02/23/17 07:16:08 (683.9) (2270376): attempt
> > > > to connect to
> > > > <192.168.123.37:30389> failed: No route to host (connect errno =
> > > > 113). 02/23/17 07:16:08 (2143.0) (2719507):
> > > > RemoteResource::killStarter(): Could not send command to startd
> > > > 02/23/17 07:16:08 (683.9) (2270376): locateStarter(): Failed to
> > > > connect to startd <192.168.123.37:30389?addrs=192.168.123.37-30389>
> > > > 02/23/17 07:16:08 (683.9) (2270376): JobLeaseDuration remaining: 2399
> > > > 02/23/17 07:16:08 (683.9) (2270376): Scheduling another attempt to
> > > > reconnect in 8 seconds
> > > > 02/23/17 07:16:11 (1209.3) (49060): Attempting to locate disconnected
> > > > starter 02/23/17 07:16:11 (1209.3) (49060): Found starter:
> > > > <192.168.123.37:38618?addrs=192.168.123.37-38618>
> > > > 02/23/17 07:16:11 (1209.3) (49060): Attempting to reconnect to
> > > > starter <192.168.123.37:38618?addrs=192.168.123.37-38618>
> > > > 02/23/17 07:16:11 (2143.0) (2719507): ERROR "Can no longer talk to
> > > > condor_starter <192.168.123.37:30389>" at line 209 in file
> > > > /slots/02/dir_53434/userdir/src/condor_shadow.V6.1/NTreceivers.cpp
> > > > 02/23/17 07:16:14 (1208.16) (46751): Attempting to locate
> > > > disconnected starter 02/23/17 07:16:14 (1208.16) (46751): Found
> > > > starter:
> > > > <192.168.123.51:49120?addrs=192.168.123.51-49120>
> > > > 02/23/17 07:16:14 (1208.16) (46751): Attempting to reconnect to
> > > > starter <192.168.123.51:49120?addrs=192.168.123.51-49120>
> > > > 02/23/17 07:16:15 (1208.16) (46751): Reconnect SUCCESS: connection
> > > > re- established
> > > > 02/23/17 07:16:16 (683.9) (2270376): Attempting to locate
> > > > disconnected starter 02/23/17 07:16:16 (683.9) (2270376): Found
> > > > starter:
> > > > <192.168.123.37:30325?addrs=192.168.123.37-30325>
> > > > 02/23/17 07:16:16 (683.9) (2270376): Attempting to reconnect to
> > > > starter <192.168.123.37:30325?addrs=192.168.123.37-30325>
> > > > 02/23/17 07:16:25 (683.9) (2270376): Reconnect SUCCESS: connection
> > > > re- established
> > > > 02/23/17 07:16:41 (1209.3) (49060): condor_read(): timeout reading 5
> > > > bytes from starter at <192.168.123.37:38618>.
> > > > 02/23/17 07:16:41 (1209.3) (49060): IO: Failed to read packet header
> > > > 02/23/17 07:16:41 (1209.3) (49060): Attempt to reconnect failed:
> > > > Failed to read reply ClassAd
> > > > 02/23/17 07:16:41 (1209.3) (49060): JobLeaseDuration remaining: 2361
> > > > 02/23/17 07:16:41 (1209.3) (49060): Scheduling another attempt to
> > > > reconnect in 16 seconds
> > > > 02/23/17 07:16:57 (1209.3) (49060): Attempting to locate disconnected
> > > > starter 02/23/17 07:16:57 (1209.3) (49060): Found starter:
> > > > <192.168.123.37:38618?addrs=192.168.123.37-38618>
> > > > 02/23/17 07:16:57 (1209.3) (49060): Attempting to reconnect to
> > > > starter <192.168.123.37:38618?addrs=192.168.123.37-38618>
> > > > 02/23/17 07:16:57 (1209.3) (49060): Reconnect SUCCESS: connection re-
> > > > established
> > > > 02/23/17 07:43:17 (2102.0) (2559295): Job 2102.0 terminated: killed
> > > > by signal 6
> > > > 02/23/17 07:43:17 (2102.0) (2559295): **** condor_shadow
> > 
> > (condor_SHADOW)
> > 
> > > > pid 2559295 EXITING WITH STATUS 115
> > > > 
> > > > On Tuesday 21 February 2017 22:41:54 Harald van Pee wrote:
> > > > > Hi Todd,
> > > > > 
> > > > > thank you for your help.
> > > > > 
> > > > > Concerning the no route to host, I see no ethernet port down on any
> > > > > machine during that time, but maybe a change in
> > > > > /etc/host.conf:
> > > > > order hosts,bind
> > > > > 
> > > > > /etc/nsswitch.conf:
> > > > > hosts:      files dns
> > > > > 
> > > > > instead of the debian default will help anyway, /etc/hosts has all
> > > > > ip addresses of all nodes in.
> > > > > 
> > > > > Regards
> > > > > Harald
> > > > > 
> > > > > On Tuesday 21 February 2017 21:06:20 Todd Tannenbaum wrote:
> > > > > > On 2/21/2017 1:33 PM, Harald van Pee wrote:
> > > > > > > It seems that openmpi (or mpi) are not used very often with
> > > > > > > htcondor and the information is spare and I got some questions
> > 
> > how
> > 
> > > > > > > I have managed it to get it running at all. I will share all I
> > > > > > > know about this in a new thread soon, or is there a wiki where
> > > > > > > I should put the information?
> > > > > > 
> > > > > > Off-list I put Harold in touch with the folks who can put
> > > > > > Harold's info into the Manual or the HTCondor Wiki (from the web
> > > > > > homepage, look for the link "HOWTO recipes" and "HTcondor
> > > > > > Wiki").
> > > > > > 
> > > > > > Also we did some work for upcoming HTCondor v8.6.1 release so it
> > > > > > works properly with the latest releases of OpenMPI - for details
> > 
> > see
> > 
> > > > > >    https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=6024
> > > > > > > 
> > > > > > > Now back to our problem:
> > > > > > > One hint that it will be related to network (ethernet or
> > > > > > > infiniband) is, that we have one job running for 11 days
> > > > > > > without problems as we have less jobs running, and we got
> > > > > > > problems within a few days as we have startetd 200 more jobs.
> > > > > > > I have found now 2 independend parallel mpi jobs which share
> > > > > > > one machine with one job each and there are no ethernet
> > > > > > > problems
> > 
> > seen,
> > 
> > > > > > > not on the scheduler machine nor on the starter node.
> > 
> > Unfortunately
> > 
> > > > > > > there is no error output in the jobs error file.
> > > > > > > Its clear that condor kills the jobs but for me its unclear
> > > > > > > why, because it seems both starter processes are still running
> > > > > > > if I understand the logfiles correct.
> > > > > > 
> > > > > > At first blush, it looks to me like the condor_shadow on the
> > > > > > submit node could no longer contact the execute node at IP
> > > > > > address 192.168.123.37 due to "No route to host".  The "No route
> > > > > > to host" error comes the operating system, not from HTCondor -
> > > > > > you can
> > 
> > google
> > 
> > > > > > this error and see lots of opinions/ideas on how to troubleshoot
> > 
> > and
> > 
> > > > > > fix, but basically there is no route for the execute node IP
> > 
> > address
> > 
> > > > > > in the client's routing table... not sure why this would happen
> > > > > > all of the sudden, maybe some interface on your submit machine
> > > > > > is being disabled or some switch port?
> > > > > > 
> > > > > > regards,
> > > > > > Todd
> > > > > > 
> > > > > > > Maybe one of you find a hint in the condor log below and can
> > > > > > > give me a hint what happens, or what I can do to find out.
> > > > > > > 
> > > > > > > Best
> > > > > > > Harald
> > > > > > > 
> > > > > > > ShadowLog:
> > > > > > > 02/19/17 03:09:44 (1744.0) (1729179): condor_read() failed:
> > > > > > > recv(fd=12) returned -1, errno = 110 Connection timed out,
> > 
> > reading
> > 
> > > > > > > 5 bytes from startd at <192.168.123.37:30389>.
> > > > > > > 02/19/17 03:09:44 (1745.0) (1729180): condor_read() failed:
> > > > > > > recv(fd=9) returned -1, errno = 110 Connection timed out,
> > 
> > reading 5
> > 
> > > > > > > bytes from startd at <192.168.123.37:30389>.
> > > > > > > 02/19/17 03:09:44 (1744.0) (1729179): condor_read(): UNEXPECTED
> > > > > > > read timeout after 0s during non-blocking read from startd at
> > > > > > > <192.168.123.37:30389> (desired timeout=300s)
> > > > > > > 02/19/17 03:09:44 (1745.0) (1729180): condor_read(): UNEXPECTED
> > > > > > > read timeout after 0s during non-blocking read from startd at
> > > > > > > <192.168.123.37:30389> (desired timeout=300s)
> > > > > > > 02/19/17 03:09:44 (1744.0) (1729179): IO: Failed to read packet
> > > > > > > header 02/19/17 03:09:44 (1745.0) (1729180): IO: Failed to read
> > > > > > > packet header 02/19/17 03:09:44 (1744.0) (1729179): Can no
> > > > > > > longer talk to
> > > > > > > condor_starter <192.168.123.37:30389>
> > > > > > > 02/19/17 03:09:44 (1745.0) (1729180): Can no longer talk to
> > > > > > > condor_starter <192.168.123.37:30389>
> > > > > > > 02/19/17 03:09:44 (1744.0) (1729179): This job cannot reconnect
> > 
> > to
> > 
> > > > > > > starter, so job exiting
> > > > > > > 02/19/17 03:09:44 (1745.0) (1729180): This job cannot reconnect
> > 
> > to
> > 
> > > > > > > starter, so job exiting
> > > > > > > 02/19/17 03:09:47 (1745.0) (1729180): attempt to connect to
> > > > > > > <192.168.123.37:30389> failed: No route to host (connect errno
> > > > > > > = 113). 02/19/17 03:09:47 (1744.0) (1729179): attempt to
> > > > > > > connect to <192.168.123.37:30389> failed: No route to host
> > > > > > > (connect errno = 113). 02/19/17 03:09:47 (1745.0) (1729180):
> > > > > > > RemoteResource::killStarter(): Could not send command to startd
> > > > > > > 02/19/17 03:09:47 (1744.0) (1729179):
> > > > > > > RemoteResource::killStarter(): Could not send command to startd
> > > > > > > 02/19/17 03:09:47 (1744.0) (1729179): ERROR "Can no longer talk
> > 
> > to
> > 
> > > > > > > condor_starter <192.168.123.37:30389>" at line 209 in file
> > > > > > > /slots/02/dir_53434/userdir/src/condor_shadow.V6.1/
> > 
> > NTreceivers.cpp
> > 
> > > > > > > 02/19/17 03:09:47 (1745.0) (1729180): ERROR "Can no longer talk
> > 
> > to
> > 
> > > > > > > condor_starter <192.168.123.37:30389>" at line 209 in file
> > > > > > > /slots/02/dir_53434/userdir/src/condor_shadow.V6.1/
> > 
> > NTreceivers.cpp
> > 
> > > > > > > StarterLog of job 1745.0 on node 192.168.123.37
> > > > > > > 02/15/17 17:14:34 (pid:751398) Create_Process succeeded,
> > 
> > pid=751405
> > 
> > > > > > > 02/15/17 17:14:35 (pid:751398) condor_write() failed: send() 1
> > > > > > > bytes to <127.0.0.1:10238> returned -1, timeout=0, errno=32
> > 
> > Broken
> > 
> > > > > > > pipe. 02/19/17 03:10:05 (pid:751398) condor_read() failed:
> > > > > > > recv(fd=8) returned -1, errno = 104 Connection reset by peer,
> > > > > > > reading 5 bytes from <192.168.123.100:25500>.
> > > > > > > 02/19/17 03:10:05 (pid:751398) IO: Failed to read packet header
> > > > > > > 02/19/17 03:10:05 (pid:751398) Lost connection to shadow,
> > > > > > > waiting 2400 secs for reconnect
> > > > > > > 02/19/17 03:10:05 (pid:751398) Got SIGTERM. Performing graceful
> > > > > > > shutdown. 02/19/17 03:10:05 (pid:751398) ShutdownGraceful all
> > 
> > jobs.
> > 
> > > > > > > 02/19/17 03:10:05 (pid:751398) Process exited, pid=751405,
> > 
> > status=0
> > 
> > > > > > > 02/19/17 03:10:05 (pid:751398) Failed to send job exit status
> > > > > > > to shadow 02/19/17 03:10:05 (pid:751398) Last process exited,
> > > > > > > now Starter is exiting 02/19/17 03:10:05 (pid:751398) ****
> > > > > > > condor_starter
> > > > > > > (condor_STARTER) pid 751398 EXITING WITH STATUS 0
> > > > > > > 
> > > > > > > StarterLog of job 1744.0 on node 92.168.123.37
> > > > > > > 02/15/17 17:14:34 (pid:751399) Create_Process succeeded,
> > 
> > pid=751400
> > 
> > > > > > > 02/15/17 17:14:34 (pid:751399) condor_write() failed: send() 1
> > > > > > > bytes to <127.0.0.1:48689> returned -1, timeout=0, errno=32
> > 
> > Broken
> > 
> > > > > > > pipe. 02/19/17 03:10:03 (pid:751399) condor_read() failed:
> > > > > > > recv(fd=8) returned -1, errno = 104 Connection reset by peer,
> > > > > > > reading 5 bytes from <192.168.123.100:34337>.
> > > > > > > 02/19/17 03:10:03 (pid:751399) IO: Failed to read packet header
> > > > > > > 02/19/17 03:10:03 (pid:751399) Lost connection to shadow,
> > > > > > > waiting 2400 secs for reconnect
> > > > > > > 02/19/17 03:10:03 (pid:751399) Got SIGTERM. Performing graceful
> > > > > > > shutdown. 02/19/17 03:10:03 (pid:751399) ShutdownGraceful all
> > 
> > jobs.
> > 
> > > > > > > 02/19/17 03:10:03 (pid:751399) Process exited, pid=751400,
> > 
> > status=0
> > 
> > > > > > > 02/19/17 03:10:03 (pid:751399) Failed to send job exit status
> > > > > > > to shadow 02/19/17 03:10:03 (pid:751399) Last process exited,
> > > > > > > now Starter is exiting 02/19/17 03:10:03 (pid:751399) ****
> > > > > > > condor_starter
> > > > > > > (condor_STARTER) pid 751399 EXITING WITH STATUS 0
> > > > > > > 
> > > > > > > StartLog:
> > > > > > > 02/19/17 03:09:48 slot1_11: Called deactivate_claim()
> > > > > > > 02/19/17 03:09:48 slot1_11: Changing state and activity:
> > > > > > > Claimed/Busy -> Preempting/Vacating
> > > > > > > 02/19/17 03:09:48 slot1_13: Called deactivate_claim()
> > > > > > > 02/19/17 03:09:48 slot1_13: Changing state and activity:
> > > > > > > Claimed/Busy -> Preempting/Vacating
> > > > > > > 02/19/17 03:10:03 Starter pid 751399 exited with status 0
> > > > > > > 02/19/17 03:10:03 slot1_11: State change: starter exited
> > > > > > > 02/19/17 03:10:03 slot1_11: State change: No preempting claim,
> > > > > > > returning to owner
> > > > > > > 02/19/17 03:10:03 slot1_11: Changing state and activity:
> > > > > > > Preempting/Vacating -
> > > > > > > 
> > > > > > >> Owner/Idle
> > > > > > > 
> > > > > > > 02/19/17 03:10:03 slot1_11: State change: IS_OWNER is false
> > > > > > > 02/19/17 03:10:03 slot1_11: Changing state: Owner -> Unclaimed
> > > > > > > 02/19/17 03:10:03 slot1_11: Changing state: Unclaimed -> Delete
> > > > > > > 02/19/17 03:10:03 slot1_11: Resource no longer needed, deleting
> > > > > > > 02/19/17 03:10:05 Starter pid 751398 exited with status 0
> > > > > > > 02/19/17 03:10:05 slot1_13: State change: starter exited
> > > > > > > 02/19/17 03:10:05 slot1_13: State change: No preempting claim,
> > > > > > > returning to owner
> > > > > > > 02/19/17 03:10:05 slot1_13: Changing state and activity:
> > > > > > > Preempting/Vacating -
> > > > > > > 
> > > > > > >> Owner/Idle
> > > > > > > 
> > > > > > > 02/19/17 03:10:05 slot1_13: State change: IS_OWNER is false
> > > > > > > 02/19/17 03:10:05 slot1_13: Changing state: Owner -> Unclaimed
> > > > > > > 02/19/17 03:10:05 slot1_13: Changing state: Unclaimed -> Delete
> > > > > > > 02/19/17 03:10:05 slot1_13: Resource no longer needed, deleting
> > > > > > > 02/19/17 03:19:48 Error: can't find resource with ClaimId
> > > > > > > (<192.168.123.37:30389>#1481221329#1484#...) for 443
> > > > > > > (RELEASE_CLAIM); perhaps this claim was removed already.
> > > > > > > 02/19/17 03:19:48 condor_write(): Socket closed when trying to
> > > > > > > write 13 bytes to <192.168.123.100:20962>, fd is 8
> > > > > > > 02/19/17 03:19:48 Buf::write(): condor_write() failed
> > > > > > > 02/19/17 03:19:48 Error: can't find resource with ClaimId
> > > > > > > (<192.168.123.37:30389>#1481221329#1487#...) for 443
> > > > > > > (RELEASE_CLAIM); perhaps this claim was removed already.
> > > > > > > 02/19/17 03:19:48 condor_write(): Socket closed when trying to
> > > > > > > write 13 bytes to <192.168.123.100:34792>, fd is 8
> > > > > > > 02/19/17 03:19:48 Buf::write(): condor_write() failed
> > > > > > > 
> > > > > > > On Tuesday 07 February 2017 19:55:31 Harald van Pee wrote:
> > > > > > >> Dear experts,
> > > > > > >> 
> > > > > > >> I have some questions for debugging:
> > > > > > >> Can I avoid restarting of a job in vanilla and/or parallel
> > > > > > >> universe if I use Requirements = (NumJobStarts==0)
> > > > > > >> in the condor submit description file?
> > > > > > >> If it works, will the job stay idle or will be removed?
> > > > > > >> 
> > > > > > >> I found a job in the vanilla universe started at 12/9 and
> > > > > > >> restarted shortly before Christmas and still running. I assume
> > > > > > >> the reason were also network problems, but unfortunatelly our
> > > > > > >> last condor and system log files are from January.
> > > > > > >> Is there any possibility to make condor a little bit more
> > > > > > >> robust against network problems via configuration? Just wait
> > > > > > >> a little
> > 
> > bit
> > 
> > > > > > >> longer or make more reconnection tries?
> > > > > > >> 
> > > > > > >> We are working on automatic restart of the mpi jobs and try to
> > 
> > use
> > 
> > > > > > >> more frequent checkpoints, but it seems a lot of work,
> > > > > > >> therefore any idea would be welcome.
> > > > > > >> 
> > > > > > >> Best
> > > > > > >> Harald
> > > > > > >> 
> > > > > > >> On Monday 06 February 2017 23:43:47 Harald van Pee wrote:
> > > > > > >>> There is one important argument, why I think the problem is
> > > > > > >>> condor related not mpi (of course I can be wrong).
> > > > > > >>> The condor communication goes via ethernet, and the ethernet
> > > > > > >>> connection has a problem for several minutes.
> > > > > > >>> The mpi communication goes via infiniband, and there is no
> > > > > > >>> infiniband problem during this time.
> > > > > > >>> 
> > > > > > >>> Harald
> > > > > > >>> 
> > > > > > >>> On Monday 06 February 2017 23:04:01 Harald van Pee wrote:
> > > > > > >>>> Hi Greg,
> > > > > > >>>> 
> > > > > > >>>> thanks for your answer.
> > > > > > >>>> 
> > > > > > >>>> On Monday 06 February 2017 22:18:08 Greg Thain wrote:
> > > > > > >>>>> On 02/06/2017 02:40 PM, Harald van Pee wrote:
> > > > > > >>>>>> Hello,
> > > > > > >>>>>> 
> > > > > > >>>>>> we got mpi running in parallel universe with htcondor 8.4
> > > > > > >>>>>> using openmpiscript and its working in general without any
> > > > > > >>>>>> problem.
> > > > > > >>>>> 
> > > > > > >>>>> In general, the MPI jobs themselves cannot survive a
> > > > > > >>>>> network outage or partition, even a temporary one. 
> > > > > > >>>>> HTCondor will reconnect the shadow to the starters, if the
> > > > > > >>>>> problem is just between the submit machine and the execute
> > > > > > >>>>> machines, but if
> > 
> > the
> > 
> > > > > > >>>>> network problem also impacts node-to-node communication,
> > > > > > >>>>> then the job has to be aborted and restarted from scratch
> > > > > > >>>>> because
> > 
> > of
> > 
> > > > > > >>>>> the way MPI works.
> > > > > > >>>> 
> > > > > > >>>> The problem seems between submit machine and one running
> > > > > > >>>> node (not the node where mpirun was started).
> > > > > > >>>> If you are right it should be possible to get or found an
> > 
> > error
> > 
> > > > > > >>>> of mpirun because it lost one node right?
> > > > > > >>>> But it seems condor kills the job because of a shadow
> > 
> > exception.
> > 
> > > > > > >>>> Unfortunatelly we do not see the output of the stoped job
> > > > > > >>>> because its overwritten by the new started.
> > > > > > >>>> Any suggestion how to find out if its realy an mpi related
> > > > > > >>>> problem?
> > > > > > >>>> 
> > > > > > >>>>> If possible, we would recommend that long-running jobs that
> > > > > > >>>>> suffer from this problem try to self-checkpoint themselves,
> > 
> > so
> > 
> > > > > > >>>>> that when they are restarted, they don't need to be
> > > > > > >>>>> restarted from scratch.
> > > > > > >>>>> 
> > > > > > >>>>> -greg
> > > > 
> > > > _______________________________________________
> > > > HTCondor-users mailing list
> > > > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
> > 
> > with
> > 
> > > > a subject: Unsubscribe
> > > > You can also unsubscribe by visiting
> > > > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> > > > 
> > > > The archives can be found at:
> > > > https://lists.cs.wisc.edu/archive/htcondor-users/
> > > 
> > > _______________________________________________
> > > HTCondor-users mailing list
> > > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
> > 
> > with a
> > 
> > > subject: Unsubscribe
> > > You can also unsubscribe by visiting
> > > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> > > 
> > > The archives can be found at:
> > > https://lists.cs.wisc.edu/archive/htcondor-users/
> > 
> > _______________________________________________
> > HTCondor-users mailing list
> > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with
> > a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> > 
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/htcondor-users/

-- 
Harald van Pee

Helmholtz-Institut fuer Strahlen- und Kernphysik der Universitaet Bonn
Nussallee 14-16 - 53115 Bonn - Tel +49-228-732213 - Fax +49-228-732505
mail: pee@xxxxxxxxxxxxxxxxx