[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp



On Wednesday 22 March 2017 21:13:27 Jason Patton wrote:
> Before changing NETWORK_INTERFACE (which requires a condor_restart and
> would end up killing all your vanilla universe jobs when applied to
> the STARTDs), 

o.k. thats an important information,  this is what I have suspected but if I 
will do so I have no chance for the user price Admin of the Year...
 

> you can try adding this to your parallel jobs' submit
> files:
> 
> +ParallelShutdownPolicy = "WAIT_FOR_ALL"
> 
> The manual says that this tells condor to only consider the job
> finished when all the nodes' processes have exited. What the manual
> *doesn't* say is that this tells condor to reconnect to *all* of the
> execute nodes in a parallel universe job if there is a network
> interruption. Under the default configuration, where the job exits
> only if node 0 exits, reconnection will only happen for node 0 if
> there is an interruption.

Ah thats interesting, we will try this next!

> 
> We will consider changing this behavior in a future release. In the
> meantime, if you use WAIT_FOR_ALL, you might have to be diligent
> watching for hung parallel universe jobs in case the processes on
> nodes > 0 decide not to exit when mpirun exits. I may be able to add
> some more cleanup code to openmpiscript to decrease the chance that
> other nodes hang if mpirun exits.

Many thanks
Harald

> 
> Jason Patton
> 
> On Wed, Mar 22, 2017 at 12:24 PM, Harald van Pee <pee@xxxxxxxxxxxxxxxxx> 
wrote:
> > On Wednesday 22 March 2017 17:54:04 Jason Patton wrote:
> >> If it works for your condor pool, you can have condor communicate over
> >> your infiniband network by setting NETWORK_INTERFACE appropriately. The
> >> daemons should continue to listen on all interfaces as long as
> >> BIND_ALL_INTERFACES is set to true.
> > 
> > o.k. this is a good hint, we have restricted the scheduler
> > NETWORK_INTERFACE to the private ethernet address, but we can also use
> > NETWORK_INTERFACE=192.168.*
> > is a condor_reconfig on this host which has
> > DAEMON_LIST = SCHEDD, COLLECTOR, MASTER, NEGOTIATOR
> > enough and will all vanilla jobs kept running?
> > 
> >> Assuming that both your vanilla and parallel jobs were disconnected for
> >> the same amount of time, it does sounds like you've found a bug with
> >> the reconnection of parallel jobs. We will see if we can reproduce the
> >> behavior here.
> > 
> > I assume this is the case, because there are allways vanilla jobs and
> > starters from parallel jobs on node 3 running (because some of the
> > vanilla jobs running there since the cluster was rebooted the last
> > time).
> > 
> > Harald
> > 
> >> Jason Patton
> >> 
> >> On Tue, Mar 21, 2017 at 12:11 PM, Harald van Pee <pee@xxxxxxxxxxxxxxxxx>
> >> 
> >> wrote:
> >> > Hello all,
> >> > 
> >> > now I found the reason for  the problems:
> >> > Our cisco sg200-50 switch reboots every couple of days and sometimes
> >> > even hours. I will try a firmware update or replace it, but in the
> >> > 100days cluster
> >> > its not such easy.
> >> > 
> >> > From the condor side the difference seems to be for a parallel
> >> > universe job:
> >> > 03/20/17 07:54:11 (3939.0) (1617771): This job cannot reconnect to
> >> > starter, so
> >> > job exiting
> >> > instead for a vanilla universe job:
> >> > 03/20/17 07:54:13 (3854.0) (1465927): Trying to reconnect to
> >> > disconnected job
> >> > 
> >> > Is this just a feature to avoid problems with mpi, or could one have a
> >> > configuration that condor tries the same as in the vanilla universe,
> >> > because
> >> > mpi runs over infiniband?
> >> > 
> >> > Or do you suggest to make the whole communiction in condor via
> >> > infiniband? Or should/can? I add a second network?
> >> > 
> >> > Nevertheless if we can solve the problem of the not terminated
> >> > programs we can
> >> > go into production with openmpi soon.
> >> > 
> >> > Best regards
> >> > Harald
> >> > 
> >> > On Monday 20 March 2017 18:11:47 Harald van Pee wrote:
> >> > > Hello,
> >> > > 
> >> > > here my status update:
> >> > > 
> >> > > - We are now sure that this problem never was seen for the vanilla
> >> > > universe, the only job in question was restarted because of a broken
> >> > 
> >> > node.
> >> > 
> >> > > - the new options in openmpiscript from htcondor 8.6.1 make no
> >> > 
> >> > difference.
> >> > 
> >> > >  Especialy excluding ethernet interfaces does not help, as
> >> > >  Jason Patton assumes from openmpi documentation.
> >> > > 
> >> > > - It is clearly caused by minor ethernet problems because:
> >> > >  a) We see connection problems to some nodes at the same time, but
> >> > >  all connections but some from parallel universe are reestablished
> >> > >  b) We have identified 3 nodes which make more problems than
> >> > >  others, if we exclude these nodes we have managed to run several
> >> > >  mpi jobs with
> >> > 
> >> > 40
> >> > 
> >> > > mpinodes for longer than 16 days without restarts (the jobs were
> >> > > removed
> >> > 
> >> > by
> >> > 
> >> > > the user or finished).
> >> > > But there is no reason to assume that these nodes have any severe
> >> > 
> >> > problem,
> >> > 
> >> > > because we see no mpi errors even with high verbosity, and on the
> >> > > problem nodes there are vanilla jobs running for up to 101 days now.
> >> > > 
> >> > > - here what happens today:
> >> > > Summary: Condor assumes that a starter of node 3 has a problem and
> >> > > sends
> >> > 
> >> > a
> >> > 
> >> > > kill command. Even condor assumes that the kill command could not be
> >> > 
> >> > send,
> >> > 
> >> > > it reaches the node 2 as it should be because this was the first
> >> > > node where mpirun is running.
> >> > > It does also send SIGQUIT before SIGTERM which I have not assumed
> >> > > and
> >> > 
> >> > maybe
> >> > 
> >> > > this is the reason why our trap handler does not work beause it
> >> > > expects only SIGTERM? In this case still some mpiprograms are
> >> > > running after the job was removed from condor.
> >> > > Than later the starters of job 3939 on node 3 also gets a kill
> >> > > signal and handle this. But this was the prove that this starters
> >> > > are still alive
> >> > 
> >> > and
> >> > 
> >> > > there was no reason to kill them, right?
> >> > > 
> >> > > Even if there are work arounds, I think this behaviour could and
> >> > > should
> >> > 
> >> > be
> >> > 
> >> > > improved.
> >> > > Or can we expect something was changed with htcondor 8.6.1?
> >> > > 
> >> > > Best
> >> > > Harald
> >> > > 
> >> > > Here the most relevant log messages:
> >> > > 
> >> > > ShadowLog:
> >> > > 03/20/17 07:54:11 (3939.0) (1617771): condor_read() failed:
> >> > > recv(fd=10) returned -1, errno = 110 Connection timed out, reading
> >> > > 5 bytes from
> >> > 
> >> > startd
> >> > 
> >> > > at <192.16
> >> > > 8.123.3:29143>.
> >> > > 03/20/17 07:54:11 (3939.0) (1617771): condor_read(): UNEXPECTED read
> >> > > timeout after 0s during non-blocking read from startd at
> >> > > <192.168.123.3:29143> (desired
> >> > > timeout=300s)
> >> > > 03/20/17 07:54:11 (3939.0) (1617771): IO: Failed to read packet
> >> > > header 03/20/17 07:54:11 (3939.0) (1617771): Can no longer talk to
> >> > 
> >> > condor_starter
> >> > 
> >> > > <192.168.123.3:29143>
> >> > > 03/20/17 07:54:11 (3939.0) (1617771): This job cannot reconnect to
> >> > 
> >> > starter,
> >> > 
> >> > > so job exiting
> >> > > 03/20/17 07:54:12 (3939.0) (1617771): attempt to connect to
> >> > > <192.168.123.3:29143> failed: No route to host (connect errno =
> >> > > 113).
> >> > 
> >> > > 03/20/17 07:54:12 (3939.0) (1617771): RemoteResource::killStarter():
> >> > Could
> >> > 
> >> > > not send command to startd
> >> > > 03/20/17 07:54:15 (3939.0) (1617771): attempt to connect to
> >> > > <192.168.123.3:29143> failed: No route to host (connect errno =
> >> > > 113).
> >> > 
> >> > > 03/20/17 07:54:15 (3939.0) (1617771): RemoteResource::killStarter():
> >> > Could
> >> > 
> >> > > not send command to startd
> >> > > 03/20/17 07:54:21 (3939.0) (1617771): attempt to connect to
> >> > > <192.168.123.3:29143> failed: No route to host (connect errno =
> >> > > 113).
> >> > 
> >> > > 03/20/17 07:54:21 (3939.0) (1617771): RemoteResource::killStarter():
> >> > Could
> >> > 
> >> > > not send command to startd
> >> > > 03/20/17 07:54:24 (3939.0) (1617771): ERROR "Can no longer talk to
> >> > > condor_starter <192.168.123.3:29143>" at line 209 in file
> >> > > /slots/02/dir_53434/userdir/src/condor_shadow.V6.1/NTreceivers.cpp
> >> > > 
> >> > > StarterLog.slot1_2 on node 2 (mpirun of job 3939.0)
> >> > > 03/20/17 07:54:11 (pid:1066066) Got SIGQUIT.  Performing fast
> >> > > shutdown. 03/20/17 07:54:11 (pid:1066066) ShutdownFast all jobs.
> >> > > 03/20/17 07:54:11 (pid:1066066) Got SIGTERM. Performing graceful
> >> > 
> >> > shutdown.
> >> > 
> >> > > 03/20/17 07:54:11 (pid:1066066) ShutdownGraceful all jobs.
> >> > > 03/20/17 07:54:11 (pid:1066066) Process exited, pid=1066068,
> >> > > status=0 03/20/17 07:54:24 (pid:1066066) condor_read() failed:
> >> > > recv(fd=8) returned -1, errno = 104 Connection reset by peer,
> >> > > reading 5 bytes from <192.168.123.100:18658>.
> >> > > 03/20/17 07:54:24 (pid:1066066) IO: Failed to read packet header
> >> > > 03/20/17 07:54:24 (pid:1066066) Lost connection to shadow, waiting
> >> > > 2400 secs for reconnect
> >> > > 03/20/17 07:54:24 (pid:1066066) Failed to send job exit status to
> >> > > shadow 03/20/17 07:54:24 (pid:1066066) Last process exited, now
> >> > > Starter is
> >> > 
> >> > exiting
> >> > 
> >> > > 03/20/17 07:54:24 (pid:1066066) **** condor_starter (condor_STARTER)
> >> > > pid 1066066 EXITING WITH STATUS 0
> >> > > 
> >> > > 
> >> > > StarterLog.slot1_2 on node 3  running job 3939.0
> >> > > 03/20/17 07:54:33 (pid:1056820) condor_read() failed: recv(fd=8)
> >> > > returned -1, errno = 104 Connection reset b\
> >> > > y peer, reading 5 bytes from <192.168.123.100:24154>.
> >> > > 03/20/17 07:54:33 (pid:1056820) IO: Failed to read packet header
> >> > > 03/20/17 07:54:33 (pid:1056820) Lost connection to shadow, waiting
> >> > > 2400 secs for reconnect
> >> > > 03/20/17 07:54:33 (pid:1056820) Got SIGTERM. Performing graceful
> >> > 
> >> > shutdown.
> >> > 
> >> > > 03/20/17 07:54:33 (pid:1056820) ShutdownGraceful all jobs.
> >> > > 03/20/17 07:54:33 (pid:1056820) Process exited, pid=1056824,
> >> > > status=0 03/20/17 07:54:33 (pid:1056820) Failed to send job exit
> >> > > status to shadow 03/20/17 07:54:33 (pid:1056820) Last process
> >> > > exited, now Starter is
> >> > 
> >> > exiting
> >> > 
> >> > > 03/20/17 07:54:33 (pid:1056820) **** condor_starter (condor_STARTER)
> >> > > pid 1056820 EXITING WITH STATUS 0
> >> > > 
> >> > > StarterLog.slot1_3 on node 3  running job 3939.0
> >> > > 03/20/17 07:54:46 (pid:1056821) condor_read() failed: recv(fd=8)
> >> > > returned -1, errno = 104 Connection reset b\
> >> > > y peer, reading 5 bytes from <192.168.123.100:3768>.
> >> > > 03/20/17 07:54:46 (pid:1056821) IO: Failed to read packet header
> >> > > 03/20/17 07:54:46 (pid:1056821) Lost connection to shadow, waiting
> >> > > 2400 secs for reconnect
> >> > > 03/20/17 07:54:46 (pid:1056821) Got SIGTERM. Performing graceful
> >> > 
> >> > shutdown.
> >> > 
> >> > > 03/20/17 07:54:46 (pid:1056821) ShutdownGraceful all jobs.
> >> > > 03/20/17 07:54:46 (pid:1056821) Process exited, pid=1056823,
> >> > > status=0 03/20/17 07:54:46 (pid:1056821) Failed to send job exit
> >> > > status to shadow 03/20/17 07:54:46 (pid:1056821) Last process
> >> > > exited, now Starter is
> >> > 
> >> > exiting
> >> > 
> >> > > 03/20/17 07:54:46 (pid:1056821) **** condor_starter (condor_STARTER)
> >> > > pid 1056821 EXITING WITH STATUS 0
> >> > > 
> >> > > On Thursday 23 February 2017 15:12:38 Harald van Pee wrote:
> >> > > > Hello,
> >> > > > 
> >> > > > it happens again. What we have learned until now is:
> >> > > > - a communication problem occurs between scheduler node and
> >> > > > starter
> >> > 
> >> > node
> >> > 
> >> > > > - condor kills the starter process and afterwards  kills the job
> >> > > > - several different nodes are affected due to lack of statistics
> >> > > > we can not claim that all nodes are affected nor exclude that
> >> > > > some have more problems than others.
> >> > > > - its very unlikly that the program itself has a problem, because
> >> > > > we
> >> > 
> >> > have
> >> > 
> >> > > > seen that 2 starter processes of 2 independent parallel jobs were
> >> > 
> >> > killed
> >> > 
> >> > > > on the same node at the same time.
> >> > > > - at least within the last 2 monthes only parallel jobs are
> >> > > > affected,
> >> > 
> >> > but
> >> > 
> >> > > > there is no hint for a mpi problem, any help how one can proof
> >> > > > that no mpi problem exists are welcome.
> >> > > > We have much more vanilla starters running than parallel ones.
> >> > > > 
> >> > > > This morning the same node as last week was affacted. On this node
> >> > > > 9 single vanilla starters are running,  2 of them since more than
> >> > > > 47 days. In addition 5 starters of 2 parallel jobs and only one
> >> > > > starter of one parallel job was killed.
> >> > > > From the ShadowLog below, one can see that several starters from
> >> > 
> >> > serveral
> >> > 
> >> > > > jobs and not only from node 37 have communication problems and the
> >> > > > time during the problem occurs is less than one minute. Therfore I
> >> > > > would expect that there will be no problem to reconnect to the
> >> > > > starters and this is true for all vanilla jobs.
> >> > > > But why the parallel starters were killed such fast?
> >> > > > 
> >> > > > Any idea is welcome
> >> > > > Harald
> >> > > > 
> >> > > > ShadowLog (begin 2 lines before,  end 2 lines after the minute of
> >> > > > the problem):
> >> > > > 02/23/17 07:10:59 (1835.0) (2412453): Job 1835.0 terminated:
> >> > > > exited
> >> > 
> >> > with
> >> > 
> >> > > > status 0
> >> > > > 02/23/17 07:10:59 (1835.0) (2412453): **** condor_shadow
> >> > 
> >> > (condor_SHADOW)
> >> > 
> >> > > > pid 2412453 EXITING WITH STATUS 115
> >> > > > 02/23/17 07:16:02 (1209.3) (49060): condor_read() failed:
> >> > > > recv(fd=4) returned -1, errno = 110 Connection timed out, reading
> >> > > > 5 bytes from startd slot1@xxxxxxxxxxxxxxxxxxxxxxxx
> >> > > > 02/23/17 07:16:02 (1209.3) (49060): condor_read(): UNEXPECTED read
> >> > > > timeout after 0s during non-blocking read from startd
> >> > > > slot1@xxxxxxxxxxxxxxxxxxxxxxx (desired timeout=300s)
> >> > > > 02/23/17 07:16:02 (1209.3) (49060): IO: Failed to read packet
> >> > > > header 02/23/17 07:16:02 (1209.3) (49060): Can no longer talk to
> >> > 
> >> > condor_starter
> >> > 
> >> > > > <192.168.123.37:30389>
> >> > > > 02/23/17 07:16:02 (1209.3) (49060): Trying to reconnect to
> >> > > > disconnected job 02/23/17 07:16:02 (1209.3) (49060):
> >> > > > LastJobLeaseRenewal: 1487830176 Thu Feb 23 07:09:36 2017
> >> > > > 02/23/17 07:16:02 (1209.3) (49060): JobLeaseDuration: 2400 seconds
> >> > > > 02/23/17 07:16:02 (1209.3) (49060): JobLeaseDuration remaining:
> >> > > > 2014 02/23/17 07:16:02 (1209.3) (49060): Attempting to locate
> >> > > > disconnected starter 02/23/17 07:16:03 (1209.3) (49060): attempt
> >> > > > to connect to <192.168.123.37:30389> failed: No route to host
> >> > > > (connect errno = 113). 02/23/17 07:16:03 (1209.3) (49060):
> >> > > > locateStarter(): Failed to connect
> >> > 
> >> > to
> >> > 
> >> > > > startd <192.168.123.37:30389?addrs=192.168.123.37-30389>
> >> > > > 02/23/17 07:16:03 (1209.3) (49060): JobLeaseDuration remaining:
> >> > > > 2399 02/23/17 07:16:03 (1209.3) (49060): Scheduling another
> >> > > > attempt to reconnect in 8 seconds
> >> > > > 02/23/17 07:16:04 (1208.16) (46751): condor_read() failed:
> >> > > > recv(fd=4) returned -1, errno = 110 Connection timed out, reading
> >> > > > 5 bytes from starter at <192.168.123.51:49120>.
> >> > > > 02/23/17 07:16:04 (1208.16) (46751): condor_read(): UNEXPECTED
> >> > > > read timeout after 0s during non-blocking read from starter at
> >> > > > <192.168.123.51:49120> (desired timeout=300s)
> >> > > > 02/23/17 07:16:04 (1208.16) (46751): IO: Failed to read packet
> >> > > > header 02/23/17 07:16:04 (1208.16) (46751): Can no longer talk to
> >> > 
> >> > condor_starter
> >> > 
> >> > > > <192.168.123.51:49120>
> >> > > > 02/23/17 07:16:04 (1208.16) (46751): JobLeaseDuration remaining:
> >> > > > 2014 02/23/17 07:16:04 (1208.16) (46751): Attempting to locate
> >> > > > disconnected starter 02/23/17 07:16:05 (2143.0) (2719507):
> >> > > > condor_read() failed: recv(fd=25) returned -1, errno = 110
> >> > > > Connection timed out, reading 5 bytes from startd at
> >> > > > <192.168.123.37:30389>.
> >> > > > 02/23/17 07:16:05 (2143.0) (2719507): condor_read(): UNEXPECTED
> >> > > > read timeout after 0s during non-blocking read from startd at
> >> > > > <192.168.123.37:30389> (desired timeout=300s)
> >> > > > 02/23/17 07:16:05 (2143.0) (2719507): IO: Failed to read packet
> >> > > > header 02/23/17 07:16:05 (2143.0) (2719507): Can no longer talk to
> >> > > > condor_starter <192.168.123.37:30389>
> >> > > > 02/23/17 07:16:05 (2143.0) (2719507): This job cannot reconnect to
> >> > > > starter, so job exiting
> >> > > > 02/23/17 07:16:06 (1208.16) (46751): attempt to connect to
> >> > > > <192.168.123.51:29246> failed: No route to host (connect errno =
> >> > > > 113). 02/23/17 07:16:06 (1208.16) (46751): locateStarter(): Failed
> >> > > > to connect to startd
> >> > > > <192.168.123.51:29246?addrs=192.168.123.51-29246>
> >> > > > 02/23/17 07:16:06 (1208.16) (46751): JobLeaseDuration remaining:
> >> > > > 2398 02/23/17 07:16:06 (1208.16) (46751): Scheduling another
> >> > > > attempt to reconnect in 8 seconds
> >> > > > 02/23/17 07:16:07 (683.9) (2270376): condor_read() failed:
> >> > > > recv(fd=4) returned -1, errno = 110 Connection timed out, reading
> >> > > > 5 bytes from starter at <192.168.123.37:30325>.
> >> > > > 02/23/17 07:16:07 (683.9) (2270376): condor_read(): UNEXPECTED
> >> > > > read timeout after 0s during non-blocking read from starter at
> >> > > > <192.168.123.37:30325> (desired timeout=300s)
> >> > > > 02/23/17 07:16:07 (683.9) (2270376): IO: Failed to read packet
> >> > > > header 02/23/17 07:16:07 (683.9) (2270376): Can no longer talk to
> >> > 
> >> > condor_starter
> >> > 
> >> > > > <192.168.123.37:30325>
> >> > > > 02/23/17 07:16:07 (683.9) (2270376): JobLeaseDuration remaining:
> >> > > > 2014 02/23/17 07:16:07 (683.9) (2270376): Attempting to locate
> >> > > > disconnected starter 02/23/17 07:16:08 (2143.0) (2719507):
> >> > > > attempt to connect to <192.168.123.37:30389> failed: No route to
> >> > > > host (connect errno = 113). 02/23/17 07:16:08 (683.9) (2270376):
> >> > > > attempt to connect to
> >> > > > <192.168.123.37:30389> failed: No route to host (connect errno =
> >> > > > 113). 02/23/17 07:16:08 (2143.0) (2719507):
> >> > > > RemoteResource::killStarter(): Could not send command to startd
> >> > > > 02/23/17 07:16:08 (683.9) (2270376): locateStarter(): Failed to
> >> > > > connect to startd
> >> > > > <192.168.123.37:30389?addrs=192.168.123.37-30389> 02/23/17
> >> > > > 07:16:08 (683.9) (2270376): JobLeaseDuration remaining: 2399
> >> > > > 02/23/17 07:16:08 (683.9) (2270376): Scheduling another attempt
> >> > > > to reconnect in 8 seconds
> >> > > > 02/23/17 07:16:11 (1209.3) (49060): Attempting to locate
> >> > > > disconnected starter 02/23/17 07:16:11 (1209.3) (49060): Found
> >> > > > starter: <192.168.123.37:38618?addrs=192.168.123.37-38618>
> >> > > > 02/23/17 07:16:11 (1209.3) (49060): Attempting to reconnect to
> >> > > > starter <192.168.123.37:38618?addrs=192.168.123.37-38618>
> >> > > > 02/23/17 07:16:11 (2143.0) (2719507): ERROR "Can no longer talk to
> >> > > > condor_starter <192.168.123.37:30389>" at line 209 in file
> >> > > > /slots/02/dir_53434/userdir/src/condor_shadow.V6.1/NTreceivers.cpp
> >> > > > 02/23/17 07:16:14 (1208.16) (46751): Attempting to locate
> >> > > > disconnected starter 02/23/17 07:16:14 (1208.16) (46751): Found
> >> > > > starter:
> >> > > > <192.168.123.51:49120?addrs=192.168.123.51-49120>
> >> > > > 02/23/17 07:16:14 (1208.16) (46751): Attempting to reconnect to
> >> > > > starter <192.168.123.51:49120?addrs=192.168.123.51-49120>
> >> > > > 02/23/17 07:16:15 (1208.16) (46751): Reconnect SUCCESS: connection
> >> > > > re- established
> >> > > > 02/23/17 07:16:16 (683.9) (2270376): Attempting to locate
> >> > > > disconnected starter 02/23/17 07:16:16 (683.9) (2270376): Found
> >> > > > starter:
> >> > > > <192.168.123.37:30325?addrs=192.168.123.37-30325>
> >> > > > 02/23/17 07:16:16 (683.9) (2270376): Attempting to reconnect to
> >> > > > starter <192.168.123.37:30325?addrs=192.168.123.37-30325>
> >> > > > 02/23/17 07:16:25 (683.9) (2270376): Reconnect SUCCESS: connection
> >> > > > re- established
> >> > > > 02/23/17 07:16:41 (1209.3) (49060): condor_read(): timeout reading
> >> > > > 5 bytes from starter at <192.168.123.37:38618>.
> >> > > > 02/23/17 07:16:41 (1209.3) (49060): IO: Failed to read packet
> >> > > > header 02/23/17 07:16:41 (1209.3) (49060): Attempt to reconnect
> >> > > > failed: Failed to read reply ClassAd
> >> > > > 02/23/17 07:16:41 (1209.3) (49060): JobLeaseDuration remaining:
> >> > > > 2361 02/23/17 07:16:41 (1209.3) (49060): Scheduling another
> >> > > > attempt to reconnect in 16 seconds
> >> > > > 02/23/17 07:16:57 (1209.3) (49060): Attempting to locate
> >> > > > disconnected starter 02/23/17 07:16:57 (1209.3) (49060): Found
> >> > > > starter: <192.168.123.37:38618?addrs=192.168.123.37-38618>
> >> > > > 02/23/17 07:16:57 (1209.3) (49060): Attempting to reconnect to
> >> > > > starter <192.168.123.37:38618?addrs=192.168.123.37-38618>
> >> > > > 02/23/17 07:16:57 (1209.3) (49060): Reconnect SUCCESS: connection
> >> > > > re- established
> >> > > > 02/23/17 07:43:17 (2102.0) (2559295): Job 2102.0 terminated:
> >> > > > killed by signal 6
> >> > > > 02/23/17 07:43:17 (2102.0) (2559295): **** condor_shadow
> >> > 
> >> > (condor_SHADOW)
> >> > 
> >> > > > pid 2559295 EXITING WITH STATUS 115
> >> > > > 
> >> > > > On Tuesday 21 February 2017 22:41:54 Harald van Pee wrote:
> >> > > > > Hi Todd,
> >> > > > > 
> >> > > > > thank you for your help.
> >> > > > > 
> >> > > > > Concerning the no route to host, I see no ethernet port down on
> >> > > > > any machine during that time, but maybe a change in
> >> > > > > /etc/host.conf:
> >> > > > > order hosts,bind
> >> > > > > 
> >> > > > > /etc/nsswitch.conf:
> >> > > > > hosts:      files dns
> >> > > > > 
> >> > > > > instead of the debian default will help anyway, /etc/hosts has
> >> > > > > all ip addresses of all nodes in.
> >> > > > > 
> >> > > > > Regards
> >> > > > > Harald
> >> > > > > 
> >> > > > > On Tuesday 21 February 2017 21:06:20 Todd Tannenbaum wrote:
> >> > > > > > On 2/21/2017 1:33 PM, Harald van Pee wrote:
> >> > > > > > > It seems that openmpi (or mpi) are not used very often with
> >> > > > > > > htcondor and the information is spare and I got some
> >> > > > > > > questions
> >> > 
> >> > how
> >> > 
> >> > > > > > > I have managed it to get it running at all. I will share all
> >> > > > > > > I know about this in a new thread soon, or is there a wiki
> >> > > > > > > where I should put the information?
> >> > > > > > 
> >> > > > > > Off-list I put Harold in touch with the folks who can put
> >> > > > > > Harold's info into the Manual or the HTCondor Wiki (from the
> >> > > > > > web homepage, look for the link "HOWTO recipes" and "HTcondor
> >> > > > > > Wiki").
> >> > > > > > 
> >> > > > > > Also we did some work for upcoming HTCondor v8.6.1 release so
> >> > > > > > it works properly with the latest releases of OpenMPI - for
> >> > > > > > details
> >> > 
> >> > see
> >> > 
> >> > > > > >    https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=6024
> >> > > > > > > 
> >> > > > > > > Now back to our problem:
> >> > > > > > > One hint that it will be related to network (ethernet or
> >> > > > > > > infiniband) is, that we have one job running for 11 days
> >> > > > > > > without problems as we have less jobs running, and we got
> >> > > > > > > problems within a few days as we have startetd 200 more
> >> > > > > > > jobs. I have found now 2 independend parallel mpi jobs
> >> > > > > > > which share one machine with one job each and there are no
> >> > > > > > > ethernet problems
> >> > 
> >> > seen,
> >> > 
> >> > > > > > > not on the scheduler machine nor on the starter node.
> >> > 
> >> > Unfortunately
> >> > 
> >> > > > > > > there is no error output in the jobs error file.
> >> > > > > > > Its clear that condor kills the jobs but for me its unclear
> >> > > > > > > why, because it seems both starter processes are still
> >> > > > > > > running if I understand the logfiles correct.
> >> > > > > > 
> >> > > > > > At first blush, it looks to me like the condor_shadow on the
> >> > > > > > submit node could no longer contact the execute node at IP
> >> > > > > > address 192.168.123.37 due to "No route to host".  The "No
> >> > > > > > route to host" error comes the operating system, not from
> >> > > > > > HTCondor - you can
> >> > 
> >> > google
> >> > 
> >> > > > > > this error and see lots of opinions/ideas on how to
> >> > > > > > troubleshoot
> >> > 
> >> > and
> >> > 
> >> > > > > > fix, but basically there is no route for the execute node IP
> >> > 
> >> > address
> >> > 
> >> > > > > > in the client's routing table... not sure why this would
> >> > > > > > happen all of the sudden, maybe some interface on your submit
> >> > > > > > machine is being disabled or some switch port?
> >> > > > > > 
> >> > > > > > regards,
> >> > > > > > Todd
> >> > > > > > 
> >> > > > > > > Maybe one of you find a hint in the condor log below and can
> >> > > > > > > give me a hint what happens, or what I can do to find out.
> >> > > > > > > 
> >> > > > > > > Best
> >> > > > > > > Harald
> >> > > > > > > 
> >> > > > > > > ShadowLog:
> >> > > > > > > 02/19/17 03:09:44 (1744.0) (1729179): condor_read() failed:
> >> > > > > > > recv(fd=12) returned -1, errno = 110 Connection timed out,
> >> > 
> >> > reading
> >> > 
> >> > > > > > > 5 bytes from startd at <192.168.123.37:30389>.
> >> > > > > > > 02/19/17 03:09:44 (1745.0) (1729180): condor_read() failed:
> >> > > > > > > recv(fd=9) returned -1, errno = 110 Connection timed out,
> >> > 
> >> > reading 5
> >> > 
> >> > > > > > > bytes from startd at <192.168.123.37:30389>.
> >> > > > > > > 02/19/17 03:09:44 (1744.0) (1729179): condor_read():
> >> > > > > > > UNEXPECTED read timeout after 0s during non-blocking read
> >> > > > > > > from startd at <192.168.123.37:30389> (desired
> >> > > > > > > timeout=300s)
> >> > > > > > > 02/19/17 03:09:44 (1745.0) (1729180): condor_read():
> >> > > > > > > UNEXPECTED read timeout after 0s during non-blocking read
> >> > > > > > > from startd at <192.168.123.37:30389> (desired
> >> > > > > > > timeout=300s)
> >> > > > > > > 02/19/17 03:09:44 (1744.0) (1729179): IO: Failed to read
> >> > > > > > > packet header 02/19/17 03:09:44 (1745.0) (1729180): IO:
> >> > > > > > > Failed to read packet header 02/19/17 03:09:44 (1744.0)
> >> > > > > > > (1729179): Can no longer talk to
> >> > > > > > > condor_starter <192.168.123.37:30389>
> >> > > > > > > 02/19/17 03:09:44 (1745.0) (1729180): Can no longer talk to
> >> > > > > > > condor_starter <192.168.123.37:30389>
> >> > > > > > > 02/19/17 03:09:44 (1744.0) (1729179): This job cannot
> >> > > > > > > reconnect
> >> > 
> >> > to
> >> > 
> >> > > > > > > starter, so job exiting
> >> > > > > > > 02/19/17 03:09:44 (1745.0) (1729180): This job cannot
> >> > > > > > > reconnect
> >> > 
> >> > to
> >> > 
> >> > > > > > > starter, so job exiting
> >> > > > > > > 02/19/17 03:09:47 (1745.0) (1729180): attempt to connect to
> >> > > > > > > <192.168.123.37:30389> failed: No route to host (connect
> >> > > > > > > errno = 113). 02/19/17 03:09:47 (1744.0) (1729179): attempt
> >> > > > > > > to connect to <192.168.123.37:30389> failed: No route to
> >> > > > > > > host (connect errno = 113). 02/19/17 03:09:47 (1745.0)
> >> > > > > > > (1729180): RemoteResource::killStarter(): Could not send
> >> > > > > > > command to startd 02/19/17 03:09:47 (1744.0) (1729179):
> >> > > > > > > RemoteResource::killStarter(): Could not send command to
> >> > > > > > > startd 02/19/17 03:09:47 (1744.0) (1729179): ERROR "Can no
> >> > > > > > > longer talk
> >> > 
> >> > to
> >> > 
> >> > > > > > > condor_starter <192.168.123.37:30389>" at line 209 in file
> >> > > > > > > /slots/02/dir_53434/userdir/src/condor_shadow.V6.1/
> >> > 
> >> > NTreceivers.cpp
> >> > 
> >> > > > > > > 02/19/17 03:09:47 (1745.0) (1729180): ERROR "Can no longer
> >> > > > > > > talk
> >> > 
> >> > to
> >> > 
> >> > > > > > > condor_starter <192.168.123.37:30389>" at line 209 in file
> >> > > > > > > /slots/02/dir_53434/userdir/src/condor_shadow.V6.1/
> >> > 
> >> > NTreceivers.cpp
> >> > 
> >> > > > > > > StarterLog of job 1745.0 on node 192.168.123.37
> >> > > > > > > 02/15/17 17:14:34 (pid:751398) Create_Process succeeded,
> >> > 
> >> > pid=751405
> >> > 
> >> > > > > > > 02/15/17 17:14:35 (pid:751398) condor_write() failed: send()
> >> > > > > > > 1 bytes to <127.0.0.1:10238> returned -1, timeout=0,
> >> > > > > > > errno=32
> >> > 
> >> > Broken
> >> > 
> >> > > > > > > pipe. 02/19/17 03:10:05 (pid:751398) condor_read() failed:
> >> > > > > > > recv(fd=8) returned -1, errno = 104 Connection reset by
> >> > > > > > > peer, reading 5 bytes from <192.168.123.100:25500>.
> >> > > > > > > 02/19/17 03:10:05 (pid:751398) IO: Failed to read packet
> >> > > > > > > header 02/19/17 03:10:05 (pid:751398) Lost connection to
> >> > > > > > > shadow, waiting 2400 secs for reconnect
> >> > > > > > > 02/19/17 03:10:05 (pid:751398) Got SIGTERM. Performing
> >> > > > > > > graceful shutdown. 02/19/17 03:10:05 (pid:751398)
> >> > > > > > > ShutdownGraceful all
> >> > 
> >> > jobs.
> >> > 
> >> > > > > > > 02/19/17 03:10:05 (pid:751398) Process exited, pid=751405,
> >> > 
> >> > status=0
> >> > 
> >> > > > > > > 02/19/17 03:10:05 (pid:751398) Failed to send job exit
> >> > > > > > > status to shadow 02/19/17 03:10:05 (pid:751398) Last
> >> > > > > > > process exited, now Starter is exiting 02/19/17 03:10:05
> >> > > > > > > (pid:751398) **** condor_starter
> >> > > > > > > (condor_STARTER) pid 751398 EXITING WITH STATUS 0
> >> > > > > > > 
> >> > > > > > > StarterLog of job 1744.0 on node 92.168.123.37
> >> > > > > > > 02/15/17 17:14:34 (pid:751399) Create_Process succeeded,
> >> > 
> >> > pid=751400
> >> > 
> >> > > > > > > 02/15/17 17:14:34 (pid:751399) condor_write() failed: send()
> >> > > > > > > 1 bytes to <127.0.0.1:48689> returned -1, timeout=0,
> >> > > > > > > errno=32
> >> > 
> >> > Broken
> >> > 
> >> > > > > > > pipe. 02/19/17 03:10:03 (pid:751399) condor_read() failed:
> >> > > > > > > recv(fd=8) returned -1, errno = 104 Connection reset by
> >> > > > > > > peer, reading 5 bytes from <192.168.123.100:34337>.
> >> > > > > > > 02/19/17 03:10:03 (pid:751399) IO: Failed to read packet
> >> > > > > > > header 02/19/17 03:10:03 (pid:751399) Lost connection to
> >> > > > > > > shadow, waiting 2400 secs for reconnect
> >> > > > > > > 02/19/17 03:10:03 (pid:751399) Got SIGTERM. Performing
> >> > > > > > > graceful shutdown. 02/19/17 03:10:03 (pid:751399)
> >> > > > > > > ShutdownGraceful all
> >> > 
> >> > jobs.
> >> > 
> >> > > > > > > 02/19/17 03:10:03 (pid:751399) Process exited, pid=751400,
> >> > 
> >> > status=0
> >> > 
> >> > > > > > > 02/19/17 03:10:03 (pid:751399) Failed to send job exit
> >> > > > > > > status to shadow 02/19/17 03:10:03 (pid:751399) Last
> >> > > > > > > process exited, now Starter is exiting 02/19/17 03:10:03
> >> > > > > > > (pid:751399) **** condor_starter
> >> > > > > > > (condor_STARTER) pid 751399 EXITING WITH STATUS 0
> >> > > > > > > 
> >> > > > > > > StartLog:
> >> > > > > > > 02/19/17 03:09:48 slot1_11: Called deactivate_claim()
> >> > > > > > > 02/19/17 03:09:48 slot1_11: Changing state and activity:
> >> > > > > > > Claimed/Busy -> Preempting/Vacating
> >> > > > > > > 02/19/17 03:09:48 slot1_13: Called deactivate_claim()
> >> > > > > > > 02/19/17 03:09:48 slot1_13: Changing state and activity:
> >> > > > > > > Claimed/Busy -> Preempting/Vacating
> >> > > > > > > 02/19/17 03:10:03 Starter pid 751399 exited with status 0
> >> > > > > > > 02/19/17 03:10:03 slot1_11: State change: starter exited
> >> > > > > > > 02/19/17 03:10:03 slot1_11: State change: No preempting
> >> > > > > > > claim, returning to owner
> >> > > > > > > 02/19/17 03:10:03 slot1_11: Changing state and activity:
> >> > > > > > > Preempting/Vacating -
> >> > > > > > > 
> >> > > > > > >> Owner/Idle
> >> > > > > > > 
> >> > > > > > > 02/19/17 03:10:03 slot1_11: State change: IS_OWNER is false
> >> > > > > > > 02/19/17 03:10:03 slot1_11: Changing state: Owner ->
> >> > > > > > > Unclaimed 02/19/17 03:10:03 slot1_11: Changing state:
> >> > > > > > > Unclaimed -> Delete 02/19/17 03:10:03 slot1_11: Resource no
> >> > > > > > > longer needed, deleting 02/19/17 03:10:05 Starter pid
> >> > > > > > > 751398 exited with status 0 02/19/17 03:10:05 slot1_13:
> >> > > > > > > State change: starter exited 02/19/17 03:10:05 slot1_13:
> >> > > > > > > State change: No preempting claim, returning to owner
> >> > > > > > > 02/19/17 03:10:05 slot1_13: Changing state and activity:
> >> > > > > > > Preempting/Vacating -
> >> > > > > > > 
> >> > > > > > >> Owner/Idle
> >> > > > > > > 
> >> > > > > > > 02/19/17 03:10:05 slot1_13: State change: IS_OWNER is false
> >> > > > > > > 02/19/17 03:10:05 slot1_13: Changing state: Owner ->
> >> > > > > > > Unclaimed 02/19/17 03:10:05 slot1_13: Changing state:
> >> > > > > > > Unclaimed -> Delete 02/19/17 03:10:05 slot1_13: Resource no
> >> > > > > > > longer needed, deleting 02/19/17 03:19:48 Error: can't find
> >> > > > > > > resource with ClaimId
> >> > > > > > > (<192.168.123.37:30389>#1481221329#1484#...) for 443
> >> > > > > > > (RELEASE_CLAIM); perhaps this claim was removed already.
> >> > > > > > > 02/19/17 03:19:48 condor_write(): Socket closed when trying
> >> > > > > > > to write 13 bytes to <192.168.123.100:20962>, fd is 8
> >> > > > > > > 02/19/17 03:19:48 Buf::write(): condor_write() failed
> >> > > > > > > 02/19/17 03:19:48 Error: can't find resource with ClaimId
> >> > > > > > > (<192.168.123.37:30389>#1481221329#1487#...) for 443
> >> > > > > > > (RELEASE_CLAIM); perhaps this claim was removed already.
> >> > > > > > > 02/19/17 03:19:48 condor_write(): Socket closed when trying
> >> > > > > > > to write 13 bytes to <192.168.123.100:34792>, fd is 8
> >> > > > > > > 02/19/17 03:19:48 Buf::write(): condor_write() failed
> >> > > > > > > 
> >> > > > > > > On Tuesday 07 February 2017 19:55:31 Harald van Pee wrote:
> >> > > > > > >> Dear experts,
> >> > > > > > >> 
> >> > > > > > >> I have some questions for debugging:
> >> > > > > > >> Can I avoid restarting of a job in vanilla and/or parallel
> >> > > > > > >> universe if I use Requirements = (NumJobStarts==0)
> >> > > > > > >> in the condor submit description file?
> >> > > > > > >> If it works, will the job stay idle or will be removed?
> >> > > > > > >> 
> >> > > > > > >> I found a job in the vanilla universe started at 12/9 and
> >> > > > > > >> restarted shortly before Christmas and still running. I
> >> > > > > > >> assume the reason were also network problems, but
> >> > > > > > >> unfortunatelly our last condor and system log files are
> >> > > > > > >> from January. Is there any possibility to make condor a
> >> > > > > > >> little bit more robust against network problems via
> >> > > > > > >> configuration? Just wait a little
> >> > 
> >> > bit
> >> > 
> >> > > > > > >> longer or make more reconnection tries?
> >> > > > > > >> 
> >> > > > > > >> We are working on automatic restart of the mpi jobs and try
> >> > > > > > >> to
> >> > 
> >> > use
> >> > 
> >> > > > > > >> more frequent checkpoints, but it seems a lot of work,
> >> > > > > > >> therefore any idea would be welcome.
> >> > > > > > >> 
> >> > > > > > >> Best
> >> > > > > > >> Harald
> >> > > > > > >> 
> >> > > > > > >> On Monday 06 February 2017 23:43:47 Harald van Pee wrote:
> >> > > > > > >>> There is one important argument, why I think the problem
> >> > > > > > >>> is condor related not mpi (of course I can be wrong). The
> >> > > > > > >>> condor communication goes via ethernet, and the ethernet
> >> > > > > > >>> connection has a problem for several minutes.
> >> > > > > > >>> The mpi communication goes via infiniband, and there is no
> >> > > > > > >>> infiniband problem during this time.
> >> > > > > > >>> 
> >> > > > > > >>> Harald
> >> > > > > > >>> 
> >> > > > > > >>> On Monday 06 February 2017 23:04:01 Harald van Pee wrote:
> >> > > > > > >>>> Hi Greg,
> >> > > > > > >>>> 
> >> > > > > > >>>> thanks for your answer.
> >> > > > > > >>>> 
> >> > > > > > >>>> On Monday 06 February 2017 22:18:08 Greg Thain wrote:
> >> > > > > > >>>>> On 02/06/2017 02:40 PM, Harald van Pee wrote:
> >> > > > > > >>>>>> Hello,
> >> > > > > > >>>>>> 
> >> > > > > > >>>>>> we got mpi running in parallel universe with htcondor
> >> > > > > > >>>>>> 8.4 using openmpiscript and its working in general
> >> > > > > > >>>>>> without any problem.
> >> > > > > > >>>>> 
> >> > > > > > >>>>> In general, the MPI jobs themselves cannot survive a
> >> > > > > > >>>>> network outage or partition, even a temporary one.
> >> > > > > > >>>>> HTCondor will reconnect the shadow to the starters, if
> >> > > > > > >>>>> the problem is just between the submit machine and the
> >> > > > > > >>>>> execute machines, but if
> >> > 
> >> > the
> >> > 
> >> > > > > > >>>>> network problem also impacts node-to-node communication,
> >> > > > > > >>>>> then the job has to be aborted and restarted from
> >> > > > > > >>>>> scratch because
> >> > 
> >> > of
> >> > 
> >> > > > > > >>>>> the way MPI works.
> >> > > > > > >>>> 
> >> > > > > > >>>> The problem seems between submit machine and one running
> >> > > > > > >>>> node (not the node where mpirun was started).
> >> > > > > > >>>> If you are right it should be possible to get or found an
> >> > 
> >> > error
> >> > 
> >> > > > > > >>>> of mpirun because it lost one node right?
> >> > > > > > >>>> But it seems condor kills the job because of a shadow
> >> > 
> >> > exception.
> >> > 
> >> > > > > > >>>> Unfortunatelly we do not see the output of the stoped job
> >> > > > > > >>>> because its overwritten by the new started.
> >> > > > > > >>>> Any suggestion how to find out if its realy an mpi
> >> > > > > > >>>> related problem?
> >> > > > > > >>>> 
> >> > > > > > >>>>> If possible, we would recommend that long-running jobs
> >> > > > > > >>>>> that suffer from this problem try to self-checkpoint
> >> > > > > > >>>>> themselves,
> >> > 
> >> > so
> >> > 
> >> > > > > > >>>>> that when they are restarted, they don't need to be
> >> > > > > > >>>>> restarted from scratch.
> >> > > > > > >>>>> 
> >> > > > > > >>>>> -greg
> >> > > > 
> >> > > > _______________________________________________
> >> > > > HTCondor-users mailing list
> >> > > > To unsubscribe, send a message to
> >> > > > htcondor-users-request@xxxxxxxxxxx
> >> > 
> >> > with
> >> > 
> >> > > > a subject: Unsubscribe
> >> > > > You can also unsubscribe by visiting
> >> > > > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> >> > > > 
> >> > > > The archives can be found at:
> >> > > > https://lists.cs.wisc.edu/archive/htcondor-users/
> >> > > 
> >> > > _______________________________________________
> >> > > HTCondor-users mailing list
> >> > > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
> >> > 
> >> > with a
> >> > 
> >> > > subject: Unsubscribe
> >> > > You can also unsubscribe by visiting
> >> > > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> >> > > 
> >> > > The archives can be found at:
> >> > > https://lists.cs.wisc.edu/archive/htcondor-users/
> >> > 
> >> > _______________________________________________
> >> > HTCondor-users mailing list
> >> > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
> >> > with a
> >> > subject: Unsubscribe
> >> > You can also unsubscribe by visiting
> >> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> >> > 
> >> > The archives can be found at:
> >> > https://lists.cs.wisc.edu/archive/htcondor-users/
> > 
> > --
> > Harald van Pee
> > 
> > Helmholtz-Institut fuer Strahlen- und Kernphysik der Universitaet Bonn
> > Nussallee 14-16 - 53115 Bonn - Tel +49-228-732213 - Fax +49-228-732505
> > mail: pee@xxxxxxxxxxxxxxxxx
> > _______________________________________________
> > HTCondor-users mailing list
> > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with
> > a subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> > 
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/htcondor-users/
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/

-- 
Harald van Pee

Helmholtz-Institut fuer Strahlen- und Kernphysik der Universitaet Bonn
Nussallee 14-16 - 53115 Bonn - Tel +49-228-732213 - Fax +49-228-732505
mail: pee@xxxxxxxxxxxxxxxxx