[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp



Before changing NETWORK_INTERFACE (which requires a condor_restart and
would end up killing all your vanilla universe jobs when applied to
the STARTDs), you can try adding this to your parallel jobs' submit
files:

+ParallelShutdownPolicy = "WAIT_FOR_ALL"

The manual says that this tells condor to only consider the job
finished when all the nodes' processes have exited. What the manual
*doesn't* say is that this tells condor to reconnect to *all* of the
execute nodes in a parallel universe job if there is a network
interruption. Under the default configuration, where the job exits
only if node 0 exits, reconnection will only happen for node 0 if
there is an interruption.

We will consider changing this behavior in a future release. In the
meantime, if you use WAIT_FOR_ALL, you might have to be diligent
watching for hung parallel universe jobs in case the processes on
nodes > 0 decide not to exit when mpirun exits. I may be able to add
some more cleanup code to openmpiscript to decrease the chance that
other nodes hang if mpirun exits.

Jason Patton

On Wed, Mar 22, 2017 at 12:24 PM, Harald van Pee <pee@xxxxxxxxxxxxxxxxx> wrote:
> On Wednesday 22 March 2017 17:54:04 Jason Patton wrote:
>> If it works for your condor pool, you can have condor communicate over your
>> infiniband network by setting NETWORK_INTERFACE appropriately. The daemons
>> should continue to listen on all interfaces as long as BIND_ALL_INTERFACES
>> is set to true.
>
> o.k. this is a good hint, we have restricted the scheduler NETWORK_INTERFACE
> to the private ethernet address, but we can also use
> NETWORK_INTERFACE=192.168.*
> is a condor_reconfig on this host which has
> DAEMON_LIST = SCHEDD, COLLECTOR, MASTER, NEGOTIATOR
> enough and will all vanilla jobs kept running?
>
>>
>> Assuming that both your vanilla and parallel jobs were disconnected for the
>> same amount of time, it does sounds like you've found a bug with the
>> reconnection of parallel jobs. We will see if we can reproduce the behavior
>> here.
>
> I assume this is the case, because there are allways vanilla jobs and starters
> from parallel jobs on node 3 running (because some of the vanilla jobs running
> there since the cluster was rebooted the last time).
>
> Harald
>>
>> Jason Patton
>>
>> On Tue, Mar 21, 2017 at 12:11 PM, Harald van Pee <pee@xxxxxxxxxxxxxxxxx>
>>
>> wrote:
>> > Hello all,
>> >
>> > now I found the reason for  the problems:
>> > Our cisco sg200-50 switch reboots every couple of days and sometimes even
>> > hours. I will try a firmware update or replace it, but in the 100days
>> > cluster
>> > its not such easy.
>> >
>> > From the condor side the difference seems to be for a parallel universe
>> > job:
>> > 03/20/17 07:54:11 (3939.0) (1617771): This job cannot reconnect to
>> > starter, so
>> > job exiting
>> > instead for a vanilla universe job:
>> > 03/20/17 07:54:13 (3854.0) (1465927): Trying to reconnect to disconnected
>> > job
>> >
>> > Is this just a feature to avoid problems with mpi, or could one have a
>> > configuration that condor tries the same as in the vanilla universe,
>> > because
>> > mpi runs over infiniband?
>> >
>> > Or do you suggest to make the whole communiction in condor via
>> > infiniband? Or should/can? I add a second network?
>> >
>> > Nevertheless if we can solve the problem of the not terminated programs
>> > we can
>> > go into production with openmpi soon.
>> >
>> > Best regards
>> > Harald
>> >
>> > On Monday 20 March 2017 18:11:47 Harald van Pee wrote:
>> > > Hello,
>> > >
>> > > here my status update:
>> > >
>> > > - We are now sure that this problem never was seen for the vanilla
>> > > universe, the only job in question was restarted because of a broken
>> >
>> > node.
>> >
>> > > - the new options in openmpiscript from htcondor 8.6.1 make no
>> >
>> > difference.
>> >
>> > >  Especialy excluding ethernet interfaces does not help, as
>> > >  Jason Patton assumes from openmpi documentation.
>> > >
>> > > - It is clearly caused by minor ethernet problems because:
>> > >  a) We see connection problems to some nodes at the same time, but all
>> > >  connections but some from parallel universe are reestablished
>> > >  b) We have identified 3 nodes which make more problems than others,
>> > >  if we exclude these nodes we have managed to run several mpi jobs with
>> >
>> > 40
>> >
>> > > mpinodes for longer than 16 days without restarts (the jobs were
>> > > removed
>> >
>> > by
>> >
>> > > the user or finished).
>> > > But there is no reason to assume that these nodes have any severe
>> >
>> > problem,
>> >
>> > > because we see no mpi errors even with high verbosity, and on the
>> > > problem nodes there are vanilla jobs running for up to 101 days now.
>> > >
>> > > - here what happens today:
>> > > Summary: Condor assumes that a starter of node 3 has a problem and
>> > > sends
>> >
>> > a
>> >
>> > > kill command. Even condor assumes that the kill command could not be
>> >
>> > send,
>> >
>> > > it reaches the node 2 as it should be because this was the first node
>> > > where mpirun is running.
>> > > It does also send SIGQUIT before SIGTERM which I have not assumed and
>> >
>> > maybe
>> >
>> > > this is the reason why our trap handler does not work beause it expects
>> > > only SIGTERM? In this case still some mpiprograms are running after the
>> > > job was removed from condor.
>> > > Than later the starters of job 3939 on node 3 also gets a kill signal
>> > > and handle this. But this was the prove that this starters are still
>> > > alive
>> >
>> > and
>> >
>> > > there was no reason to kill them, right?
>> > >
>> > > Even if there are work arounds, I think this behaviour could and should
>> >
>> > be
>> >
>> > > improved.
>> > > Or can we expect something was changed with htcondor 8.6.1?
>> > >
>> > > Best
>> > > Harald
>> > >
>> > > Here the most relevant log messages:
>> > >
>> > > ShadowLog:
>> > > 03/20/17 07:54:11 (3939.0) (1617771): condor_read() failed: recv(fd=10)
>> > > returned -1, errno = 110 Connection timed out, reading 5 bytes from
>> >
>> > startd
>> >
>> > > at <192.16
>> > > 8.123.3:29143>.
>> > > 03/20/17 07:54:11 (3939.0) (1617771): condor_read(): UNEXPECTED read
>> > > timeout after 0s during non-blocking read from startd at
>> > > <192.168.123.3:29143> (desired
>> > > timeout=300s)
>> > > 03/20/17 07:54:11 (3939.0) (1617771): IO: Failed to read packet header
>> > > 03/20/17 07:54:11 (3939.0) (1617771): Can no longer talk to
>> >
>> > condor_starter
>> >
>> > > <192.168.123.3:29143>
>> > > 03/20/17 07:54:11 (3939.0) (1617771): This job cannot reconnect to
>> >
>> > starter,
>> >
>> > > so job exiting
>> > > 03/20/17 07:54:12 (3939.0) (1617771): attempt to connect to
>> > > <192.168.123.3:29143> failed: No route to host (connect errno = 113).
>> >
>> > > 03/20/17 07:54:12 (3939.0) (1617771): RemoteResource::killStarter():
>> > Could
>> >
>> > > not send command to startd
>> > > 03/20/17 07:54:15 (3939.0) (1617771): attempt to connect to
>> > > <192.168.123.3:29143> failed: No route to host (connect errno = 113).
>> >
>> > > 03/20/17 07:54:15 (3939.0) (1617771): RemoteResource::killStarter():
>> > Could
>> >
>> > > not send command to startd
>> > > 03/20/17 07:54:21 (3939.0) (1617771): attempt to connect to
>> > > <192.168.123.3:29143> failed: No route to host (connect errno = 113).
>> >
>> > > 03/20/17 07:54:21 (3939.0) (1617771): RemoteResource::killStarter():
>> > Could
>> >
>> > > not send command to startd
>> > > 03/20/17 07:54:24 (3939.0) (1617771): ERROR "Can no longer talk to
>> > > condor_starter <192.168.123.3:29143>" at line 209 in file
>> > > /slots/02/dir_53434/userdir/src/condor_shadow.V6.1/NTreceivers.cpp
>> > >
>> > > StarterLog.slot1_2 on node 2 (mpirun of job 3939.0)
>> > > 03/20/17 07:54:11 (pid:1066066) Got SIGQUIT.  Performing fast shutdown.
>> > > 03/20/17 07:54:11 (pid:1066066) ShutdownFast all jobs.
>> > > 03/20/17 07:54:11 (pid:1066066) Got SIGTERM. Performing graceful
>> >
>> > shutdown.
>> >
>> > > 03/20/17 07:54:11 (pid:1066066) ShutdownGraceful all jobs.
>> > > 03/20/17 07:54:11 (pid:1066066) Process exited, pid=1066068, status=0
>> > > 03/20/17 07:54:24 (pid:1066066) condor_read() failed: recv(fd=8)
>> > > returned -1, errno = 104 Connection reset by peer, reading 5 bytes
>> > > from <192.168.123.100:18658>.
>> > > 03/20/17 07:54:24 (pid:1066066) IO: Failed to read packet header
>> > > 03/20/17 07:54:24 (pid:1066066) Lost connection to shadow, waiting 2400
>> > > secs for reconnect
>> > > 03/20/17 07:54:24 (pid:1066066) Failed to send job exit status to
>> > > shadow 03/20/17 07:54:24 (pid:1066066) Last process exited, now
>> > > Starter is
>> >
>> > exiting
>> >
>> > > 03/20/17 07:54:24 (pid:1066066) **** condor_starter (condor_STARTER)
>> > > pid 1066066 EXITING WITH STATUS 0
>> > >
>> > >
>> > > StarterLog.slot1_2 on node 3  running job 3939.0
>> > > 03/20/17 07:54:33 (pid:1056820) condor_read() failed: recv(fd=8)
>> > > returned -1, errno = 104 Connection reset b\
>> > > y peer, reading 5 bytes from <192.168.123.100:24154>.
>> > > 03/20/17 07:54:33 (pid:1056820) IO: Failed to read packet header
>> > > 03/20/17 07:54:33 (pid:1056820) Lost connection to shadow, waiting 2400
>> > > secs for reconnect
>> > > 03/20/17 07:54:33 (pid:1056820) Got SIGTERM. Performing graceful
>> >
>> > shutdown.
>> >
>> > > 03/20/17 07:54:33 (pid:1056820) ShutdownGraceful all jobs.
>> > > 03/20/17 07:54:33 (pid:1056820) Process exited, pid=1056824, status=0
>> > > 03/20/17 07:54:33 (pid:1056820) Failed to send job exit status to
>> > > shadow 03/20/17 07:54:33 (pid:1056820) Last process exited, now
>> > > Starter is
>> >
>> > exiting
>> >
>> > > 03/20/17 07:54:33 (pid:1056820) **** condor_starter (condor_STARTER)
>> > > pid 1056820 EXITING WITH STATUS 0
>> > >
>> > > StarterLog.slot1_3 on node 3  running job 3939.0
>> > > 03/20/17 07:54:46 (pid:1056821) condor_read() failed: recv(fd=8)
>> > > returned -1, errno = 104 Connection reset b\
>> > > y peer, reading 5 bytes from <192.168.123.100:3768>.
>> > > 03/20/17 07:54:46 (pid:1056821) IO: Failed to read packet header
>> > > 03/20/17 07:54:46 (pid:1056821) Lost connection to shadow, waiting 2400
>> > > secs for reconnect
>> > > 03/20/17 07:54:46 (pid:1056821) Got SIGTERM. Performing graceful
>> >
>> > shutdown.
>> >
>> > > 03/20/17 07:54:46 (pid:1056821) ShutdownGraceful all jobs.
>> > > 03/20/17 07:54:46 (pid:1056821) Process exited, pid=1056823, status=0
>> > > 03/20/17 07:54:46 (pid:1056821) Failed to send job exit status to
>> > > shadow 03/20/17 07:54:46 (pid:1056821) Last process exited, now
>> > > Starter is
>> >
>> > exiting
>> >
>> > > 03/20/17 07:54:46 (pid:1056821) **** condor_starter (condor_STARTER)
>> > > pid 1056821 EXITING WITH STATUS 0
>> > >
>> > > On Thursday 23 February 2017 15:12:38 Harald van Pee wrote:
>> > > > Hello,
>> > > >
>> > > > it happens again. What we have learned until now is:
>> > > > - a communication problem occurs between scheduler node and starter
>> >
>> > node
>> >
>> > > > - condor kills the starter process and afterwards  kills the job
>> > > > - several different nodes are affected due to lack of statistics we
>> > > > can not claim that all nodes are affected nor exclude that some have
>> > > > more problems than others.
>> > > > - its very unlikly that the program itself has a problem, because we
>> >
>> > have
>> >
>> > > > seen that 2 starter processes of 2 independent parallel jobs were
>> >
>> > killed
>> >
>> > > > on the same node at the same time.
>> > > > - at least within the last 2 monthes only parallel jobs are affected,
>> >
>> > but
>> >
>> > > > there is no hint for a mpi problem, any help how one can proof that
>> > > > no mpi problem exists are welcome.
>> > > > We have much more vanilla starters running than parallel ones.
>> > > >
>> > > > This morning the same node as last week was affacted. On this node
>> > > > 9 single vanilla starters are running,  2 of them since more than 47
>> > > > days. In addition 5 starters of 2 parallel jobs and only one starter
>> > > > of one parallel job was killed.
>> > > > From the ShadowLog below, one can see that several starters from
>> >
>> > serveral
>> >
>> > > > jobs and not only from node 37 have communication problems and the
>> > > > time during the problem occurs is less than one minute. Therfore I
>> > > > would expect that there will be no problem to reconnect to the
>> > > > starters and this is true for all vanilla jobs.
>> > > > But why the parallel starters were killed such fast?
>> > > >
>> > > > Any idea is welcome
>> > > > Harald
>> > > >
>> > > > ShadowLog (begin 2 lines before,  end 2 lines after the minute of the
>> > > > problem):
>> > > > 02/23/17 07:10:59 (1835.0) (2412453): Job 1835.0 terminated: exited
>> >
>> > with
>> >
>> > > > status 0
>> > > > 02/23/17 07:10:59 (1835.0) (2412453): **** condor_shadow
>> >
>> > (condor_SHADOW)
>> >
>> > > > pid 2412453 EXITING WITH STATUS 115
>> > > > 02/23/17 07:16:02 (1209.3) (49060): condor_read() failed: recv(fd=4)
>> > > > returned -1, errno = 110 Connection timed out, reading 5 bytes from
>> > > > startd slot1@xxxxxxxxxxxxxxxxxxxxxxxx
>> > > > 02/23/17 07:16:02 (1209.3) (49060): condor_read(): UNEXPECTED read
>> > > > timeout after 0s during non-blocking read from startd
>> > > > slot1@xxxxxxxxxxxxxxxxxxxxxxx (desired timeout=300s)
>> > > > 02/23/17 07:16:02 (1209.3) (49060): IO: Failed to read packet header
>> > > > 02/23/17 07:16:02 (1209.3) (49060): Can no longer talk to
>> >
>> > condor_starter
>> >
>> > > > <192.168.123.37:30389>
>> > > > 02/23/17 07:16:02 (1209.3) (49060): Trying to reconnect to
>> > > > disconnected job 02/23/17 07:16:02 (1209.3) (49060):
>> > > > LastJobLeaseRenewal: 1487830176 Thu Feb 23 07:09:36 2017
>> > > > 02/23/17 07:16:02 (1209.3) (49060): JobLeaseDuration: 2400 seconds
>> > > > 02/23/17 07:16:02 (1209.3) (49060): JobLeaseDuration remaining: 2014
>> > > > 02/23/17 07:16:02 (1209.3) (49060): Attempting to locate disconnected
>> > > > starter 02/23/17 07:16:03 (1209.3) (49060): attempt to connect to
>> > > > <192.168.123.37:30389> failed: No route to host (connect errno =
>> > > > 113). 02/23/17 07:16:03 (1209.3) (49060): locateStarter(): Failed to
>> > > > connect
>> >
>> > to
>> >
>> > > > startd <192.168.123.37:30389?addrs=192.168.123.37-30389>
>> > > > 02/23/17 07:16:03 (1209.3) (49060): JobLeaseDuration remaining: 2399
>> > > > 02/23/17 07:16:03 (1209.3) (49060): Scheduling another attempt to
>> > > > reconnect in 8 seconds
>> > > > 02/23/17 07:16:04 (1208.16) (46751): condor_read() failed: recv(fd=4)
>> > > > returned -1, errno = 110 Connection timed out, reading 5 bytes from
>> > > > starter at <192.168.123.51:49120>.
>> > > > 02/23/17 07:16:04 (1208.16) (46751): condor_read(): UNEXPECTED read
>> > > > timeout after 0s during non-blocking read from starter at
>> > > > <192.168.123.51:49120> (desired timeout=300s)
>> > > > 02/23/17 07:16:04 (1208.16) (46751): IO: Failed to read packet header
>> > > > 02/23/17 07:16:04 (1208.16) (46751): Can no longer talk to
>> >
>> > condor_starter
>> >
>> > > > <192.168.123.51:49120>
>> > > > 02/23/17 07:16:04 (1208.16) (46751): JobLeaseDuration remaining: 2014
>> > > > 02/23/17 07:16:04 (1208.16) (46751): Attempting to locate
>> > > > disconnected starter 02/23/17 07:16:05 (2143.0) (2719507):
>> > > > condor_read() failed: recv(fd=25) returned -1, errno = 110
>> > > > Connection timed out, reading 5 bytes from startd at
>> > > > <192.168.123.37:30389>.
>> > > > 02/23/17 07:16:05 (2143.0) (2719507): condor_read(): UNEXPECTED read
>> > > > timeout after 0s during non-blocking read from startd at
>> > > > <192.168.123.37:30389> (desired timeout=300s)
>> > > > 02/23/17 07:16:05 (2143.0) (2719507): IO: Failed to read packet
>> > > > header 02/23/17 07:16:05 (2143.0) (2719507): Can no longer talk to
>> > > > condor_starter <192.168.123.37:30389>
>> > > > 02/23/17 07:16:05 (2143.0) (2719507): This job cannot reconnect to
>> > > > starter, so job exiting
>> > > > 02/23/17 07:16:06 (1208.16) (46751): attempt to connect to
>> > > > <192.168.123.51:29246> failed: No route to host (connect errno =
>> > > > 113). 02/23/17 07:16:06 (1208.16) (46751): locateStarter(): Failed
>> > > > to connect to startd
>> > > > <192.168.123.51:29246?addrs=192.168.123.51-29246>
>> > > > 02/23/17 07:16:06 (1208.16) (46751): JobLeaseDuration remaining: 2398
>> > > > 02/23/17 07:16:06 (1208.16) (46751): Scheduling another attempt to
>> > > > reconnect in 8 seconds
>> > > > 02/23/17 07:16:07 (683.9) (2270376): condor_read() failed: recv(fd=4)
>> > > > returned -1, errno = 110 Connection timed out, reading 5 bytes from
>> > > > starter at <192.168.123.37:30325>.
>> > > > 02/23/17 07:16:07 (683.9) (2270376): condor_read(): UNEXPECTED read
>> > > > timeout after 0s during non-blocking read from starter at
>> > > > <192.168.123.37:30325> (desired timeout=300s)
>> > > > 02/23/17 07:16:07 (683.9) (2270376): IO: Failed to read packet header
>> > > > 02/23/17 07:16:07 (683.9) (2270376): Can no longer talk to
>> >
>> > condor_starter
>> >
>> > > > <192.168.123.37:30325>
>> > > > 02/23/17 07:16:07 (683.9) (2270376): JobLeaseDuration remaining: 2014
>> > > > 02/23/17 07:16:07 (683.9) (2270376): Attempting to locate
>> > > > disconnected starter 02/23/17 07:16:08 (2143.0) (2719507): attempt
>> > > > to connect to <192.168.123.37:30389> failed: No route to host
>> > > > (connect errno = 113). 02/23/17 07:16:08 (683.9) (2270376): attempt
>> > > > to connect to
>> > > > <192.168.123.37:30389> failed: No route to host (connect errno =
>> > > > 113). 02/23/17 07:16:08 (2143.0) (2719507):
>> > > > RemoteResource::killStarter(): Could not send command to startd
>> > > > 02/23/17 07:16:08 (683.9) (2270376): locateStarter(): Failed to
>> > > > connect to startd <192.168.123.37:30389?addrs=192.168.123.37-30389>
>> > > > 02/23/17 07:16:08 (683.9) (2270376): JobLeaseDuration remaining: 2399
>> > > > 02/23/17 07:16:08 (683.9) (2270376): Scheduling another attempt to
>> > > > reconnect in 8 seconds
>> > > > 02/23/17 07:16:11 (1209.3) (49060): Attempting to locate disconnected
>> > > > starter 02/23/17 07:16:11 (1209.3) (49060): Found starter:
>> > > > <192.168.123.37:38618?addrs=192.168.123.37-38618>
>> > > > 02/23/17 07:16:11 (1209.3) (49060): Attempting to reconnect to
>> > > > starter <192.168.123.37:38618?addrs=192.168.123.37-38618>
>> > > > 02/23/17 07:16:11 (2143.0) (2719507): ERROR "Can no longer talk to
>> > > > condor_starter <192.168.123.37:30389>" at line 209 in file
>> > > > /slots/02/dir_53434/userdir/src/condor_shadow.V6.1/NTreceivers.cpp
>> > > > 02/23/17 07:16:14 (1208.16) (46751): Attempting to locate
>> > > > disconnected starter 02/23/17 07:16:14 (1208.16) (46751): Found
>> > > > starter:
>> > > > <192.168.123.51:49120?addrs=192.168.123.51-49120>
>> > > > 02/23/17 07:16:14 (1208.16) (46751): Attempting to reconnect to
>> > > > starter <192.168.123.51:49120?addrs=192.168.123.51-49120>
>> > > > 02/23/17 07:16:15 (1208.16) (46751): Reconnect SUCCESS: connection
>> > > > re- established
>> > > > 02/23/17 07:16:16 (683.9) (2270376): Attempting to locate
>> > > > disconnected starter 02/23/17 07:16:16 (683.9) (2270376): Found
>> > > > starter:
>> > > > <192.168.123.37:30325?addrs=192.168.123.37-30325>
>> > > > 02/23/17 07:16:16 (683.9) (2270376): Attempting to reconnect to
>> > > > starter <192.168.123.37:30325?addrs=192.168.123.37-30325>
>> > > > 02/23/17 07:16:25 (683.9) (2270376): Reconnect SUCCESS: connection
>> > > > re- established
>> > > > 02/23/17 07:16:41 (1209.3) (49060): condor_read(): timeout reading 5
>> > > > bytes from starter at <192.168.123.37:38618>.
>> > > > 02/23/17 07:16:41 (1209.3) (49060): IO: Failed to read packet header
>> > > > 02/23/17 07:16:41 (1209.3) (49060): Attempt to reconnect failed:
>> > > > Failed to read reply ClassAd
>> > > > 02/23/17 07:16:41 (1209.3) (49060): JobLeaseDuration remaining: 2361
>> > > > 02/23/17 07:16:41 (1209.3) (49060): Scheduling another attempt to
>> > > > reconnect in 16 seconds
>> > > > 02/23/17 07:16:57 (1209.3) (49060): Attempting to locate disconnected
>> > > > starter 02/23/17 07:16:57 (1209.3) (49060): Found starter:
>> > > > <192.168.123.37:38618?addrs=192.168.123.37-38618>
>> > > > 02/23/17 07:16:57 (1209.3) (49060): Attempting to reconnect to
>> > > > starter <192.168.123.37:38618?addrs=192.168.123.37-38618>
>> > > > 02/23/17 07:16:57 (1209.3) (49060): Reconnect SUCCESS: connection re-
>> > > > established
>> > > > 02/23/17 07:43:17 (2102.0) (2559295): Job 2102.0 terminated: killed
>> > > > by signal 6
>> > > > 02/23/17 07:43:17 (2102.0) (2559295): **** condor_shadow
>> >
>> > (condor_SHADOW)
>> >
>> > > > pid 2559295 EXITING WITH STATUS 115
>> > > >
>> > > > On Tuesday 21 February 2017 22:41:54 Harald van Pee wrote:
>> > > > > Hi Todd,
>> > > > >
>> > > > > thank you for your help.
>> > > > >
>> > > > > Concerning the no route to host, I see no ethernet port down on any
>> > > > > machine during that time, but maybe a change in
>> > > > > /etc/host.conf:
>> > > > > order hosts,bind
>> > > > >
>> > > > > /etc/nsswitch.conf:
>> > > > > hosts:      files dns
>> > > > >
>> > > > > instead of the debian default will help anyway, /etc/hosts has all
>> > > > > ip addresses of all nodes in.
>> > > > >
>> > > > > Regards
>> > > > > Harald
>> > > > >
>> > > > > On Tuesday 21 February 2017 21:06:20 Todd Tannenbaum wrote:
>> > > > > > On 2/21/2017 1:33 PM, Harald van Pee wrote:
>> > > > > > > It seems that openmpi (or mpi) are not used very often with
>> > > > > > > htcondor and the information is spare and I got some questions
>> >
>> > how
>> >
>> > > > > > > I have managed it to get it running at all. I will share all I
>> > > > > > > know about this in a new thread soon, or is there a wiki where
>> > > > > > > I should put the information?
>> > > > > >
>> > > > > > Off-list I put Harold in touch with the folks who can put
>> > > > > > Harold's info into the Manual or the HTCondor Wiki (from the web
>> > > > > > homepage, look for the link "HOWTO recipes" and "HTcondor
>> > > > > > Wiki").
>> > > > > >
>> > > > > > Also we did some work for upcoming HTCondor v8.6.1 release so it
>> > > > > > works properly with the latest releases of OpenMPI - for details
>> >
>> > see
>> >
>> > > > > >    https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=6024
>> > > > > > >
>> > > > > > > Now back to our problem:
>> > > > > > > One hint that it will be related to network (ethernet or
>> > > > > > > infiniband) is, that we have one job running for 11 days
>> > > > > > > without problems as we have less jobs running, and we got
>> > > > > > > problems within a few days as we have startetd 200 more jobs.
>> > > > > > > I have found now 2 independend parallel mpi jobs which share
>> > > > > > > one machine with one job each and there are no ethernet
>> > > > > > > problems
>> >
>> > seen,
>> >
>> > > > > > > not on the scheduler machine nor on the starter node.
>> >
>> > Unfortunately
>> >
>> > > > > > > there is no error output in the jobs error file.
>> > > > > > > Its clear that condor kills the jobs but for me its unclear
>> > > > > > > why, because it seems both starter processes are still running
>> > > > > > > if I understand the logfiles correct.
>> > > > > >
>> > > > > > At first blush, it looks to me like the condor_shadow on the
>> > > > > > submit node could no longer contact the execute node at IP
>> > > > > > address 192.168.123.37 due to "No route to host".  The "No route
>> > > > > > to host" error comes the operating system, not from HTCondor -
>> > > > > > you can
>> >
>> > google
>> >
>> > > > > > this error and see lots of opinions/ideas on how to troubleshoot
>> >
>> > and
>> >
>> > > > > > fix, but basically there is no route for the execute node IP
>> >
>> > address
>> >
>> > > > > > in the client's routing table... not sure why this would happen
>> > > > > > all of the sudden, maybe some interface on your submit machine
>> > > > > > is being disabled or some switch port?
>> > > > > >
>> > > > > > regards,
>> > > > > > Todd
>> > > > > >
>> > > > > > > Maybe one of you find a hint in the condor log below and can
>> > > > > > > give me a hint what happens, or what I can do to find out.
>> > > > > > >
>> > > > > > > Best
>> > > > > > > Harald
>> > > > > > >
>> > > > > > > ShadowLog:
>> > > > > > > 02/19/17 03:09:44 (1744.0) (1729179): condor_read() failed:
>> > > > > > > recv(fd=12) returned -1, errno = 110 Connection timed out,
>> >
>> > reading
>> >
>> > > > > > > 5 bytes from startd at <192.168.123.37:30389>.
>> > > > > > > 02/19/17 03:09:44 (1745.0) (1729180): condor_read() failed:
>> > > > > > > recv(fd=9) returned -1, errno = 110 Connection timed out,
>> >
>> > reading 5
>> >
>> > > > > > > bytes from startd at <192.168.123.37:30389>.
>> > > > > > > 02/19/17 03:09:44 (1744.0) (1729179): condor_read(): UNEXPECTED
>> > > > > > > read timeout after 0s during non-blocking read from startd at
>> > > > > > > <192.168.123.37:30389> (desired timeout=300s)
>> > > > > > > 02/19/17 03:09:44 (1745.0) (1729180): condor_read(): UNEXPECTED
>> > > > > > > read timeout after 0s during non-blocking read from startd at
>> > > > > > > <192.168.123.37:30389> (desired timeout=300s)
>> > > > > > > 02/19/17 03:09:44 (1744.0) (1729179): IO: Failed to read packet
>> > > > > > > header 02/19/17 03:09:44 (1745.0) (1729180): IO: Failed to read
>> > > > > > > packet header 02/19/17 03:09:44 (1744.0) (1729179): Can no
>> > > > > > > longer talk to
>> > > > > > > condor_starter <192.168.123.37:30389>
>> > > > > > > 02/19/17 03:09:44 (1745.0) (1729180): Can no longer talk to
>> > > > > > > condor_starter <192.168.123.37:30389>
>> > > > > > > 02/19/17 03:09:44 (1744.0) (1729179): This job cannot reconnect
>> >
>> > to
>> >
>> > > > > > > starter, so job exiting
>> > > > > > > 02/19/17 03:09:44 (1745.0) (1729180): This job cannot reconnect
>> >
>> > to
>> >
>> > > > > > > starter, so job exiting
>> > > > > > > 02/19/17 03:09:47 (1745.0) (1729180): attempt to connect to
>> > > > > > > <192.168.123.37:30389> failed: No route to host (connect errno
>> > > > > > > = 113). 02/19/17 03:09:47 (1744.0) (1729179): attempt to
>> > > > > > > connect to <192.168.123.37:30389> failed: No route to host
>> > > > > > > (connect errno = 113). 02/19/17 03:09:47 (1745.0) (1729180):
>> > > > > > > RemoteResource::killStarter(): Could not send command to startd
>> > > > > > > 02/19/17 03:09:47 (1744.0) (1729179):
>> > > > > > > RemoteResource::killStarter(): Could not send command to startd
>> > > > > > > 02/19/17 03:09:47 (1744.0) (1729179): ERROR "Can no longer talk
>> >
>> > to
>> >
>> > > > > > > condor_starter <192.168.123.37:30389>" at line 209 in file
>> > > > > > > /slots/02/dir_53434/userdir/src/condor_shadow.V6.1/
>> >
>> > NTreceivers.cpp
>> >
>> > > > > > > 02/19/17 03:09:47 (1745.0) (1729180): ERROR "Can no longer talk
>> >
>> > to
>> >
>> > > > > > > condor_starter <192.168.123.37:30389>" at line 209 in file
>> > > > > > > /slots/02/dir_53434/userdir/src/condor_shadow.V6.1/
>> >
>> > NTreceivers.cpp
>> >
>> > > > > > > StarterLog of job 1745.0 on node 192.168.123.37
>> > > > > > > 02/15/17 17:14:34 (pid:751398) Create_Process succeeded,
>> >
>> > pid=751405
>> >
>> > > > > > > 02/15/17 17:14:35 (pid:751398) condor_write() failed: send() 1
>> > > > > > > bytes to <127.0.0.1:10238> returned -1, timeout=0, errno=32
>> >
>> > Broken
>> >
>> > > > > > > pipe. 02/19/17 03:10:05 (pid:751398) condor_read() failed:
>> > > > > > > recv(fd=8) returned -1, errno = 104 Connection reset by peer,
>> > > > > > > reading 5 bytes from <192.168.123.100:25500>.
>> > > > > > > 02/19/17 03:10:05 (pid:751398) IO: Failed to read packet header
>> > > > > > > 02/19/17 03:10:05 (pid:751398) Lost connection to shadow,
>> > > > > > > waiting 2400 secs for reconnect
>> > > > > > > 02/19/17 03:10:05 (pid:751398) Got SIGTERM. Performing graceful
>> > > > > > > shutdown. 02/19/17 03:10:05 (pid:751398) ShutdownGraceful all
>> >
>> > jobs.
>> >
>> > > > > > > 02/19/17 03:10:05 (pid:751398) Process exited, pid=751405,
>> >
>> > status=0
>> >
>> > > > > > > 02/19/17 03:10:05 (pid:751398) Failed to send job exit status
>> > > > > > > to shadow 02/19/17 03:10:05 (pid:751398) Last process exited,
>> > > > > > > now Starter is exiting 02/19/17 03:10:05 (pid:751398) ****
>> > > > > > > condor_starter
>> > > > > > > (condor_STARTER) pid 751398 EXITING WITH STATUS 0
>> > > > > > >
>> > > > > > > StarterLog of job 1744.0 on node 92.168.123.37
>> > > > > > > 02/15/17 17:14:34 (pid:751399) Create_Process succeeded,
>> >
>> > pid=751400
>> >
>> > > > > > > 02/15/17 17:14:34 (pid:751399) condor_write() failed: send() 1
>> > > > > > > bytes to <127.0.0.1:48689> returned -1, timeout=0, errno=32
>> >
>> > Broken
>> >
>> > > > > > > pipe. 02/19/17 03:10:03 (pid:751399) condor_read() failed:
>> > > > > > > recv(fd=8) returned -1, errno = 104 Connection reset by peer,
>> > > > > > > reading 5 bytes from <192.168.123.100:34337>.
>> > > > > > > 02/19/17 03:10:03 (pid:751399) IO: Failed to read packet header
>> > > > > > > 02/19/17 03:10:03 (pid:751399) Lost connection to shadow,
>> > > > > > > waiting 2400 secs for reconnect
>> > > > > > > 02/19/17 03:10:03 (pid:751399) Got SIGTERM. Performing graceful
>> > > > > > > shutdown. 02/19/17 03:10:03 (pid:751399) ShutdownGraceful all
>> >
>> > jobs.
>> >
>> > > > > > > 02/19/17 03:10:03 (pid:751399) Process exited, pid=751400,
>> >
>> > status=0
>> >
>> > > > > > > 02/19/17 03:10:03 (pid:751399) Failed to send job exit status
>> > > > > > > to shadow 02/19/17 03:10:03 (pid:751399) Last process exited,
>> > > > > > > now Starter is exiting 02/19/17 03:10:03 (pid:751399) ****
>> > > > > > > condor_starter
>> > > > > > > (condor_STARTER) pid 751399 EXITING WITH STATUS 0
>> > > > > > >
>> > > > > > > StartLog:
>> > > > > > > 02/19/17 03:09:48 slot1_11: Called deactivate_claim()
>> > > > > > > 02/19/17 03:09:48 slot1_11: Changing state and activity:
>> > > > > > > Claimed/Busy -> Preempting/Vacating
>> > > > > > > 02/19/17 03:09:48 slot1_13: Called deactivate_claim()
>> > > > > > > 02/19/17 03:09:48 slot1_13: Changing state and activity:
>> > > > > > > Claimed/Busy -> Preempting/Vacating
>> > > > > > > 02/19/17 03:10:03 Starter pid 751399 exited with status 0
>> > > > > > > 02/19/17 03:10:03 slot1_11: State change: starter exited
>> > > > > > > 02/19/17 03:10:03 slot1_11: State change: No preempting claim,
>> > > > > > > returning to owner
>> > > > > > > 02/19/17 03:10:03 slot1_11: Changing state and activity:
>> > > > > > > Preempting/Vacating -
>> > > > > > >
>> > > > > > >> Owner/Idle
>> > > > > > >
>> > > > > > > 02/19/17 03:10:03 slot1_11: State change: IS_OWNER is false
>> > > > > > > 02/19/17 03:10:03 slot1_11: Changing state: Owner -> Unclaimed
>> > > > > > > 02/19/17 03:10:03 slot1_11: Changing state: Unclaimed -> Delete
>> > > > > > > 02/19/17 03:10:03 slot1_11: Resource no longer needed, deleting
>> > > > > > > 02/19/17 03:10:05 Starter pid 751398 exited with status 0
>> > > > > > > 02/19/17 03:10:05 slot1_13: State change: starter exited
>> > > > > > > 02/19/17 03:10:05 slot1_13: State change: No preempting claim,
>> > > > > > > returning to owner
>> > > > > > > 02/19/17 03:10:05 slot1_13: Changing state and activity:
>> > > > > > > Preempting/Vacating -
>> > > > > > >
>> > > > > > >> Owner/Idle
>> > > > > > >
>> > > > > > > 02/19/17 03:10:05 slot1_13: State change: IS_OWNER is false
>> > > > > > > 02/19/17 03:10:05 slot1_13: Changing state: Owner -> Unclaimed
>> > > > > > > 02/19/17 03:10:05 slot1_13: Changing state: Unclaimed -> Delete
>> > > > > > > 02/19/17 03:10:05 slot1_13: Resource no longer needed, deleting
>> > > > > > > 02/19/17 03:19:48 Error: can't find resource with ClaimId
>> > > > > > > (<192.168.123.37:30389>#1481221329#1484#...) for 443
>> > > > > > > (RELEASE_CLAIM); perhaps this claim was removed already.
>> > > > > > > 02/19/17 03:19:48 condor_write(): Socket closed when trying to
>> > > > > > > write 13 bytes to <192.168.123.100:20962>, fd is 8
>> > > > > > > 02/19/17 03:19:48 Buf::write(): condor_write() failed
>> > > > > > > 02/19/17 03:19:48 Error: can't find resource with ClaimId
>> > > > > > > (<192.168.123.37:30389>#1481221329#1487#...) for 443
>> > > > > > > (RELEASE_CLAIM); perhaps this claim was removed already.
>> > > > > > > 02/19/17 03:19:48 condor_write(): Socket closed when trying to
>> > > > > > > write 13 bytes to <192.168.123.100:34792>, fd is 8
>> > > > > > > 02/19/17 03:19:48 Buf::write(): condor_write() failed
>> > > > > > >
>> > > > > > > On Tuesday 07 February 2017 19:55:31 Harald van Pee wrote:
>> > > > > > >> Dear experts,
>> > > > > > >>
>> > > > > > >> I have some questions for debugging:
>> > > > > > >> Can I avoid restarting of a job in vanilla and/or parallel
>> > > > > > >> universe if I use Requirements = (NumJobStarts==0)
>> > > > > > >> in the condor submit description file?
>> > > > > > >> If it works, will the job stay idle or will be removed?
>> > > > > > >>
>> > > > > > >> I found a job in the vanilla universe started at 12/9 and
>> > > > > > >> restarted shortly before Christmas and still running. I assume
>> > > > > > >> the reason were also network problems, but unfortunatelly our
>> > > > > > >> last condor and system log files are from January.
>> > > > > > >> Is there any possibility to make condor a little bit more
>> > > > > > >> robust against network problems via configuration? Just wait
>> > > > > > >> a little
>> >
>> > bit
>> >
>> > > > > > >> longer or make more reconnection tries?
>> > > > > > >>
>> > > > > > >> We are working on automatic restart of the mpi jobs and try to
>> >
>> > use
>> >
>> > > > > > >> more frequent checkpoints, but it seems a lot of work,
>> > > > > > >> therefore any idea would be welcome.
>> > > > > > >>
>> > > > > > >> Best
>> > > > > > >> Harald
>> > > > > > >>
>> > > > > > >> On Monday 06 February 2017 23:43:47 Harald van Pee wrote:
>> > > > > > >>> There is one important argument, why I think the problem is
>> > > > > > >>> condor related not mpi (of course I can be wrong).
>> > > > > > >>> The condor communication goes via ethernet, and the ethernet
>> > > > > > >>> connection has a problem for several minutes.
>> > > > > > >>> The mpi communication goes via infiniband, and there is no
>> > > > > > >>> infiniband problem during this time.
>> > > > > > >>>
>> > > > > > >>> Harald
>> > > > > > >>>
>> > > > > > >>> On Monday 06 February 2017 23:04:01 Harald van Pee wrote:
>> > > > > > >>>> Hi Greg,
>> > > > > > >>>>
>> > > > > > >>>> thanks for your answer.
>> > > > > > >>>>
>> > > > > > >>>> On Monday 06 February 2017 22:18:08 Greg Thain wrote:
>> > > > > > >>>>> On 02/06/2017 02:40 PM, Harald van Pee wrote:
>> > > > > > >>>>>> Hello,
>> > > > > > >>>>>>
>> > > > > > >>>>>> we got mpi running in parallel universe with htcondor 8.4
>> > > > > > >>>>>> using openmpiscript and its working in general without any
>> > > > > > >>>>>> problem.
>> > > > > > >>>>>
>> > > > > > >>>>> In general, the MPI jobs themselves cannot survive a
>> > > > > > >>>>> network outage or partition, even a temporary one.
>> > > > > > >>>>> HTCondor will reconnect the shadow to the starters, if the
>> > > > > > >>>>> problem is just between the submit machine and the execute
>> > > > > > >>>>> machines, but if
>> >
>> > the
>> >
>> > > > > > >>>>> network problem also impacts node-to-node communication,
>> > > > > > >>>>> then the job has to be aborted and restarted from scratch
>> > > > > > >>>>> because
>> >
>> > of
>> >
>> > > > > > >>>>> the way MPI works.
>> > > > > > >>>>
>> > > > > > >>>> The problem seems between submit machine and one running
>> > > > > > >>>> node (not the node where mpirun was started).
>> > > > > > >>>> If you are right it should be possible to get or found an
>> >
>> > error
>> >
>> > > > > > >>>> of mpirun because it lost one node right?
>> > > > > > >>>> But it seems condor kills the job because of a shadow
>> >
>> > exception.
>> >
>> > > > > > >>>> Unfortunatelly we do not see the output of the stoped job
>> > > > > > >>>> because its overwritten by the new started.
>> > > > > > >>>> Any suggestion how to find out if its realy an mpi related
>> > > > > > >>>> problem?
>> > > > > > >>>>
>> > > > > > >>>>> If possible, we would recommend that long-running jobs that
>> > > > > > >>>>> suffer from this problem try to self-checkpoint themselves,
>> >
>> > so
>> >
>> > > > > > >>>>> that when they are restarted, they don't need to be
>> > > > > > >>>>> restarted from scratch.
>> > > > > > >>>>>
>> > > > > > >>>>> -greg
>> > > >
>> > > > _______________________________________________
>> > > > HTCondor-users mailing list
>> > > > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
>> >
>> > with
>> >
>> > > > a subject: Unsubscribe
>> > > > You can also unsubscribe by visiting
>> > > > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>> > > >
>> > > > The archives can be found at:
>> > > > https://lists.cs.wisc.edu/archive/htcondor-users/
>> > >
>> > > _______________________________________________
>> > > HTCondor-users mailing list
>> > > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
>> >
>> > with a
>> >
>> > > subject: Unsubscribe
>> > > You can also unsubscribe by visiting
>> > > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>> > >
>> > > The archives can be found at:
>> > > https://lists.cs.wisc.edu/archive/htcondor-users/
>> >
>> > _______________________________________________
>> > HTCondor-users mailing list
>> > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with
>> > a
>> > subject: Unsubscribe
>> > You can also unsubscribe by visiting
>> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>> >
>> > The archives can be found at:
>> > https://lists.cs.wisc.edu/archive/htcondor-users/
>
> --
> Harald van Pee
>
> Helmholtz-Institut fuer Strahlen- und Kernphysik der Universitaet Bonn
> Nussallee 14-16 - 53115 Bonn - Tel +49-228-732213 - Fax +49-228-732505
> mail: pee@xxxxxxxxxxxxxxxxx
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/