[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp



Hello,

it happens again. What we have learned until now is:
- a communication problem occurs between scheduler node and starter node 
- condor kills the starter process and afterwards  kills the job
- several different nodes are affected due to lack of statistics we can not 
claim that all nodes are affected nor exclude that some have more problems 
than others.
- its very unlikly that the program itself has a problem, because we have seen 
that 2 starter processes of 2 independent parallel jobs were killed on the 
same node at the same time. 
- at least within the last 2 monthes only parallel jobs are affected, but 
there is no hint for a mpi problem, any help how one can proof that no mpi 
problem exists are welcome.
We have much more vanilla starters running than parallel ones.

This morning the same node as last week was affacted. On this node
9 single vanilla starters are running,  2 of them since more than 47 days.
In addition 5 starters of 2 parallel jobs and only one starter of one parallel 
job was killed.
>From the ShadowLog below, one can see that several starters from serveral jobs
and not only from node 37 have communication problems and the time during the 
problem occurs is less than one minute. Therfore I would expect that there 
will be no problem to reconnect to the starters and this is true for all 
vanilla jobs.
But why the parallel starters were killed such fast?

Any idea is welcome
Harald

ShadowLog (begin 2 lines before,  end 2 lines after the minute of the 
problem):
02/23/17 07:10:59 (1835.0) (2412453): Job 1835.0 terminated: exited with 
status 0
02/23/17 07:10:59 (1835.0) (2412453): **** condor_shadow (condor_SHADOW) pid 
2412453 EXITING WITH STATUS 115
02/23/17 07:16:02 (1209.3) (49060): condor_read() failed: recv(fd=4) returned 
-1, errno = 110 Connection timed out, reading 5 bytes from startd 
slot1@xxxxxxxxxxxxxxxxxxxxxxxx
02/23/17 07:16:02 (1209.3) (49060): condor_read(): UNEXPECTED read timeout 
after 0s during non-blocking read from startd slot1@xxxxxxxxxxxxxxxxxxxxxxx 
(desired timeout=300s)
02/23/17 07:16:02 (1209.3) (49060): IO: Failed to read packet header
02/23/17 07:16:02 (1209.3) (49060): Can no longer talk to condor_starter 
<192.168.123.37:30389>
02/23/17 07:16:02 (1209.3) (49060): Trying to reconnect to disconnected job
02/23/17 07:16:02 (1209.3) (49060): LastJobLeaseRenewal: 1487830176 Thu Feb 23 
07:09:36 2017
02/23/17 07:16:02 (1209.3) (49060): JobLeaseDuration: 2400 seconds
02/23/17 07:16:02 (1209.3) (49060): JobLeaseDuration remaining: 2014
02/23/17 07:16:02 (1209.3) (49060): Attempting to locate disconnected starter
02/23/17 07:16:03 (1209.3) (49060): attempt to connect to 
<192.168.123.37:30389> failed: No route to host (connect errno = 113).
02/23/17 07:16:03 (1209.3) (49060): locateStarter(): Failed to connect to 
startd <192.168.123.37:30389?addrs=192.168.123.37-30389>
02/23/17 07:16:03 (1209.3) (49060): JobLeaseDuration remaining: 2399
02/23/17 07:16:03 (1209.3) (49060): Scheduling another attempt to reconnect in 
8 seconds
02/23/17 07:16:04 (1208.16) (46751): condor_read() failed: recv(fd=4) returned 
-1, errno = 110 Connection timed out, reading 5 bytes from starter at 
<192.168.123.51:49120>.
02/23/17 07:16:04 (1208.16) (46751): condor_read(): UNEXPECTED read timeout 
after 0s during non-blocking read from starter at <192.168.123.51:49120> 
(desired timeout=300s)
02/23/17 07:16:04 (1208.16) (46751): IO: Failed to read packet header
02/23/17 07:16:04 (1208.16) (46751): Can no longer talk to condor_starter 
<192.168.123.51:49120>
02/23/17 07:16:04 (1208.16) (46751): JobLeaseDuration remaining: 2014
02/23/17 07:16:04 (1208.16) (46751): Attempting to locate disconnected starter
02/23/17 07:16:05 (2143.0) (2719507): condor_read() failed: recv(fd=25) 
returned -1, errno = 110 Connection timed out, reading 5 bytes from startd at 
<192.168.123.37:30389>.
02/23/17 07:16:05 (2143.0) (2719507): condor_read(): UNEXPECTED read timeout 
after 0s during non-blocking read from startd at <192.168.123.37:30389> 
(desired timeout=300s)
02/23/17 07:16:05 (2143.0) (2719507): IO: Failed to read packet header
02/23/17 07:16:05 (2143.0) (2719507): Can no longer talk to condor_starter 
<192.168.123.37:30389>
02/23/17 07:16:05 (2143.0) (2719507): This job cannot reconnect to starter, so 
job exiting
02/23/17 07:16:06 (1208.16) (46751): attempt to connect to 
<192.168.123.51:29246> failed: No route to host (connect errno = 113).
02/23/17 07:16:06 (1208.16) (46751): locateStarter(): Failed to connect to 
startd <192.168.123.51:29246?addrs=192.168.123.51-29246>
02/23/17 07:16:06 (1208.16) (46751): JobLeaseDuration remaining: 2398
02/23/17 07:16:06 (1208.16) (46751): Scheduling another attempt to reconnect 
in 8 seconds
02/23/17 07:16:07 (683.9) (2270376): condor_read() failed: recv(fd=4) returned 
-1, errno = 110 Connection timed out, reading 5 bytes from starter at 
<192.168.123.37:30325>.
02/23/17 07:16:07 (683.9) (2270376): condor_read(): UNEXPECTED read timeout 
after 0s during non-blocking read from starter at <192.168.123.37:30325> 
(desired timeout=300s)
02/23/17 07:16:07 (683.9) (2270376): IO: Failed to read packet header
02/23/17 07:16:07 (683.9) (2270376): Can no longer talk to condor_starter 
<192.168.123.37:30325>
02/23/17 07:16:07 (683.9) (2270376): JobLeaseDuration remaining: 2014
02/23/17 07:16:07 (683.9) (2270376): Attempting to locate disconnected starter
02/23/17 07:16:08 (2143.0) (2719507): attempt to connect to 
<192.168.123.37:30389> failed: No route to host (connect errno = 113).
02/23/17 07:16:08 (683.9) (2270376): attempt to connect to 
<192.168.123.37:30389> failed: No route to host (connect errno = 113).
02/23/17 07:16:08 (2143.0) (2719507): RemoteResource::killStarter(): Could not 
send command to startd
02/23/17 07:16:08 (683.9) (2270376): locateStarter(): Failed to connect to 
startd <192.168.123.37:30389?addrs=192.168.123.37-30389>
02/23/17 07:16:08 (683.9) (2270376): JobLeaseDuration remaining: 2399
02/23/17 07:16:08 (683.9) (2270376): Scheduling another attempt to reconnect 
in 8 seconds
02/23/17 07:16:11 (1209.3) (49060): Attempting to locate disconnected starter
02/23/17 07:16:11 (1209.3) (49060): Found starter: 
<192.168.123.37:38618?addrs=192.168.123.37-38618>
02/23/17 07:16:11 (1209.3) (49060): Attempting to reconnect to starter 
<192.168.123.37:38618?addrs=192.168.123.37-38618>
02/23/17 07:16:11 (2143.0) (2719507): ERROR "Can no longer talk to 
condor_starter <192.168.123.37:30389>" at line 209 in file 
/slots/02/dir_53434/userdir/src/condor_shadow.V6.1/NTreceivers.cpp
02/23/17 07:16:14 (1208.16) (46751): Attempting to locate disconnected starter
02/23/17 07:16:14 (1208.16) (46751): Found starter: 
<192.168.123.51:49120?addrs=192.168.123.51-49120>
02/23/17 07:16:14 (1208.16) (46751): Attempting to reconnect to starter 
<192.168.123.51:49120?addrs=192.168.123.51-49120>
02/23/17 07:16:15 (1208.16) (46751): Reconnect SUCCESS: connection re-
established
02/23/17 07:16:16 (683.9) (2270376): Attempting to locate disconnected starter
02/23/17 07:16:16 (683.9) (2270376): Found starter: 
<192.168.123.37:30325?addrs=192.168.123.37-30325>
02/23/17 07:16:16 (683.9) (2270376): Attempting to reconnect to starter 
<192.168.123.37:30325?addrs=192.168.123.37-30325>
02/23/17 07:16:25 (683.9) (2270376): Reconnect SUCCESS: connection re-
established
02/23/17 07:16:41 (1209.3) (49060): condor_read(): timeout reading 5 bytes 
from starter at <192.168.123.37:38618>.
02/23/17 07:16:41 (1209.3) (49060): IO: Failed to read packet header
02/23/17 07:16:41 (1209.3) (49060): Attempt to reconnect failed: Failed to 
read reply ClassAd
02/23/17 07:16:41 (1209.3) (49060): JobLeaseDuration remaining: 2361
02/23/17 07:16:41 (1209.3) (49060): Scheduling another attempt to reconnect in 
16 seconds
02/23/17 07:16:57 (1209.3) (49060): Attempting to locate disconnected starter
02/23/17 07:16:57 (1209.3) (49060): Found starter: 
<192.168.123.37:38618?addrs=192.168.123.37-38618>
02/23/17 07:16:57 (1209.3) (49060): Attempting to reconnect to starter 
<192.168.123.37:38618?addrs=192.168.123.37-38618>
02/23/17 07:16:57 (1209.3) (49060): Reconnect SUCCESS: connection re-
established
02/23/17 07:43:17 (2102.0) (2559295): Job 2102.0 terminated: killed by signal 
6
02/23/17 07:43:17 (2102.0) (2559295): **** condor_shadow (condor_SHADOW) pid 
2559295 EXITING WITH STATUS 115




On Tuesday 21 February 2017 22:41:54 Harald van Pee wrote:
> Hi Todd,
> 
> thank you for your help.
> 
> Concerning the no route to host, I see no ethernet port down on any machine
> during that time, but maybe a change in
> /etc/host.conf:
> order hosts,bind
> 
> /etc/nsswitch.conf:
> hosts:      files dns
> 
> instead of the debian default will help anyway, /etc/hosts has all ip
> addresses of all nodes in.
> 
> Regards
> Harald
> 
> On Tuesday 21 February 2017 21:06:20 Todd Tannenbaum wrote:
> > On 2/21/2017 1:33 PM, Harald van Pee wrote:
> > > It seems that openmpi (or mpi) are not used very often with htcondor
> > > and the information is spare and I got some questions how I have
> > > managed it to get it running at all. I will share all I know about
> > > this in a new thread soon, or is there a wiki where I should put the
> > > information?
> > 
> > Off-list I put Harold in touch with the folks who can put Harold's info
> > into the Manual or the HTCondor Wiki (from the web homepage, look for
> > the link "HOWTO recipes" and "HTcondor Wiki").
> > 
> > Also we did some work for upcoming HTCondor v8.6.1 release so it works
> > properly with the latest releases of OpenMPI - for details see
> > 
> >    https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=6024
> > > 
> > > Now back to our problem:
> > > One hint that it will be related to network (ethernet or infiniband)
> > > is, that we have one job running for 11 days without problems as we
> > > have less jobs running, and we got problems within a few days as we
> > > have startetd 200 more jobs.
> > > I have found now 2 independend parallel mpi jobs which share one
> > > machine with one job each and there are no ethernet problems seen, not
> > > on the scheduler machine nor on the starter node. Unfortunately there
> > > is no error output in the jobs error file.
> > > Its clear that condor kills the jobs but for me its unclear why,
> > > because it seems both starter processes are still running if I
> > > understand the logfiles correct.
> > 
> > At first blush, it looks to me like the condor_shadow on the submit node
> > could no longer contact the execute node at IP address 192.168.123.37
> > due to "No route to host".  The "No route to host" error comes the
> > operating system, not from HTCondor - you can google this error and see
> > lots of opinions/ideas on how to troubleshoot and fix, but basically
> > there is no route for the execute node IP address in the client's
> > routing table... not sure why this would happen all of the sudden, maybe
> > some interface on your submit machine is being disabled or some switch
> > port?
> > 
> > regards,
> > Todd
> > 
> > > Maybe one of you find a hint in the condor log below and can give me a
> > > hint what happens, or what I can do to find out.
> > > 
> > > Best
> > > Harald
> > > 
> > > ShadowLog:
> > > 02/19/17 03:09:44 (1744.0) (1729179): condor_read() failed: recv(fd=12)
> > > returned -1, errno = 110 Connection timed out, reading 5 bytes from
> > > startd at <192.168.123.37:30389>.
> > > 02/19/17 03:09:44 (1745.0) (1729180): condor_read() failed: recv(fd=9)
> > > returned -1, errno = 110 Connection timed out, reading 5 bytes from
> > > startd at <192.168.123.37:30389>.
> > > 02/19/17 03:09:44 (1744.0) (1729179): condor_read(): UNEXPECTED read
> > > timeout after 0s during non-blocking read from startd at
> > > <192.168.123.37:30389> (desired timeout=300s)
> > > 02/19/17 03:09:44 (1745.0) (1729180): condor_read(): UNEXPECTED read
> > > timeout after 0s during non-blocking read from startd at
> > > <192.168.123.37:30389> (desired timeout=300s)
> > > 02/19/17 03:09:44 (1744.0) (1729179): IO: Failed to read packet header
> > > 02/19/17 03:09:44 (1745.0) (1729180): IO: Failed to read packet header
> > > 02/19/17 03:09:44 (1744.0) (1729179): Can no longer talk to
> > > condor_starter <192.168.123.37:30389>
> > > 02/19/17 03:09:44 (1745.0) (1729180): Can no longer talk to
> > > condor_starter <192.168.123.37:30389>
> > > 02/19/17 03:09:44 (1744.0) (1729179): This job cannot reconnect to
> > > starter, so job exiting
> > > 02/19/17 03:09:44 (1745.0) (1729180): This job cannot reconnect to
> > > starter, so job exiting
> > > 02/19/17 03:09:47 (1745.0) (1729180): attempt to connect to
> > > <192.168.123.37:30389> failed: No route to host (connect errno = 113).
> > > 02/19/17 03:09:47 (1744.0) (1729179): attempt to connect to
> > > <192.168.123.37:30389> failed: No route to host (connect errno = 113).
> > > 02/19/17 03:09:47 (1745.0) (1729180): RemoteResource::killStarter():
> > > Could not send command to startd
> > > 02/19/17 03:09:47 (1744.0) (1729179): RemoteResource::killStarter():
> > > Could not send command to startd
> > > 02/19/17 03:09:47 (1744.0) (1729179): ERROR "Can no longer talk to
> > > condor_starter <192.168.123.37:30389>" at line 209 in file
> > > /slots/02/dir_53434/userdir/src/condor_shadow.V6.1/NTreceivers.cpp
> > > 02/19/17 03:09:47 (1745.0) (1729180): ERROR "Can no longer talk to
> > > condor_starter <192.168.123.37:30389>" at line 209 in file
> > > /slots/02/dir_53434/userdir/src/condor_shadow.V6.1/NTreceivers.cpp
> > > 
> > > 
> > > StarterLog of job 1745.0 on node 192.168.123.37
> > > 02/15/17 17:14:34 (pid:751398) Create_Process succeeded, pid=751405
> > > 02/15/17 17:14:35 (pid:751398) condor_write() failed: send() 1 bytes to
> > > <127.0.0.1:10238> returned -1, timeout=0, errno=32 Broken pipe.
> > > 02/19/17 03:10:05 (pid:751398) condor_read() failed: recv(fd=8)
> > > returned -1, errno = 104 Connection reset by peer, reading 5 bytes
> > > from <192.168.123.100:25500>.
> > > 02/19/17 03:10:05 (pid:751398) IO: Failed to read packet header
> > > 02/19/17 03:10:05 (pid:751398) Lost connection to shadow, waiting 2400
> > > secs for reconnect
> > > 02/19/17 03:10:05 (pid:751398) Got SIGTERM. Performing graceful
> > > shutdown. 02/19/17 03:10:05 (pid:751398) ShutdownGraceful all jobs.
> > > 02/19/17 03:10:05 (pid:751398) Process exited, pid=751405, status=0
> > > 02/19/17 03:10:05 (pid:751398) Failed to send job exit status to shadow
> > > 02/19/17 03:10:05 (pid:751398) Last process exited, now Starter is
> > > exiting 02/19/17 03:10:05 (pid:751398) **** condor_starter
> > > (condor_STARTER) pid 751398 EXITING WITH STATUS 0
> > > 
> > > StarterLog of job 1744.0 on node 92.168.123.37
> > > 02/15/17 17:14:34 (pid:751399) Create_Process succeeded, pid=751400
> > > 02/15/17 17:14:34 (pid:751399) condor_write() failed: send() 1 bytes to
> > > <127.0.0.1:48689> returned -1, timeout=0, errno=32 Broken pipe.
> > > 02/19/17 03:10:03 (pid:751399) condor_read() failed: recv(fd=8)
> > > returned -1, errno = 104 Connection reset by peer, reading 5 bytes
> > > from <192.168.123.100:34337>.
> > > 02/19/17 03:10:03 (pid:751399) IO: Failed to read packet header
> > > 02/19/17 03:10:03 (pid:751399) Lost connection to shadow, waiting 2400
> > > secs for reconnect
> > > 02/19/17 03:10:03 (pid:751399) Got SIGTERM. Performing graceful
> > > shutdown. 02/19/17 03:10:03 (pid:751399) ShutdownGraceful all jobs.
> > > 02/19/17 03:10:03 (pid:751399) Process exited, pid=751400, status=0
> > > 02/19/17 03:10:03 (pid:751399) Failed to send job exit status to shadow
> > > 02/19/17 03:10:03 (pid:751399) Last process exited, now Starter is
> > > exiting 02/19/17 03:10:03 (pid:751399) **** condor_starter
> > > (condor_STARTER) pid 751399 EXITING WITH STATUS 0
> > > 
> > > StartLog:
> > > 02/19/17 03:09:48 slot1_11: Called deactivate_claim()
> > > 02/19/17 03:09:48 slot1_11: Changing state and activity: Claimed/Busy
> > > -> Preempting/Vacating
> > > 02/19/17 03:09:48 slot1_13: Called deactivate_claim()
> > > 02/19/17 03:09:48 slot1_13: Changing state and activity: Claimed/Busy
> > > -> Preempting/Vacating
> > > 02/19/17 03:10:03 Starter pid 751399 exited with status 0
> > > 02/19/17 03:10:03 slot1_11: State change: starter exited
> > > 02/19/17 03:10:03 slot1_11: State change: No preempting claim,
> > > returning to owner
> > > 02/19/17 03:10:03 slot1_11: Changing state and activity:
> > > Preempting/Vacating -
> > > 
> > >> Owner/Idle
> > > 
> > > 02/19/17 03:10:03 slot1_11: State change: IS_OWNER is false
> > > 02/19/17 03:10:03 slot1_11: Changing state: Owner -> Unclaimed
> > > 02/19/17 03:10:03 slot1_11: Changing state: Unclaimed -> Delete
> > > 02/19/17 03:10:03 slot1_11: Resource no longer needed, deleting
> > > 02/19/17 03:10:05 Starter pid 751398 exited with status 0
> > > 02/19/17 03:10:05 slot1_13: State change: starter exited
> > > 02/19/17 03:10:05 slot1_13: State change: No preempting claim,
> > > returning to owner
> > > 02/19/17 03:10:05 slot1_13: Changing state and activity:
> > > Preempting/Vacating -
> > > 
> > >> Owner/Idle
> > > 
> > > 02/19/17 03:10:05 slot1_13: State change: IS_OWNER is false
> > > 02/19/17 03:10:05 slot1_13: Changing state: Owner -> Unclaimed
> > > 02/19/17 03:10:05 slot1_13: Changing state: Unclaimed -> Delete
> > > 02/19/17 03:10:05 slot1_13: Resource no longer needed, deleting
> > > 02/19/17 03:19:48 Error: can't find resource with ClaimId
> > > (<192.168.123.37:30389>#1481221329#1484#...) for 443 (RELEASE_CLAIM);
> > > perhaps this claim was removed already.
> > > 02/19/17 03:19:48 condor_write(): Socket closed when trying to write 13
> > > bytes to <192.168.123.100:20962>, fd is 8
> > > 02/19/17 03:19:48 Buf::write(): condor_write() failed
> > > 02/19/17 03:19:48 Error: can't find resource with ClaimId
> > > (<192.168.123.37:30389>#1481221329#1487#...) for 443 (RELEASE_CLAIM);
> > > perhaps this claim was removed already.
> > > 02/19/17 03:19:48 condor_write(): Socket closed when trying to write 13
> > > bytes to <192.168.123.100:34792>, fd is 8
> > > 02/19/17 03:19:48 Buf::write(): condor_write() failed
> > > 
> > > On Tuesday 07 February 2017 19:55:31 Harald van Pee wrote:
> > >> Dear experts,
> > >> 
> > >> I have some questions for debugging:
> > >> Can I avoid restarting of a job in vanilla and/or parallel universe if
> > >> I use Requirements = (NumJobStarts==0)
> > >> in the condor submit description file?
> > >> If it works, will the job stay idle or will be removed?
> > >> 
> > >> I found a job in the vanilla universe started at 12/9 and restarted
> > >> shortly before Christmas and still running. I assume the reason were
> > >> also network problems, but unfortunatelly our last condor and system
> > >> log files are from January.
> > >> Is there any possibility to make condor a little bit more robust
> > >> against network problems via configuration? Just wait a little bit
> > >> longer or make more reconnection tries?
> > >> 
> > >> We are working on automatic restart of the mpi jobs and try to use
> > >> more frequent checkpoints, but it seems a lot of work, therefore any
> > >> idea would be welcome.
> > >> 
> > >> Best
> > >> Harald
> > >> 
> > >> On Monday 06 February 2017 23:43:47 Harald van Pee wrote:
> > >>> There is one important argument, why I think the problem is condor
> > >>> related not mpi (of course I can be wrong).
> > >>> The condor communication goes via ethernet, and the ethernet
> > >>> connection has a problem for several minutes.
> > >>> The mpi communication goes via infiniband, and there is no infiniband
> > >>> problem during this time.
> > >>> 
> > >>> Harald
> > >>> 
> > >>> On Monday 06 February 2017 23:04:01 Harald van Pee wrote:
> > >>>> Hi Greg,
> > >>>> 
> > >>>> thanks for your answer.
> > >>>> 
> > >>>> On Monday 06 February 2017 22:18:08 Greg Thain wrote:
> > >>>>> On 02/06/2017 02:40 PM, Harald van Pee wrote:
> > >>>>>> Hello,
> > >>>>>> 
> > >>>>>> we got mpi running in parallel universe with htcondor 8.4 using
> > >>>>>> openmpiscript and its working in general without any problem.
> > >>>>> 
> > >>>>> In general, the MPI jobs themselves cannot survive a network outage
> > >>>>> or partition, even a temporary one.  HTCondor will reconnect the
> > >>>>> shadow to the starters, if the problem is just between the submit
> > >>>>> machine and the execute machines, but if the network problem also
> > >>>>> impacts node-to-node communication, then the job has to be aborted
> > >>>>> and restarted from scratch because of the way MPI works.
> > >>>> 
> > >>>> The problem seems between submit machine and one running node (not
> > >>>> the node where mpirun was started).
> > >>>> If you are right it should be possible to get or found an error of
> > >>>> mpirun because it lost one node right?
> > >>>> But it seems condor kills the job because of a shadow exception.
> > >>>> Unfortunatelly we do not see the output of the stoped job because
> > >>>> its overwritten by the new started.
> > >>>> Any suggestion how to find out if its realy an mpi related problem?
> > >>>> 
> > >>>>> If possible, we would recommend that long-running jobs that suffer
> > >>>>> from this problem try to self-checkpoint themselves, so that when
> > >>>>> they are restarted, they don't need to be restarted from scratch.
> > >>>>> 
> > >>>>> -greg