[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp



Hello, 

here my status update:

- We are now sure that this problem never was seen for the vanilla universe, 
the only job in question was restarted because of a broken node.

- the new options in openmpiscript from htcondor 8.6.1 make no difference.
 Especialy excluding ethernet interfaces does not help, as 
 Jason Patton assumes from openmpi documentation.

- It is clearly caused by minor ethernet problems because:
 a) We see connection problems to some nodes at the same time, but all
 connections but some from parallel universe are reestablished
 b) We have identified 3 nodes which make more problems than others,
 if we exclude these nodes we have managed to run several mpi jobs with 40   
mpinodes for longer than 16 days without restarts (the jobs were removed by 
the user or finished).
But there is no reason to assume that these nodes have any severe problem, 
because we see no mpi errors even with high verbosity, and on the problem 
nodes there are vanilla jobs running for up to 101 days now.

- here what happens today:
Summary: Condor assumes that a starter of node 3 has a problem and sends a 
kill command. Even condor assumes that the kill command could not be send, it 
reaches the node 2 as it should be because this was the first node where 
mpirun is running. 
It does also send SIGQUIT before SIGTERM which I have not assumed and maybe 
this is the reason why our trap handler does not work beause it expects only 
SIGTERM? In this case still some mpiprograms are running after the job was 
removed from condor.
Than later the starters of job 3939 on node 3 also gets a kill signal and 
handle this. But this was the prove that this starters are still alive and 
there was no reason to kill them, right?

Even if there are work arounds, I think this behaviour could and should be 
improved.
Or can we expect something was changed with htcondor 8.6.1?

Best
Harald

Here the most relevant log messages:
 
ShadowLog:
03/20/17 07:54:11 (3939.0) (1617771): condor_read() failed: recv(fd=10) 
returned -1, errno = 110 Connection timed out, reading 5 bytes from startd at 
<192.16
8.123.3:29143>.
03/20/17 07:54:11 (3939.0) (1617771): condor_read(): UNEXPECTED read timeout 
after 0s during non-blocking read from startd at <192.168.123.3:29143> 
(desired 
timeout=300s)
03/20/17 07:54:11 (3939.0) (1617771): IO: Failed to read packet header
03/20/17 07:54:11 (3939.0) (1617771): Can no longer talk to condor_starter 
<192.168.123.3:29143>
03/20/17 07:54:11 (3939.0) (1617771): This job cannot reconnect to starter, so 
job exiting
03/20/17 07:54:12 (3939.0) (1617771): attempt to connect to 
<192.168.123.3:29143> failed: No route to host (connect errno = 113).
03/20/17 07:54:12 (3939.0) (1617771): RemoteResource::killStarter(): Could not 
send command to startd
03/20/17 07:54:15 (3939.0) (1617771): attempt to connect to 
<192.168.123.3:29143> failed: No route to host (connect errno = 113).
03/20/17 07:54:15 (3939.0) (1617771): RemoteResource::killStarter(): Could not 
send command to startd
03/20/17 07:54:21 (3939.0) (1617771): attempt to connect to 
<192.168.123.3:29143> failed: No route to host (connect errno = 113).
03/20/17 07:54:21 (3939.0) (1617771): RemoteResource::killStarter(): Could not 
send command to startd
03/20/17 07:54:24 (3939.0) (1617771): ERROR "Can no longer talk to 
condor_starter <192.168.123.3:29143>" at line 209 in file 
/slots/02/dir_53434/userdir/src/condor_shadow.V6.1/NTreceivers.cpp

StarterLog.slot1_2 on node 2 (mpirun of job 3939.0)
03/20/17 07:54:11 (pid:1066066) Got SIGQUIT.  Performing fast shutdown.
03/20/17 07:54:11 (pid:1066066) ShutdownFast all jobs.
03/20/17 07:54:11 (pid:1066066) Got SIGTERM. Performing graceful shutdown.
03/20/17 07:54:11 (pid:1066066) ShutdownGraceful all jobs.
03/20/17 07:54:11 (pid:1066066) Process exited, pid=1066068, status=0
03/20/17 07:54:24 (pid:1066066) condor_read() failed: recv(fd=8) returned -1, 
errno = 104 Connection reset by peer, reading 5 bytes from 
<192.168.123.100:18658>.
03/20/17 07:54:24 (pid:1066066) IO: Failed to read packet header
03/20/17 07:54:24 (pid:1066066) Lost connection to shadow, waiting 2400 secs 
for reconnect
03/20/17 07:54:24 (pid:1066066) Failed to send job exit status to shadow
03/20/17 07:54:24 (pid:1066066) Last process exited, now Starter is exiting
03/20/17 07:54:24 (pid:1066066) **** condor_starter (condor_STARTER) pid 
1066066 EXITING WITH STATUS 0


StarterLog.slot1_2 on node 3  running job 3939.0
03/20/17 07:54:33 (pid:1056820) condor_read() failed: recv(fd=8) returned -1, 
errno = 104 Connection reset b\
y peer, reading 5 bytes from <192.168.123.100:24154>.                                                        
03/20/17 07:54:33 (pid:1056820) IO: Failed to read packet header                                             
03/20/17 07:54:33 (pid:1056820) Lost connection to shadow, waiting 2400 secs 
for reconnect                   
03/20/17 07:54:33 (pid:1056820) Got SIGTERM. Performing graceful shutdown.                                   
03/20/17 07:54:33 (pid:1056820) ShutdownGraceful all jobs.                                                   
03/20/17 07:54:33 (pid:1056820) Process exited, pid=1056824, status=0                                        
03/20/17 07:54:33 (pid:1056820) Failed to send job exit status to shadow                                     
03/20/17 07:54:33 (pid:1056820) Last process exited, now Starter is exiting                                  
03/20/17 07:54:33 (pid:1056820) **** condor_starter (condor_STARTER) pid 
1056820 EXITING WITH STATUS 0       

StarterLog.slot1_3 on node 3  running job 3939.0
03/20/17 07:54:46 (pid:1056821) condor_read() failed: recv(fd=8) returned -1, 
errno = 104 Connection reset b\
y peer, reading 5 bytes from <192.168.123.100:3768>.                                                         
03/20/17 07:54:46 (pid:1056821) IO: Failed to read packet header                                             
03/20/17 07:54:46 (pid:1056821) Lost connection to shadow, waiting 2400 secs 
for reconnect                   
03/20/17 07:54:46 (pid:1056821) Got SIGTERM. Performing graceful shutdown.                                   
03/20/17 07:54:46 (pid:1056821) ShutdownGraceful all jobs.                                                   
03/20/17 07:54:46 (pid:1056821) Process exited, pid=1056823, status=0                                        
03/20/17 07:54:46 (pid:1056821) Failed to send job exit status to shadow                                     
03/20/17 07:54:46 (pid:1056821) Last process exited, now Starter is exiting                                  
03/20/17 07:54:46 (pid:1056821) **** condor_starter (condor_STARTER) pid 
1056821 EXITING WITH STATUS 0       





On Thursday 23 February 2017 15:12:38 Harald van Pee wrote:
> Hello,
> 
> it happens again. What we have learned until now is:
> - a communication problem occurs between scheduler node and starter node
> - condor kills the starter process and afterwards  kills the job
> - several different nodes are affected due to lack of statistics we can not
> claim that all nodes are affected nor exclude that some have more problems
> than others.
> - its very unlikly that the program itself has a problem, because we have
> seen that 2 starter processes of 2 independent parallel jobs were killed
> on the same node at the same time.
> - at least within the last 2 monthes only parallel jobs are affected, but
> there is no hint for a mpi problem, any help how one can proof that no mpi
> problem exists are welcome.
> We have much more vanilla starters running than parallel ones.
> 
> This morning the same node as last week was affacted. On this node
> 9 single vanilla starters are running,  2 of them since more than 47 days.
> In addition 5 starters of 2 parallel jobs and only one starter of one
> parallel job was killed.
> From the ShadowLog below, one can see that several starters from serveral
> jobs and not only from node 37 have communication problems and the time
> during the problem occurs is less than one minute. Therfore I would expect
> that there will be no problem to reconnect to the starters and this is
> true for all vanilla jobs.
> But why the parallel starters were killed such fast?
> 
> Any idea is welcome
> Harald
> 
> ShadowLog (begin 2 lines before,  end 2 lines after the minute of the
> problem):
> 02/23/17 07:10:59 (1835.0) (2412453): Job 1835.0 terminated: exited with
> status 0
> 02/23/17 07:10:59 (1835.0) (2412453): **** condor_shadow (condor_SHADOW)
> pid 2412453 EXITING WITH STATUS 115
> 02/23/17 07:16:02 (1209.3) (49060): condor_read() failed: recv(fd=4)
> returned -1, errno = 110 Connection timed out, reading 5 bytes from startd
> slot1@xxxxxxxxxxxxxxxxxxxxxxxx
> 02/23/17 07:16:02 (1209.3) (49060): condor_read(): UNEXPECTED read timeout
> after 0s during non-blocking read from startd slot1@xxxxxxxxxxxxxxxxxxxxxxx
> (desired timeout=300s)
> 02/23/17 07:16:02 (1209.3) (49060): IO: Failed to read packet header
> 02/23/17 07:16:02 (1209.3) (49060): Can no longer talk to condor_starter
> <192.168.123.37:30389>
> 02/23/17 07:16:02 (1209.3) (49060): Trying to reconnect to disconnected job
> 02/23/17 07:16:02 (1209.3) (49060): LastJobLeaseRenewal: 1487830176 Thu Feb
> 23 07:09:36 2017
> 02/23/17 07:16:02 (1209.3) (49060): JobLeaseDuration: 2400 seconds
> 02/23/17 07:16:02 (1209.3) (49060): JobLeaseDuration remaining: 2014
> 02/23/17 07:16:02 (1209.3) (49060): Attempting to locate disconnected
> starter 02/23/17 07:16:03 (1209.3) (49060): attempt to connect to
> <192.168.123.37:30389> failed: No route to host (connect errno = 113).
> 02/23/17 07:16:03 (1209.3) (49060): locateStarter(): Failed to connect to
> startd <192.168.123.37:30389?addrs=192.168.123.37-30389>
> 02/23/17 07:16:03 (1209.3) (49060): JobLeaseDuration remaining: 2399
> 02/23/17 07:16:03 (1209.3) (49060): Scheduling another attempt to reconnect
> in 8 seconds
> 02/23/17 07:16:04 (1208.16) (46751): condor_read() failed: recv(fd=4)
> returned -1, errno = 110 Connection timed out, reading 5 bytes from
> starter at <192.168.123.51:49120>.
> 02/23/17 07:16:04 (1208.16) (46751): condor_read(): UNEXPECTED read timeout
> after 0s during non-blocking read from starter at <192.168.123.51:49120>
> (desired timeout=300s)
> 02/23/17 07:16:04 (1208.16) (46751): IO: Failed to read packet header
> 02/23/17 07:16:04 (1208.16) (46751): Can no longer talk to condor_starter
> <192.168.123.51:49120>
> 02/23/17 07:16:04 (1208.16) (46751): JobLeaseDuration remaining: 2014
> 02/23/17 07:16:04 (1208.16) (46751): Attempting to locate disconnected
> starter 02/23/17 07:16:05 (2143.0) (2719507): condor_read() failed:
> recv(fd=25) returned -1, errno = 110 Connection timed out, reading 5 bytes
> from startd at <192.168.123.37:30389>.
> 02/23/17 07:16:05 (2143.0) (2719507): condor_read(): UNEXPECTED read
> timeout after 0s during non-blocking read from startd at
> <192.168.123.37:30389> (desired timeout=300s)
> 02/23/17 07:16:05 (2143.0) (2719507): IO: Failed to read packet header
> 02/23/17 07:16:05 (2143.0) (2719507): Can no longer talk to condor_starter
> <192.168.123.37:30389>
> 02/23/17 07:16:05 (2143.0) (2719507): This job cannot reconnect to starter,
> so job exiting
> 02/23/17 07:16:06 (1208.16) (46751): attempt to connect to
> <192.168.123.51:29246> failed: No route to host (connect errno = 113).
> 02/23/17 07:16:06 (1208.16) (46751): locateStarter(): Failed to connect to
> startd <192.168.123.51:29246?addrs=192.168.123.51-29246>
> 02/23/17 07:16:06 (1208.16) (46751): JobLeaseDuration remaining: 2398
> 02/23/17 07:16:06 (1208.16) (46751): Scheduling another attempt to
> reconnect in 8 seconds
> 02/23/17 07:16:07 (683.9) (2270376): condor_read() failed: recv(fd=4)
> returned -1, errno = 110 Connection timed out, reading 5 bytes from
> starter at <192.168.123.37:30325>.
> 02/23/17 07:16:07 (683.9) (2270376): condor_read(): UNEXPECTED read timeout
> after 0s during non-blocking read from starter at <192.168.123.37:30325>
> (desired timeout=300s)
> 02/23/17 07:16:07 (683.9) (2270376): IO: Failed to read packet header
> 02/23/17 07:16:07 (683.9) (2270376): Can no longer talk to condor_starter
> <192.168.123.37:30325>
> 02/23/17 07:16:07 (683.9) (2270376): JobLeaseDuration remaining: 2014
> 02/23/17 07:16:07 (683.9) (2270376): Attempting to locate disconnected
> starter 02/23/17 07:16:08 (2143.0) (2719507): attempt to connect to
> <192.168.123.37:30389> failed: No route to host (connect errno = 113).
> 02/23/17 07:16:08 (683.9) (2270376): attempt to connect to
> <192.168.123.37:30389> failed: No route to host (connect errno = 113).
> 02/23/17 07:16:08 (2143.0) (2719507): RemoteResource::killStarter(): Could
> not send command to startd
> 02/23/17 07:16:08 (683.9) (2270376): locateStarter(): Failed to connect to
> startd <192.168.123.37:30389?addrs=192.168.123.37-30389>
> 02/23/17 07:16:08 (683.9) (2270376): JobLeaseDuration remaining: 2399
> 02/23/17 07:16:08 (683.9) (2270376): Scheduling another attempt to
> reconnect in 8 seconds
> 02/23/17 07:16:11 (1209.3) (49060): Attempting to locate disconnected
> starter 02/23/17 07:16:11 (1209.3) (49060): Found starter:
> <192.168.123.37:38618?addrs=192.168.123.37-38618>
> 02/23/17 07:16:11 (1209.3) (49060): Attempting to reconnect to starter
> <192.168.123.37:38618?addrs=192.168.123.37-38618>
> 02/23/17 07:16:11 (2143.0) (2719507): ERROR "Can no longer talk to
> condor_starter <192.168.123.37:30389>" at line 209 in file
> /slots/02/dir_53434/userdir/src/condor_shadow.V6.1/NTreceivers.cpp
> 02/23/17 07:16:14 (1208.16) (46751): Attempting to locate disconnected
> starter 02/23/17 07:16:14 (1208.16) (46751): Found starter:
> <192.168.123.51:49120?addrs=192.168.123.51-49120>
> 02/23/17 07:16:14 (1208.16) (46751): Attempting to reconnect to starter
> <192.168.123.51:49120?addrs=192.168.123.51-49120>
> 02/23/17 07:16:15 (1208.16) (46751): Reconnect SUCCESS: connection re-
> established
> 02/23/17 07:16:16 (683.9) (2270376): Attempting to locate disconnected
> starter 02/23/17 07:16:16 (683.9) (2270376): Found starter:
> <192.168.123.37:30325?addrs=192.168.123.37-30325>
> 02/23/17 07:16:16 (683.9) (2270376): Attempting to reconnect to starter
> <192.168.123.37:30325?addrs=192.168.123.37-30325>
> 02/23/17 07:16:25 (683.9) (2270376): Reconnect SUCCESS: connection re-
> established
> 02/23/17 07:16:41 (1209.3) (49060): condor_read(): timeout reading 5 bytes
> from starter at <192.168.123.37:38618>.
> 02/23/17 07:16:41 (1209.3) (49060): IO: Failed to read packet header
> 02/23/17 07:16:41 (1209.3) (49060): Attempt to reconnect failed: Failed to
> read reply ClassAd
> 02/23/17 07:16:41 (1209.3) (49060): JobLeaseDuration remaining: 2361
> 02/23/17 07:16:41 (1209.3) (49060): Scheduling another attempt to reconnect
> in 16 seconds
> 02/23/17 07:16:57 (1209.3) (49060): Attempting to locate disconnected
> starter 02/23/17 07:16:57 (1209.3) (49060): Found starter:
> <192.168.123.37:38618?addrs=192.168.123.37-38618>
> 02/23/17 07:16:57 (1209.3) (49060): Attempting to reconnect to starter
> <192.168.123.37:38618?addrs=192.168.123.37-38618>
> 02/23/17 07:16:57 (1209.3) (49060): Reconnect SUCCESS: connection re-
> established
> 02/23/17 07:43:17 (2102.0) (2559295): Job 2102.0 terminated: killed by
> signal 6
> 02/23/17 07:43:17 (2102.0) (2559295): **** condor_shadow (condor_SHADOW)
> pid 2559295 EXITING WITH STATUS 115
> 
> On Tuesday 21 February 2017 22:41:54 Harald van Pee wrote:
> > Hi Todd,
> > 
> > thank you for your help.
> > 
> > Concerning the no route to host, I see no ethernet port down on any
> > machine during that time, but maybe a change in
> > /etc/host.conf:
> > order hosts,bind
> > 
> > /etc/nsswitch.conf:
> > hosts:      files dns
> > 
> > instead of the debian default will help anyway, /etc/hosts has all ip
> > addresses of all nodes in.
> > 
> > Regards
> > Harald
> > 
> > On Tuesday 21 February 2017 21:06:20 Todd Tannenbaum wrote:
> > > On 2/21/2017 1:33 PM, Harald van Pee wrote:
> > > > It seems that openmpi (or mpi) are not used very often with htcondor
> > > > and the information is spare and I got some questions how I have
> > > > managed it to get it running at all. I will share all I know about
> > > > this in a new thread soon, or is there a wiki where I should put the
> > > > information?
> > > 
> > > Off-list I put Harold in touch with the folks who can put Harold's info
> > > into the Manual or the HTCondor Wiki (from the web homepage, look for
> > > the link "HOWTO recipes" and "HTcondor Wiki").
> > > 
> > > Also we did some work for upcoming HTCondor v8.6.1 release so it works
> > > properly with the latest releases of OpenMPI - for details see
> > > 
> > >    https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=6024
> > > > 
> > > > Now back to our problem:
> > > > One hint that it will be related to network (ethernet or infiniband)
> > > > is, that we have one job running for 11 days without problems as we
> > > > have less jobs running, and we got problems within a few days as we
> > > > have startetd 200 more jobs.
> > > > I have found now 2 independend parallel mpi jobs which share one
> > > > machine with one job each and there are no ethernet problems seen,
> > > > not on the scheduler machine nor on the starter node. Unfortunately
> > > > there is no error output in the jobs error file.
> > > > Its clear that condor kills the jobs but for me its unclear why,
> > > > because it seems both starter processes are still running if I
> > > > understand the logfiles correct.
> > > 
> > > At first blush, it looks to me like the condor_shadow on the submit
> > > node could no longer contact the execute node at IP address
> > > 192.168.123.37 due to "No route to host".  The "No route to host"
> > > error comes the operating system, not from HTCondor - you can google
> > > this error and see lots of opinions/ideas on how to troubleshoot and
> > > fix, but basically there is no route for the execute node IP address
> > > in the client's routing table... not sure why this would happen all of
> > > the sudden, maybe some interface on your submit machine is being
> > > disabled or some switch port?
> > > 
> > > regards,
> > > Todd
> > > 
> > > > Maybe one of you find a hint in the condor log below and can give me
> > > > a hint what happens, or what I can do to find out.
> > > > 
> > > > Best
> > > > Harald
> > > > 
> > > > ShadowLog:
> > > > 02/19/17 03:09:44 (1744.0) (1729179): condor_read() failed:
> > > > recv(fd=12) returned -1, errno = 110 Connection timed out, reading 5
> > > > bytes from startd at <192.168.123.37:30389>.
> > > > 02/19/17 03:09:44 (1745.0) (1729180): condor_read() failed:
> > > > recv(fd=9) returned -1, errno = 110 Connection timed out, reading 5
> > > > bytes from startd at <192.168.123.37:30389>.
> > > > 02/19/17 03:09:44 (1744.0) (1729179): condor_read(): UNEXPECTED read
> > > > timeout after 0s during non-blocking read from startd at
> > > > <192.168.123.37:30389> (desired timeout=300s)
> > > > 02/19/17 03:09:44 (1745.0) (1729180): condor_read(): UNEXPECTED read
> > > > timeout after 0s during non-blocking read from startd at
> > > > <192.168.123.37:30389> (desired timeout=300s)
> > > > 02/19/17 03:09:44 (1744.0) (1729179): IO: Failed to read packet
> > > > header 02/19/17 03:09:44 (1745.0) (1729180): IO: Failed to read
> > > > packet header 02/19/17 03:09:44 (1744.0) (1729179): Can no longer
> > > > talk to
> > > > condor_starter <192.168.123.37:30389>
> > > > 02/19/17 03:09:44 (1745.0) (1729180): Can no longer talk to
> > > > condor_starter <192.168.123.37:30389>
> > > > 02/19/17 03:09:44 (1744.0) (1729179): This job cannot reconnect to
> > > > starter, so job exiting
> > > > 02/19/17 03:09:44 (1745.0) (1729180): This job cannot reconnect to
> > > > starter, so job exiting
> > > > 02/19/17 03:09:47 (1745.0) (1729180): attempt to connect to
> > > > <192.168.123.37:30389> failed: No route to host (connect errno =
> > > > 113). 02/19/17 03:09:47 (1744.0) (1729179): attempt to connect to
> > > > <192.168.123.37:30389> failed: No route to host (connect errno =
> > > > 113). 02/19/17 03:09:47 (1745.0) (1729180):
> > > > RemoteResource::killStarter(): Could not send command to startd
> > > > 02/19/17 03:09:47 (1744.0) (1729179): RemoteResource::killStarter():
> > > > Could not send command to startd
> > > > 02/19/17 03:09:47 (1744.0) (1729179): ERROR "Can no longer talk to
> > > > condor_starter <192.168.123.37:30389>" at line 209 in file
> > > > /slots/02/dir_53434/userdir/src/condor_shadow.V6.1/NTreceivers.cpp
> > > > 02/19/17 03:09:47 (1745.0) (1729180): ERROR "Can no longer talk to
> > > > condor_starter <192.168.123.37:30389>" at line 209 in file
> > > > /slots/02/dir_53434/userdir/src/condor_shadow.V6.1/NTreceivers.cpp
> > > > 
> > > > 
> > > > StarterLog of job 1745.0 on node 192.168.123.37
> > > > 02/15/17 17:14:34 (pid:751398) Create_Process succeeded, pid=751405
> > > > 02/15/17 17:14:35 (pid:751398) condor_write() failed: send() 1 bytes
> > > > to <127.0.0.1:10238> returned -1, timeout=0, errno=32 Broken pipe.
> > > > 02/19/17 03:10:05 (pid:751398) condor_read() failed: recv(fd=8)
> > > > returned -1, errno = 104 Connection reset by peer, reading 5 bytes
> > > > from <192.168.123.100:25500>.
> > > > 02/19/17 03:10:05 (pid:751398) IO: Failed to read packet header
> > > > 02/19/17 03:10:05 (pid:751398) Lost connection to shadow, waiting
> > > > 2400 secs for reconnect
> > > > 02/19/17 03:10:05 (pid:751398) Got SIGTERM. Performing graceful
> > > > shutdown. 02/19/17 03:10:05 (pid:751398) ShutdownGraceful all jobs.
> > > > 02/19/17 03:10:05 (pid:751398) Process exited, pid=751405, status=0
> > > > 02/19/17 03:10:05 (pid:751398) Failed to send job exit status to
> > > > shadow 02/19/17 03:10:05 (pid:751398) Last process exited, now
> > > > Starter is exiting 02/19/17 03:10:05 (pid:751398) ****
> > > > condor_starter
> > > > (condor_STARTER) pid 751398 EXITING WITH STATUS 0
> > > > 
> > > > StarterLog of job 1744.0 on node 92.168.123.37
> > > > 02/15/17 17:14:34 (pid:751399) Create_Process succeeded, pid=751400
> > > > 02/15/17 17:14:34 (pid:751399) condor_write() failed: send() 1 bytes
> > > > to <127.0.0.1:48689> returned -1, timeout=0, errno=32 Broken pipe.
> > > > 02/19/17 03:10:03 (pid:751399) condor_read() failed: recv(fd=8)
> > > > returned -1, errno = 104 Connection reset by peer, reading 5 bytes
> > > > from <192.168.123.100:34337>.
> > > > 02/19/17 03:10:03 (pid:751399) IO: Failed to read packet header
> > > > 02/19/17 03:10:03 (pid:751399) Lost connection to shadow, waiting
> > > > 2400 secs for reconnect
> > > > 02/19/17 03:10:03 (pid:751399) Got SIGTERM. Performing graceful
> > > > shutdown. 02/19/17 03:10:03 (pid:751399) ShutdownGraceful all jobs.
> > > > 02/19/17 03:10:03 (pid:751399) Process exited, pid=751400, status=0
> > > > 02/19/17 03:10:03 (pid:751399) Failed to send job exit status to
> > > > shadow 02/19/17 03:10:03 (pid:751399) Last process exited, now
> > > > Starter is exiting 02/19/17 03:10:03 (pid:751399) ****
> > > > condor_starter
> > > > (condor_STARTER) pid 751399 EXITING WITH STATUS 0
> > > > 
> > > > StartLog:
> > > > 02/19/17 03:09:48 slot1_11: Called deactivate_claim()
> > > > 02/19/17 03:09:48 slot1_11: Changing state and activity: Claimed/Busy
> > > > -> Preempting/Vacating
> > > > 02/19/17 03:09:48 slot1_13: Called deactivate_claim()
> > > > 02/19/17 03:09:48 slot1_13: Changing state and activity: Claimed/Busy
> > > > -> Preempting/Vacating
> > > > 02/19/17 03:10:03 Starter pid 751399 exited with status 0
> > > > 02/19/17 03:10:03 slot1_11: State change: starter exited
> > > > 02/19/17 03:10:03 slot1_11: State change: No preempting claim,
> > > > returning to owner
> > > > 02/19/17 03:10:03 slot1_11: Changing state and activity:
> > > > Preempting/Vacating -
> > > > 
> > > >> Owner/Idle
> > > > 
> > > > 02/19/17 03:10:03 slot1_11: State change: IS_OWNER is false
> > > > 02/19/17 03:10:03 slot1_11: Changing state: Owner -> Unclaimed
> > > > 02/19/17 03:10:03 slot1_11: Changing state: Unclaimed -> Delete
> > > > 02/19/17 03:10:03 slot1_11: Resource no longer needed, deleting
> > > > 02/19/17 03:10:05 Starter pid 751398 exited with status 0
> > > > 02/19/17 03:10:05 slot1_13: State change: starter exited
> > > > 02/19/17 03:10:05 slot1_13: State change: No preempting claim,
> > > > returning to owner
> > > > 02/19/17 03:10:05 slot1_13: Changing state and activity:
> > > > Preempting/Vacating -
> > > > 
> > > >> Owner/Idle
> > > > 
> > > > 02/19/17 03:10:05 slot1_13: State change: IS_OWNER is false
> > > > 02/19/17 03:10:05 slot1_13: Changing state: Owner -> Unclaimed
> > > > 02/19/17 03:10:05 slot1_13: Changing state: Unclaimed -> Delete
> > > > 02/19/17 03:10:05 slot1_13: Resource no longer needed, deleting
> > > > 02/19/17 03:19:48 Error: can't find resource with ClaimId
> > > > (<192.168.123.37:30389>#1481221329#1484#...) for 443 (RELEASE_CLAIM);
> > > > perhaps this claim was removed already.
> > > > 02/19/17 03:19:48 condor_write(): Socket closed when trying to write
> > > > 13 bytes to <192.168.123.100:20962>, fd is 8
> > > > 02/19/17 03:19:48 Buf::write(): condor_write() failed
> > > > 02/19/17 03:19:48 Error: can't find resource with ClaimId
> > > > (<192.168.123.37:30389>#1481221329#1487#...) for 443 (RELEASE_CLAIM);
> > > > perhaps this claim was removed already.
> > > > 02/19/17 03:19:48 condor_write(): Socket closed when trying to write
> > > > 13 bytes to <192.168.123.100:34792>, fd is 8
> > > > 02/19/17 03:19:48 Buf::write(): condor_write() failed
> > > > 
> > > > On Tuesday 07 February 2017 19:55:31 Harald van Pee wrote:
> > > >> Dear experts,
> > > >> 
> > > >> I have some questions for debugging:
> > > >> Can I avoid restarting of a job in vanilla and/or parallel universe
> > > >> if I use Requirements = (NumJobStarts==0)
> > > >> in the condor submit description file?
> > > >> If it works, will the job stay idle or will be removed?
> > > >> 
> > > >> I found a job in the vanilla universe started at 12/9 and restarted
> > > >> shortly before Christmas and still running. I assume the reason were
> > > >> also network problems, but unfortunatelly our last condor and system
> > > >> log files are from January.
> > > >> Is there any possibility to make condor a little bit more robust
> > > >> against network problems via configuration? Just wait a little bit
> > > >> longer or make more reconnection tries?
> > > >> 
> > > >> We are working on automatic restart of the mpi jobs and try to use
> > > >> more frequent checkpoints, but it seems a lot of work, therefore any
> > > >> idea would be welcome.
> > > >> 
> > > >> Best
> > > >> Harald
> > > >> 
> > > >> On Monday 06 February 2017 23:43:47 Harald van Pee wrote:
> > > >>> There is one important argument, why I think the problem is condor
> > > >>> related not mpi (of course I can be wrong).
> > > >>> The condor communication goes via ethernet, and the ethernet
> > > >>> connection has a problem for several minutes.
> > > >>> The mpi communication goes via infiniband, and there is no
> > > >>> infiniband problem during this time.
> > > >>> 
> > > >>> Harald
> > > >>> 
> > > >>> On Monday 06 February 2017 23:04:01 Harald van Pee wrote:
> > > >>>> Hi Greg,
> > > >>>> 
> > > >>>> thanks for your answer.
> > > >>>> 
> > > >>>> On Monday 06 February 2017 22:18:08 Greg Thain wrote:
> > > >>>>> On 02/06/2017 02:40 PM, Harald van Pee wrote:
> > > >>>>>> Hello,
> > > >>>>>> 
> > > >>>>>> we got mpi running in parallel universe with htcondor 8.4 using
> > > >>>>>> openmpiscript and its working in general without any problem.
> > > >>>>> 
> > > >>>>> In general, the MPI jobs themselves cannot survive a network
> > > >>>>> outage or partition, even a temporary one.  HTCondor will
> > > >>>>> reconnect the shadow to the starters, if the problem is just
> > > >>>>> between the submit machine and the execute machines, but if the
> > > >>>>> network problem also impacts node-to-node communication, then
> > > >>>>> the job has to be aborted and restarted from scratch because of
> > > >>>>> the way MPI works.
> > > >>>> 
> > > >>>> The problem seems between submit machine and one running node (not
> > > >>>> the node where mpirun was started).
> > > >>>> If you are right it should be possible to get or found an error of
> > > >>>> mpirun because it lost one node right?
> > > >>>> But it seems condor kills the job because of a shadow exception.
> > > >>>> Unfortunatelly we do not see the output of the stoped job because
> > > >>>> its overwritten by the new started.
> > > >>>> Any suggestion how to find out if its realy an mpi related
> > > >>>> problem?
> > > >>>> 
> > > >>>>> If possible, we would recommend that long-running jobs that
> > > >>>>> suffer from this problem try to self-checkpoint themselves, so
> > > >>>>> that when they are restarted, they don't need to be restarted
> > > >>>>> from scratch.
> > > >>>>> 
> > > >>>>> -greg
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/