[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp



If it works for your condor pool, you can have condor communicate over your infiniband network by settingÂNETWORK_INTERFACE appropriately. The daemons should continue to listen on all interfaces as long asÂBIND_ALL_INTERFACES is set to true.

Assuming that both your vanilla and parallel jobs were disconnected for the same amount of time, it does sounds like you've found a bug with the reconnection of parallel jobs. We will see if we can reproduce the behavior here.

Jason Patton

On Tue, Mar 21, 2017 at 12:11 PM, Harald van Pee <pee@xxxxxxxxxxxxxxxxx> wrote:
Hello all,

now I found the reason for the problems:
Our cisco sg200-50 switch reboots every couple of days and sometimes even
hours. I will try a firmware update or replace it, but in the 100days cluster
its not such easy.

>From the condor side the difference seems to be for a parallel universe job:
03/20/17 07:54:11 (3939.0) (1617771): This job cannot reconnect to starter, so
job exiting
instead for a vanilla universe job:
03/20/17 07:54:13 (3854.0) (1465927): Trying to reconnect to disconnected job

Is this just a feature to avoid problems with mpi, or could one have a
configuration that condor tries the same as in the vanilla universe, because
mpi runs over infiniband?

Or do you suggest to make the whole communiction in condor via infiniband?
Or should/can? I add a second network?

Nevertheless if we can solve the problem of the not terminated programs we can
go into production with openmpi soon.

Best regards
Harald



On Monday 20 March 2017 18:11:47 Harald van Pee wrote:
> Hello,
>
> here my status update:
>
> - We are now sure that this problem never was seen for the vanilla
> universe, the only job in question was restarted because of a broken node.
>
> - the new options in openmpiscript from htcondor 8.6.1 make no difference.
>Â Especialy excluding ethernet interfaces does not help, as
>Â Jason Patton assumes from openmpi documentation.
>
> - It is clearly caused by minor ethernet problems because:
>Â a) We see connection problems to some nodes at the same time, but all
>Â connections but some from parallel universe are reestablished
>Â b) We have identified 3 nodes which make more problems than others,
>Â if we exclude these nodes we have managed to run several mpi jobs with 40
> mpinodes for longer than 16 days without restarts (the jobs were removed by
> the user or finished).
> But there is no reason to assume that these nodes have any severe problem,
> because we see no mpi errors even with high verbosity, and on the problem
> nodes there are vanilla jobs running for up to 101 days now.
>
> - here what happens today:
> Summary: Condor assumes that a starter of node 3 has a problem and sends a
> kill command. Even condor assumes that the kill command could not be send,
> it reaches the node 2 as it should be because this was the first node
> where mpirun is running.
> It does also send SIGQUIT before SIGTERM which I have not assumed and maybe
> this is the reason why our trap handler does not work beause it expects
> only SIGTERM? In this case still some mpiprograms are running after the
> job was removed from condor.
> Than later the starters of job 3939 on node 3 also gets a kill signal and
> handle this. But this was the prove that this starters are still alive and
> there was no reason to kill them, right?
>
> Even if there are work arounds, I think this behaviour could and should be
> improved.
> Or can we expect something was changed with htcondor 8.6.1?
>
> Best
> Harald
>
> Here the most relevant log messages:
>
> ShadowLog:
> 03/20/17 07:54:11 (3939.0) (1617771): condor_read() failed: recv(fd=10)
> returned -1, errno = 110 Connection timed out, reading 5 bytes from startd
> at <192.16
> 8.123.3:29143>.
> 03/20/17 07:54:11 (3939.0) (1617771): condor_read(): UNEXPECTED read
> timeout after 0s during non-blocking read from startd at
> <192.168.123.3:29143> (desired
> timeout=300s)
> 03/20/17 07:54:11 (3939.0) (1617771): IO: Failed to read packet header
> 03/20/17 07:54:11 (3939.0) (1617771): Can no longer talk to condor_starter
> <192.168.123.3:29143>
> 03/20/17 07:54:11 (3939.0) (1617771): This job cannot reconnect to starter,
> so job exiting
> 03/20/17 07:54:12 (3939.0) (1617771): attempt to connect to
> <192.168.123.3:29143> failed: No route to host (connect errno = 113).
> 03/20/17 07:54:12 (3939.0) (1617771): RemoteResource::killStarter(): Could
> not send command to startd
> 03/20/17 07:54:15 (3939.0) (1617771): attempt to connect to
> <192.168.123.3:29143> failed: No route to host (connect errno = 113).
> 03/20/17 07:54:15 (3939.0) (1617771): RemoteResource::killStarter(): Could
> not send command to startd
> 03/20/17 07:54:21 (3939.0) (1617771): attempt to connect to
> <192.168.123.3:29143> failed: No route to host (connect errno = 113).
> 03/20/17 07:54:21 (3939.0) (1617771): RemoteResource::killStarter(): Could
> not send command to startd
> 03/20/17 07:54:24 (3939.0) (1617771): ERROR "Can no longer talk to
> condor_starter <192.168.123.3:29143>" at line 209 in file
> /slots/02/dir_53434/userdir/src/condor_shadow.V6.1/NTreceivers.cpp
>
> StarterLog.slot1_2 on node 2 (mpirun of job 3939.0)
> 03/20/17 07:54:11 (pid:1066066) Got SIGQUIT. Performing fast shutdown.
> 03/20/17 07:54:11 (pid:1066066) ShutdownFast all jobs.
> 03/20/17 07:54:11 (pid:1066066) Got SIGTERM. Performing graceful shutdown.
> 03/20/17 07:54:11 (pid:1066066) ShutdownGraceful all jobs.
> 03/20/17 07:54:11 (pid:1066066) Process exited, pid=1066068, status=0
> 03/20/17 07:54:24 (pid:1066066) condor_read() failed: recv(fd=8) returned
> -1, errno = 104 Connection reset by peer, reading 5 bytes from
> <192.168.123.100:18658>.
> 03/20/17 07:54:24 (pid:1066066) IO: Failed to read packet header
> 03/20/17 07:54:24 (pid:1066066) Lost connection to shadow, waiting 2400
> secs for reconnect
> 03/20/17 07:54:24 (pid:1066066) Failed to send job exit status to shadow
> 03/20/17 07:54:24 (pid:1066066) Last process exited, now Starter is exiting
> 03/20/17 07:54:24 (pid:1066066) **** condor_starter (condor_STARTER) pid
> 1066066 EXITING WITH STATUS 0
>
>
> StarterLog.slot1_2 on node 3Â running job 3939.0
> 03/20/17 07:54:33 (pid:1056820) condor_read() failed: recv(fd=8) returned
> -1, errno = 104 Connection reset b\
> y peer, reading 5 bytes from <192.168.123.100:24154>.
> 03/20/17 07:54:33 (pid:1056820) IO: Failed to read packet header
> 03/20/17 07:54:33 (pid:1056820) Lost connection to shadow, waiting 2400
> secs for reconnect
> 03/20/17 07:54:33 (pid:1056820) Got SIGTERM. Performing graceful shutdown.
> 03/20/17 07:54:33 (pid:1056820) ShutdownGraceful all jobs.
> 03/20/17 07:54:33 (pid:1056820) Process exited, pid=1056824, status=0
> 03/20/17 07:54:33 (pid:1056820) Failed to send job exit status to shadow
> 03/20/17 07:54:33 (pid:1056820) Last process exited, now Starter is exiting
> 03/20/17 07:54:33 (pid:1056820) **** condor_starter (condor_STARTER) pid
> 1056820 EXITING WITH STATUS 0
>
> StarterLog.slot1_3 on node 3Â running job 3939.0
> 03/20/17 07:54:46 (pid:1056821) condor_read() failed: recv(fd=8) returned
> -1, errno = 104 Connection reset b\
> y peer, reading 5 bytes from <192.168.123.100:3768>.
> 03/20/17 07:54:46 (pid:1056821) IO: Failed to read packet header
> 03/20/17 07:54:46 (pid:1056821) Lost connection to shadow, waiting 2400
> secs for reconnect
> 03/20/17 07:54:46 (pid:1056821) Got SIGTERM. Performing graceful shutdown.
> 03/20/17 07:54:46 (pid:1056821) ShutdownGraceful all jobs.
> 03/20/17 07:54:46 (pid:1056821) Process exited, pid=1056823, status=0
> 03/20/17 07:54:46 (pid:1056821) Failed to send job exit status to shadow
> 03/20/17 07:54:46 (pid:1056821) Last process exited, now Starter is exiting
> 03/20/17 07:54:46 (pid:1056821) **** condor_starter (condor_STARTER) pid
> 1056821 EXITING WITH STATUS 0
>
> On Thursday 23 February 2017 15:12:38 Harald van Pee wrote:
> > Hello,
> >
> > it happens again. What we have learned until now is:
> > - a communication problem occurs between scheduler node and starter node
> > - condor kills the starter process and afterwards kills the job
> > - several different nodes are affected due to lack of statistics we can
> > not claim that all nodes are affected nor exclude that some have more
> > problems than others.
> > - its very unlikly that the program itself has a problem, because we have
> > seen that 2 starter processes of 2 independent parallel jobs were killed
> > on the same node at the same time.
> > - at least within the last 2 monthes only parallel jobs are affected, but
> > there is no hint for a mpi problem, any help how one can proof that no
> > mpi problem exists are welcome.
> > We have much more vanilla starters running than parallel ones.
> >
> > This morning the same node as last week was affacted. On this node
> > 9 single vanilla starters are running, 2 of them since more than 47
> > days. In addition 5 starters of 2 parallel jobs and only one starter of
> > one parallel job was killed.
> > From the ShadowLog below, one can see that several starters from serveral
> > jobs and not only from node 37 have communication problems and the time
> > during the problem occurs is less than one minute. Therfore I would
> > expect that there will be no problem to reconnect to the starters and
> > this is true for all vanilla jobs.
> > But why the parallel starters were killed such fast?
> >
> > Any idea is welcome
> > Harald
> >
> > ShadowLog (begin 2 lines before, end 2 lines after the minute of the
> > problem):
> > 02/23/17 07:10:59 (1835.0) (2412453): Job 1835.0 terminated: exited with
> > status 0
> > 02/23/17 07:10:59 (1835.0) (2412453): **** condor_shadow (condor_SHADOW)
> > pid 2412453 EXITING WITH STATUS 115
> > 02/23/17 07:16:02 (1209.3) (49060): condor_read() failed: recv(fd=4)
> > returned -1, errno = 110 Connection timed out, reading 5 bytes from
> > startd slot1@xxxxxxxxxxxxxxxxxxxxxxxx
> > 02/23/17 07:16:02 (1209.3) (49060): condor_read(): UNEXPECTED read
> > timeout after 0s during non-blocking read from startd
> > slot1@xxxxxxxxxxxxxxxxxxxxxxx (desired timeout=300s)
> > 02/23/17 07:16:02 (1209.3) (49060): IO: Failed to read packet header
> > 02/23/17 07:16:02 (1209.3) (49060): Can no longer talk to condor_starter
> > <192.168.123.37:30389>
> > 02/23/17 07:16:02 (1209.3) (49060): Trying to reconnect to disconnected
> > job 02/23/17 07:16:02 (1209.3) (49060): LastJobLeaseRenewal: 1487830176
> > Thu Feb 23 07:09:36 2017
> > 02/23/17 07:16:02 (1209.3) (49060): JobLeaseDuration: 2400 seconds
> > 02/23/17 07:16:02 (1209.3) (49060): JobLeaseDuration remaining: 2014
> > 02/23/17 07:16:02 (1209.3) (49060): Attempting to locate disconnected
> > starter 02/23/17 07:16:03 (1209.3) (49060): attempt to connect to
> > <192.168.123.37:30389> failed: No route to host (connect errno = 113).
> > 02/23/17 07:16:03 (1209.3) (49060): locateStarter(): Failed to connect to
> > startd <192.168.123.37:30389?addrs=192.168.123.37-30389>
> > 02/23/17 07:16:03 (1209.3) (49060): JobLeaseDuration remaining: 2399
> > 02/23/17 07:16:03 (1209.3) (49060): Scheduling another attempt to
> > reconnect in 8 seconds
> > 02/23/17 07:16:04 (1208.16) (46751): condor_read() failed: recv(fd=4)
> > returned -1, errno = 110 Connection timed out, reading 5 bytes from
> > starter at <192.168.123.51:49120>.
> > 02/23/17 07:16:04 (1208.16) (46751): condor_read(): UNEXPECTED read
> > timeout after 0s during non-blocking read from starter at
> > <192.168.123.51:49120> (desired timeout=300s)
> > 02/23/17 07:16:04 (1208.16) (46751): IO: Failed to read packet header
> > 02/23/17 07:16:04 (1208.16) (46751): Can no longer talk to condor_starter
> > <192.168.123.51:49120>
> > 02/23/17 07:16:04 (1208.16) (46751): JobLeaseDuration remaining: 2014
> > 02/23/17 07:16:04 (1208.16) (46751): Attempting to locate disconnected
> > starter 02/23/17 07:16:05 (2143.0) (2719507): condor_read() failed:
> > recv(fd=25) returned -1, errno = 110 Connection timed out, reading 5
> > bytes from startd at <192.168.123.37:30389>.
> > 02/23/17 07:16:05 (2143.0) (2719507): condor_read(): UNEXPECTED read
> > timeout after 0s during non-blocking read from startd at
> > <192.168.123.37:30389> (desired timeout=300s)
> > 02/23/17 07:16:05 (2143.0) (2719507): IO: Failed to read packet header
> > 02/23/17 07:16:05 (2143.0) (2719507): Can no longer talk to
> > condor_starter <192.168.123.37:30389>
> > 02/23/17 07:16:05 (2143.0) (2719507): This job cannot reconnect to
> > starter, so job exiting
> > 02/23/17 07:16:06 (1208.16) (46751): attempt to connect to
> > <192.168.123.51:29246> failed: No route to host (connect errno = 113).
> > 02/23/17 07:16:06 (1208.16) (46751): locateStarter(): Failed to connect
> > to startd <192.168.123.51:29246?addrs=192.168.123.51-29246>
> > 02/23/17 07:16:06 (1208.16) (46751): JobLeaseDuration remaining: 2398
> > 02/23/17 07:16:06 (1208.16) (46751): Scheduling another attempt to
> > reconnect in 8 seconds
> > 02/23/17 07:16:07 (683.9) (2270376): condor_read() failed: recv(fd=4)
> > returned -1, errno = 110 Connection timed out, reading 5 bytes from
> > starter at <192.168.123.37:30325>.
> > 02/23/17 07:16:07 (683.9) (2270376): condor_read(): UNEXPECTED read
> > timeout after 0s during non-blocking read from starter at
> > <192.168.123.37:30325> (desired timeout=300s)
> > 02/23/17 07:16:07 (683.9) (2270376): IO: Failed to read packet header
> > 02/23/17 07:16:07 (683.9) (2270376): Can no longer talk to condor_starter
> > <192.168.123.37:30325>
> > 02/23/17 07:16:07 (683.9) (2270376): JobLeaseDuration remaining: 2014
> > 02/23/17 07:16:07 (683.9) (2270376): Attempting to locate disconnected
> > starter 02/23/17 07:16:08 (2143.0) (2719507): attempt to connect to
> > <192.168.123.37:30389> failed: No route to host (connect errno = 113).
> > 02/23/17 07:16:08 (683.9) (2270376): attempt to connect to
> > <192.168.123.37:30389> failed: No route to host (connect errno = 113).
> > 02/23/17 07:16:08 (2143.0) (2719507): RemoteResource::killStarter():
> > Could not send command to startd
> > 02/23/17 07:16:08 (683.9) (2270376): locateStarter(): Failed to connect
> > to startd <192.168.123.37:30389?addrs=192.168.123.37-30389>
> > 02/23/17 07:16:08 (683.9) (2270376): JobLeaseDuration remaining: 2399
> > 02/23/17 07:16:08 (683.9) (2270376): Scheduling another attempt to
> > reconnect in 8 seconds
> > 02/23/17 07:16:11 (1209.3) (49060): Attempting to locate disconnected
> > starter 02/23/17 07:16:11 (1209.3) (49060): Found starter:
> > <192.168.123.37:38618?addrs=192.168.123.37-38618>
> > 02/23/17 07:16:11 (1209.3) (49060): Attempting to reconnect to starter
> > <192.168.123.37:38618?addrs=192.168.123.37-38618>
> > 02/23/17 07:16:11 (2143.0) (2719507): ERROR "Can no longer talk to
> > condor_starter <192.168.123.37:30389>" at line 209 in file
> > /slots/02/dir_53434/userdir/src/condor_shadow.V6.1/NTreceivers.cpp
> > 02/23/17 07:16:14 (1208.16) (46751): Attempting to locate disconnected
> > starter 02/23/17 07:16:14 (1208.16) (46751): Found starter:
> > <192.168.123.51:49120?addrs=192.168.123.51-49120>
> > 02/23/17 07:16:14 (1208.16) (46751): Attempting to reconnect to starter
> > <192.168.123.51:49120?addrs=192.168.123.51-49120>
> > 02/23/17 07:16:15 (1208.16) (46751): Reconnect SUCCESS: connection re-
> > established
> > 02/23/17 07:16:16 (683.9) (2270376): Attempting to locate disconnected
> > starter 02/23/17 07:16:16 (683.9) (2270376): Found starter:
> > <192.168.123.37:30325?addrs=192.168.123.37-30325>
> > 02/23/17 07:16:16 (683.9) (2270376): Attempting to reconnect to starter
> > <192.168.123.37:30325?addrs=192.168.123.37-30325>
> > 02/23/17 07:16:25 (683.9) (2270376): Reconnect SUCCESS: connection re-
> > established
> > 02/23/17 07:16:41 (1209.3) (49060): condor_read(): timeout reading 5
> > bytes from starter at <192.168.123.37:38618>.
> > 02/23/17 07:16:41 (1209.3) (49060): IO: Failed to read packet header
> > 02/23/17 07:16:41 (1209.3) (49060): Attempt to reconnect failed: Failed
> > to read reply ClassAd
> > 02/23/17 07:16:41 (1209.3) (49060): JobLeaseDuration remaining: 2361
> > 02/23/17 07:16:41 (1209.3) (49060): Scheduling another attempt to
> > reconnect in 16 seconds
> > 02/23/17 07:16:57 (1209.3) (49060): Attempting to locate disconnected
> > starter 02/23/17 07:16:57 (1209.3) (49060): Found starter:
> > <192.168.123.37:38618?addrs=192.168.123.37-38618>
> > 02/23/17 07:16:57 (1209.3) (49060): Attempting to reconnect to starter
> > <192.168.123.37:38618?addrs=192.168.123.37-38618>
> > 02/23/17 07:16:57 (1209.3) (49060): Reconnect SUCCESS: connection re-
> > established
> > 02/23/17 07:43:17 (2102.0) (2559295): Job 2102.0 terminated: killed by
> > signal 6
> > 02/23/17 07:43:17 (2102.0) (2559295): **** condor_shadow (condor_SHADOW)
> > pid 2559295 EXITING WITH STATUS 115
> >
> > On Tuesday 21 February 2017 22:41:54 Harald van Pee wrote:
> > > Hi Todd,
> > >
> > > thank you for your help.
> > >
> > > Concerning the no route to host, I see no ethernet port down on any
> > > machine during that time, but maybe a change in
> > > /etc/host.conf:
> > > order hosts,bind
> > >
> > > /etc/nsswitch.conf:
> > > hosts:Â Â Â files dns
> > >
> > > instead of the debian default will help anyway, /etc/hosts has all ip
> > > addresses of all nodes in.
> > >
> > > Regards
> > > Harald
> > >
> > > On Tuesday 21 February 2017 21:06:20 Todd Tannenbaum wrote:
> > > > On 2/21/2017 1:33 PM, Harald van Pee wrote:
> > > > > It seems that openmpi (or mpi) are not used very often with
> > > > > htcondor and the information is spare and I got some questions how
> > > > > I have managed it to get it running at all. I will share all I
> > > > > know about this in a new thread soon, or is there a wiki where I
> > > > > should put the information?
> > > >
> > > > Off-list I put Harold in touch with the folks who can put Harold's
> > > > info into the Manual or the HTCondor Wiki (from the web homepage,
> > > > look for the link "HOWTO recipes" and "HTcondor Wiki").
> > > >
> > > > Also we did some work for upcoming HTCondor v8.6.1 release so it
> > > > works properly with the latest releases of OpenMPI - for details see
> > > >
> > > >Â Â https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=6024
> > > > >
> > > > > Now back to our problem:
> > > > > One hint that it will be related to network (ethernet or
> > > > > infiniband) is, that we have one job running for 11 days without
> > > > > problems as we have less jobs running, and we got problems within
> > > > > a few days as we have startetd 200 more jobs.
> > > > > I have found now 2 independend parallel mpi jobs which share one
> > > > > machine with one job each and there are no ethernet problems seen,
> > > > > not on the scheduler machine nor on the starter node. Unfortunately
> > > > > there is no error output in the jobs error file.
> > > > > Its clear that condor kills the jobs but for me its unclear why,
> > > > > because it seems both starter processes are still running if I
> > > > > understand the logfiles correct.
> > > >
> > > > At first blush, it looks to me like the condor_shadow on the submit
> > > > node could no longer contact the execute node at IP address
> > > > 192.168.123.37 due to "No route to host". The "No route to host"
> > > > error comes the operating system, not from HTCondor - you can google
> > > > this error and see lots of opinions/ideas on how to troubleshoot and
> > > > fix, but basically there is no route for the execute node IP address
> > > > in the client's routing table... not sure why this would happen all
> > > > of the sudden, maybe some interface on your submit machine is being
> > > > disabled or some switch port?
> > > >
> > > > regards,
> > > > Todd
> > > >
> > > > > Maybe one of you find a hint in the condor log below and can give
> > > > > me a hint what happens, or what I can do to find out.
> > > > >
> > > > > Best
> > > > > Harald
> > > > >
> > > > > ShadowLog:
> > > > > 02/19/17 03:09:44 (1744.0) (1729179): condor_read() failed:
> > > > > recv(fd=12) returned -1, errno = 110 Connection timed out, reading
> > > > > 5 bytes from startd at <192.168.123.37:30389>.
> > > > > 02/19/17 03:09:44 (1745.0) (1729180): condor_read() failed:
> > > > > recv(fd=9) returned -1, errno = 110 Connection timed out, reading 5
> > > > > bytes from startd at <192.168.123.37:30389>.
> > > > > 02/19/17 03:09:44 (1744.0) (1729179): condor_read(): UNEXPECTED
> > > > > read timeout after 0s during non-blocking read from startd at
> > > > > <192.168.123.37:30389> (desired timeout=300s)
> > > > > 02/19/17 03:09:44 (1745.0) (1729180): condor_read(): UNEXPECTED
> > > > > read timeout after 0s during non-blocking read from startd at
> > > > > <192.168.123.37:30389> (desired timeout=300s)
> > > > > 02/19/17 03:09:44 (1744.0) (1729179): IO: Failed to read packet
> > > > > header 02/19/17 03:09:44 (1745.0) (1729180): IO: Failed to read
> > > > > packet header 02/19/17 03:09:44 (1744.0) (1729179): Can no longer
> > > > > talk to
> > > > > condor_starter <192.168.123.37:30389>
> > > > > 02/19/17 03:09:44 (1745.0) (1729180): Can no longer talk to
> > > > > condor_starter <192.168.123.37:30389>
> > > > > 02/19/17 03:09:44 (1744.0) (1729179): This job cannot reconnect to
> > > > > starter, so job exiting
> > > > > 02/19/17 03:09:44 (1745.0) (1729180): This job cannot reconnect to
> > > > > starter, so job exiting
> > > > > 02/19/17 03:09:47 (1745.0) (1729180): attempt to connect to
> > > > > <192.168.123.37:30389> failed: No route to host (connect errno =
> > > > > 113). 02/19/17 03:09:47 (1744.0) (1729179): attempt to connect to
> > > > > <192.168.123.37:30389> failed: No route to host (connect errno =
> > > > > 113). 02/19/17 03:09:47 (1745.0) (1729180):
> > > > > RemoteResource::killStarter(): Could not send command to startd
> > > > > 02/19/17 03:09:47 (1744.0) (1729179):
> > > > > RemoteResource::killStarter(): Could not send command to startd
> > > > > 02/19/17 03:09:47 (1744.0) (1729179): ERROR "Can no longer talk to
> > > > > condor_starter <192.168.123.37:30389>" at line 209 in file
> > > > > /slots/02/dir_53434/userdir/src/condor_shadow.V6.1/NTreceivers.cpp
> > > > > 02/19/17 03:09:47 (1745.0) (1729180): ERROR "Can no longer talk to
> > > > > condor_starter <192.168.123.37:30389>" at line 209 in file
> > > > > /slots/02/dir_53434/userdir/src/condor_shadow.V6.1/NTreceivers.cpp
> > > > >
> > > > >
> > > > > StarterLog of job 1745.0 on node 192.168.123.37
> > > > > 02/15/17 17:14:34 (pid:751398) Create_Process succeeded, pid=751405
> > > > > 02/15/17 17:14:35 (pid:751398) condor_write() failed: send() 1
> > > > > bytes to <127.0.0.1:10238> returned -1, timeout=0, errno=32 Broken
> > > > > pipe. 02/19/17 03:10:05 (pid:751398) condor_read() failed:
> > > > > recv(fd=8) returned -1, errno = 104 Connection reset by peer,
> > > > > reading 5 bytes from <192.168.123.100:25500>.
> > > > > 02/19/17 03:10:05 (pid:751398) IO: Failed to read packet header
> > > > > 02/19/17 03:10:05 (pid:751398) Lost connection to shadow, waiting
> > > > > 2400 secs for reconnect
> > > > > 02/19/17 03:10:05 (pid:751398) Got SIGTERM. Performing graceful
> > > > > shutdown. 02/19/17 03:10:05 (pid:751398) ShutdownGraceful all jobs.
> > > > > 02/19/17 03:10:05 (pid:751398) Process exited, pid=751405, status=0
> > > > > 02/19/17 03:10:05 (pid:751398) Failed to send job exit status to
> > > > > shadow 02/19/17 03:10:05 (pid:751398) Last process exited, now
> > > > > Starter is exiting 02/19/17 03:10:05 (pid:751398) ****
> > > > > condor_starter
> > > > > (condor_STARTER) pid 751398 EXITING WITH STATUS 0
> > > > >
> > > > > StarterLog of job 1744.0 on node 92.168.123.37
> > > > > 02/15/17 17:14:34 (pid:751399) Create_Process succeeded, pid=751400
> > > > > 02/15/17 17:14:34 (pid:751399) condor_write() failed: send() 1
> > > > > bytes to <127.0.0.1:48689> returned -1, timeout=0, errno=32 Broken
> > > > > pipe. 02/19/17 03:10:03 (pid:751399) condor_read() failed:
> > > > > recv(fd=8) returned -1, errno = 104 Connection reset by peer,
> > > > > reading 5 bytes from <192.168.123.100:34337>.
> > > > > 02/19/17 03:10:03 (pid:751399) IO: Failed to read packet header
> > > > > 02/19/17 03:10:03 (pid:751399) Lost connection to shadow, waiting
> > > > > 2400 secs for reconnect
> > > > > 02/19/17 03:10:03 (pid:751399) Got SIGTERM. Performing graceful
> > > > > shutdown. 02/19/17 03:10:03 (pid:751399) ShutdownGraceful all jobs.
> > > > > 02/19/17 03:10:03 (pid:751399) Process exited, pid=751400, status=0
> > > > > 02/19/17 03:10:03 (pid:751399) Failed to send job exit status to
> > > > > shadow 02/19/17 03:10:03 (pid:751399) Last process exited, now
> > > > > Starter is exiting 02/19/17 03:10:03 (pid:751399) ****
> > > > > condor_starter
> > > > > (condor_STARTER) pid 751399 EXITING WITH STATUS 0
> > > > >
> > > > > StartLog:
> > > > > 02/19/17 03:09:48 slot1_11: Called deactivate_claim()
> > > > > 02/19/17 03:09:48 slot1_11: Changing state and activity:
> > > > > Claimed/Busy -> Preempting/Vacating
> > > > > 02/19/17 03:09:48 slot1_13: Called deactivate_claim()
> > > > > 02/19/17 03:09:48 slot1_13: Changing state and activity:
> > > > > Claimed/Busy -> Preempting/Vacating
> > > > > 02/19/17 03:10:03 Starter pid 751399 exited with status 0
> > > > > 02/19/17 03:10:03 slot1_11: State change: starter exited
> > > > > 02/19/17 03:10:03 slot1_11: State change: No preempting claim,
> > > > > returning to owner
> > > > > 02/19/17 03:10:03 slot1_11: Changing state and activity:
> > > > > Preempting/Vacating -
> > > > >
> > > > >> Owner/Idle
> > > > >
> > > > > 02/19/17 03:10:03 slot1_11: State change: IS_OWNER is false
> > > > > 02/19/17 03:10:03 slot1_11: Changing state: Owner -> Unclaimed
> > > > > 02/19/17 03:10:03 slot1_11: Changing state: Unclaimed -> Delete
> > > > > 02/19/17 03:10:03 slot1_11: Resource no longer needed, deleting
> > > > > 02/19/17 03:10:05 Starter pid 751398 exited with status 0
> > > > > 02/19/17 03:10:05 slot1_13: State change: starter exited
> > > > > 02/19/17 03:10:05 slot1_13: State change: No preempting claim,
> > > > > returning to owner
> > > > > 02/19/17 03:10:05 slot1_13: Changing state and activity:
> > > > > Preempting/Vacating -
> > > > >
> > > > >> Owner/Idle
> > > > >
> > > > > 02/19/17 03:10:05 slot1_13: State change: IS_OWNER is false
> > > > > 02/19/17 03:10:05 slot1_13: Changing state: Owner -> Unclaimed
> > > > > 02/19/17 03:10:05 slot1_13: Changing state: Unclaimed -> Delete
> > > > > 02/19/17 03:10:05 slot1_13: Resource no longer needed, deleting
> > > > > 02/19/17 03:19:48 Error: can't find resource with ClaimId
> > > > > (<192.168.123.37:30389>#1481221329#1484#...) for 443
> > > > > (RELEASE_CLAIM); perhaps this claim was removed already.
> > > > > 02/19/17 03:19:48 condor_write(): Socket closed when trying to
> > > > > write 13 bytes to <192.168.123.100:20962>, fd is 8
> > > > > 02/19/17 03:19:48 Buf::write(): condor_write() failed
> > > > > 02/19/17 03:19:48 Error: can't find resource with ClaimId
> > > > > (<192.168.123.37:30389>#1481221329#1487#...) for 443
> > > > > (RELEASE_CLAIM); perhaps this claim was removed already.
> > > > > 02/19/17 03:19:48 condor_write(): Socket closed when trying to
> > > > > write 13 bytes to <192.168.123.100:34792>, fd is 8
> > > > > 02/19/17 03:19:48 Buf::write(): condor_write() failed
> > > > >
> > > > > On Tuesday 07 February 2017 19:55:31 Harald van Pee wrote:
> > > > >> Dear experts,
> > > > >>
> > > > >> I have some questions for debugging:
> > > > >> Can I avoid restarting of a job in vanilla and/or parallel
> > > > >> universe if I use Requirements = (NumJobStarts==0)
> > > > >> in the condor submit description file?
> > > > >> If it works, will the job stay idle or will be removed?
> > > > >>
> > > > >> I found a job in the vanilla universe started at 12/9 and
> > > > >> restarted shortly before Christmas and still running. I assume
> > > > >> the reason were also network problems, but unfortunatelly our
> > > > >> last condor and system log files are from January.
> > > > >> Is there any possibility to make condor a little bit more robust
> > > > >> against network problems via configuration? Just wait a little bit
> > > > >> longer or make more reconnection tries?
> > > > >>
> > > > >> We are working on automatic restart of the mpi jobs and try to use
> > > > >> more frequent checkpoints, but it seems a lot of work, therefore
> > > > >> any idea would be welcome.
> > > > >>
> > > > >> Best
> > > > >> Harald
> > > > >>
> > > > >> On Monday 06 February 2017 23:43:47 Harald van Pee wrote:
> > > > >>> There is one important argument, why I think the problem is
> > > > >>> condor related not mpi (of course I can be wrong).
> > > > >>> The condor communication goes via ethernet, and the ethernet
> > > > >>> connection has a problem for several minutes.
> > > > >>> The mpi communication goes via infiniband, and there is no
> > > > >>> infiniband problem during this time.
> > > > >>>
> > > > >>> Harald
> > > > >>>
> > > > >>> On Monday 06 February 2017 23:04:01 Harald van Pee wrote:
> > > > >>>> Hi Greg,
> > > > >>>>
> > > > >>>> thanks for your answer.
> > > > >>>>
> > > > >>>> On Monday 06 February 2017 22:18:08 Greg Thain wrote:
> > > > >>>>> On 02/06/2017 02:40 PM, Harald van Pee wrote:
> > > > >>>>>> Hello,
> > > > >>>>>>
> > > > >>>>>> we got mpi running in parallel universe with htcondor 8.4
> > > > >>>>>> using openmpiscript and its working in general without any
> > > > >>>>>> problem.
> > > > >>>>>
> > > > >>>>> In general, the MPI jobs themselves cannot survive a network
> > > > >>>>> outage or partition, even a temporary one. HTCondor will
> > > > >>>>> reconnect the shadow to the starters, if the problem is just
> > > > >>>>> between the submit machine and the execute machines, but if the
> > > > >>>>> network problem also impacts node-to-node communication, then
> > > > >>>>> the job has to be aborted and restarted from scratch because of
> > > > >>>>> the way MPI works.
> > > > >>>>
> > > > >>>> The problem seems between submit machine and one running node
> > > > >>>> (not the node where mpirun was started).
> > > > >>>> If you are right it should be possible to get or found an error
> > > > >>>> of mpirun because it lost one node right?
> > > > >>>> But it seems condor kills the job because of a shadow exception.
> > > > >>>> Unfortunatelly we do not see the output of the stoped job
> > > > >>>> because its overwritten by the new started.
> > > > >>>> Any suggestion how to find out if its realy an mpi related
> > > > >>>> problem?
> > > > >>>>
> > > > >>>>> If possible, we would recommend that long-running jobs that
> > > > >>>>> suffer from this problem try to self-checkpoint themselves, so
> > > > >>>>> that when they are restarted, they don't need to be restarted
> > > > >>>>> from scratch.
> > > > >>>>>
> > > > >>>>> -greg
> >
> > _______________________________________________
> > HTCondor-users mailing list
> > To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with
> > a subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> >
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/htcondor-users/
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/