[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp



Hello Jason,

On Friday 31 March 2017 19:19:34 Jason Patton wrote:
> Harald,
> 
> As far as the mpi processes still running on the other nodes when the job
> is killed, I think I might have an idea. Do you still see non-system sshds
> running on the execute nodes? 

indeed I never checked sshd before. The most ugly thing is if the program is 
still running and the claim becomes free. Then it seems openmpiscript are not 
running. 

If openmpiscript is still running the claim is not given free, this happens 
with
+ParallelShutdownPolicy = "WAIT_FOR_ALL"
policy once as we try to remove the job with condor_rm.

But the other way around I have checked now: if the programs are killed and no 
openmpiscript is running than no user sshd was left over.

It could also be that the problem occurs only if we use at least a certain 
number of mpi nodes (>30).

> We may want to add another handler to make
> sure they get SIGTERM when sshd_cleanup is called.

Sounds good.

Harald

> 
> Jason
> 
> On Fri, Mar 31, 2017 at 7:06 AM, Harald van Pee <pee@xxxxxxxxxxxxxxxxx>
> 
> wrote:
> > Hello,
> > 
> > now we have tried the different shutdown policy.  What we found out is
> > there are now reconnection tries
> > but
> > all end with an error message therefore this does not help.
> > What we see is, that mpirun the running program and openmpiscript are
> > removed
> > on node0, also openmpiscript was removed on all other nodes,
> > but all mpi programms are still running on all other nodes.
> > Therefore most likely mpirun has no chance to kill the jobs (do not
> > understand
> > why) and condor does not do it.
> > 
> > Here the error messages from ShadowLog, I just skip similar messages from
> > other nodes:
> > 03/31/17 08:48:10 (5037.0) (3148721): condor_read() failed: recv(fd=16)
> > returned -1, errno = 110 Connection timed out, reading 5 bytes from
> > startd at
> > <192.168.123.13:19188>.
> > 03/31/17 08:48:10 (5037.0) (3148721): condor_read(): UNEXPECTED read
> > timeout
> > after 0s during non-blocking read from startd at <192.168.123.13:19188>
> > (desired timeout=300s)
> > 03/31/17 08:48:10 (5037.0) (3148721): IO: Failed to read packet header
> > 03/31/17 08:48:10 (5037.0) (3148721): Can no longer talk to
> > condor_starter <192.168.123.13:19188>
> > 03/31/17 08:48:11 (5037.0) (3148721): SECMAN: failed to create session
> > <192.168.123.2:24345>#1481207467#1730 (key already exists).
> > 03/31/17 08:48:11 (5037.0) (3148721): SECMAN: existing session
> > <192.168.123.2:24345>#1481207467#1730:
> > 03/31/17 08:48:11 (5037.0) (3148721):
> > SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION: failed to create security
> > session
> > for <192.168.123.2:24345>#1481207467#1730#..., so will fall back on
> > security
> > negotiation
> > 03/31/17 08:48:11 (5037.0) (3148721): SECMAN: failed to create session
> > filetrans.<192.168.123.2:24345>#1481207467#1730 (key already exists).
> > 03/31/17 08:48:11 (5037.0) (3148721): SECMAN: existing session
> > filetrans.<192.168.123.2:24345>#1481207467#1730:
> > 
> > <skip messages for other nodes they occur allways twice>
> > 
> > 03/31/17 08:48:11 (5037.0) (3148721): Trying to reconnect to disconnected
> > job
> > 03/31/17 08:48:11 (5037.0) (3148721): LastJobLeaseRenewal: 1490942508 Fri
> > Mar
> > 31 08:41:48 2017
> > 03/31/17 08:48:11 (5037.0) (3148721): JobLeaseDuration: 2400 seconds
> > 03/31/17 08:48:11 (5037.0) (3148721): JobLeaseDuration remaining: 2017
> > 03/31/17 08:48:11 (5037.0) (3148721): Trying to reconnect to disconnected
> > job
> > 03/31/17 08:48:11 (5037.0) (3148721): LastJobLeaseRenewal: 1490942744 Fri
> > Mar
> > 31 08:45:44 2017
> > 03/31/17 08:48:11 (5037.0) (3148721): JobLeaseDuration: 2400 seconds
> > 03/31/17 08:48:11 (5037.0) (3148721): JobLeaseDuration remaining: 2253
> > 
> > <skip>
> > 
> > 03/31/17 08:48:11 (5037.0) (3148721): Trying to reconnect to disconnected
> > job
> > 03/31/17 08:48:11 (5037.0) (3148721): LastJobLeaseRenewal: 1490942744 Fri
> > Mar
> > 31 08:45:44 2017
> > 03/31/17 08:48:11 (5037.0) (3148721): JobLeaseDuration: 2400 seconds
> > 03/31/17 08:48:11 (5037.0) (3148721): JobLeaseDuration remaining: 2253
> > 03/31/17 08:48:11 (5037.0) (3148721): ERROR "Assertion ERROR on
> > (nextResourceToStart == numNodes)" at line 385 in file
> > /slots/02/dir_53434/userdir/src/condor_shadow.V6.1/parallelshadow.cpp
> > 
> > 
> > Harald
> > 
> > On Thursday 23 March 2017 15:48:21 Harald van Pee wrote:
> > > On Wednesday 22 March 2017 21:13:27 Jason Patton wrote:
> > > > you can try adding this to your parallel jobs' submit
> > > > 
> > > > files:
> > > > 
> > > > 
> > > > +ParallelShutdownPolicy = "WAIT_FOR_ALL"
> > > > 
> > > > 
> > > > 
> > > > The manual says that this tells condor to only consider the job
> > > > finished when all the nodes' processes have exited. What the manual
> > > > doesn't say is that this tells condor to reconnect to all of the
> > > > execute nodes in a parallel universe job if there is a network
> > > > interruption. Under the default configuration, where the job exits
> > > > only if node 0 exits, reconnection will only happen for node 0 if
> > > > there is an interruption.
> > > 
> > > Ah thats interesting, we will try this next!
> > 
> > <snip>
> > _______________________________________________
> > HTCondor-users mailing list
> > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with
> > a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> > 
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/htcondor-users/

-- 
Harald van Pee

Helmholtz-Institut fuer Strahlen- und Kernphysik der Universitaet Bonn
Nussallee 14-16 - 53115 Bonn - Tel +49-228-732213 - Fax +49-228-732505
mail: pee@xxxxxxxxxxxxxxxxx