[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp



Hello, 

now we have tried the different shutdown policy.  What we found out is
there are now reconnection tries
but
all end with an error message therefore this does not help.
What we see is, that mpirun the running program and openmpiscript are removed 
on node0, also openmpiscript was removed on all other nodes,
but all mpi programms are still running on all other nodes. 
Therefore most likely mpirun has no chance to kill the jobs (do not understand 
why) and condor does not do it.

Here the error messages from ShadowLog, I just skip similar messages from 
other nodes:
03/31/17 08:48:10 (5037.0) (3148721): condor_read() failed: recv(fd=16) 
returned -1, errno = 110 Connection timed out, reading 5 bytes from startd at 
<192.168.123.13:19188>.
03/31/17 08:48:10 (5037.0) (3148721): condor_read(): UNEXPECTED read timeout 
after 0s during non-blocking read from startd at <192.168.123.13:19188> 
(desired timeout=300s)
03/31/17 08:48:10 (5037.0) (3148721): IO: Failed to read packet header
03/31/17 08:48:10 (5037.0) (3148721): Can no longer talk to condor_starter 
<192.168.123.13:19188>
03/31/17 08:48:11 (5037.0) (3148721): SECMAN: failed to create session 
<192.168.123.2:24345>#1481207467#1730 (key already exists).
03/31/17 08:48:11 (5037.0) (3148721): SECMAN: existing session 
<192.168.123.2:24345>#1481207467#1730:
03/31/17 08:48:11 (5037.0) (3148721): 
SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION: failed to create security session 
for <192.168.123.2:24345>#1481207467#1730#..., so will fall back on security 
negotiation
03/31/17 08:48:11 (5037.0) (3148721): SECMAN: failed to create session 
filetrans.<192.168.123.2:24345>#1481207467#1730 (key already exists).
03/31/17 08:48:11 (5037.0) (3148721): SECMAN: existing session 
filetrans.<192.168.123.2:24345>#1481207467#1730:

<skip messages for other nodes they occur allways twice>

03/31/17 08:48:11 (5037.0) (3148721): Trying to reconnect to disconnected job
03/31/17 08:48:11 (5037.0) (3148721): LastJobLeaseRenewal: 1490942508 Fri Mar 
31 08:41:48 2017
03/31/17 08:48:11 (5037.0) (3148721): JobLeaseDuration: 2400 seconds
03/31/17 08:48:11 (5037.0) (3148721): JobLeaseDuration remaining: 2017
03/31/17 08:48:11 (5037.0) (3148721): Trying to reconnect to disconnected job
03/31/17 08:48:11 (5037.0) (3148721): LastJobLeaseRenewal: 1490942744 Fri Mar 
31 08:45:44 2017
03/31/17 08:48:11 (5037.0) (3148721): JobLeaseDuration: 2400 seconds
03/31/17 08:48:11 (5037.0) (3148721): JobLeaseDuration remaining: 2253

<skip>

03/31/17 08:48:11 (5037.0) (3148721): Trying to reconnect to disconnected job
03/31/17 08:48:11 (5037.0) (3148721): LastJobLeaseRenewal: 1490942744 Fri Mar 
31 08:45:44 2017
03/31/17 08:48:11 (5037.0) (3148721): JobLeaseDuration: 2400 seconds
03/31/17 08:48:11 (5037.0) (3148721): JobLeaseDuration remaining: 2253
03/31/17 08:48:11 (5037.0) (3148721): ERROR "Assertion ERROR on 
(nextResourceToStart == numNodes)" at line 385 in file 
/slots/02/dir_53434/userdir/src/condor_shadow.V6.1/parallelshadow.cpp


Harald


On Thursday 23 March 2017 15:48:21 Harald van Pee wrote:
> On Wednesday 22 March 2017 21:13:27 Jason Patton wrote:
> > you can try adding this to your parallel jobs' submit
> >
> > files:
> > 
> >
> > +ParallelShutdownPolicy = "WAIT_FOR_ALL"
> >
> > 
> >
> > The manual says that this tells condor to only consider the job
> > finished when all the nodes' processes have exited. What the manual
> > doesn't say is that this tells condor to reconnect to all of the
> > execute nodes in a parallel universe job if there is a network
> > interruption. Under the default configuration, where the job exits
> > only if node 0 exits, reconnection will only happen for node 0 if
> > there is an interruption.
> 
> Ah thats interesting, we will try this next!

<snip>