[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp



Harald,

As far as the mpi processes still running on the other nodes when the job is killed, I think I might have an idea. Do you still see non-system sshds running on the execute nodes? We may want to add another handler to make sure they get SIGTERM when sshd_cleanup is called.

Jason

On Fri, Mar 31, 2017 at 7:06 AM, Harald van Pee <pee@xxxxxxxxxxxxxxxxx> wrote:
Hello,

now we have tried the different shutdown policy. What we found out is
there are now reconnection tries
but
all end with an error message therefore this does not help.
What we see is, that mpirun the running program and openmpiscript are removed
on node0, also openmpiscript was removed on all other nodes,
but all mpi programms are still running on all other nodes.
Therefore most likely mpirun has no chance to kill the jobs (do not understand
why) and condor does not do it.

Here the error messages from ShadowLog, I just skip similar messages from
other nodes:
03/31/17 08:48:10 (5037.0) (3148721): condor_read() failed: recv(fd=16)
returned -1, errno = 110 Connection timed out, reading 5 bytes from startd at
<192.168.123.13:19188>.
03/31/17 08:48:10 (5037.0) (3148721): condor_read(): UNEXPECTED read timeout
after 0s during non-blocking read from startd at <192.168.123.13:19188>
(desired timeout=300s)
03/31/17 08:48:10 (5037.0) (3148721): IO: Failed to read packet header
03/31/17 08:48:10 (5037.0) (3148721): Can no longer talk to condor_starter
<192.168.123.13:19188>
03/31/17 08:48:11 (5037.0) (3148721): SECMAN: failed to create session
<192.168.123.2:24345>#1481207467#1730 (key already exists).
03/31/17 08:48:11 (5037.0) (3148721): SECMAN: existing session
<192.168.123.2:24345>#1481207467#1730:
03/31/17 08:48:11 (5037.0) (3148721):
SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION: failed to create security session
for <192.168.123.2:24345>#1481207467#1730#..., so will fall back on security
negotiation
03/31/17 08:48:11 (5037.0) (3148721): SECMAN: failed to create session
filetrans.<192.168.123.2:24345>#1481207467#1730 (key already exists).
03/31/17 08:48:11 (5037.0) (3148721): SECMAN: existing session
filetrans.<192.168.123.2:24345>#1481207467#1730:

<skip messages for other nodes they occur allways twice>

03/31/17 08:48:11 (5037.0) (3148721): Trying to reconnect to disconnected job
03/31/17 08:48:11 (5037.0) (3148721): LastJobLeaseRenewal: 1490942508 Fri Mar
31 08:41:48 2017
03/31/17 08:48:11 (5037.0) (3148721): JobLeaseDuration: 2400 seconds
03/31/17 08:48:11 (5037.0) (3148721): JobLeaseDuration remaining: 2017
03/31/17 08:48:11 (5037.0) (3148721): Trying to reconnect to disconnected job
03/31/17 08:48:11 (5037.0) (3148721): LastJobLeaseRenewal: 1490942744 Fri Mar
31 08:45:44 2017
03/31/17 08:48:11 (5037.0) (3148721): JobLeaseDuration: 2400 seconds
03/31/17 08:48:11 (5037.0) (3148721): JobLeaseDuration remaining: 2253

<skip>

03/31/17 08:48:11 (5037.0) (3148721): Trying to reconnect to disconnected job
03/31/17 08:48:11 (5037.0) (3148721): LastJobLeaseRenewal: 1490942744 Fri Mar
31 08:45:44 2017
03/31/17 08:48:11 (5037.0) (3148721): JobLeaseDuration: 2400 seconds
03/31/17 08:48:11 (5037.0) (3148721): JobLeaseDuration remaining: 2253
03/31/17 08:48:11 (5037.0) (3148721): ERROR "Assertion ERROR on
(nextResourceToStart == numNodes)" at line 385 in file
/slots/02/dir_53434/userdir/src/condor_shadow.V6.1/parallelshadow.cpp


Harald


On Thursday 23 March 2017 15:48:21 Harald van Pee wrote:
> On Wednesday 22 March 2017 21:13:27 Jason Patton wrote:
> > you can try adding this to your parallel jobs' submit
> >
> > files:
> >
> >
> > +ParallelShutdownPolicy = "WAIT_FOR_ALL"
> >
> >
> >
> > The manual says that this tells condor to only consider the job
> > finished when all the nodes' processes have exited. What the manual
> > doesn't say is that this tells condor to reconnect to all of the
> > execute nodes in a parallel universe job if there is a network
> > interruption. Under the default configuration, where the job exits
> > only if node 0 exits, reconnection will only happen for node 0 if
> > there is an interruption.
>
> Ah thats interesting, we will try this next!

<snip>
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/