[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp

Hi Greg,

thanks for your answer.

On Monday 06 February 2017 22:18:08 Greg Thain wrote:
> On 02/06/2017 02:40 PM, Harald van Pee wrote:
> > Hello,
> > 
> > we got mpi running in parallel universe with htcondor 8.4 using
> > openmpiscript and its working in general without any problem.
> In general, the MPI jobs themselves cannot survive a network outage or
> partition, even a temporary one.  HTCondor will reconnect the shadow to
> the starters, if the problem is just between the submit machine and the
> execute machines, but if the network problem also impacts node-to-node
> communication, then the job has to be aborted and restarted from scratch
> because of the way MPI works.

The problem seems between submit machine and one running node (not the node 
where mpirun was started).
If you are right it should be possible to get or found an error of mpirun 
because it lost one node right?
But it seems condor kills the job because of a shadow exception. 
Unfortunatelly we do not see the output of the stoped job because its 
overwritten by the new started. 
Any suggestion how to find out if its realy an mpi related problem?

> If possible, we would recommend that long-running jobs that suffer from
> this problem try to self-checkpoint themselves, so that when they are
> restarted, they don't need to be restarted from scratch.
> -greg
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/

Harald van Pee

Helmholtz-Institut fuer Strahlen- und Kernphysik der Universitaet Bonn
Nussallee 14-16 - 53115 Bonn - Tel +49-228-732213 - Fax +49-228-732505
mail: pee@xxxxxxxxxxxxxxxxx