[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp



On 02/06/2017 02:40 PM, Harald van Pee wrote:
Hello,

we got mpi running in parallel universe with htcondor 8.4 using openmpiscript
and its working in general without any problem.


In general, the MPI jobs themselves cannot survive a network outage or partition, even a temporary one. HTCondor will reconnect the shadow to the starters, if the problem is just between the submit machine and the execute machines, but if the network problem also impacts node-to-node communication, then the job has to be aborted and restarted from scratch because of the way MPI works.

If possible, we would recommend that long-running jobs that suffer from this problem try to self-checkpoint themselves, so that when they are restarted, they don't need to be restarted from scratch.

-greg