Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp

Date: Mon, 06 Feb 2017 15:18:08 -0600
From: Greg Thain <gthain@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp

On 02/06/2017 02:40 PM, Harald van Pee wrote:

Hello,

we got mpi running in parallel universe with htcondor 8.4 using openmpiscript
and its working in general without any problem.

In general, the MPI jobs themselves cannot survive a network outage orpartition, even a temporary one. HTCondor will reconnect the shadow tothe starters, if the problem is just between the submit machine and theexecute machines, but if the network problem also impacts node-to-nodecommunication, then the job has to be aborted and restarted from scratchbecause of the way MPI works.

If possible, we would recommend that long-running jobs that suffer fromthis problem try to self-checkpoint themselves, so that when they arerestarted, they don't need to be restarted from scratch.


-greg

Follow-Ups:
- Re: [HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp
  - From: Harald van Pee

References:
- [HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp
  - From: Harald van Pee

Prev by Date: [HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp
Next by Date: Re: [HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp
Previous by thread: [HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp
Next by thread: Re: [HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp