Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp

Date: Mon, 06 Feb 2017 21:40:30 +0100
From: Harald van Pee <pee@xxxxxxxxxxxxxxxxx>
Subject: [HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp

Hello,

we got mpi running in parallel universe with htcondor 8.4 using openmpiscript 
and its working in general without any problem. 

But from time to time we have some temporary network interruption, the jobs in 
the vanilla universe (in general no mpi job) seems to have no problem and
the scheduler reconnects to startd.
But for the mpi job and only for this kind of job, the error message in the 
subject occurs in the ShadowLog for one of the (in this case 50) starter of a 
mpirun job. 
Then unfortunately the complete mpirun job is restarted from the beginning 
because of a Shadow exception!

Is it just a bug or can I change any configuration that condor will wait 
longer for the missing starter? 

Because its a temporary network problem on 
a lot of hosts I am pretty sure that the running starter (program) still 
exists on the missing host connection. 

Best
Harald

Follow-Ups:
- Re: [HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp
  - From: Greg Thain

Prev by Date: [HTCondor-users] How can I resolve this permission error?
Next by Date: Re: [HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp
Previous by thread: Re: [HTCondor-users] How can I resolve this permission error?
Next by thread: Re: [HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

[HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp