[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp


we got mpi running in parallel universe with htcondor 8.4 using openmpiscript 
and its working in general without any problem. 

But from time to time we have some temporary network interruption, the jobs in 
the vanilla universe (in general no mpi job) seems to have no problem and
the scheduler reconnects to startd.
But for the mpi job and only for this kind of job, the error message in the 
subject occurs in the ShadowLog for one of the (in this case 50) starter of a 
mpirun job. 
Then unfortunately the complete mpirun job is restarted from the beginning 
because of a Shadow exception!

Is it just a bug or can I change any configuration that condor will wait 
longer for the missing starter? 

Because its a temporary network problem on 
a lot of hosts I am pretty sure that the running starter (program) still 
exists on the missing host connection.