[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] job cannot reconnect to starter running MPI



06/06/17 19:45:06 (1115.0) (2409): Request to run on <10.1.255.217:47300> <10.1.255.217:47300> was ACCEPTED
06/06/17 19:47:52 (1115.0) (2409): Can no longer talk to condor_starter <10.1.255.217:47300>
06/06/17 19:47:52 (1115.0) (2409): This job cannot reconnect to starter, so job exiting
06/06/17 19:47:52 (1115.0) (2409): ERROR "Can no longer talk to condor_starter <10.1.255.217:47300>" at line 208 in file /slots/11/dir_17560/userdir/src/condor_shadow.V6.1/NTreceivers.cpp

So what this says is that about two minutes into the job, the starter either crashed or hung (or the network went away, but that seems unlikely), and the shadow doesn't know why. At this point, it would make sense to look at the execute node(s) -- their startd and starter logs -- and see what's going on at the same time.

- ToddM