Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] job cannot reconnect to starter running MPI

Date: Wed, 07 Jun 2017 15:59:58 -0500 (CDT)
From: Todd L Miller <tlmiller@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] job cannot reconnect to starter running MPI

06/06/17 19:45:06 (1115.0) (2409): Request to run on <10.1.255.217:47300> <10.1.255.217:47300> was ACCEPTED
06/06/17 19:47:52 (1115.0) (2409): Can no longer talk to condor_starter <10.1.255.217:47300>
06/06/17 19:47:52 (1115.0) (2409): This job cannot reconnect to starter, so job exiting
06/06/17 19:47:52 (1115.0) (2409): ERROR "Can no longer talk to condor_starter <10.1.255.217:47300>" at line 208 in file /slots/11/dir_17560/userdir/src/condor_shadow.V6.1/NTreceivers.cpp

So what this says is that about two minutes into the job, thestarter either crashed or hung (or the network went away, but that seemsunlikely), and the shadow doesn't know why. At this point, it would makesense to look at the execute node(s) -- their startd and starter logs --and see what's going on at the same time.


- ToddM

Follow-Ups:
- Re: [HTCondor-users] job cannot reconnect to starter running MPI
  - From: Carlos Adean

References:
- [HTCondor-users] job cannot reconnect to starter running MPI
  - From: Carlos Adean

Prev by Date: Re: [HTCondor-users] job cannot reconnect to starter running MPI
Next by Date: [HTCondor-users] Creating tarball after successfully compiling condor
Previous by thread: Re: [HTCondor-users] job cannot reconnect to starter running MPI
Next by thread: Re: [HTCondor-users] job cannot reconnect to starter running MPI
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] job cannot reconnect to starter running MPI