[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp



Hello, 

now I can answer some of my questions.
With
Requirements = (NumJobStarts==0)
in the condor submit description file the jobs will not restart, but stay 
idle, this is what we want.
The error and output files are than not overwritten and with condor_q -l 
one gets a lot of informations about where the old job were started.

It seems that openmpi (or mpi) are not used very often with htcondor and the 
information is spare and I got some questions how I have managed it to get it 
running at all. I will share all I know about this in a new thread soon,
or is there a wiki where I should put the information?

Now back to our problem:
One hint that it will be related to network (ethernet or infiniband) is, that 
we have one job running for 11 days without problems as we have less jobs 
running, and we got problems within a few days as we have startetd 200 more 
jobs. 
I have found now 2 independend parallel mpi jobs which share one machine with 
one job each and there are no ethernet problems seen, not on the scheduler 
machine nor on the starter node. Unfortunately there is no error output in the 
jobs error file.
Its clear that condor kills the jobs but for me its unclear why, because it 
seems both starter processes are still running if I understand the logfiles 
correct.

Maybe one of you find a hint in the condor log below and can give me a hint 
what happens, or what I can do to find out.

Best 
Harald

ShadowLog:
02/19/17 03:09:44 (1744.0) (1729179): condor_read() failed: recv(fd=12) 
returned -1, errno = 110 Connection timed out, reading 5 bytes from startd at 
<192.168.123.37:30389>.
02/19/17 03:09:44 (1745.0) (1729180): condor_read() failed: recv(fd=9) 
returned -1, errno = 110 Connection timed out, reading 5 bytes from startd at 
<192.168.123.37:30389>.
02/19/17 03:09:44 (1744.0) (1729179): condor_read(): UNEXPECTED read timeout 
after 0s during non-blocking read from startd at <192.168.123.37:30389> 
(desired timeout=300s)
02/19/17 03:09:44 (1745.0) (1729180): condor_read(): UNEXPECTED read timeout 
after 0s during non-blocking read from startd at <192.168.123.37:30389> 
(desired timeout=300s)
02/19/17 03:09:44 (1744.0) (1729179): IO: Failed to read packet header
02/19/17 03:09:44 (1745.0) (1729180): IO: Failed to read packet header
02/19/17 03:09:44 (1744.0) (1729179): Can no longer talk to condor_starter 
<192.168.123.37:30389>
02/19/17 03:09:44 (1745.0) (1729180): Can no longer talk to condor_starter 
<192.168.123.37:30389>
02/19/17 03:09:44 (1744.0) (1729179): This job cannot reconnect to starter, so 
job exiting
02/19/17 03:09:44 (1745.0) (1729180): This job cannot reconnect to starter, so 
job exiting
02/19/17 03:09:47 (1745.0) (1729180): attempt to connect to 
<192.168.123.37:30389> failed: No route to host (connect errno = 113).
02/19/17 03:09:47 (1744.0) (1729179): attempt to connect to 
<192.168.123.37:30389> failed: No route to host (connect errno = 113).
02/19/17 03:09:47 (1745.0) (1729180): RemoteResource::killStarter(): Could not 
send command to startd
02/19/17 03:09:47 (1744.0) (1729179): RemoteResource::killStarter(): Could not 
send command to startd
02/19/17 03:09:47 (1744.0) (1729179): ERROR "Can no longer talk to 
condor_starter <192.168.123.37:30389>" at line 209 in file 
/slots/02/dir_53434/userdir/src/condor_shadow.V6.1/NTreceivers.cpp
02/19/17 03:09:47 (1745.0) (1729180): ERROR "Can no longer talk to 
condor_starter <192.168.123.37:30389>" at line 209 in file 
/slots/02/dir_53434/userdir/src/condor_shadow.V6.1/NTreceivers.cpp


StarterLog of job 1745.0 on node 192.168.123.37
02/15/17 17:14:34 (pid:751398) Create_Process succeeded, pid=751405
02/15/17 17:14:35 (pid:751398) condor_write() failed: send() 1 bytes to 
<127.0.0.1:10238> returned -1, timeout=0, errno=32 Broken pipe.
02/19/17 03:10:05 (pid:751398) condor_read() failed: recv(fd=8) returned -1, 
errno = 104 Connection reset by peer, reading 5 bytes from 
<192.168.123.100:25500>.
02/19/17 03:10:05 (pid:751398) IO: Failed to read packet header
02/19/17 03:10:05 (pid:751398) Lost connection to shadow, waiting 2400 secs 
for reconnect
02/19/17 03:10:05 (pid:751398) Got SIGTERM. Performing graceful shutdown.
02/19/17 03:10:05 (pid:751398) ShutdownGraceful all jobs.
02/19/17 03:10:05 (pid:751398) Process exited, pid=751405, status=0
02/19/17 03:10:05 (pid:751398) Failed to send job exit status to shadow
02/19/17 03:10:05 (pid:751398) Last process exited, now Starter is exiting
02/19/17 03:10:05 (pid:751398) **** condor_starter (condor_STARTER) pid 751398 
EXITING WITH STATUS 0

StarterLog of job 1744.0 on node 92.168.123.37
02/15/17 17:14:34 (pid:751399) Create_Process succeeded, pid=751400
02/15/17 17:14:34 (pid:751399) condor_write() failed: send() 1 bytes to 
<127.0.0.1:48689> returned -1, timeout=0, errno=32 Broken pipe.
02/19/17 03:10:03 (pid:751399) condor_read() failed: recv(fd=8) returned -1, 
errno = 104 Connection reset by peer, reading 5 bytes from 
<192.168.123.100:34337>.
02/19/17 03:10:03 (pid:751399) IO: Failed to read packet header
02/19/17 03:10:03 (pid:751399) Lost connection to shadow, waiting 2400 secs 
for reconnect
02/19/17 03:10:03 (pid:751399) Got SIGTERM. Performing graceful shutdown.
02/19/17 03:10:03 (pid:751399) ShutdownGraceful all jobs.
02/19/17 03:10:03 (pid:751399) Process exited, pid=751400, status=0
02/19/17 03:10:03 (pid:751399) Failed to send job exit status to shadow
02/19/17 03:10:03 (pid:751399) Last process exited, now Starter is exiting
02/19/17 03:10:03 (pid:751399) **** condor_starter (condor_STARTER) pid 751399 
EXITING WITH STATUS 0

StartLog:
02/19/17 03:09:48 slot1_11: Called deactivate_claim()
02/19/17 03:09:48 slot1_11: Changing state and activity: Claimed/Busy -> 
Preempting/Vacating
02/19/17 03:09:48 slot1_13: Called deactivate_claim()
02/19/17 03:09:48 slot1_13: Changing state and activity: Claimed/Busy -> 
Preempting/Vacating
02/19/17 03:10:03 Starter pid 751399 exited with status 0
02/19/17 03:10:03 slot1_11: State change: starter exited
02/19/17 03:10:03 slot1_11: State change: No preempting claim, returning to 
owner
02/19/17 03:10:03 slot1_11: Changing state and activity: Preempting/Vacating -
> Owner/Idle
02/19/17 03:10:03 slot1_11: State change: IS_OWNER is false
02/19/17 03:10:03 slot1_11: Changing state: Owner -> Unclaimed
02/19/17 03:10:03 slot1_11: Changing state: Unclaimed -> Delete
02/19/17 03:10:03 slot1_11: Resource no longer needed, deleting
02/19/17 03:10:05 Starter pid 751398 exited with status 0
02/19/17 03:10:05 slot1_13: State change: starter exited
02/19/17 03:10:05 slot1_13: State change: No preempting claim, returning to 
owner
02/19/17 03:10:05 slot1_13: Changing state and activity: Preempting/Vacating -
> Owner/Idle
02/19/17 03:10:05 slot1_13: State change: IS_OWNER is false
02/19/17 03:10:05 slot1_13: Changing state: Owner -> Unclaimed
02/19/17 03:10:05 slot1_13: Changing state: Unclaimed -> Delete
02/19/17 03:10:05 slot1_13: Resource no longer needed, deleting
02/19/17 03:19:48 Error: can't find resource with ClaimId 
(<192.168.123.37:30389>#1481221329#1484#...) for 443 (RELEASE_CLAIM); perhaps 
this claim was removed already.
02/19/17 03:19:48 condor_write(): Socket closed when trying to write 13 bytes 
to <192.168.123.100:20962>, fd is 8
02/19/17 03:19:48 Buf::write(): condor_write() failed
02/19/17 03:19:48 Error: can't find resource with ClaimId 
(<192.168.123.37:30389>#1481221329#1487#...) for 443 (RELEASE_CLAIM); perhaps 
this claim was removed already.
02/19/17 03:19:48 condor_write(): Socket closed when trying to write 13 bytes 
to <192.168.123.100:34792>, fd is 8
02/19/17 03:19:48 Buf::write(): condor_write() failed



On Tuesday 07 February 2017 19:55:31 Harald van Pee wrote:
> Dear experts,
> 
> I have some questions for debugging:
> Can I avoid restarting of a job in vanilla and/or parallel universe if I
> use Requirements = (NumJobStarts==0)
> in the condor submit description file?
> If it works, will the job stay idle or will be removed?
> 
> I found a job in the vanilla universe started at 12/9 and restarted shortly
> before Christmas and still running. I assume the reason were also network
> problems, but unfortunatelly our last condor and system log files are from
> January.
> Is there any possibility to make condor a little bit more robust against
> network problems via configuration? Just wait a little bit longer or make
> more reconnection tries?
> 
> We are working on automatic restart of the mpi jobs and try to use more
> frequent checkpoints, but it seems a lot of work, therefore any idea
> would be welcome.
> 
> Best
> Harald
> 
> On Monday 06 February 2017 23:43:47 Harald van Pee wrote:
> > There is one important argument, why I think the problem is condor
> > related not mpi (of course I can be wrong).
> > The condor communication goes via ethernet, and the ethernet connection
> > has a problem for several minutes.
> > The mpi communication goes via infiniband, and there is no infiniband
> > problem during this time.
> > 
> > Harald
> > 
> > On Monday 06 February 2017 23:04:01 Harald van Pee wrote:
> > > Hi Greg,
> > > 
> > > thanks for your answer.
> > > 
> > > On Monday 06 February 2017 22:18:08 Greg Thain wrote:
> > > > On 02/06/2017 02:40 PM, Harald van Pee wrote:
> > > > > Hello,
> > > > > 
> > > > > we got mpi running in parallel universe with htcondor 8.4 using
> > > > > openmpiscript and its working in general without any problem.
> > > > 
> > > > In general, the MPI jobs themselves cannot survive a network outage
> > > > or partition, even a temporary one.  HTCondor will reconnect the
> > > > shadow to the starters, if the problem is just between the submit
> > > > machine and the execute machines, but if the network problem also
> > > > impacts node-to-node communication, then the job has to be aborted
> > > > and restarted from scratch because of the way MPI works.
> > > 
> > > The problem seems between submit machine and one running node (not the
> > > node where mpirun was started).
> > > If you are right it should be possible to get or found an error of
> > > mpirun because it lost one node right?
> > > But it seems condor kills the job because of a shadow exception.
> > > Unfortunatelly we do not see the output of the stoped job because its
> > > overwritten by the new started.
> > > Any suggestion how to find out if its realy an mpi related problem?
> > > 
> > > > If possible, we would recommend that long-running jobs that suffer
> > > > from this problem try to self-checkpoint themselves, so that when
> > > > they are restarted, they don't need to be restarted from scratch.
> > > > 
> > > > -greg
> > > > _______________________________________________
> > > > HTCondor-users mailing list
> > > > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
> > > > with a subject: Unsubscribe
> > > > You can also unsubscribe by visiting
> > > > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> > > > 
> > > > The archives can be found at:
> > > > https://lists.cs.wisc.edu/archive/htcondor-users/
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/