[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp



On 2/21/2017 1:33 PM, Harald van Pee wrote:
It seems that openmpi (or mpi) are not used very often with htcondor and the
information is spare and I got some questions how I have managed it to get it
running at all. I will share all I know about this in a new thread soon,
or is there a wiki where I should put the information?


Off-list I put Harold in touch with the folks who can put Harold's info into the Manual or the HTCondor Wiki (from the web homepage, look for the link "HOWTO recipes" and "HTcondor Wiki").

Also we did some work for upcoming HTCondor v8.6.1 release so it works properly with the latest releases of OpenMPI - for details see
  https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=6024

Now back to our problem:
One hint that it will be related to network (ethernet or infiniband) is, that
we have one job running for 11 days without problems as we have less jobs
running, and we got problems within a few days as we have startetd 200 more
jobs.
I have found now 2 independend parallel mpi jobs which share one machine with
one job each and there are no ethernet problems seen, not on the scheduler
machine nor on the starter node. Unfortunately there is no error output in the
jobs error file.
Its clear that condor kills the jobs but for me its unclear why, because it
seems both starter processes are still running if I understand the logfiles
correct.


At first blush, it looks to me like the condor_shadow on the submit node could no longer contact the execute node at IP address 192.168.123.37 due to "No route to host". The "No route to host" error comes the operating system, not from HTCondor - you can google this error and see lots of opinions/ideas on how to troubleshoot and fix, but basically there is no route for the execute node IP address in the client's routing table... not sure why this would happen all of the sudden, maybe some interface on your submit machine is being disabled or some switch port?

regards,
Todd


Maybe one of you find a hint in the condor log below and can give me a hint
what happens, or what I can do to find out.

Best
Harald

ShadowLog:
02/19/17 03:09:44 (1744.0) (1729179): condor_read() failed: recv(fd=12)
returned -1, errno = 110 Connection timed out, reading 5 bytes from startd at
<192.168.123.37:30389>.
02/19/17 03:09:44 (1745.0) (1729180): condor_read() failed: recv(fd=9)
returned -1, errno = 110 Connection timed out, reading 5 bytes from startd at
<192.168.123.37:30389>.
02/19/17 03:09:44 (1744.0) (1729179): condor_read(): UNEXPECTED read timeout
after 0s during non-blocking read from startd at <192.168.123.37:30389>
(desired timeout=300s)
02/19/17 03:09:44 (1745.0) (1729180): condor_read(): UNEXPECTED read timeout
after 0s during non-blocking read from startd at <192.168.123.37:30389>
(desired timeout=300s)
02/19/17 03:09:44 (1744.0) (1729179): IO: Failed to read packet header
02/19/17 03:09:44 (1745.0) (1729180): IO: Failed to read packet header
02/19/17 03:09:44 (1744.0) (1729179): Can no longer talk to condor_starter
<192.168.123.37:30389>
02/19/17 03:09:44 (1745.0) (1729180): Can no longer talk to condor_starter
<192.168.123.37:30389>
02/19/17 03:09:44 (1744.0) (1729179): This job cannot reconnect to starter, so
job exiting
02/19/17 03:09:44 (1745.0) (1729180): This job cannot reconnect to starter, so
job exiting
02/19/17 03:09:47 (1745.0) (1729180): attempt to connect to
<192.168.123.37:30389> failed: No route to host (connect errno = 113).
02/19/17 03:09:47 (1744.0) (1729179): attempt to connect to
<192.168.123.37:30389> failed: No route to host (connect errno = 113).
02/19/17 03:09:47 (1745.0) (1729180): RemoteResource::killStarter(): Could not
send command to startd
02/19/17 03:09:47 (1744.0) (1729179): RemoteResource::killStarter(): Could not
send command to startd
02/19/17 03:09:47 (1744.0) (1729179): ERROR "Can no longer talk to
condor_starter <192.168.123.37:30389>" at line 209 in file
/slots/02/dir_53434/userdir/src/condor_shadow.V6.1/NTreceivers.cpp
02/19/17 03:09:47 (1745.0) (1729180): ERROR "Can no longer talk to
condor_starter <192.168.123.37:30389>" at line 209 in file
/slots/02/dir_53434/userdir/src/condor_shadow.V6.1/NTreceivers.cpp


StarterLog of job 1745.0 on node 192.168.123.37
02/15/17 17:14:34 (pid:751398) Create_Process succeeded, pid=751405
02/15/17 17:14:35 (pid:751398) condor_write() failed: send() 1 bytes to
<127.0.0.1:10238> returned -1, timeout=0, errno=32 Broken pipe.
02/19/17 03:10:05 (pid:751398) condor_read() failed: recv(fd=8) returned -1,
errno = 104 Connection reset by peer, reading 5 bytes from
<192.168.123.100:25500>.
02/19/17 03:10:05 (pid:751398) IO: Failed to read packet header
02/19/17 03:10:05 (pid:751398) Lost connection to shadow, waiting 2400 secs
for reconnect
02/19/17 03:10:05 (pid:751398) Got SIGTERM. Performing graceful shutdown.
02/19/17 03:10:05 (pid:751398) ShutdownGraceful all jobs.
02/19/17 03:10:05 (pid:751398) Process exited, pid=751405, status=0
02/19/17 03:10:05 (pid:751398) Failed to send job exit status to shadow
02/19/17 03:10:05 (pid:751398) Last process exited, now Starter is exiting
02/19/17 03:10:05 (pid:751398) **** condor_starter (condor_STARTER) pid 751398
EXITING WITH STATUS 0

StarterLog of job 1744.0 on node 92.168.123.37
02/15/17 17:14:34 (pid:751399) Create_Process succeeded, pid=751400
02/15/17 17:14:34 (pid:751399) condor_write() failed: send() 1 bytes to
<127.0.0.1:48689> returned -1, timeout=0, errno=32 Broken pipe.
02/19/17 03:10:03 (pid:751399) condor_read() failed: recv(fd=8) returned -1,
errno = 104 Connection reset by peer, reading 5 bytes from
<192.168.123.100:34337>.
02/19/17 03:10:03 (pid:751399) IO: Failed to read packet header
02/19/17 03:10:03 (pid:751399) Lost connection to shadow, waiting 2400 secs
for reconnect
02/19/17 03:10:03 (pid:751399) Got SIGTERM. Performing graceful shutdown.
02/19/17 03:10:03 (pid:751399) ShutdownGraceful all jobs.
02/19/17 03:10:03 (pid:751399) Process exited, pid=751400, status=0
02/19/17 03:10:03 (pid:751399) Failed to send job exit status to shadow
02/19/17 03:10:03 (pid:751399) Last process exited, now Starter is exiting
02/19/17 03:10:03 (pid:751399) **** condor_starter (condor_STARTER) pid 751399
EXITING WITH STATUS 0

StartLog:
02/19/17 03:09:48 slot1_11: Called deactivate_claim()
02/19/17 03:09:48 slot1_11: Changing state and activity: Claimed/Busy ->
Preempting/Vacating
02/19/17 03:09:48 slot1_13: Called deactivate_claim()
02/19/17 03:09:48 slot1_13: Changing state and activity: Claimed/Busy ->
Preempting/Vacating
02/19/17 03:10:03 Starter pid 751399 exited with status 0
02/19/17 03:10:03 slot1_11: State change: starter exited
02/19/17 03:10:03 slot1_11: State change: No preempting claim, returning to
owner
02/19/17 03:10:03 slot1_11: Changing state and activity: Preempting/Vacating -
Owner/Idle
02/19/17 03:10:03 slot1_11: State change: IS_OWNER is false
02/19/17 03:10:03 slot1_11: Changing state: Owner -> Unclaimed
02/19/17 03:10:03 slot1_11: Changing state: Unclaimed -> Delete
02/19/17 03:10:03 slot1_11: Resource no longer needed, deleting
02/19/17 03:10:05 Starter pid 751398 exited with status 0
02/19/17 03:10:05 slot1_13: State change: starter exited
02/19/17 03:10:05 slot1_13: State change: No preempting claim, returning to
owner
02/19/17 03:10:05 slot1_13: Changing state and activity: Preempting/Vacating -
Owner/Idle
02/19/17 03:10:05 slot1_13: State change: IS_OWNER is false
02/19/17 03:10:05 slot1_13: Changing state: Owner -> Unclaimed
02/19/17 03:10:05 slot1_13: Changing state: Unclaimed -> Delete
02/19/17 03:10:05 slot1_13: Resource no longer needed, deleting
02/19/17 03:19:48 Error: can't find resource with ClaimId
(<192.168.123.37:30389>#1481221329#1484#...) for 443 (RELEASE_CLAIM); perhaps
this claim was removed already.
02/19/17 03:19:48 condor_write(): Socket closed when trying to write 13 bytes
to <192.168.123.100:20962>, fd is 8
02/19/17 03:19:48 Buf::write(): condor_write() failed
02/19/17 03:19:48 Error: can't find resource with ClaimId
(<192.168.123.37:30389>#1481221329#1487#...) for 443 (RELEASE_CLAIM); perhaps
this claim was removed already.
02/19/17 03:19:48 condor_write(): Socket closed when trying to write 13 bytes
to <192.168.123.100:34792>, fd is 8
02/19/17 03:19:48 Buf::write(): condor_write() failed



On Tuesday 07 February 2017 19:55:31 Harald van Pee wrote:
Dear experts,

I have some questions for debugging:
Can I avoid restarting of a job in vanilla and/or parallel universe if I
use Requirements = (NumJobStarts==0)
in the condor submit description file?
If it works, will the job stay idle or will be removed?

I found a job in the vanilla universe started at 12/9 and restarted shortly
before Christmas and still running. I assume the reason were also network
problems, but unfortunatelly our last condor and system log files are from
January.
Is there any possibility to make condor a little bit more robust against
network problems via configuration? Just wait a little bit longer or make
more reconnection tries?

We are working on automatic restart of the mpi jobs and try to use more
frequent checkpoints, but it seems a lot of work, therefore any idea
would be welcome.

Best
Harald

On Monday 06 February 2017 23:43:47 Harald van Pee wrote:
There is one important argument, why I think the problem is condor
related not mpi (of course I can be wrong).
The condor communication goes via ethernet, and the ethernet connection
has a problem for several minutes.
The mpi communication goes via infiniband, and there is no infiniband
problem during this time.

Harald

On Monday 06 February 2017 23:04:01 Harald van Pee wrote:
Hi Greg,

thanks for your answer.

On Monday 06 February 2017 22:18:08 Greg Thain wrote:
On 02/06/2017 02:40 PM, Harald van Pee wrote:
Hello,

we got mpi running in parallel universe with htcondor 8.4 using
openmpiscript and its working in general without any problem.

In general, the MPI jobs themselves cannot survive a network outage
or partition, even a temporary one.  HTCondor will reconnect the
shadow to the starters, if the problem is just between the submit
machine and the execute machines, but if the network problem also
impacts node-to-node communication, then the job has to be aborted
and restarted from scratch because of the way MPI works.

The problem seems between submit machine and one running node (not the
node where mpirun was started).
If you are right it should be possible to get or found an error of
mpirun because it lost one node right?
But it seems condor kills the job because of a shadow exception.
Unfortunatelly we do not see the output of the stoped job because its
overwritten by the new started.
Any suggestion how to find out if its realy an mpi related problem?

If possible, we would recommend that long-running jobs that suffer
from this problem try to self-checkpoint themselves, so that when
they are restarted, they don't need to be restarted from scratch.

-greg
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
with a subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685