[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Jobs don't run on execute machines



Finch, Ralph <rfinch@xxxxxxxxxxxx> wrote:
> 022 (5193.000.000) 02/03 08:11:22 Job disconnected, attempting to
> reconnect
>     Socket between submit and execute hosts closed unexpectedly
>     Trying to reconnect to slot2@xxxxxxxxxxxxxxxxxxxxx
> <136.200.32.179:4314>

The reconnection message is a red-herring.  Condor is just trying
to recover from the real problem.  The question is, why did the
connection between your submit and execute computers close?

I suggest taking a few of these "disconnected" events, and
correlate them with the ShadowLog on your submit computer and the
Master, StartLog, and StarterLogs on the matching execute
computer.  There might be some useful clues in there.  I'm betting
the ShadowLog will just say something like "socket closed
unexpectedly."  Hopefully the execute computer will be able to
tell you why the connection closed.  Did the MasterLog report
that the Startd exited unexpectedly?  Did the Startd report that
the Starter exited unexpectedly?  Do the Startd or Starter either
have warnings or errors in their logs?  Perhaps it complains
about timing out trying to contact the submit computer.

To engage in wild guesswork, perhaps your submit computer is so
heavily overloaded that your shadows are unable to keep up with
the network traffic from the starters on the execute computers.
The starters eventually decide the other side is dead and hang
up.  If this is the problem, you might try configuration changes
on the submit computer: cut down on the number of jobs the startd
is willing to run simultaneously, use JOB_RENICE_INCREMENT to
decrease their priority, or both.  If the situation is bad
enough, you might need to stop running jobs on your submit node
entirely, but I would be surprised if you needed to go that far.

I doubt that your central manager being overloaded is causing a
problem.  The most likely symptom of an overloaded central
manager is that new jobs don't get matched to execute nodes.
What you're seeing is existing jobs being interrupted.

-- 
Alan De Smet                              Condor Project Research
adesmet@xxxxxxxxxxx                http://www.cs.wisc.edu/condor/