[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] CondorMaster restarting sometime



On Wednesday 20 January 2010, Henning Fehrmann wrote:
> Hello,
>
> we have condor 7.4.1 running and observed that on the nodes running a
> startd the condor_master process is stopping with exit code 0 and starting
> from time to time. This happens on arbitrary nodes at arbitrary time. We
> have not been able yet to correlate this with a particular kind of jobs.
> We increased the verbosity on some nodes and collected the logs.
>
> I took the time around such an event and put the CKPTLog, MasterLog and
> StartLog of the startd node and the CollectorLog of the submit host into
> a tar ball:
>
> http://atlas1.atlas.aei.uni-hannover.de/~fehrmann/condor_log.tgz
>
> Unfortunately, we have been too slow - the log rotate erased the
> corresponding events in the StarterLogs.
>
> If you need the configuration or more logging please tell us.

I see this in the Master's log that's suspicious...  The master got a SIGTERM 
and did what it's supposed to.  It's not at all clear as to why it's getting 
the SIGTERM, however...

01/18 18:34:01 (fd:8) (pid:9465) DaemonCore: received Signal 15 (SIGTERM), 
raising event handle_dc_sigterm()

Earlier in the log there's this, but I think that it's a DAEMON_OFF_PEACEFUL 
to the startd (which the master then sends a TERM to).

01/18 18:33:52 (fd:9) (pid:9465) Received TCP command 483 
(DAEMON_OFF_PEACEFUL) from  <10.10.1.74:56227>, access level ADMINISTRATOR

I'd look around on the system and see what could be sending a TERM to the 
master.

-Nick

-- 
           <<< Why, oh, why, didn't I take the blue pill? >>>
 /`-_    Nicholas R. LeRoy               The Condor Project
{     }/ http://www.cs.wisc.edu/~nleroy  http://www.cs.wisc.edu/condor
 \    /  nleroy@xxxxxxxxxxx              The University of Wisconsin
 |_*_|   608-265-5761                    Department of Computer Sciences