[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] schedd goes catatonic




I see an alarming amount of time passing in your logs in the following:

06/28/12 16:49:09 (pid:18733) Sent ad to 1 collectors for pandey3@xxxxxxxxxxxxxxx
06/28/12 16:50:02 (pid:18733) Failed to start non-blocking update to unknown.
06/28/12 16:50:51 (pid:18733) Failed to start non-blocking update to unknown.
06/28/12 16:51:44 (pid:18733) Failed to start non-blocking update to unknown.
06/28/12 16:52:39 (pid:18733) Failed to start non-blocking update to unknown.
06/28/12 16:53:29 (pid:18733) Failed to start non-blocking update to unknown.
06/28/12 16:54:23 (pid:18733) Failed to start non-blocking update to unknown.
06/28/12 16:55:06 (pid:18733) Failed to start non-blocking update to unknown.
06/28/12 16:56:02 (pid:18733) Failed to start non-blocking update to unknown.

Every time this "update to unknown" message appears in your logs, it is preceded by a long gap in time.  The above occurrence of many of these in a row probably caused the final collapse.

I'm not sure what would cause this.  Have you got something in your flocking list that isn't a valid DNS name?

--Dan

On 6/28/12 4:29 PM, Ben Cotton wrote:
We've been seeing a problem on our XSEDE Condor frontend (running
version 7.6.7) where after a few hours the schedd seems to fall apart.
Job submissions hang, the number of running shadow processes drops to
zero or near-zero, and condor_q returns:

-- Failed to fetch ads from: <128.211.128.45:51941> : tg-condor.rcac.purdue.edu
SECMAN:2007:Failed to end classad message.

The only way to get jobs running again is to restart Condor, but since
it only takes a few hours for the schedd to keel over, we're not
seeing much job completion. Other identically configured schedds do
not exhibit this behavior. There are no obvious indications of other
system problems, and I'm at a loss for where to look next. Logs from
the host can be found at:

http://boilergrid.rcac.purdue.edu/tickets/tg-condor_schedd_deaths/

>From the count of running condor_shadow processes, it appears the
problem begins manifesting itself around 16:56. Any guidance would be
greatly appreciated.


Thanks,
BC