[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] schedd goes catatonic



We've been seeing a problem on our XSEDE Condor frontend (running
version 7.6.7) where after a few hours the schedd seems to fall apart.
Job submissions hang, the number of running shadow processes drops to
zero or near-zero, and condor_q returns:

-- Failed to fetch ads from: <128.211.128.45:51941> : tg-condor.rcac.purdue.edu
SECMAN:2007:Failed to end classad message.

The only way to get jobs running again is to restart Condor, but since
it only takes a few hours for the schedd to keel over, we're not
seeing much job completion. Other identically configured schedds do
not exhibit this behavior. There are no obvious indications of other
system problems, and I'm at a loss for where to look next. Logs from
the host can be found at:

http://boilergrid.rcac.purdue.edu/tickets/tg-condor_schedd_deaths/

>From the count of running condor_shadow processes, it appears the
problem begins manifesting itself around 16:56. Any guidance would be
greatly appreciated.


Thanks,
BC

-- 
Ben Cotton
Systems Research Engineer
IT Research Systems
Purdue University