[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Spectacular crash of Condor 8.2.6 under heavy load



This morning, one of my users somehow managed to kill the whole
set of Condor daemons - only some scheduniv processes were still
alive.
Something that seems to point to the culprit is this message in
the last of the rotated ScheddLogs (the last one never got written,
apparently):
15-02-19_08:37:30 (pid:1629) DaemonCore: accept() failed!
repeated over and over.
I have saved the whole /var/log/condor directory, copied the 
ScheddClassad file into a new one (is there anything else that
keeps track of job cluster ids?), and am about to restart the
whole thing - hoping it won't go wrong this badly again.

If someone wants to see logs, tell me. I also have core dumps
of the MASTER, the SHADOW and the STARTER. None for the SCHEDD 
though (there was an older one from December though)

Cheers
- S

-- 
Steffen Grunewald * Cluster Admin * steffen.grunewald(*)aei.mpg.de
MPI f. Gravitationsphysik (AEI) * Am Mühlenberg 1, D-14476 Potsdam
http://www.aei.mpg.de/ * ------- * +49-331-567-{fon:7274,fax:7298}