[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] SCHEDD died due to signal 11



Hello,

We are having problems with our dedicated scheduler. The schedd daemon dies and then restarts, causing all jobs to start from the beginning (we can't use checkpoint). Here goes the message from the MasterLog.

11/7 14:24:12 The SCHEDD (pid 17884) died due to signal 11
11/7 14:24:12 Sending obituary for "/condor/sbin/condor_schedd"
11/7 14:24:12 restarting /condor/sbin/condor_schedd in 10 seconds
11/7 14:24:22 Started DaemonCore process "/condor/sbin/condor_schedd", pid and pgroup = 32506
11/7 14:24:44 The SCHEDD (pid 32506) exited with status 4
11/7 14:24:44 Sending obituary for "/condor/sbin/condor_schedd"
11/7 14:24:44 restarting /condor/sbin/condor_schedd in 11 seconds
11/7 14:24:55 Started DaemonCore process "/condor/sbin/condor_schedd", pid and pgroup = 32533

What does signal 11 mean? What are the possible reasons for this to happen?

This is the mail sent by condor:

Subject: [Condor] Problem

This is an automated email from the Condor system
on machine "cluster00.itqb.unl.pt".  Do not reply.

"/condor/sbin/condor_schedd" on "cluster00.itqb.unl.pt" died due to signal 11.
Condor will automatically restart this process in 10 seconds.

*** Last 20 line(s) of file SchedLog:
11/7 14:23:20 (pid:17884) condor_write(): Socket closed when trying to write 504 bytes to unknown source, fd is 14, errno=107
11/7 14:23:20 (pid:17884) Buf::write(): condor_write() failed
11/7 14:23:20 (pid:17884) SECMAN: failed to end classad message
11/7 14:23:20 (pid:17884) ERROR: SECMAN:2007:Failed to end classad message
11/7 14:23:20 (pid:17884) condor_write(): Socket closed when trying to write 6 bytes to unknown source, fd is 14, errno=107
11/7 14:23:20 (pid:17884) Buf::write(): condor_write() failed
11/7 14:24:03 (pid:17884)       (Can't send alive message to  )
11/7 14:24:05 (pid:17884) Sent ad to central manager for ...
11/7 14:24:05 (pid:17884) Sent ad to 1 collectors for ...
11/7 14:24:05 (pid:17884) Sent ad to central manager for ...
11/7 14:24:05 (pid:17884) Sent ad to 1 collectors for ...
11/7 14:24:05 (pid:17884) Sent ad to central manager for ...
11/7 14:24:05 (pid:17884) Sent ad to central manager for ...
11/7 14:24:05 (pid:17884) Sent ad to 1 collectors for ...
11/7 14:24:07 (pid:17884) Inserting new attribute Scheduler into non-active cluster cid=335 acid=-1
11/7 14:24:07 (pid:17884) Inserting new attribute Scheduler into non-active cluster cid=336 acid=-1
11/7 14:24:07 (pid:17884) Inserting new attribute Scheduler into non-active cluster cid=302 acid=-1
11/7 14:24:08 (pid:17884) Inserting new attribute Scheduler into non-active cluster cid=333 acid=-1
11/7 14:24:08 (pid:17884) Inserting new attribute Scheduler into non-active cluster cid=334 acid=-1
*** End of file SchedLog


Thanks in advance

Sara Campos