[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] What would cause a schedd to stop responding to condor_q queries?



Title: What would cause a schedd to stop responding to condor_q queries?

I have schedd that continually reports "failed to fetch ads" when asked for it's state with condor_q. I looked in the ScheddLog for the machine and I'm seeing lots and lots of errors. What could have happened to put the machine in such an awful state?

- Ian

2/4 14:30:14 Tables are consistent
2/4 14:30:14 condor_write(): Socket closed when trying to write buffer
2/4 14:30:14 Buf::write(): condor_write() failed
2/4 14:30:14 Can't send job eom to mgr
2/4 14:30:14 Negotiating for owner: bchan@xxxxxxxxxx
2/4 14:30:14 Checking consistency running and runnable jobs
2/4 14:30:14 Tables are consistent
2/4 14:30:14 condor_write(): Socket closed when trying to write buffer
2/4 14:30:14 Buf::write(): condor_write() failed
2/4 14:30:14 Can't send job eom to mgr
2/4 14:30:14 Shadow pid 17382 for job 11.28 exited with status 4
2/4 14:30:14 ERROR: Shadow exited with job exception code!
2/4 14:30:14 Started shadow for job 25.73 on "<137.57.176.51:4846>", (shadow pid = 18388)
2/4 14:30:21 Sent ad to central manager for bchan@xxxxxxxxxx
2/4 14:30:21 Sent ad to 1 collectors for bchan@xxxxxxxxxx
2/4 14:31:09 condor_read(): recv() returned -1, errno = 104, assuming failure.
2/4 14:31:09 ERROR: Child pid 16415 appears hung! Killing it hard.
2/4 14:31:09 ERROR: Child pid 17381 appears hung! Killing it hard.
2/4 14:31:09 ERROR: Child pid 15970 appears hung! Killing it hard.
2/4 14:31:09 ERROR: Child pid 17350 appears hung! Killing it hard.
2/4 14:31:09 ERROR: Child pid 17293 appears hung! Killing it hard.
2/4 14:31:09 condor_write(): Socket closed when trying to write buffer
2/4 14:31:09 Buf::write(): condor_write() failed
2/4 14:31:09 AUTHENTICATE: handshake failed!
2/4 14:31:09 SCHEDD: authentication failed: AUTHENTICATE:1002:Failure performing handshake

2/4 14:31:09 Shadow pid 17381 successfully killed because it was hung.
2/4 14:31:09 Shadow pid 17381 died with signal 4
2/4 14:31:09 Started shadow for job 25.74 on "<137.57.176.42:1998>", (shadow pid = 18497)
2/4 14:31:09 condor_write(): Socket closed when trying to write buffer
2/4 14:31:09 Buf::write(): condor_write() failed
2/4 14:31:09 AUTHENTICATE: handshake failed!
2/4 14:31:09 SCHEDD: authentication failed: AUTHENTICATE:1002:Failure performing handshake

2/4 14:31:09 Shadow pid 17379 for job 11.26 exited with status 4
2/4 14:31:09 ERROR: Shadow exited with job exception code!
2/4 14:31:11 Started shadow for job 25.81 on "<137.57.176.70:3975>", (shadow pid = 18499)
2/4 14:31:12 condor_write(): Socket closed when trying to write buffer
2/4 14:31:12 Buf::write(): condor_write() failed
2/4 14:31:12 AUTHENTICATE: handshake failed!
2/4 14:31:12 SCHEDD: authentication failed: AUTHENTICATE:1002:Failure performing handshake

2/4 14:31:12 Shadow pid 17378 for job 14.1 exited with status 4
2/4 14:31:12 ERROR: Shadow exited with job exception code!
2/4 14:31:12 condor_write(): Socket closed when trying to write buffer
2/4 14:31:12 Buf::write(): condor_write() failed
2/4 14:31:12 AUTHENTICATE: handshake failed!
2/4 14:31:12 SCHEDD: authentication failed: AUTHENTICATE:1002:Failure performing handshake

2/4 14:31:12 Shadow pid 17376 for job 11.25 exited with status 4
2/4 14:31:12 ERROR: Shadow exited with job exception code!
2/4 14:31:14 Started shadow for job 25.15 on "<137.57.176.86:3838>", (shadow pid = 18501)




--
Ian R. Chesal <ichesal@xxxxxxxxxx>
Senior Software Engineer

Altera Corporation
Toronto Technology Center
Tel: (416) 926-8300