[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Failed to start non-blocking update to <127.0.1.1:9618>



Hello.
First of all, I apologize for my "Google English".
We have a cluster with 11 nodes, and suddenly, one of them, node06, stopped being seen by Condor: the node exists in the network, it can be accessed normally (the LDPA connection seems to be fine, because it loads "home" and NFS file shares), but Condor does not consider it to be one of its nodes.
a week before, 2 nodes were added to the network (they were not added to Condor), and it is possible that node 6 had started having problems since then, until Condor lost sight of it.
I must emphasize that I am a user of HTCondor, I am not the administrator: it is not available at the moment, and a solution is urgently needed.
I have searched for a solution, but the work has been unsuccessful. I will appreciate your help.
I put here an extract of the last LOG files.


STARTLOG.TXT:

04/11/22 12:08:19 Now in new log file /var/log/condor/StartLog
04/11/22 12:08:19 condor_read() failed: recv() 5 bytes from collector head.econets.org returned -1, timeout=20, errno=104 Connection reset by peer.
04/11/22 12:08:19 IO: Failed to read packet header
04/11/22 12:08:19 SECMAN: no classad from server, failing
04/11/22 12:08:19 ERROR: SECMAN:2007:Failed to end classad message.
04/11/22 12:08:19 Failed to start non-blocking update to <127.0.1.1:9618>.
04/11/22 12:13:16 Unable to calculate keyboard/mouse idle time due to them both being USB or not present, assuming infinite idle time for these devices.
04/11/22 12:13:19 condor_read() failed: recv() 5 bytes from collector head.econets.org returned -1, timeout=20, errno=104 Connection reset by peer.
04/11/22 12:13:19 IO: Failed to read packet header
04/11/22 12:13:19 SECMAN: no classad from server, failing
04/11/22 12:13:19 ERROR: SECMAN:2007:Failed to end classad message.
04/11/22 12:13:19 Failed to start non-blocking update to <127.0.1.1:9618>.
04/11/22 12:13:19 condor_read() failed: recv() 5 bytes from collector head.econets.org returned -1, timeout=20, errno=104 Connection reset by peer.
04/11/22 12:13:19 IO: Failed to read packet header
04/11/22 12:13:19 SECMAN: no classad from server, failing
04/11/22 12:13:19 ERROR: SECMAN:2007:Failed to end classad message.
04/11/22 12:13:19 Failed to start non-blocking update to <127.0.1.1:9618>.
04/11/22 12:13:19 condor_read() failed: recv() 5 bytes from collector head.econets.org returned -1, timeout=20, errno=104 Connection reset by peer.
04/11/22 12:13:19 IO: Failed to read packet header
04/11/22 12:13:19 SECMAN: no classad from server, failing
04/11/22 12:13:19 ERROR: SECMAN:2007:Failed to end classad message.
04/11/22 12:13:19 Failed to start non-blocking update to <127.0.1.1:9618>.
04/11/22 12:13:19 condor_read() failed: recv() 5 bytes from collector head.econets.org returned -1, timeout=20, errno=104 Connection reset by peer.
04/11/22 12:13:19 IO: Failed to read packet header
04/11/22 12:13:19 SECMAN: no classad from server, failing
04/11/22 12:13:19 ERROR: SECMAN:2007:Failed to end classad message.
04/11/22 12:13:19 Failed to start non-blocking update to <127.0.1.1:9618>.
...
...
04/11/22 17:28:20 IO: Failed to read packet header
04/11/22 17:28:20 SECMAN: no classad from server, failing
04/11/22 17:28:20 ERROR: SECMAN:2007:Failed to end classad message.
04/11/22 17:28:20 Failed to start non-blocking update to <127.0.1.1:9618>.
04/11/22 17:28:20 condor_read() failed: recv() 5 bytes from collector head.econets.org returned -1, timeout=20, errno=104 Connection reset by peer.
04/11/22 17:28:20 IO: Failed to read packet header
04/11/22 17:28:20 SECMAN: no classad from server, failing
04/11/22 17:28:20 ERROR: SECMAN:2007:Failed to end classad message.
04/11/22 17:28:20 Failed to start non-blocking update to <127.0.1.1:9618>.
04/11/22 17:28:20 condor_read() failed: recv() 5 bytes from collector head.econets.org returned -1, timeout=20, errno=104 Connection reset by peer.
04/11/22 17:28:20 IO: Failed to read packet header
04/11/22 17:28:20 SECMAN: no classad from server, failing
04/11/22 17:28:20 ERROR: SECMAN:2007:Failed to end classad message.
04/11/22 17:28:20 Failed to start non-blocking update to <127.0.1.1:9618>.
04/11/22 17:28:20 condor_read() failed: recv() 5 bytes from collector head.econets.org returned -1, timeout=20, errno=104 Connection reset by peer.
04/11/22 17:28:20 IO: Failed to read packet header
04/11/22 17:28:20 SECMAN: no classad from server, failing
04/11/22 17:28:20 ERROR: SECMAN:2007:Failed to end classad message.
04/11/22 17:28:20 Failed to start non-blocking update to <127.0.1.1:9618>.
04/11/22 17:33:17 Unable to calculate keyboard/mouse idle time due to them both being USB or not present, assuming infinite idle time for these devices.
04/11/22 17:33:20 condor_read() failed: recv() 5 bytes from collector head.econets.org returned -1, timeout=20, errno=104 Connection reset by peer.
04/11/22 17:33:20 IO: Failed to read packet header
04/11/22 17:33:20 SECMAN: no classad from server, failing
04/11/22 17:33:20 ERROR: SECMAN:2007:Failed to end classad message.
04/11/22 17:33:20 Failed to start non-blocking update to <127.0.1.1:9618>.
04/11/22 17:33:20 condor_read() failed: recv() 5 bytes from collector head.econets.org returned -1, timeout=20, errno=104 Connection reset by peer.
04/11/22 17:33:20 IO: Failed to read packet header
04/11/22 17:33:20 SECMAN: no classad from server, failing
04/11/22 17:33:20 ERROR: SECMAN:2007:Failed to end classad message.
04/11/22 17:33:20 Failed to start non-blocking update to <127.0.1.1:9618>.
04/11/22 17:33:20 condor_read() failed: recv() 5 bytes from collector head.econets.org returned -1, timeout=20, errno=104 Connection reset by peer.
04/11/22 17:33:20 IO: Failed to read packet header
04/11/22 17:33:20 SECMAN: no classad from server, failing
04/11/22 17:33:20 ERROR: SECMAN:2007:Failed to end classad message.
04/11/22 17:33:20 Failed to start non-blocking update to <127.0.1.1:9618>.
04/11/22 17:33:20 condor_read() failed: recv() 5 bytes from collector head.econets.org returned -1, timeout=20, errno=104 Connection reset by peer.
04/11/22 17:33:20 IO: Failed to read packet header
04/11/22 17:33:20 SECMAN: no classad from server, failing
04/11/22 17:33:20 ERROR: SECMAN:2007:Failed to end classad message.
04/11/22 17:33:20 Failed to start non-blocking update to <127.0.1.1:9618>.
...
...
04/12/22 12:18:22 condor_read() failed: recv() 5 bytes from collector head.econets.org returned -1, timeout=20, errno=104 Connection reset by peer.
04/12/22 12:18:22 IO: Failed to read packet header
04/12/22 12:18:22 SECMAN: no classad from server, failing
04/12/22 12:18:22 ERROR: SECMAN:2007:Failed to end classad message.
04/12/22 12:18:22 Failed to start non-blocking update to <127.0.1.1:9618>.
04/12/22 12:18:22 condor_read() failed: recv() 5 bytes from collector head.econets.org returned -1, timeout=20, errno=104 Connection reset by peer.
04/12/22 12:18:22 IO: Failed to read packet header
04/12/22 12:18:22 SECMAN: no classad from server, failing
04/12/22 12:18:22 ERROR: SECMAN:2007:Failed to end classad message.
04/12/22 12:18:22 Failed to start non-blocking update to <127.0.1.1:9618>.
04/12/22 12:18:22 condor_read() failed: recv() 5 bytes from collector head.econets.org returned -1, timeout=20, errno=104 Connection reset by peer.
04/12/22 12:18:22 IO: Failed to read packet header
04/12/22 12:18:22 SECMAN: no classad from server, failing
04/12/22 12:18:22 ERROR: SECMAN:2007:Failed to end classad message.
04/12/22 12:18:22 Failed to start non-blocking update to <127.0.1.1:9618>.
04/12/22 12:18:22 condor_read() failed: recv() 5 bytes from collector head.econets.org returned -1, timeout=20, errno=104 Connection reset by peer.
04/12/22 12:18:22 IO: Failed to read packet header
04/12/22 12:18:22 SECMAN: no classad from server, failing
04/12/22 12:18:22 ERROR: SECMAN:2007:Failed to end classad message.
04/12/22 12:18:22 Failed to start non-blocking update to <127.0.1.1:9618>.
...
...
04/13/22 12:53:29 IO: Failed to read packet header
04/13/22 12:53:29 SECMAN: no classad from server, failing
04/13/22 12:53:29 ERROR: SECMAN:2007:Failed to end classad message.
04/13/22 12:53:29 Failed to start non-blocking update to <127.0.1.1:9618>.
04/13/22 12:53:29 condor_read() failed: recv() 5 bytes from collector head.econets.org returned -1, timeout=20, errno=104 Connection reset by peer.
04/13/22 12:53:29 IO: Failed to read packet header
04/13/22 12:53:29 SECMAN: no classad from server, failing
04/13/22 12:53:29 ERROR: SECMAN:2007:Failed to end classad message.
04/13/22 12:53:29 Failed to start non-blocking update to <127.0.1.1:9618>.
04/13/22 12:53:29 condor_read() failed: recv() 5 bytes from collector head.econets.org returned -1, timeout=20, errno=104 Connection reset by peer.
04/13/22 12:53:29 IO: Failed to read packet header
04/13/22 12:53:29 SECMAN: no classad from server, failing
04/13/22 12:53:29 ERROR: SECMAN:2007:Failed to end classad message.
04/13/22 12:53:29 Failed to start non-blocking update to <127.0.1.1:9618>.
04/13/22 12:53:29 condor_read() failed: recv() 5 bytes from collector head.econets.org returned -1, timeout=20, errno=104 Connection reset by peer.
04/13/22 12:53:29 IO: Failed to read packet header
04/13/22 12:53:29 SECMAN: no classad from server, failing
04/13/22 12:53:29 ERROR: SECMAN:2007:Failed to end classad message.
04/13/22 12:53:29 Failed to start non-blocking update to <127.0.1.1:9618>.

=============================================================================================

MASTERLOG.TXT:

04/11/22 04:42:57 Now in new log file /var/log/condor/MasterLog
04/11/22 04:42:57 Failed to start non-blocking update to <127.0.1.1:9618>.
04/11/22 04:47:57 condor_read() failed: recv() 5 bytes from collector head.econets.org returned -1, timeout=20, errno=104 Connection reset by peer.
04/11/22 04:47:57 IO: Failed to read packet header
04/11/22 04:47:57 SECMAN: no classad from server, failing
04/11/22 04:47:57 ERROR: SECMAN:2007:Failed to end classad message.
04/11/22 04:47:57 Failed to start non-blocking update to <127.0.1.1:9618>.
04/11/22 04:52:57 condor_read() failed: recv() 5 bytes from collector head.econets.org returned -1, timeout=20, errno=104 Connection reset by peer.
04/11/22 04:52:57 IO: Failed to read packet header
04/11/22 04:52:57 SECMAN: no classad from server, failing
04/11/22 04:52:57 ERROR: SECMAN:2007:Failed to end classad message.
04/11/22 04:52:57 Failed to start non-blocking update to <127.0.1.1:9618>.
04/11/22 04:57:57 condor_read() failed: recv() 5 bytes from collector head.econets.org returned -1, timeout=20, errno=104 Connection reset by peer.
04/11/22 04:57:57 IO: Failed to read packet header
04/11/22 04:57:57 SECMAN: no classad from server, failing
04/11/22 04:57:57 ERROR: SECMAN:2007:Failed to end classad message.
04/11/22 04:57:57 Failed to start non-blocking update to <127.0.1.1:9618>.
...
...
04/13/22 12:43:08 condor_read() failed: recv() 5 bytes from collector head.econets.org returned -1, timeout=20, errno=104 Connection reset by peer.
04/13/22 12:43:08 IO: Failed to read packet header
04/13/22 12:43:08 SECMAN: no classad from server, failing
04/13/22 12:43:08 ERROR: SECMAN:2007:Failed to end classad message.
04/13/22 12:43:08 Failed to start non-blocking update to <127.0.1.1:9618>.
04/13/22 12:48:08 condor_read() failed: recv() 5 bytes from collector head.econets.org returned -1, timeout=20, errno=104 Connection reset by peer.
04/13/22 12:48:08 IO: Failed to read packet header
04/13/22 12:48:08 SECMAN: no classad from server, failing
04/13/22 12:48:08 ERROR: SECMAN:2007:Failed to end classad message.
04/13/22 12:48:08 Failed to start non-blocking update to <127.0.1.1:9618>.
04/13/22 12:53:08 condor_read() failed: recv() 5 bytes from collector head.econets.org returned -1, timeout=20, errno=104 Connection reset by peer.
04/13/22 12:53:08 IO: Failed to read packet header
04/13/22 12:53:08 SECMAN: no classad from server, failing
04/13/22 12:53:08 ERROR: SECMAN:2007:Failed to end classad message.
04/13/22 12:53:08 Failed to start non-blocking update to <127.0.1.1:9618>.

===================================================================================

PROCLOG.TXT:

04/11/22 18:45:22 : Now in new log file /var/log/condor/ProcLog
04/11/22 18:45:22 : no methods have determined process 10983 to be in a monitored family
04/11/22 18:45:22 : ...snapshot complete
04/11/22 18:46:22 : taking a snapshot...
04/11/22 18:46:22 : ProcAPI: new boottime = 1649166365; old_boottime = 1649166365; /proc/stat boottime = 1649166365; /proc/uptime boottime = 1649166365
04/11/22 18:46:22 : ...snapshot complete
04/11/22 18:47:22 : taking a snapshot...
04/11/22 18:47:22 : ProcAPI: new boottime = 1649166365; old_boottime = 1649166365; /proc/stat boottime = 1649166365; /proc/uptime boottime = 1649166365
04/11/22 18:47:22 : ...snapshot complete
04/11/22 18:48:22 : taking a snapshot...
04/11/22 18:48:22 : ProcAPI: new boottime = 1649166365; old_boottime = 1649166365; /proc/stat boottime = 1649166365; /proc/uptime boottime = 1649166365
04/11/22 18:48:22 : ...snapshot complete
04/11/22 18:49:22 : taking a snapshot...
04/11/22 18:49:22 : ProcAPI: new boottime = 1649166365; old_boottime = 1649166365; /proc/stat boottime = 1649166365; /proc/uptime boottime = 1649166365
04/11/22 18:49:22 : ...snapshot complete
04/11/22 18:50:22 : taking a snapshot...
04/11/22 18:50:22 : ProcAPI: new boottime = 1649166365; old_boottime = 1649166365; /proc/stat boottime = 1649166365; /proc/uptime boottime = 1649166365
04/11/22 18:50:22 : ...snapshot complete
04/11/22 18:51:23 : taking a snapshot...
04/11/22 18:51:23 : ProcAPI: new boottime = 1649166365; old_boottime = 1649166365; /proc/stat boottime = 1649166365; /proc/uptime boottime = 1649166365
04/11/22 18:51:23 : process 10979 (not in monitored family) has exited
04/11/22 18:51:23 : no methods have determined process 10990 to be in a monitored family
04/11/22 18:51:23 : ...snapshot complete
...
...
04/13/22 12:53:44 : taking a snapshot...
04/13/22 12:53:44 : ProcAPI: new boottime = 1649166365; old_boottime = 1649166365; /proc/stat boottime = 1649166365; /proc/uptime boottime = 1649166365
04/13/22 12:53:44 : process 14123 (not in monitored family) has exited
04/13/22 12:53:44 : no methods have determined process 14131 to be in a monitored family
04/13/22 12:53:44 : no methods have determined process 14135 to be in a monitored family
04/13/22 12:53:44 : no methods have determined process 14136 to be in a monitored family
04/13/22 12:53:44 : ...snapshot complete
04/13/22 12:54:44 : taking a snapshot...
04/13/22 12:54:44 : ProcAPI: new boottime = 1649166365; old_boottime = 1649166365; /proc/stat boottime = 1649166365; /proc/uptime boottime = 1649166365
04/13/22 12:54:44 : process 14136 (not in monitored family) has exited
04/13/22 12:54:44 : no methods have determined process 14141 to be in a monitored family
04/13/22 12:54:44 : no methods have determined process 14142 to be in a monitored family
04/13/22 12:54:44 : no methods have determined process 14148 to be in a monitored family
04/13/22 12:54:44 : no methods have determined process 14149 to be in a monitored family
04/13/22 12:54:44 : ...snapshot complete
04/13/22 12:55:44 : taking a snapshot...
04/13/22 12:55:44 : ProcAPI: new boottime = 1649166365; old_boottime = 1649166365; /proc/stat boottime = 1649166365; /proc/uptime boottime = 1649166365
04/13/22 12:55:44 : no methods have determined process 14159 to be in a monitored family
04/13/22 12:55:44 : ...snapshot complete

==========================================================================================================

SHAREDPORTLOG.TXT:
(NOTE: Key-string was changed)

04/12/22 15:08:05 Now in new log file /var/log/condor/SharedPortLog
04/12/22 15:08:05 SharedPortServer: server was busy, failed to connect collector as requested by <127.0.0.1:45619>: primary (xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/collector): Connection refused (111); alt (/var/lock/condor/daemon_sock/collector): Connection refused (111)
04/12/22 15:08:24 SharedPortServer: server was busy, failed to connect collector as requested by <127.0.0.1:34753>: primary (xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/collector): Connection refused (111); alt (/var/lock/condor/daemon_sock/collector): Connection refused (111)
04/12/22 15:08:24 SharedPortServer: server was busy, failed to connect collector as requested by <127.0.0.1:46251>: primary (xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/collector): Connection refused (111); alt (/var/lock/condor/daemon_sock/collector): Connection refused (111)
04/12/22 15:08:24 SharedPortServer: server was busy, failed to connect collector as requested by <127.0.0.1:42219>: primary (xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/collector): Connection refused (111); alt (/var/lock/condor/daemon_sock/collector): Connection refused (111)
04/12/22 15:08:24 SharedPortServer: server was busy, failed to connect collector as requested by <127.0.0.1:38137>: primary (xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/collector): Connection refused (111); alt (/var/lock/condor/daemon_sock/collector): Connection refused (111)
04/12/22 15:13:03 About to update statistics in shared_port daemon ad file at /var/lock/condor/shared_port_ad :
ForkedChildrenPeak = 0
ForkedChildrenCurrent = 0
RequestsBlocked = 10410
MyAddress = "<192.168.1.6:9618?addrs=192.168.1.6-9618+[--1]-9618&noUDP>"
SharedPortCommandSinfuls = "<192.168.1.6:9618>,<[::1]:9618>"
RequestsPendingCurrent = 0
RequestsPendingPeak = 2
RequestsSucceeded = 1089
RequestsFailed = 10410
...
...
04/13/22 12:48:08 SharedPortServer: server was busy, failed to connect collector as requested by <127.0.0.1:39011>: primary (xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/collector): Connection refused (111); alt (/var/lock/condor/daemon_sock/collector): Connection refused (111)
04/13/22 12:48:29 SharedPortServer: server was busy, failed to connect collector as requested by <127.0.0.1:38233>: primary (xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/collector): Connection refused (111); alt (/var/lock/condor/daemon_sock/collector): Connection refused (111)
04/13/22 12:48:29 SharedPortServer: server was busy, failed to connect collector as requested by <127.0.0.1:46563>: primary (xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/collector): Connection refused (111); alt (/var/lock/condor/daemon_sock/collector): Connection refused (111)
04/13/22 12:48:29 SharedPortServer: server was busy, failed to connect collector as requested by <127.0.0.1:43055>: primary (xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/collector): Connection refused (111); alt (/var/lock/condor/daemon_sock/collector): Connection refused (111)
04/13/22 12:48:29 SharedPortServer: server was busy, failed to connect collector as requested by <127.0.0.1:34009>: primary (xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/collector): Connection refused (111); alt (/var/lock/condor/daemon_sock/collector): Connection refused (111)
04/13/22 12:53:06 About to update statistics in shared_port daemon ad file at /var/lock/condor/shared_port_ad :
ForkedChildrenPeak = 0
ForkedChildrenCurrent = 0
RequestsBlocked = 11710
MyAddress = "<192.168.1.6:9618?addrs=192.168.1.6-9618+[--1]-9618&noUDP>"
SharedPortCommandSinfuls = "<192.168.1.6:9618>,<[::1]:9618>"
RequestsPendingCurrent = 0
RequestsPendingPeak = 2
RequestsSucceeded = 1225
RequestsFailed = 11710

--
Un abrazo,
_______________________________
Daniel L. Stuardo (Mr. Dalien)
+569 38997269