[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Schedd daemon problems on Windows, Condor 7.8.2



I have a Windows pool via Condor 7.8.2 and I have about 6 submit machines. After updating to 7.8.2 I started noticing that the central manager was not tracking the schedd machines well. I am also noticing errors in the schedlog files. If I restart the services on these machines everything works fine for hours but then at some point they drop out of the pool. The schedd daemon never crashes but the central manager cannot identify them. If I reboot, they show back up but within 12 hours the central manager drops them again.

Here is an error message I found in the schedlog for one of the submit machines.
10/03/12 08:14:56 (pid:7848) History file rotation is enabled.
10/03/12 08:14:56 (pid:7848)   Maximum history file size is: 4000000 bytes
10/03/12 08:14:56 (pid:7848)   Number of rotated history files is: 5
10/03/12 08:14:56 (pid:7848) LISTEN <159.xxx.xxx.xxx:62494> fd=708
10/03/12 08:14:56 (pid:7848) my_popen: CreateProcess failed
10/03/12 08:14:56 (pid:7848) Failed to execute C:/Condor/bin/condor_shadow.std.exe, ignoring

Another error I have seen:

10/03/12 07:14:19 (pid:3788) condor_read() failed: recv(fd=936) returned -1, errno = 10054 , reading 5 bytes from <159.xxx.xxx.xxx:51513>.
10/03/12 07:14:19 (pid:3788) IO: Failed to read packet header
10/03/12 07:14:19 (pid:3788) Stream::get(int) failed to read padding
10/03/12 07:14:19 (pid:3788) Socket activated, but could not read command
10/03/12 07:14:19 (pid:3788) (Negotiator probably invalidated cached socket)
10/03/12 07:14:19 (pid:3788) CLOSE <159.xxx.xxx.xx:59120> fd=936

Error 10054 via MSDN:
Connection reset by peer.
An existing connection was forcibly closed by the remote host. This normally results if the peer application on the remote host is suddenly stopped, the host is rebooted, the host or remote network interface is disabled, or the remote host uses a hard close (see setsockopt for more information on the SO_LINGER option on the remote socket). This error may also result if a connection was broken due to keep-alive activity detecting a failure while one or more operations are in progress. Operations that were in progress fail with WSAENETRESET. Subsequent operations fail with WSAECONNRESET.



Is anyone else seeing this or have thoughts as to what the problem could be. I am not sure if this is the schedd or negotiator on the central manager. I am not seeing an errors with our network or for the 2 VM submit machines that can more easily monitor.


thank you!
mike