[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Schedd daemon problems on Windows, Condor 7.8.2

inline below 

----- Original Message ----- 

> From: "Michael O'Donnell" <odonnellm@xxxxxxxx>
> To: condor-users@xxxxxxxxxxx
> Sent: Thursday, October 4, 2012 7:43:37 AM
> Subject: [Condor-users] Schedd daemon problems on Windows, Condor
> 7.8.2

> I have a Windows pool via Condor 7.8.2 and I have about 6 submit
> machines. After updating to 7.8.2 I started noticing that the
> central manager was not tracking the schedd machines well.

So your entire pool is now 7.8.2 ?

> I am also
> noticing errors in the schedlog files. If I restart the services on
> these machines everything works fine for hours but then at some
> point they drop out of the pool. The schedd daemon never crashes but
> the central manager cannot identify them. If I reboot, they show
> back up but within 12 hours the central manager drops them again.

> Here is an error message I found in the schedlog for one of the
> submit machines.
> 10/03/12 08:14:56 (pid:7848) History file rotation is enabled.
> 10/03/12 08:14:56 (pid:7848) Maximum history file size is: 4000000
> bytes
> 10/03/12 08:14:56 (pid:7848) Number of rotated history files is: 5
> 10/03/12 08:14:56 (pid:7848) LISTEN <159.xxx.xxx.xxx:62494> fd=708
> 10/03/12 08:14:56 (pid:7848) my_popen: CreateProcess failed
> 10/03/12 08:14:56 (pid:7848) Failed to execute
> C:/Condor/bin/condor_shadow.std.exe, ignoring

^ not an issue imho as std universe is unsupported on windows iirc. 

> Another error I have seen:
> 10/03/12 07:14:19 (pid:3788) condor_read() failed: recv(fd=936)
> returned -1, errno = 10054 , reading 5 bytes from
> <159.xxx.xxx.xxx:51513>.
> 10/03/12 07:14:19 (pid:3788) IO: Failed to read packet header
> 10/03/12 07:14:19 (pid:3788) Stream::get(int) failed to read padding
> 10/03/12 07:14:19 (pid:3788) Socket activated, but could not read
> command
> 10/03/12 07:14:19 (pid:3788) (Negotiator probably invalidated cached
> socket)
> 10/03/12 07:14:19 (pid:3788) CLOSE <159.xxx.xxx.xx:59120> fd=936

^^ This seems to point towards the issue you are seeing, could you capture more full log snap with D_FULLDEBUG? 

> Error 10054 via MSDN:
> Connection reset by peer.
> An existing connection was forcibly closed by the remote host. This
> normally results if the peer application on the remote host is
> suddenly stopped, the host is rebooted, the host or remote network
> interface is disabled, or the remote host uses a hard close (see
> setsockopt for more information on the SO_LINGER option on the
> remote socket). This error may also result if a connection was
> broken due to keep-alive activity detecting a failure while one or
> more operations are in progress. Operations that were in progress
> fail with WSAENETRESET. Subsequent operations fail with

> Is anyone else seeing this or have thoughts as to what the problem
> could be. I am not sure if this is the schedd or negotiator on the
> central manager. I am not seeing an errors with our network or for
> the 2 VM submit machines that can more easily monitor.

> thank you!
> mike

> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
> with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users

> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/