[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Schedd daemon problems on Windows, Condor 7.8.2



Thank you. I will work on collecting the information (will try to wait for submit machines to drop) and get back to you. Just about all my machines in the pool (~100) are 7.8.2. There are definitely a handful that are 7.6.6 but the central manager and submit machines are 7.8.2.

thanks again,
mike






From: Tim St Clair <tstclair@xxxxxxxxxx>
To: Condor-Users Mail List <condor-users@xxxxxxxxxxx>
Date: 10/04/2012 08:02 AM
Subject: Re: [Condor-users] Schedd daemon problems on Windows, Condor 7.8.2
Sent by: condor-users-bounces@xxxxxxxxxxx





inline below




----- Original Message -----

> From: "Michael O'Donnell" <odonnellm@xxxxxxxx>
> To: condor-users@xxxxxxxxxxx
> Sent: Thursday, October 4, 2012 7:43:37 AM
> Subject: [Condor-users] Schedd daemon problems on Windows, Condor
> 7.8.2

> I have a Windows pool via Condor 7.8.2 and I have about 6 submit
> machines. After updating to 7.8.2 I started noticing that the
> central manager was not tracking the schedd machines well.

So your entire pool is now 7.8.2 ?

> I am also
> noticing errors in the schedlog files. If I restart the services on
> these machines everything works fine for hours but then at some
> point they drop out of the pool. The schedd daemon never crashes but
> the central manager cannot identify them. If I reboot, they show
> back up but within 12 hours the central manager drops them again.

> Here is an error message I found in the schedlog for one of the
> submit machines.
> 10/03/12 08:14:56 (pid:7848) History file rotation is enabled.
> 10/03/12 08:14:56 (pid:7848) Maximum history file size is: 4000000
> bytes
> 10/03/12 08:14:56 (pid:7848) Number of rotated history files is: 5
> 10/03/12 08:14:56 (pid:7848) LISTEN <159.xxx.xxx.xxx:62494> fd=708
> 10/03/12 08:14:56 (pid:7848) my_popen: CreateProcess failed
> 10/03/12 08:14:56 (pid:7848) Failed to execute
> C:/Condor/bin/condor_shadow.std.exe, ignoring

^ not an issue imho as std universe is unsupported on windows iirc.

> Another error I have seen:
> 10/03/12 07:14:19 (pid:3788) condor_read() failed: recv(fd=936)
> returned -1, errno = 10054 , reading 5 bytes from
> <159.xxx.xxx.xxx:51513>.
> 10/03/12 07:14:19 (pid:3788) IO: Failed to read packet header
> 10/03/12 07:14:19 (pid:3788) Stream::get(int) failed to read padding
> 10/03/12 07:14:19 (pid:3788) Socket activated, but could not read
> command
> 10/03/12 07:14:19 (pid:3788) (Negotiator probably invalidated cached
> socket)
> 10/03/12 07:14:19 (pid:3788) CLOSE <159.xxx.xxx.xx:59120> fd=936


^^ This seems to point towards the issue you are seeing, could you capture more full log snap with D_FULLDEBUG?

> Error 10054 via MSDN:
> Connection reset by peer.
> An existing connection was forcibly closed by the remote host. This
> normally results if the peer application on the remote host is
> suddenly stopped, the host is rebooted, the host or remote network
> interface is disabled, or the remote host uses a hard close (see
> setsockopt for more information on the SO_LINGER option on the
> remote socket). This error may also result if a connection was
> broken due to keep-alive activity detecting a failure while one or
> more operations are in progress. Operations that were in progress
> fail with WSAENETRESET. Subsequent operations fail with
> WSAECONNRESET.

> Is anyone else seeing this or have thoughts as to what the problem
> could be. I am not sure if this is the schedd or negotiator on the
> central manager. I am not seeing an errors with our network or for
> the 2 VM submit machines that can more easily monitor.

> thank you!
> mike

> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
> with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
>
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

> The archives can be found at:
>
https://lists.cs.wisc.edu/archive/condor-users/
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/