[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Schedd daemon problems on Windows, Condor 7.8.2



Tim, I am still monitoring and trying to tack the problem down. The schedd on my central manager threw the following core.SCHEDD so maybe this will help.

Thanks for any assistance you can provide. I will try to continue monitoring, but I am not picking up too many errors. Everything was working and then sometime in the last 8 hours 9pm-5am 4 of my submit machines dropped off the schedd list (condor_status -schedd) and I can't find any errors in the log files during this time.

I did find this core dump (I have not seen this before and it happend last night on the central manager).

//=====================================================
PID: 3144
Exception code: C0000005 ACCESS_VIOLATION
Fault address:  76F613B0 01:000013B0 C:\Windows\syswow64\kernel32.dll

Registers:
EAX:01C1D8DC
EBX:00000000
ECX:01C1D8DC
EDX:00000000
ESI:00000000
EDI:01C1E370
CS:EIP:0023:76F613B0
SS:ESP:002B:04B0F6F0  EBP:04B0F850
DS:002B  ES:002B  FS:0053  GS:002B
Flags:00010202



I also noticed a couple different errors in my schedd and master log files on submit machines, but these errors do not seem to be thrown when the central manager looses track of the submit machines (this is what I have had a difficult time tracking--I have enabled all logging for now).


IP2 is a server within my organization that is scanning our machines, but these servers, do not have access to the pool itself but they can scan all hardware. I am working with my IT group to see if they can help on this front.

This stood out, but I could not find a reference in Condor and I do not have SOAP enabled:

10/09/12 06:22:36 Received HTTP GET connection from <IP2.12:36427> -- DENIED because ENABLE_WEB_SERVER=FALSE

CDRS2—masterLog (I also see the same error in the ScheddLog on same machine)
10/09/12 06:18:21 Time stamp of running C:/Condor/bin/condor_master.exe: 1344461336
10/09/12 06:18:21 GetTimeStamp returned: 1344461336
10/09/12 06:18:21 Return from Timer handler 10 (Daemons::CheckForNewExecutable())
10/09/12 06:18:22 Calling Timer handler 6 (KillFamily::takesnapshot)
10/09/12 06:18:22 Return from Timer handler 6 (KillFamily::takesnapshot)
10/09/12 06:18:22 Calling Timer handler 7 (KillFamily::takesnapshot)
10/09/12 06:18:22 Return from Timer handler 7 (KillFamily::takesnapshot)
10/09/12 06:18:51 ACCEPT bound to <IP.52:62187> fd=540 peer=<IP2.12:33239>
10/09/12 06:18:51 Calling Handler <DaemonCommandProtocol::WaitForSocketData> (2)
10/09/12 06:18:51 condor_read(fd=540 <IP2.12:33239>,,size=4,timeout=1,flags=2)
10/09/12 06:18:51 condor_read(): fd=540
10/09/12 06:18:51 condor_read(): select returned 1
10/09/12 06:18:51 condor_read() failed: recv(fd=540) returned -1, errno = 10054 , reading 4 bytes from <IP2.12:33239>.
10/09/12 06:18:51 condor_read(fd=540 <IP2.12:33239>,,size=5,timeout=1,flags=0)
10/09/12 06:18:51 condor_read(): fd=540
10/09/12 06:18:51 condor_read(): select returned 1
10/09/12 06:18:51 condor_read() failed: recv(fd=540) returned -1, errno = 10054 , reading 5 bytes from <IP2.12:33239>.
10/09/12 06:18:51 IO: Failed to read packet header
10/09/12 06:18:51 Stream::get(int) failed to read padding
10/09/12 06:18:51 DaemonCore: Can't receive command request from IP2.12 (perhaps a timeout?)
10/09/12 06:18:51 CLOSE <IP.52:62187> fd=540
10/09/12 06:18:51 Return from Handler <DaemonCommandProtocol::WaitForSocketData> 0.0000s
10/09/12 06:19:09 Calling Timer handler 8 (Daemons::UpdateCollector())
10/09/12 06:19:09 enter Daemons::UpdateCollector
10/09/12 06:19:09 Trying to update collector <IP.145:9618>
10/09/12 06:19:09 Attempting to send update via UDP to collector CENTRALMANAGER.gs.doi.net <IP.145:9618>
10/09/12 06:19:09 SECMAN: command 2 UPDATE_MASTER_AD to collector CENTRALMANAGER.gs.doi.net from UDP port 54871 (non-blocking).
10/09/12 06:19:09 SECMAN: using session CentralManager:4444:1349777948:1721 for {<IP.145:9618>,<2>}.
10/09/12 06:19:09 SECMAN: found cached session id CentralManager:4444:1349777948:1721 for {<IP.145:9618>,<2>}.
Authentication = "YES"

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
10/09/12 06:22:36 Received HTTP GET connection from <IP2.12:36427> -- DENIED because ENABLE_WEB_SERVER=FALSE
10/09/12 06:22:36 condor_read(fd=1160 <IP2.12:36427>,,size=5,timeout=1,flags=0)
10/09/12 06:22:36 condor_read(): fd=1160
10/09/12 06:22:36 condor_read(): select returned 1
10/09/12 06:22:36 IO: Incoming packet header unrecognized
10/09/12 06:22:36 Stream::get(int) failed to read padding
10/09/12 06:22:36 DaemonCore: Can't receive command request from IP2.12 (perhaps a timeout?)
10/09/12 06:22:36 CLOSE <IP.52:62187> fd=1160
10/09/12 06:22:36 Calling Handler <DaemonCommandProtocol::WaitForSocketData> (2)
10/09/12 06:22:36 condor_read(fd=312 <IP2.12:36425>,,size=4,timeout=1,flags=2)
10/09/12 06:22:36 condor_read(): fd=312
10/09/12 06:22:36 condor_read(): select returned 1
10/09/12 06:22:36 condor_read() failed: recv(fd=312) returned -1, errno = 10054 , reading 4 bytes from <IP2.12:36425>.
10/09/12 06:22:36 condor_read(fd=312 <IP2.12:36425>,,size=5,timeout=1,flags=0)
10/09/12 06:22:36 condor_read(): fd=312
10/09/12 06:22:36 condor_read(): select returned 1
10/09/12 06:22:36 condor_read() failed: recv(fd=312) returned -1, errno = 10054 , reading 5 bytes from <IP2.12:36425>.
10/09/12 06:22:36 IO: Failed to read packet header
10/09/12 06:22:36 Stream::get(int) failed to read padding
10/09/12 06:22:36 DaemonCore: Can't receive command request from IP2.12 (perhaps a timeout?)
10/09/12 06:22:36 CLOSE <IP.52:62187> fd=312
10/09/12 06:22:36 Return from Handler <DaemonCommandProtocol::WaitForSocketData> 0.0000s
10/09/12 06:23:09 ACCEPT bound to <IP.52:62187> fd=312 peer=<IP2.12:36764>
10/09/12 06:23:09 Calling Handler <DaemonCommandProtocol::WaitForSocketData> (2)
10/09/12 06:23:09 condor_read(fd=312 <IP2.12:36764>,,size=4,timeout=1,flags=2)
10/09/12 06:23:09 condor_read(): fd=312
10/09/12 06:23:09 condor_read(): select returned 1
10/09/12 06:23:09 condor_read() failed: recv(fd=312) returned -1, errno = 10054 , reading 4 bytes from <IP2.12:36764>.
10/09/12 06:23:09 condor_read(fd=312 <IP2.12:36764>,,size=5,timeout=1,flags=0)
10/09/12 06:23:09 condor_read(): fd=312
10/09/12 06:23:09 condor_read(): select returned 1
10/09/12 06:23:09 condor_read() failed: recv(fd=312) returned -1, errno = 10054 , reading 5 bytes from <IP2.12:36764>.
10/09/12 06:23:09 IO: Failed to read packet header
10/09/12 06:23:09 Stream::get(int) failed to read padding
10/09/12 06:23:09 DaemonCore: Can't receive command request from IP2.12 (perhaps a timeout?)
10/09/12 06:23:09 CLOSE <IP.52:62187> fd=312
10/09/12 06:23:09 Return from Handler <DaemonCommandProtocol::WaitForSocketData> 0.0000s
10/09/12 06:23:16 Calling Timer handler 7141 (dc_touch_log_file)
10/09/12 06:23:16 Return from Timer handler 7141 (dc_touch_log_file)
10/09/12 06:23:21 Calling Timer handler 2 (check_session_cache)
10/09/12 06:23:21 Return from Timer handler 2 (check_session_cache)
10/09/12 06:23:21 Calling Timer handler 10 (Daemons::CheckForNewExecutable())
10/09/12 06:23:21 enter Daemons::CheckForNewExecutable
10/09/12 06:23:21 Time stamp of running C:/Condor/bin/condor_master.exe: 1344461336
10/09/12 06:23:21 GetTimeStamp returned: 1344461336
10/09/12 06:23:21 Return from Timer handler 10 (Daemons::CheckForNewExecutable())
10/09/12 06:23:23 Calling Timer handler 6 (KillFamily::takesnapshot)
10/09/12 06:23:23 Return from Timer handler 6 (KillFamily::takesnapshot)
10/09/12 06:23:23 Calling Timer handler 7 (KillFamily::takesnapshot)
10/09/12 06:23:23 Return from Timer handler 7 (KillFamily::takesnapshot)
10/09/12 06:24:09 Calling Timer handler 8 (Daemons::UpdateCollector())
10/09/12 06:24:09 enter Daemons::UpdateCollector



From: Tim St Clair <tstclair@xxxxxxxxxx>
To: Condor-Users Mail List <condor-users@xxxxxxxxxxx>
Date: 10/04/2012 08:02 AM
Subject: Re: [Condor-users] Schedd daemon problems on Windows, Condor 7.8.2
Sent by: condor-users-bounces@xxxxxxxxxxx





inline below




----- Original Message -----

> From: "Michael O'Donnell" <odonnellm@xxxxxxxx>
> To: condor-users@xxxxxxxxxxx
> Sent: Thursday, October 4, 2012 7:43:37 AM
> Subject: [Condor-users] Schedd daemon problems on Windows, Condor
> 7.8.2

> I have a Windows pool via Condor 7.8.2 and I have about 6 submit
> machines. After updating to 7.8.2 I started noticing that the
> central manager was not tracking the schedd machines well.

So your entire pool is now 7.8.2 ?

> I am also
> noticing errors in the schedlog files. If I restart the services on
> these machines everything works fine for hours but then at some
> point they drop out of the pool. The schedd daemon never crashes but
> the central manager cannot identify them. If I reboot, they show
> back up but within 12 hours the central manager drops them again.

> Here is an error message I found in the schedlog for one of the
> submit machines.
> 10/03/12 08:14:56 (pid:7848) History file rotation is enabled.
> 10/03/12 08:14:56 (pid:7848) Maximum history file size is: 4000000
> bytes
> 10/03/12 08:14:56 (pid:7848) Number of rotated history files is: 5
> 10/03/12 08:14:56 (pid:7848) LISTEN <159.xxx.xxx.xxx:62494> fd=708
> 10/03/12 08:14:56 (pid:7848) my_popen: CreateProcess failed
> 10/03/12 08:14:56 (pid:7848) Failed to execute
> C:/Condor/bin/condor_shadow.std.exe, ignoring

^ not an issue imho as std universe is unsupported on windows iirc.

> Another error I have seen:
> 10/03/12 07:14:19 (pid:3788) condor_read() failed: recv(fd=936)
> returned -1, errno = 10054 , reading 5 bytes from
> <159.xxx.xxx.xxx:51513>.
> 10/03/12 07:14:19 (pid:3788) IO: Failed to read packet header
> 10/03/12 07:14:19 (pid:3788) Stream::get(int) failed to read padding
> 10/03/12 07:14:19 (pid:3788) Socket activated, but could not read
> command
> 10/03/12 07:14:19 (pid:3788) (Negotiator probably invalidated cached
> socket)
> 10/03/12 07:14:19 (pid:3788) CLOSE <159.xxx.xxx.xx:59120> fd=936


^^ This seems to point towards the issue you are seeing, could you capture more full log snap with D_FULLDEBUG?

> Error 10054 via MSDN:
> Connection reset by peer.
> An existing connection was forcibly closed by the remote host. This
> normally results if the peer application on the remote host is
> suddenly stopped, the host is rebooted, the host or remote network
> interface is disabled, or the remote host uses a hard close (see
> setsockopt for more information on the SO_LINGER option on the
> remote socket). This error may also result if a connection was
> broken due to keep-alive activity detecting a failure while one or
> more operations are in progress. Operations that were in progress
> fail with WSAENETRESET. Subsequent operations fail with
> WSAECONNRESET.

> Is anyone else seeing this or have thoughts as to what the problem
> could be. I am not sure if this is the schedd or negotiator on the
> central manager. I am not seeing an errors with our network or for
> the 2 VM submit machines that can more easily monitor.

> thank you!
> mike

> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
> with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
>
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

> The archives can be found at:
>
https://lists.cs.wisc.edu/archive/condor-users/
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/