[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Schedd and startd die immediatly on Windows XP



Hi,

I am a new User, and am presently testing Condor 7.0.0 on some Windows XP SP2 machines.

One machine is a stand-alone pool, and works fine.
Another is Central Manager (+ Submit + Execute machine) for a second pool and works fine too (at least all five daemons are running, tests didn't go further on that pool).

On a third machine however, which is configured as Submit + Execute machine on the second pool, when the condor service is started, the schedd and startd daemons die just after starting, and the master daemon consumes lot of CPU (up to 99%) The logs (below) show a select error (WSAENOTSOCK), indicating a non-socket object in one of the descriptor sets submitted to select()...

I found a thread in the ML archives whith a similar error, in May 2007:

https://lists.cs.wisc.edu/archive/condor-users/2007-May/msg00110.shtml

but I didn't found the outcome of the issue...

I checked this four items:

a) what is NETWORK_INTERFACE defined to be in condor config? (if anything). Is it for an interface that does not exist?

NETWORK_INTERFACE = 192.168.0.28
(this is the IP address of my real interface, throw which other machines in the pool are to be reached)

b) is there some specific winsock library installed other than what came from Microsoft? perhaps some third party security or firewall vendor replaced your winsock library? Condor relies on the ability to pass an open network socket from a parent process to a child process. From the logs below it looks like this parent-child socket inheritance failed. Some winsock replacements have been known to break this. Compare your winsock DLLs to a fresh install of windows. My guess is they are different. If so, try replacing the winsock DLLs with the original ones from Microsoft - this will likely fix Condor, but may break whatever security software felt the need to replace thesr system DLLs (with non-conforming buggy ones).

I have C:\WINDOWS\system32\winsock.dll,
advertised as Windows Socket 16 Bit (sic!) DLL, version 3.10.0.103, 2004-08-05, Copyright Microsoft Corp...

c) did you install Condor with the included installer? were you logged in as Administrator when you installed it? is the Condor service configured to run as LocalSystem?

Yes to three questions

d) what, if any, firewall rules do you have on your machine?

I have Windows Firewall activated with exceptions for condor_dagman.exe, condor_master.exe, condor_schedd.exe and condor_startd.exe (set by the installer). I have tried with Firewall deactivated anyway, with same results.



Thanks for helping,

Dominique Leducq.


from MasterLog:


2/28 10:23:48 ******************************************************
2/28 10:23:48 ** Condor (CONDOR_MASTER) STARTING UP
2/28 10:23:48 ** D:\condor\bin\condor_master.exe
2/28 10:23:48 ** $CondorVersion: 7.0.0 Jan 22 2008 BuildID: 72173 $
2/28 10:23:48 ** $CondorPlatform: INTEL-WINNT50 $
2/28 10:23:48 ** PID = 1244
2/28 10:23:48 ** Log last touched 2/28 09:50:37
2/28 10:23:48 ******************************************************
2/28 10:23:48 Using config source: D:\condor\condor_config
2/28 10:23:48 Using local config sources:
2/28 10:23:48    D:\condor/condor_config.local
2/28 10:23:49 DaemonCore: Command Socket at <192.168.0.28:3866>
2/28 10:23:49 Authorized application D:\condor/bin/condor_schedd.exe is now enabled in the firewall. 2/28 10:23:49 Authorized application D:\condor/bin/condor_startd.exe is now enabled in the firewall. 2/28 10:23:49 Authorized application D:\condor/bin\condor_dagman.exe is now enabled in the firewall. 2/28 10:23:49 Started DaemonCore process "D:\condor/bin/condor_schedd.exe", pid and pgroup = 5692 2/28 10:23:49 Started DaemonCore process "D:\condor/bin/condor_startd.exe", pid and pgroup = 3876
2/28 10:23:51 The SCHEDD (pid 5692) exited with status 4
2/28 10:23:51 Sending obituary for "D:\condor/bin/condor_schedd.exe"
2/28 10:23:52 restarting D:\condor/bin/condor_schedd.exe in 10 seconds
2/28 10:24:02 Started DaemonCore process "D:\condor/bin/condor_schedd.exe", pid and pgroup = 4352
2/28 10:24:04 The STARTD (pid 3876) exited with status 0
2/28 10:24:04 restarting D:\condor/bin/condor_startd.exe in 10 seconds
2/28 10:24:05 The SCHEDD (pid 4352) exited with status 4
2/28 10:24:05 Sending obituary for "D:\condor/bin/condor_schedd.exe"
2/28 10:24:06 restarting D:\condor/bin/condor_schedd.exe in 11 seconds

from SchedLog:


2/28 10:23:50 (pid:5692) ******************************************************
2/28 10:23:50 (pid:5692) ** condor_schedd.exe (CONDOR_SCHEDD) STARTING UP
2/28 10:23:50 (pid:5692) ** D:\condor\bin\condor_schedd.exe
2/28 10:23:50 (pid:5692) ** $CondorVersion: 7.0.0 Jan 22 2008 BuildID: 72173 $
2/28 10:23:50 (pid:5692) ** $CondorPlatform: INTEL-WINNT50 $
2/28 10:23:50 (pid:5692) ** PID = 5692
2/28 10:23:50 (pid:5692) ** Log last touched 2/28 09:50:35
2/28 10:23:50 (pid:5692) ******************************************************
2/28 10:23:50 (pid:5692) Using config source: D:\condor\condor_config
2/28 10:23:50 (pid:5692) Using local config sources:
2/28 10:23:50 (pid:5692)    D:\condor/condor_config.local
2/28 10:23:50 (pid:5692) DaemonCore: Command Socket at
2/28 10:23:50 (pid:5692) History file rotation is enabled.
2/28 10:23:50 (pid:5692)   Maximum history file size is: 20971520 bytes
2/28 10:23:50 (pid:5692)   Number of rotated history files is: 2
2/28 10:23:51 (pid:5692) my_popen: CreateProcess failed
2/28 10:23:51 (pid:5692) Failed to execute D:\condor/bin/condor_shadow.std.exe, ignoring 2/28 10:23:51 (pid:5692) ERROR "select, error # = 10038" at line 2624 in file ..\src\condor_daemon_core.V6\daemon_core.C 2/28 10:24:03 (pid:4352) ******************************************************
2/28 10:24:03 (pid:4352) ** condor_schedd.exe (CONDOR_SCHEDD) STARTING UP
2/28 10:24:03 (pid:4352) ** D:\condor\bin\condor_schedd.exe
2/28 10:24:03 (pid:4352) ** $CondorVersion: 7.0.0 Jan 22 2008 BuildID: 72173 $
2/28 10:24:03 (pid:4352) ** $CondorPlatform: INTEL-WINNT50 $
2/28 10:24:03 (pid:4352) ** PID = 4352
2/28 10:24:03 (pid:4352) ** Log last touched 2/28 10:23:51
2/28 10:24:03 (pid:4352) ******************************************************
2/28 10:24:03 (pid:4352) Using config source: D:\condor\condor_config
2/28 10:24:03 (pid:4352) Using local config sources:
2/28 10:24:03 (pid:4352)    D:\condor/condor_config.local
2/28 10:24:04 (pid:4352) DaemonCore: Command Socket at
2/28 10:24:04 (pid:4352) History file rotation is enabled.
2/28 10:24:04 (pid:4352)   Maximum history file size is: 20971520 bytes
2/28 10:24:04 (pid:4352)   Number of rotated history files is: 2
2/28 10:24:05 (pid:4352) my_popen: CreateProcess failed
2/28 10:24:05 (pid:4352) Failed to execute D:\condor/bin/condor_shadow.std.exe, ignoring 2/28 10:24:05 (pid:4352) ERROR "select, error # = 10038" at line 2624 in file ..\src\condor_daemon_core.V6\daemon_core.C

from StartLog:


2/28 10:23:50 ******************************************************
2/28 10:23:50 ** condor_startd.exe (CONDOR_STARTD) STARTING UP
2/28 10:23:50 ** D:\condor\bin\condor_startd.exe
2/28 10:23:50 ** $CondorVersion: 7.0.0 Jan 22 2008 BuildID: 72173 $
2/28 10:23:50 ** $CondorPlatform: INTEL-WINNT50 $
2/28 10:23:50 ** PID = 3876
2/28 10:23:50 ** Log last touched 2/28 09:50:34
2/28 10:23:50 ******************************************************
2/28 10:23:50 Using config source: D:\condor\condor_config
2/28 10:23:50 Using local config sources:
2/28 10:23:50    D:\condor/condor_config.local
2/28 10:23:50 DaemonCore: Command Socket at
2/28 10:23:51 MachAttributes::publish: failed to get Windows version information
2/28 10:23:51 my_popen: CreateProcess failed
2/28 10:23:51 Failed to execute D:\condor/bin/condor_starter.std.exe, ignoring
2/28 10:23:51 New machine resource allocated
2/28 10:23:56 no loadavg samples this minute, maybe thread died???
2/28 10:23:56 About to run initial benchmarks.
2/28 10:24:03 Completed initial benchmarks.
2/28 10:24:03 ERROR "select, error # = 10038" at line 2624 in file ..\src\condor_daemon_core.V6\daemon_core.C
2/28 10:24:03 Deleting Cronmgr
2/28 10:24:03 All resources are free, exiting.
2/28 10:24:03 **** condor_startd.exe (condor_STARTD) EXITING WITH STATUS 0