[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] jobs submitted from windows don't execute



The problems below (master cannot start schedd, shadow exits) are related.  Both are caused by the select() error 10038, which means select was passed a bogus network socket !?!?   Not really certain how this could happen.  Some random guesses are below.  Thinking about it, I'd guess item (b) below is the culprit.

a) what is NETWORK_INTERFACE defined to be in condor config? (if anything).  Is it for an interface that does not exist?

b)  is there some specific winsock library installed other than what came from Microsoft?  perhaps some third party security or firewall vendor replaced your winsock library?  Condor relies on the ability to pass an open network socket from a parent process to a child process.  From the logs below it looks like this parent-child socket inheritance failed.  Some winsock replacements have been known to break this.  Compare your winsock DLLs to a fresh install of windows.  My guess is they are different.  If so, try replacing the winsock DLLs with the original ones from Microsoft - this will likely fix Condor, but may break whatever security software felt the need to replace thesr system DLLs (with non-conforming buggy ones). 

c) did you install Condor with the included installer?  were you logged in as Administrator when you installed it? is the Condor service configured to run as LocalSystem? 

d) what, if any, firewall rules do you have on your machine?

---
Todd Tannenbaum
University of Wisconsin-Madison
<-- Sent from a Palm Treo 680 phone -->

-----Original Message-----

From:  "mohammed shambakey" <shambakey1@xxxxxxxxx>
Subj:  Re: [Condor-users] jobs submitted from windows don't execute
Date:  Mon May 14, 2007 6:48 am
Size:  2K
To:  "Condor-Users Mail List" <condor-users@xxxxxxxxxxx>

Hi
the "schedd" daemon is already included in the daemon_list beside master,
but still doesn't start automatically.
the schedd log shows the following error:

5/14 11:57:49 (pid:2384) Using config source: C:\condor\condor_config
5/14 11:57:49 (pid:2384) Using local config sources:
5/14 11:57:49 (pid:2384)    C:\condor/condor_config.local
5/14 11:57:49 (pid:2384) DaemonCore: Command Socket at
5/14 11:57:49 (pid:2384) History file rotation is enabled.
5/14 11:57:49 (pid:2384)   Maximum history file size is: 20971520 bytes
5/14 11:57:49 (pid:2384)   Number of rotated history files is: 2
5/14 11:57:49 (pid:2384) my_popen: CreateProcess failed
5/14 11:57:49 (pid:2384) Failed to execute
C:\condor/bin/condor_shadow.pvm.exe, ignoring
5/14 11:57:49 (pid:2384) my_popen: CreateProcess failed
5/14 11:57:49 (pid:2384) Failed to execute
C:\condor/bin/condor_shadow.std.exe, ignoring
5/14 11:57:49 (pid:2384) ERROR "select, error # = 10038" at line 2417 in
file ..\src\condor_daemon_core.V6\daemon_core.C

and the master log :-

5/14 11:57:48 Using config source: C:\condor\condor_config
5/14 11:57:48 Using local config sources:
5/14 11:57:48    C:\condor/condor_config.local
5/14 11:57:48 DaemonCore: Command Socket at <192.168.100.53:1541>
5/14 11:57:48 Started DaemonCore process "C:\condor/bin/condor_schedd.exe",
pid and pgroup = 2384
5/14 11:57:49 DaemonCore: Command received via UDP from host <
192.168.100.53:1546>
5/14 11:57:49 DaemonCore: received command 60011 (DC_NOP), calling handler
(handle_nop())
5/14 11:57:49 The SCHEDD (pid 2384) exited with status 4
5/14 11:57:49 Sending obituary for "C:\condor/bin/condor_schedd.exe"
5/14 11:57:52 restarting C:\condor/bin/condor_schedd.exe in 10 seconds

the condor_schedd didn't restart so i had to again start it manually. now
every time i send a job to the pool and specify in the requirements to be
executed on a linux machine, the job is idle and that's all

i have tried to configure the PED by modifying the boot.ini as mentioned in
one of the messages but useless

plz help




On 5/13/07, Matt Hope <matthew.hope@xxxxxxxxx> wrote:
>
> On 5/13/07, mohammed shambakey <shambakey1@xxxxxxxxx> wrote:
> > Hi
> > i checked the shadow log and it has the followng error:-
> >
> > 5/10 10:00:37 Using config source: C:\condor\condor_config
> > 5/10 10:00:37 Using local config sources:
> > 5/10 10:00:37    C:\condor/condor_config.local
> > 5/10 10:00:37 DaemonCore: Command Socket at
> > 5/10 10:00:38 Initializing a JAVA shadow for job 7.0
> > 5/10 10:00:38 (7.0) (3316): Request to run on <192.168.100.120:47348>
> was
> > ACCEPTED
> > 5/10 10:00:38 (7.0) (3316): ERROR "select, error # = 10038" at line 2417
> in
> > file ..\src\condor_daemon_core.V6\da
--- message truncated ---