[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Problem with Windows XP workers



Hi all,

I have a Condor Master running on Mandrake Linux 10 with a number of
worker machines running Windows XP. All (master and workers) have condor
6.7.2 with Java Universe.

The workers are behind a firewall but I opened all the required ports
(standard and ephemeral range 29000-40000 (LOWPORT and HIGHPORT
respectively)) and they can communicate. I can't see any problems with
the firewall. 

When I submit a java job from the condor master targeted to any of these
XP workers the job is assigned to the worker and the condor_q shows the
job as running. condor_status shows the worker being Claimed and Busy.
The job though never finishes and after a while though I get the
following:


On the submitter side the log says:

007 (090.000.000) 12/18 14:05:43 Shadow exception!

Can no longer talk to condor_starter <194.42.54.104:1034>

0 - Run Bytes Sent By Job

0 - Run Bytes Received By Job


repeated many times.


On the worker side the Starter log says:


12/18 14:09:16 ******************************************************

12/18 14:09:16 ** condor_starter (CONDOR_STARTER) STARTING UP

12/18 14:09:16 ** C:\Condor\bin\condor_starter.exe

12/18 14:09:16 ** $CondorVersion: 6.7.2 Oct 5 2004 $

12/18 14:09:16 ** $CondorPlatform: INTEL-WINNT40 $

12/18 14:09:16 ** PID = 3504

12/18 14:09:16 ******************************************************

12/18 14:09:16 Using config file: C:\Condor\condor_config

12/18 14:09:16 Using local config files: C:\Condor/condor_config.local

12/18 14:09:16 DaemonCore: Command Socket at <194.42.54.104:3127>

12/18 14:09:16 Setting resource limits not implemented!

12/18 14:09:16 Communicating with shadow <195.251.124.82:38235>

12/18 14:09:16 Submitting machine is "gkakaron.teilar.gr"

12/18 14:09:16 Initialized IO Proxy.

12/18 14:09:19 getpeername failed so connect must have failed

12/18 14:09:48 Connect failed for 30 seconds; returning FALSE

12/18 14:09:48 FileTransfer: Unable to connect to server <195.251.124.8
2:38235>

12/18 14:09:48 ERROR "Could not initiate file transfer" at line 1404 in
file ..\src\condor_starter.V6.1\jic_shadow.C

12/18 14:09:48 ShutdownFast all jobs.
-------------------------------------



The submitting machine hostname is gkakaron.teilar.gr with IP address
195.251.124.82.


Can anyone see what's wrong here.

There is also one more thing that I don't know if its important or not.
When I setup Condor to the Windows XP machines I didn't use the
administrator account for the domain but the local administrator
account.

Thanks in advance
George