[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Windows Xp Service Pack 2



This question was asked back in November.

I am just now attempting to set up a pool of windows xp service pack 2
machines and am having the same problems listed here with version
6.6.9

I was wondering if anyone has made any headway towards solving these problems?
The system worked flawlessly until the machines were rebooted and now
nothing seems to work.

Thanks
JJ

Old Post:

-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of
Geraint.Lloyd@xxxxxxxxxxxx
Sent: 02 November 2004 15:32
To: condor-users@xxxxxxxxxxx
Subject: [Condor-users] Windows XP firewall problems with 6.6.7


We have an all Windows (mixture of Win2K and XP) Condor pool with most
of 
the nodes acting as execute only machines, along with one central pool 
manager / submitter. We have recently updated to use Condor 6.6.7 
following installation of Windows XP SP2 on some of the execute nodes.
We 
are now having problems getting jobs to run on all nodes. I have traced 
this to a combination of 2 problems:

1) On some of the machines with XP SP2 installed, the firewall is still 
blocking some connections. This happens when the machine is initially 
booted and Condor starts automatically. The Condor master log on these 
nodes display lines similar to the following:

1/2 14:19:42 ******************************************************
11/2 14:19:42 ** Condor (CONDOR_MASTER) STARTING UP
11/2 14:19:42 ** C:\Condor\bin\condor_master.exe
11/2 14:19:42 ** $CondorVersion: 6.6.7 Oct 14 2004 $
11/2 14:19:42 ** $CondorPlatform: INTEL-WINNT40 $
11/2 14:19:42 ** PID = 432
11/2 14:19:42 ******************************************************
11/2 14:19:42 Using config file: C:\Condor\condor_config
11/2 14:19:42 Using local config files: C:\Condor/condor_config.local
11/2 14:19:42 DaemonCore: Command Socket at <10.1.16.136:1043> 11/2
14:19:42 WinFirewall: get_CurrentProfile failed: 0x800706d9 11/2
14:19:42 Started DaemonCore process 
"C:\Condor/bin/condor_startd.exe", pid and pgroup = 496

The node still appears in the pool but won't run any jobs and the 
negotiator log on the central pool manager displays errors connecting to

this machine whenever jobs are submitted.
If I stop and restart the Condor service manually at a later stage all 
works fine - the master log on the node now displays 

11/2 14:21:17 Authorized application C:\Condor/bin/condor_startd.exe is 
now enabled in the firewall.

-  and does not give the WinFirewall error. Jobs now run on the node 
without problems - no firewall blocking.

 All the firewall settings are correct - exceptions allowed etc. I've 
tried various changes, including making the Condor service dependent on 
the firewall service to ensure that it starts after this, but it hasn't 
fixed the problem. Any ideas  ?

2) Running jobs on all the nodes is made far worse by a second problem.
If 
the negotiator fails to talk correctly to one of the nodes (i.e. because

of the firewall problem) then it gives up on that negotiator cycle. The 
negotiator log displays lines such as :

11/2 14:12:12     Request 00347.00008:
11/2 14:12:12       Matched 347.8 persephone@xxxxxxxxxxxxxx 
<10.1.16.132:4990> preempting none <10.1.16.77:1039>
11/2 14:12:12       Successfully matched with pergola.tessella.co.uk
11/2 14:12:12     Request 00347.00009:
11/2 14:12:33 Can't connect to <10.1.16.136:1044>:0, errno = 10060 11/2
14:12:33 Will keep trying for 10 seconds... 11/2 14:12:34 Connect failed
for 10 seconds; returning FALSE 11/2 14:12:34 ERROR: SECMAN:2003:TCP
connection to <10.1.16.136:1044> failed

11/2 14:12:34 condor_write(): Socket closed when trying to write buffer
11/2 14:12:34 Buf::write(): condor_write() failed
11/2 14:12:34       Could not send PERMISSION
11/2 14:12:34   Error: Ignoring schedd for this cycle
11/2 14:12:34 ---------- Finished Negotiation Cycle ----------

and the scheduler something like

11/2 14:12:11 Negotiating for owner: persephone@xxxxxxxxxxxxxx 11/2
14:12:11 Checking consistency running and runnable jobs 11/2 14:12:11
Tables are consistent 11/2 14:12:32 condor_read(): timeout reading
buffer. 11/2 14:12:32 Can't receive request from manager 11/2 14:12:32
DaemonCore: Command received via UDP from host 
<10.1.16.102:1655>
11/2 14:12:32 DaemonCore: received command 60014 (DC_INVALIDATE_KEY), 
calling handler (handle_invalidate_key())
11/2 14:12:32 condor_read(): recv() returned -1, errno = 10054, assuming

failure.
11/2 14:12:32 Response problem from startd.
11/2 14:12:32 Sent RELEASE_CLAIM to startd on <10.1.16.102:1040> 11/2
14:12:32 Match record (<10.1.16.102:1040>, 347, 3) deleted

This means that all the other nodes in the pool (mostly without the 
Windows firewall) that come after this error in the negotiation ycle are

ignored and don't run any jobs. 

Is there any way of getting the scheduler / negotiator to ignore a
machine 
which it can't connect to and carry on assigning jobs to the rest of the

pool. I've tried setting NEGOTIATE_ALL_JOBS_IN_CLUSTER to True but this 
doesn't help. I noticed another posting to the users list mentioning
this 
problem but there were no responses. It was also using a Windows central

manager so has anyone seen this outside of Windows ?

Any suggestions would be appreciated,

Thanks

Geraint Lloyd