[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] Windows XP firewall problems with 6.6.7



Hi,
  We have given up on our pool of XP service pack2 machines for the time
being.Central manager is on Linux, this has no bearing, and jobs can be
channleed to any linux machine and run fine.
  I reinstalled on a few test machines, using the 6.6.7 version, and in
the first instance, I was able to test a connection from one XP machine
to another, hard coding the machine name for this purpose.
  The following day, after machines had restarted, no XP connections
would work at all, even between two individual machines. In earlier
testing, on 6.6.1 with exceptions opened I was able to submit a job to
the XP SP2 pool, but failed in the scheduling, i.e. connecting to any
other machine in the pool, and thus failed entirely.

  I think you give better detail of the problem than I supply, but sound
very similar problems. I would be interested whether anyone has actually
got this working.

  The main reason, I ask, is the severe port and programming
restrictions being incurred at our University due to virus avoidance,
and I need to ensure that none of this is affecting any of the
submissions / scheduling.

Best
Kevan

-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of
Geraint.Lloyd@xxxxxxxxxxxx
Sent: 02 November 2004 15:32
To: condor-users@xxxxxxxxxxx
Subject: [Condor-users] Windows XP firewall problems with 6.6.7


We have an all Windows (mixture of Win2K and XP) Condor pool with most
of 
the nodes acting as execute only machines, along with one central pool 
manager / submitter. We have recently updated to use Condor 6.6.7 
following installation of Windows XP SP2 on some of the execute nodes.
We 
are now having problems getting jobs to run on all nodes. I have traced 
this to a combination of 2 problems:

1) On some of the machines with XP SP2 installed, the firewall is still 
blocking some connections. This happens when the machine is initially 
booted and Condor starts automatically. The Condor master log on these 
nodes display lines similar to the following:

1/2 14:19:42 ******************************************************
11/2 14:19:42 ** Condor (CONDOR_MASTER) STARTING UP
11/2 14:19:42 ** C:\Condor\bin\condor_master.exe
11/2 14:19:42 ** $CondorVersion: 6.6.7 Oct 14 2004 $
11/2 14:19:42 ** $CondorPlatform: INTEL-WINNT40 $
11/2 14:19:42 ** PID = 432
11/2 14:19:42 ******************************************************
11/2 14:19:42 Using config file: C:\Condor\condor_config
11/2 14:19:42 Using local config files: C:\Condor/condor_config.local
11/2 14:19:42 DaemonCore: Command Socket at <10.1.16.136:1043> 11/2
14:19:42 WinFirewall: get_CurrentProfile failed: 0x800706d9 11/2
14:19:42 Started DaemonCore process 
"C:\Condor/bin/condor_startd.exe", pid and pgroup = 496

The node still appears in the pool but won't run any jobs and the 
negotiator log on the central pool manager displays errors connecting to

this machine whenever jobs are submitted.
If I stop and restart the Condor service manually at a later stage all 
works fine - the master log on the node now displays 

11/2 14:21:17 Authorized application C:\Condor/bin/condor_startd.exe is 
now enabled in the firewall.

-  and does not give the WinFirewall error. Jobs now run on the node 
without problems - no firewall blocking.

 All the firewall settings are correct - exceptions allowed etc. I've 
tried various changes, including making the Condor service dependent on 
the firewall service to ensure that it starts after this, but it hasn't 
fixed the problem. Any ideas  ?

2) Running jobs on all the nodes is made far worse by a second problem.
If 
the negotiator fails to talk correctly to one of the nodes (i.e. because

of the firewall problem) then it gives up on that negotiator cycle. The 
negotiator log displays lines such as :

11/2 14:12:12     Request 00347.00008:
11/2 14:12:12       Matched 347.8 persephone@xxxxxxxxxxxxxx 
<10.1.16.132:4990> preempting none <10.1.16.77:1039>
11/2 14:12:12       Successfully matched with pergola.tessella.co.uk
11/2 14:12:12     Request 00347.00009:
11/2 14:12:33 Can't connect to <10.1.16.136:1044>:0, errno = 10060 11/2
14:12:33 Will keep trying for 10 seconds... 11/2 14:12:34 Connect failed
for 10 seconds; returning FALSE 11/2 14:12:34 ERROR: SECMAN:2003:TCP
connection to <10.1.16.136:1044> failed

11/2 14:12:34 condor_write(): Socket closed when trying to write buffer
11/2 14:12:34 Buf::write(): condor_write() failed
11/2 14:12:34       Could not send PERMISSION
11/2 14:12:34   Error: Ignoring schedd for this cycle
11/2 14:12:34 ---------- Finished Negotiation Cycle ----------

and the scheduler something like

11/2 14:12:11 Negotiating for owner: persephone@xxxxxxxxxxxxxx 11/2
14:12:11 Checking consistency running and runnable jobs 11/2 14:12:11
Tables are consistent 11/2 14:12:32 condor_read(): timeout reading
buffer. 11/2 14:12:32 Can't receive request from manager 11/2 14:12:32
DaemonCore: Command received via UDP from host 
<10.1.16.102:1655>
11/2 14:12:32 DaemonCore: received command 60014 (DC_INVALIDATE_KEY), 
calling handler (handle_invalidate_key())
11/2 14:12:32 condor_read(): recv() returned -1, errno = 10054, assuming

failure.
11/2 14:12:32 Response problem from startd.
11/2 14:12:32 Sent RELEASE_CLAIM to startd on <10.1.16.102:1040> 11/2
14:12:32 Match record (<10.1.16.102:1040>, 347, 3) deleted

This means that all the other nodes in the pool (mostly without the 
Windows firewall) that come after this error in the negotiation ycle are

ignored and don't run any jobs. 

Is there any way of getting the scheduler / negotiator to ignore a
machine 
which it can't connect to and carry on assigning jobs to the rest of the

pool. I've tried setting NEGOTIATE_ALL_JOBS_IN_CLUSTER to True but this 
doesn't help. I noticed another posting to the users list mentioning
this 
problem but there were no responses. It was also using a Windows central

manager so has anyone seen this outside of Windows ?

Any suggestions would be appreciated,

Thanks

Geraint Lloyd

This message is confidential and may be privileged. It is intended for
the 
addressee(s) only. Access to this message by anyone else is unauthorized

and strictly prohibited. If you have received this message in error, 
please inform the sender immediately.   Please note that messages sent
or 
received by the Tessella e-mail system may be monitored and stored in an

information retrieval system.

TESSELLA   Geraint.Lloyd@xxxxxxxxxxxx
__/__/__/  Tessella Support Services plc
__/__/__/  3 Vineyard Chambers, ABINGDON, OX14 3PX, England __/__/__/
Tel: (44)(0)1235-555511  Fax: (44)(0)1235-553301
                    www.tessella.com    Registered in England No.
1466429