[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: [Condor-users] Windows XP firewall problems with 6.6.7
- Date: Tue, 2 Nov 2004 16:33:08 -0000
- From: "Wilding, Kevan A" <kwilding@xxxxxxxxxxx>
- Subject: RE: [Condor-users] Windows XP firewall problems with 6.6.7
We have given up on our pool of XP service pack2 machines for the time
being.Central manager is on Linux, this has no bearing, and jobs can be
channleed to any linux machine and run fine.
I reinstalled on a few test machines, using the 6.6.7 version, and in
the first instance, I was able to test a connection from one XP machine
to another, hard coding the machine name for this purpose.
The following day, after machines had restarted, no XP connections
would work at all, even between two individual machines. In earlier
testing, on 6.6.1 with exceptions opened I was able to submit a job to
the XP SP2 pool, but failed in the scheduling, i.e. connecting to any
other machine in the pool, and thus failed entirely.
I think you give better detail of the problem than I supply, but sound
very similar problems. I would be interested whether anyone has actually
got this working.
The main reason, I ask, is the severe port and programming
restrictions being incurred at our University due to virus avoidance,
and I need to ensure that none of this is affecting any of the
submissions / scheduling.
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of
Sent: 02 November 2004 15:32
Subject: [Condor-users] Windows XP firewall problems with 6.6.7
We have an all Windows (mixture of Win2K and XP) Condor pool with most
the nodes acting as execute only machines, along with one central pool
manager / submitter. We have recently updated to use Condor 6.6.7
following installation of Windows XP SP2 on some of the execute nodes.
are now having problems getting jobs to run on all nodes. I have traced
this to a combination of 2 problems:
1) On some of the machines with XP SP2 installed, the firewall is still
blocking some connections. This happens when the machine is initially
booted and Condor starts automatically. The Condor master log on these
nodes display lines similar to the following:
1/2 14:19:42 ******************************************************
11/2 14:19:42 ** Condor (CONDOR_MASTER) STARTING UP
11/2 14:19:42 ** C:\Condor\bin\condor_master.exe
11/2 14:19:42 ** $CondorVersion: 6.6.7 Oct 14 2004 $
11/2 14:19:42 ** $CondorPlatform: INTEL-WINNT40 $
11/2 14:19:42 ** PID = 432
11/2 14:19:42 ******************************************************
11/2 14:19:42 Using config file: C:\Condor\condor_config
11/2 14:19:42 Using local config files: C:\Condor/condor_config.local
11/2 14:19:42 DaemonCore: Command Socket at <10.1.16.136:1043> 11/2
14:19:42 WinFirewall: get_CurrentProfile failed: 0x800706d9 11/2
14:19:42 Started DaemonCore process
"C:\Condor/bin/condor_startd.exe", pid and pgroup = 496
The node still appears in the pool but won't run any jobs and the
negotiator log on the central pool manager displays errors connecting to
this machine whenever jobs are submitted.
If I stop and restart the Condor service manually at a later stage all
works fine - the master log on the node now displays
11/2 14:21:17 Authorized application C:\Condor/bin/condor_startd.exe is
now enabled in the firewall.
- and does not give the WinFirewall error. Jobs now run on the node
without problems - no firewall blocking.
All the firewall settings are correct - exceptions allowed etc. I've
tried various changes, including making the Condor service dependent on
the firewall service to ensure that it starts after this, but it hasn't
fixed the problem. Any ideas ?
2) Running jobs on all the nodes is made far worse by a second problem.
the negotiator fails to talk correctly to one of the nodes (i.e. because
of the firewall problem) then it gives up on that negotiator cycle. The
negotiator log displays lines such as :
11/2 14:12:12 Request 00347.00008:
11/2 14:12:12 Matched 347.8 persephone@xxxxxxxxxxxxxx
<10.1.16.132:4990> preempting none <10.1.16.77:1039>
11/2 14:12:12 Successfully matched with pergola.tessella.co.uk
11/2 14:12:12 Request 00347.00009:
11/2 14:12:33 Can't connect to <10.1.16.136:1044>:0, errno = 10060 11/2
14:12:33 Will keep trying for 10 seconds... 11/2 14:12:34 Connect failed
for 10 seconds; returning FALSE 11/2 14:12:34 ERROR: SECMAN:2003:TCP
connection to <10.1.16.136:1044> failed
11/2 14:12:34 condor_write(): Socket closed when trying to write buffer
11/2 14:12:34 Buf::write(): condor_write() failed
11/2 14:12:34 Could not send PERMISSION
11/2 14:12:34 Error: Ignoring schedd for this cycle
11/2 14:12:34 ---------- Finished Negotiation Cycle ----------
and the scheduler something like
11/2 14:12:11 Negotiating for owner: persephone@xxxxxxxxxxxxxx 11/2
14:12:11 Checking consistency running and runnable jobs 11/2 14:12:11
Tables are consistent 11/2 14:12:32 condor_read(): timeout reading
buffer. 11/2 14:12:32 Can't receive request from manager 11/2 14:12:32
DaemonCore: Command received via UDP from host
11/2 14:12:32 DaemonCore: received command 60014 (DC_INVALIDATE_KEY),
calling handler (handle_invalidate_key())
11/2 14:12:32 condor_read(): recv() returned -1, errno = 10054, assuming
11/2 14:12:32 Response problem from startd.
11/2 14:12:32 Sent RELEASE_CLAIM to startd on <10.1.16.102:1040> 11/2
14:12:32 Match record (<10.1.16.102:1040>, 347, 3) deleted
This means that all the other nodes in the pool (mostly without the
Windows firewall) that come after this error in the negotiation ycle are
ignored and don't run any jobs.
Is there any way of getting the scheduler / negotiator to ignore a
which it can't connect to and carry on assigning jobs to the rest of the
pool. I've tried setting NEGOTIATE_ALL_JOBS_IN_CLUSTER to True but this
doesn't help. I noticed another posting to the users list mentioning
problem but there were no responses. It was also using a Windows central
manager so has anyone seen this outside of Windows ?
Any suggestions would be appreciated,
This message is confidential and may be privileged. It is intended for
addressee(s) only. Access to this message by anyone else is unauthorized
and strictly prohibited. If you have received this message in error,
please inform the sender immediately. Please note that messages sent
received by the Tessella e-mail system may be monitored and stored in an
information retrieval system.
__/__/__/ Tessella Support Services plc
__/__/__/ 3 Vineyard Chambers, ABINGDON, OX14 3PX, England __/__/__/
Tel: (44)(0)1235-555511 Fax: (44)(0)1235-553301
www.tessella.com Registered in England No.