[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Windows XP firewall problems with 6.6.7




We have an all Windows (mixture of Win2K and XP) Condor pool with most of the nodes acting as execute only machines, along with one central pool manager / submitter. We have recently updated to use Condor 6.6.7 following installation of Windows XP SP2 on some of the execute nodes. We are now having problems getting jobs to run on all nodes. I have traced this to a combination of 2 problems:

1) On some of the machines with XP SP2 installed, the firewall is still blocking some connections. This happens when the machine is initially booted and Condor starts automatically. The Condor master log on these nodes display lines similar to the following:

1/2 14:19:42 ******************************************************
11/2 14:19:42 ** Condor (CONDOR_MASTER) STARTING UP
11/2 14:19:42 ** C:\Condor\bin\condor_master.exe
11/2 14:19:42 ** $CondorVersion: 6.6.7 Oct 14 2004 $
11/2 14:19:42 ** $CondorPlatform: INTEL-WINNT40 $
11/2 14:19:42 ** PID = 432
11/2 14:19:42 ******************************************************
11/2 14:19:42 Using config file: C:\Condor\condor_config
11/2 14:19:42 Using local config files: C:\Condor/condor_config.local
11/2 14:19:42 DaemonCore: Command Socket at <10.1.16.136:1043>
11/2 14:19:42 WinFirewall: get_CurrentProfile failed: 0x800706d9
11/2 14:19:42 Started DaemonCore process "C:\Condor/bin/condor_startd.exe", pid and pgroup = 496

The node still appears in the pool but won't run any jobs and the negotiator log on the central pool manager displays errors connecting to this machine whenever jobs are submitted.
If I stop and restart the Condor service manually at a later stage all works fine - the master log on the node now displays

11/2 14:21:17 Authorized application C:\Condor/bin/condor_startd.exe is now enabled in the firewall.

-  and does not give the WinFirewall error. Jobs now run on the node without problems - no firewall blocking.

 All the firewall settings are correct - exceptions allowed etc. I've tried various changes, including making the Condor service dependent on the firewall service to ensure that it starts after this, but it hasn't fixed the problem. Any ideas  ?

2) Running jobs on all the nodes is made far worse by a second problem. If the negotiator fails to talk correctly to one of the nodes (i.e. because of the firewall problem) then it gives up on that negotiator cycle. The negotiator log displays lines such as :

11/2 14:12:12     Request 00347.00008:
11/2 14:12:12       Matched 347.8 persephone@xxxxxxxxxxxxxx <10.1.16.132:4990> preempting none <10.1.16.77:1039>
11/2 14:12:12       Successfully matched with pergola.tessella.co.uk
11/2 14:12:12     Request 00347.00009:
11/2 14:12:33 Can't connect to <10.1.16.136:1044>:0, errno = 10060
11/2 14:12:33 Will keep trying for 10 seconds...
11/2 14:12:34 Connect failed for 10 seconds; returning FALSE
11/2 14:12:34 ERROR:
SECMAN:2003:TCP connection to <10.1.16.136:1044> failed

11/2 14:12:34 condor_write(): Socket closed when trying to write buffer
11/2 14:12:34 Buf::write(): condor_write() failed
11/2 14:12:34       Could not send PERMISSION
11/2 14:12:34   Error: Ignoring schedd for this cycle
11/2 14:12:34 ---------- Finished Negotiation Cycle ----------

and the scheduler something like

11/2 14:12:11 Negotiating for owner: persephone@xxxxxxxxxxxxxx
11/2 14:12:11 Checking consistency running and runnable jobs
11/2 14:12:11 Tables are consistent
11/2 14:12:32 condor_read(): timeout reading buffer.
11/2 14:12:32 Can't receive request from manager
11/2 14:12:32 DaemonCore: Command received via UDP from host <10.1.16.102:1655>
11/2 14:12:32 DaemonCore: received command 60014 (DC_INVALIDATE_KEY), calling handler (handle_invalidate_key())
11/2 14:12:32 condor_read(): recv() returned -1, errno = 10054, assuming failure.
11/2 14:12:32 Response problem from startd.
11/2 14:12:32 Sent RELEASE_CLAIM to startd on <10.1.16.102:1040>
11/2 14:12:32 Match record (<10.1.16.102:1040>, 347, 3) deleted

This means that all the other nodes in the pool (mostly without the Windows firewall) that come after this error in the negotiation ycle are ignored and don't run any jobs.

Is there any way of getting the scheduler / negotiator to ignore a machine which it can't connect to and carry on assigning jobs to the rest of the pool. I've tried setting NEGOTIATE_ALL_JOBS_IN_CLUSTER to True but this doesn't help. I noticed another posting to the users list mentioning this problem but there were no responses. It was also using a Windows central manager so has anyone seen this outside of Windows ?

Any suggestions would be appreciated,

Thanks

Geraint Lloyd

This message is confidential and may be privileged. It is intended for the addressee(s) only. Access to this message by anyone else is unauthorized and strictly prohibited. If you have received this message in error, please inform the sender immediately.   Please note that messages sent or received by the Tessella e-mail system may be monitored and stored in an information retrieval system.

TESSELLA   Geraint.Lloyd@xxxxxxxxxxxx
__/__/__/  Tessella Support Services plc
__/__/__/  3 Vineyard Chambers, ABINGDON, OX14 3PX, England
__/__/__/  Tel: (44)(0)1235-555511  Fax: (44)(0)1235-553301
                   www.tessella.com    Registered in England No. 1466429