[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] ERROR: SECMAN:2003:TCP connection to collectorfailed.



Hi Stefano,

Yes - I mean the big ol firewall that sits in the core of our infrastructure and inspects packets going back and forth.  For the most part we donât use local firewalls on any internal kit.

When I nmapâd to port 9618 from a production Win10 client machine that wasnât working, the result I got was âfilteredâ, the same as you are currently seeing.  When I tested from a working Win10 machine here in the IT department I got a result that said âopenâ.

I went and talked the firewalls guys, who could see that tcp/9618 was being blocked from my production Win10 machine.  They found and fixed their config error, and then I could see that my production Win10 machine returned âopenâ from an nmap test to the server on 9618.  After a few minutes (looks like the clients try to talk to the server every 5 minutes), Condor came back to life on all my production Win10 fleet.  I went from around 90 machines in the grid back up to the ~1000 that I expect, once the firewall was sorted.

If youâre getting âfilteredâ in your nmap results, Iâd strongly suspect a firewall issue somewhere in the infrastructure between your clients and your server.  Maybe ask your sysadmin to have a look at the firewall logs to see if 9618 is being blocked or quietly dropped, because it really sounds like something is going on there to me.  Hope this helps.

Cheers, Craig


On 27/09/2018, at 10:00 AM, Stefano Colafranceschi <stefano.colafranceschi@xxxxxxx> wrote:

Yes we have identical setup! Do you mean core firewall of your router/switch? All my software firewalls are disabled.

I had tried nmap on both Linux and windows and I saw that port is flagged as filtered (before I asked to open it to my sysadmin it was closed), so as I understand filtered means that is open to certain extent (according to the router rule, perhaps the inspecting TCP packet is blocked). Is there a test I can use to show to my sysadmin where the issue is?


On Wed, Sep 26, 2018, 5:32 PM Craig Parker <craig.parker@xxxxxxxxx> wrote:
Hi all,
Long time lurker, first time poster - just chiming in here as Iâve just finished dealing with this exact same "SECMAN:2003:TCP connection to collector x.x.x.x:9618 failedâ error, and it turned out to be a config error introduced onto our core firewall, blocking access on port 9618 from our Condor fleet.  Sounds like we have a similar setup to you - RHEL7 Condor server, and Windows 10 clients.

Can you run nmap / zenmapGUI from the Windows machine(s) to see if you can reach the Condor server on tcp/9618?  Apologies if youâve already covered this and I havenât seen it.  Good luck!

Cheers, Craig


On 27/09/2018, at 8:16 AM, John M Knoeller <johnkn@xxxxxxxxxxx> wrote:

This file also looks ok to  me.  
 
This is all of the config files on the Windows machine?  no condor_config.local or files in the c:\condor\config directory?
 
Now that I look more closely, this message
 
> 09/24/18 09:46:01 Query info: matched=6; skipped=4; query_time=0.000806; send_time=0.001738; type=Any; requirements={( ( ( MyType == "Scheduler" ) || ( MyType == "Submitter" ) ) || ( ( MyType == "Machine" ) ) )}; peer=<127.0.0.1:25381>; projection={}
> 09/24/18 09:46:01 DaemonCore: Can't receive command request from 127.0.0.1 (perhaps a timeout?)
 
Is a query from the condor_negotiator process on the central manager machine. it is NOT your condor_status -master query.  
 
The peer of 127.0.0.1 is reasonable for a query from negotiator to collector, and the subsequent warning is a known bug â itâs a minor protocol mismatch that happens *after* the negotiator query has succeeded.
 
If the collector log has no message corresponding to the condor_status -master query, then you should look for a firewall between your central manager and your Windows machine that is preventing any attempt to contact the central manager from your windows box.
 
-tj
 
 
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Stefano Colafranceschi
Sent: Wednesday, September 26, 2018 9:02 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] ERROR: SECMAN:2003:TCP connection to collectorfailed.
 
Find attached the config file of condor on my windows client (which is in 10.x.x.x), any further suggestions?
 
Thanks!
 
StefanoC
 
From: John M Knoeller
Sent: Tuesday, September 25, 2018 5:19 PM
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] ERROR: SECMAN:2003:TCP connection to collectorfailed.
 
This looks ok to me.
 
Your ALLOW_WRITE line is allowing everything on the 10.* subnet, that should be sufficient to give your Windows machine permission to send ads to the Collector.  (Iâm assuming your Windows machine is in that subnet?)
 
 
Could I also see the configuration of your Windows machine?  Perhaps the problem is there.
 
-tj
 
 
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Stefano Colafranceschi - Mathematical Sciences Dept
Sent: Tuesday, September 25, 2018 12:14 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] ERROR: SECMAN:2003:TCP connection to collector failed.
 
Thanks, find inline answer and attached config file

> On Sep 25, 2018, at 11:57 AM, John M Knoeller <johnkn@xxxxxxxxxxx> wrote:
> 
> I presume x.x.x.x is the correct IP for your Linux central manager machine?
yes 10.6.10.15

 
> The error in the Master log looks like it might be an authorization problem â the collector isnât allowing the Windows node to send updates.  
right but I canât figure out the issue.

 
> Check the ALLOW_WRITE configuration knob in the in the Collector, does it permit the IP of the Windows node?
 
> At the same timestamp  as the error from the master log (plus or minus a few seconds in case of clock mis-match), is there a message in the Collector log about refusing an attempt to send updates?
yes basically the error you describe as puzzling appears in coincidence with an attempt of the windows node to access.

 
> This error
 
> 09/24/18 09:46:01 Query info: matched=6; skipped=4; query_time=0.000806; send_time=0.001738; type=Any; requirements={( ( ( MyType == "Scheduler" ) || ( MyType == "Submitter" ) ) || ( ( MyType == "Machine" ) ) )}; peer=<127.0.0.1:25381>; projection={}
> 09/24/18 09:46:01 DaemonCore: Can't receive command request from 127.0.0.1 (perhaps a timeout?)
 
> is a bit more puzzling to me.  I donât see how a request from a windows node to the collector could result in a peer address of 127.0.0.1
 
> Does the config on the Windows machine have this?
this file c:\windows\system32\driver\etc\host does not contain 127.0.0.1 it contains just "10.6.10.15   mastercondorâ (I added this for convenience)
 
> NETWORK_INTERFACE = 127.0.0.1
 
> If so, remove that line.
 
> If not try running
 
>    condor_config_val -write:upgrade  config.log
ok done attached
 
> and sending me the config.log file.  Iâll see if I can see anything in that config that could cause the peer address to be set incorrectly.

thank you very much for your help and support!

 
> -tj
 
> From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Stefano Colafranceschi
> Sent: Monday, September 24, 2018 12:12 PM
> To: htcondor-users@xxxxxxxxxxx
> Subject: [HTCondor-users] ERROR: SECMAN:2003:TCP connection to collector failed.
 
> Dear all,
> I am trying to have a linux (latest) htcondor running with a windows node. On Linux I can submit jobs and they get processed no problems, but I canât figure out whatâs wrong adding a windows machine to the pool.
 
> This is the error that I see on the MasterLog (windows client):
 
> ERROR: SECMAN:2003:TCP connection to collector x.x.x.x failed.
> Failed to start non-blocking update to <x.x.x.x:9618>.
 
> And this is the content of the Collectorlog on the linux server, just after I issued on the windows machine condor_status -master
 
> 09/24/18 09:46:01 Got QUERY_STARTD_PVT_ADS
> 09/24/18 09:46:01 Number of Active Workers 0
> 09/24/18 09:46:01 (Sending 4 ads in response to query)
> 09/24/18 09:46:01 Query info: matched=4; skipped=0; query_time=0.000839; send_time=0.000619; type=MachinePrivate; requirements={true}; peer=<127.0.0.1:27363>; projection={}
> 09/24/18 09:46:01 Number of Active Workers 0
> 09/24/18 09:46:01 (Sending 6 ads in response to query)
> 09/24/18 09:46:01 Query info: matched=6; skipped=4; query_time=0.000806; send_time=0.001738; type=Any; requirements={( ( ( MyType == "Scheduler" ) || ( MyType == "Submitter" ) ) || ( ( MyType == "Machine" ) ) )}; peer=<127.0.0.1:25381>; projection={}
> 09/24/18 09:46:01 DaemonCore: Can't receive command request from 127.0.0.1 (perhaps a timeout?)
 
 
> p.s. I am sure both windows and Linux have 9618 port open.
 
> Thanks for any suggestions!
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
 
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://apac01.safelinks.protection.outlook.com/?url="">

The archives can be found at:
https://apac01.safelinks.protection.outlook.com/?url="">

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to
htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://apac01.safelinks.protection.outlook.com/?url="">

The archives can be found at:
https://apac01.safelinks.protection.outlook.com/?url="">