[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Edgar, your HTCondor shared port question



Hi Todd,

Thank you for following up with this.

I tried doing it as root:

[1258] root@uaf-10 /etc/condor/config.d# condor_tail -debug 7900841.0
02/23/18 12:58:28 Result of reading /etc/issue:  CentOS release 6.9 (Final)
 
02/23/18 12:58:28 Growing processor array to 64
02/23/18 12:58:28 Using IDs: 48 processors, 24 CPUs, 24 HTs
02/23/18 12:58:28 Reading condor configuration from '/etc/condor/condor_config'
02/23/18 12:58:28 Enumerating interfaces: lo 127.0.0.1 up
02/23/18 12:58:28 Enumerating interfaces: eth0 169.228.130.74 up
02/23/18 12:58:28 Enumerating interfaces: eth0:0 169.254.100.2 up
02/23/18 12:58:28 Enumerating interfaces: eth0:1 169.228.130.39 up
02/23/18 12:58:28 Initializing Directory: curr_dir = /etc/condor/config.d
02/23/18 12:58:28 WARNING: Config source is empty: /etc/condor/config.d/00personal_condor.config
02/23/18 12:58:28 Locating daemon process
02/23/18 12:58:28 IPVERIFY: checking uaf-10.t2.ucsd.edu against 169.228.130.74
02/23/18 12:58:28 IPVERIFY: matched 169.228.130.74 to 169.228.130.74
02/23/18 12:58:28 IPVERIFY: ip found is 1
02/23/18 12:58:29 Response for GET_JOB_CONNECT_INFO:
StarterIpAddr = "<169.228.131.159:7669?CCBID=169.228.130.23:9634%3faddrs%3d169.228.130.23-9634#294012&PrivNet=cabinet-2-2-21.t2.ucsd.edu&addrs=169.228.131.159-7669&noUDP>"
Result = true
ServerTime = 1519419509
CondorVersion = "$CondorVersion: 8.4.8 Jun 30 2016 BuildID: 373513 $"

02/23/18 12:58:29 Got connect info for starter <169.228.131.159:7669?CCBID=169.228.130.23:9634%3faddrs%3d169.228.130.23-9634#294012&PrivNet=cabinet-2-2-21.t2.ucsd.edu&addrs=169.228.131.159-7669&noUDP>
02/23/18 12:58:29 Requesting GoAhead from the transfer queue manager.
02/23/18 12:58:29 Received GoAhead from the transfer queue manager.
02/23/18 12:58:29 IPVERIFY: checking glidein-collector.t2.ucsd.edu against 169.228.130.23
02/23/18 12:58:29 IPVERIFY: matched 169.228.130.23 to 169.228.130.23
02/23/18 12:58:29 IPVERIFY: ip found is 1
02/23/18 12:58:29 CCBClient: received failure message from CCB server collector 169.228.130.23:9634?addrs=169.228.130.23-9634 in response to request for reversed connection to starter at <169.228.131.159:7669>: failed to connect
02/23/18 12:58:29 Failed to reverse connect to starter at <169.228.131.159:7669> via CCB.
Failed to peek at file from starter: Failed to connect to starter


And still have same error.

In fact:

condor_config_val LOCK
/var/log/condor


But there is no daemon_sock

[1303] root@uaf-10 /var/log/condor# ls | grep sock -i
[1303] root@uaf-10 /var/log/condor# 



Doing it as condor does not help either:

sudo -u condor condor_tail -debug 7920298.0
02/23/18 13:05:27 Result of reading /etc/issue:  CentOS release 6.9 (Final)
 
02/23/18 13:05:27 Growing processor array to 64
02/23/18 13:05:27 Using IDs: 48 processors, 24 CPUs, 24 HTs
02/23/18 13:05:27 Reading condor configuration from '/etc/condor/condor_config'
02/23/18 13:05:27 Enumerating interfaces: lo 127.0.0.1 up
02/23/18 13:05:27 Enumerating interfaces: eth0 169.228.130.74 up
02/23/18 13:05:27 Enumerating interfaces: eth0:0 169.254.100.2 up
02/23/18 13:05:27 Enumerating interfaces: eth0:1 169.228.130.39 up
02/23/18 13:05:27 Initializing Directory: curr_dir = /etc/condor/config.d
02/23/18 13:05:27 WARNING: Config source is empty: /etc/condor/config.d/00personal_condor.config
02/23/18 13:05:27 Locating daemon process
02/23/18 13:05:27 SharedPortClient: sent connection request to local schedd for shared port id 2156870_4fd2_3
02/23/18 13:05:27 IPVERIFY: checking uaf-10.t2.ucsd.edu against 169.228.130.74
02/23/18 13:05:27 IPVERIFY: matched 169.228.130.74 to 169.228.130.74
02/23/18 13:05:27 IPVERIFY: ip found is 1
02/23/18 13:05:27 Response for GET_JOB_CONNECT_INFO:
StarterIpAddr = "<169.228.131.48:38906?CCBID=169.228.130.23:9633%3faddrs%3d169.228.130.23-9633#323077&PrivNet=cabinet-5-5-21.t2.ucsd.edu&addrs=169.228.131.48-38906&noUDP>"
Result = true
ServerTime = 1519419927
CondorVersion = "$CondorVersion: 8.4.8 Jun 30 2016 BuildID: 373513 $"

02/23/18 13:05:27 Got connect info for starter <169.228.131.48:38906?CCBID=169.228.130.23:9633%3faddrs%3d169.228.130.23-9633#323077&PrivNet=cabinet-5-5-21.t2.ucsd.edu&addrs=169.228.131.48-38906&noUDP>
02/23/18 13:05:27 Requesting GoAhead from the transfer queue manager.
02/23/18 13:05:27 SharedPortClient: sent connection request to local schedd for shared port id 2156870_4fd2_3
02/23/18 13:05:27 Received GoAhead from the transfer queue manager.
02/23/18 13:05:27 IPVERIFY: checking glidein-collector.t2.ucsd.edu against 169.228.130.23
02/23/18 13:05:27 IPVERIFY: matched 169.228.130.23 to 169.228.130.23
02/23/18 13:05:27 IPVERIFY: ip found is 1
02/23/18 13:05:27 CCBClient: received failure message from CCB server collector 169.228.130.23:9633?addrs=169.228.130.23-9633 in response to request for reversed connection to starter at <169.228.131.48:38906>: failed to connect
02/23/18 13:05:27 Failed to reverse connect to starter at <169.228.131.48:38906> via CCB.
Failed to peek at file from starter: Failed to connect to starter
[1305] root@uaf-10 /var/log/condor# condor_tail -debug 7920298.0
02/23/18 13:05:30 Result of reading /etc/issue:  CentOS release 6.9 (Final)
 
02/23/18 13:05:30 Growing processor array to 64
02/23/18 13:05:30 Using IDs: 48 processors, 24 CPUs, 24 HTs
02/23/18 13:05:30 Reading condor configuration from '/etc/condor/condor_config'
02/23/18 13:05:30 Enumerating interfaces: lo 127.0.0.1 up
02/23/18 13:05:30 Enumerating interfaces: eth0 169.228.130.74 up
02/23/18 13:05:30 Enumerating interfaces: eth0:0 169.254.100.2 up
02/23/18 13:05:30 Enumerating interfaces: eth0:1 169.228.130.39 up
02/23/18 13:05:30 Initializing Directory: curr_dir = /etc/condor/config.d
02/23/18 13:05:30 WARNING: Config source is empty: /etc/condor/config.d/00personal_condor.config
02/23/18 13:05:30 Locating daemon process
02/23/18 13:05:30 IPVERIFY: checking uaf-10.t2.ucsd.edu against 169.228.130.74
02/23/18 13:05:30 IPVERIFY: matched 169.228.130.74 to 169.228.130.74
02/23/18 13:05:30 IPVERIFY: ip found is 1
02/23/18 13:05:30 Response for GET_JOB_CONNECT_INFO:
StarterIpAddr = "<169.228.131.48:38906?CCBID=169.228.130.23:9633%3faddrs%3d169.228.130.23-9633#323077&PrivNet=cabinet-5-5-21.t2.ucsd.edu&addrs=169.228.131.48-38906&noUDP>"
Result = true
ServerTime = 1519419930
CondorVersion = "$CondorVersion: 8.4.8 Jun 30 2016 BuildID: 373513 $"

02/23/18 13:05:30 Got connect info for starter <169.228.131.48:38906?CCBID=169.228.130.23:9633%3faddrs%3d169.228.130.23-9633#323077&PrivNet=cabinet-5-5-21.t2.ucsd.edu&addrs=169.228.131.48-38906&noUDP>
02/23/18 13:05:30 Requesting GoAhead from the transfer queue manager.
02/23/18 13:05:30 Received GoAhead from the transfer queue manager.
02/23/18 13:05:30 IPVERIFY: checking glidein-collector.t2.ucsd.edu against 169.228.130.23
02/23/18 13:05:30 IPVERIFY: matched 169.228.130.23 to 169.228.130.23
02/23/18 13:05:30 IPVERIFY: ip found is 1
02/23/18 13:05:30 CCBClient: received failure message from CCB server collector 169.228.130.23:9633?addrs=169.228.130.23-9633 in response to request for reversed connection to starter at <169.228.131.48:38906>: failed to connect
02/23/18 13:05:30 Failed to reverse connect to starter at <169.228.131.48:38906> via CCB.
Failed to peek at file from starter: Failed to connect to starter
[1305] root@uaf-10 /var/log/condor# host 169.228.131.48
48.131.228.169.in-addr.arpa domain name pointer cabinet-5-5-21.t2.ucsd.edu.


Edgar M Fajardo Hernandez



On Feb 23, 2018, at 11:57 AM, Todd Tannenbaum <todd.tannenbaum@xxxxxxxxx> wrote:

Hi Edgar,

Apparently email to emfajardohernandez@xxxxxxxxxxxxxxxx from UW-Madison gets bounced.  So I am trying to contact you via my GMail account.

Below is the information about the bounce, followed by the answer to the question you posted on HTCondor Users.

Please follow-up to HTCondor Users.... or email me at tannenba@xxxxxxxxxxx.... I don't ever read mail coming into my gmail account.

regards,
Todd


This report relates to a message you sent with the following header fields:

  Message-id: <0c4e345d-8bc1-57c4-0be2-cef1adc674e0@xxxxxxxxxxx>
  Date: Fri, 23 Feb 2018 13:32:39 -0600
  From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
  To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>,
   Edgar M Fajardo Hernandez <emfajardohernandez@xxxxxxxxxxxxxxxx>
  Subject: Re: [HTCondor-users] Starter not using sharedPort when condor_tail

Your message cannot be delivered to the following recipients:

  Recipient address: emfajardohernandez@xxxxxxxxxxxxxxxx
  Reason: Remote SMTP server has rejected address
  Diagnostic code: smtp;550 #5.7.1 Your message was rejected because your sending MTA (144.92.197.222) is blocked.  If you believe this to be in error, please forward this notice to postmaster@xxxxxxxx via alternate means.
  Remote system: dns;inbound.ucsd.edu (TCP|144.92.197.222|55585|132.239.0.122|25) (iport-bcv4-out.ucsd.edu ESMTP)



Reporting-MTA: dns;smtpauth2.wiscmail.wisc.edu (tcp-daemon)
Arrival-date: Fri, 23 Feb 2018 13:32:50 -0600 (CST)

Original-recipient: rfc822;emfajardohernandez@xxxxxxxxxxxxxxxx
Final-recipient: rfc822;emfajardohernandez@xxxxxxxxxxxxxxxx
Action: failed
Status: 5.0.0 (Remote SMTP server has rejected address)
Remote-MTA: dns;inbound.ucsd.edu (TCP|144.92.197.222|55585|132.239.0.122|25)
 (iport-bcv4-out.ucsd.edu ESMTP)
Diagnostic-code: smtp;550 #5.7.1 Your message was rejected because your sending
 MTA (144.92.197.222) is blocked.  If you believe this to be in error,
 please forward this notice to postmaster@xxxxxxxx via alternate means.


Re: [HTCondor-users] Starter not using sharedPort when condor_tail.eml
Subject:
Re: [HTCondor-users] Starter not using sharedPort when condor_tail
From:
Todd Tannenbaum <tannenba@xxxxxxxxxxx>
Date:
2/23/2018 1:32 PM
To:
HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>, Edgar M Fajardo Hernandez <emfajardohernandez@xxxxxxxxxxxxxxxx>

On 2/23/2018 12:31 PM, Edgar M Fajardo Hernandez wrote:
>
> So my question is why is condor_starter trying to talk back to my scheduler in port 9859 which of course its not open instead of using the shared port, which uses for everything else (except condor_tail).
>

Welcome to htcondor-users Edgar!!

For the detailed answer, see the HTCondor Manual for knob entry DAEMON_SOCKET_DIR ( http://tinyurl.com/y9bdc3zl )

DAEMON_SOCKET_DIR defaults to $(LOCK)/daemon_sock, which on RPM systems is /var/lock/condor/daemon_sock.

To get command-line tools like condor_tail to work with incoming connections through the shared_port, the tool will need to be able to write to the directory /var/local/condor/daemon_sock.

So to get it to work with shared_port and CCB, you could:

1. Set permissions on this directory the same way as /tmp so any user on the host can use condor_tail with shared_port+CCB, like so:

   chmod 1777 /var/local/condor/daemon_sock

Or

2. You could run condor_tail as user 'condor', which should already have permissions to write to /var/local/condor/daemon_sock.

Or

3. You could get rid of your firewall on ephemeral ports on your schedd machine, allowing condor_tail to just create its own listen socket to receive the incoming CCB connection without using shared_port.


More details in the Manual.

Hope the above helps
Todd