[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Condor_ssh_to_job not working with Shared Port across WAN



Submitting to the correct list


Edgar M Fajardo Hernandez



Begin forwarded message:

From: Edgar M Fajardo Hernandez <emfajardohernandez@xxxxxxxxxxxxxxxx>
Subject: Condor_ssh_to_job not working with Shared Port across WAN
Date: October 10, 2018 at 1:30:35 PM PDT

Dear HTCondor Devs,

It was my understanding that the latest HTCondor 8.7.9 it would be possible to run a condor_ssh_to_job across the WAN and into the singularity container. But I was running some tests very low latency(both in SoCal):

Submit host at UC Irvine and compute node at UCSD and I run into this:

[1324] dantrim@uclhc-1 ~$ condor_ssh_to_job -debug 856006.142
10/10/18 13:24:27 SharedPortClient: sent connection request to schedd at <192.5.19.13:9615> for shared port id 1256425_f007_4
10/10/18 13:24:27 SharedPortClient: sent connection request to local schedd for shared port id 1256425_f007_4
10/10/18 13:24:27 Response for GET_JOB_CONNECT_INFO:
StarterIpAddr = "<169.228.132.166:2574?CCBID=169.228.130.106:9647%3faddrs%3d169.228.130.106-9647+[--1]-9647#346&PrivNet=sdsc-67.t2.ucsd.edu&addrs=169.228.132.166-2574&noUDP>"
Result = true
ServerTime = 1539203067
CondorVersion = "$CondorVersion: 8.7.9 Jul 31 2018 BuildID: 446081 $"

10/10/18 13:24:27 Got connect info for starter <169.228.132.166:2574?CCBID=169.228.130.106:9647%3faddrs%3d169.228.130.106-9647+[--1]-9647#346&PrivNet=sdsc-67.t2.ucsd.edu&addrs=169.228.132.166-2574&noUDP>
10/10/18 13:24:27 No shared_port cookie available; will fall back to using on-disk $(DAEMON_SOCKET_DIR)
10/10/18 13:24:30 CCBClient: received failure message from CCB server collector 169.228.130.106:9647?addrs=169.228.130.106-9647+[--1]-9647 in response to request for reversed connection to starter at <169.228.132.166:2574>: failed to connect
10/10/18 13:24:30 Failed to reverse connect to starter at <169.228.132.166:2574> via CCB.
Failed to connect to starter
10/10/18 13:24:30 Attempting to remove /tmp/dantrim.condor_ssh_to_job_3a5314be as unknown user

From the starter side I see:

10/10/18 13:10:00 (pid:269881) Error file: /data1/condor_local/execute/dir_3943051/glide_8CmVd1/execute/dir_269881/.condor_ssh_to_job_1/sshd.log
10/10/18 13:10:00 (pid:269881) Renice expr "0" evaluated to 0
10/10/18 13:10:00 (pid:269881) Using wrapper /data1/condor_local/execute/dir_3943051/glide_8CmVd1/condor_job_wrapper.sh to exec /usr/sbin/sshd -i -e -f /data1/condor_local/execute/dir_3943051/glide_8CmVd1/execute/dir_269881/.condor_ssh_to_job_1/sshd_config
10/10/18 13:10:00 (pid:269881) Running job as user same uid as parent: personal condor
10/10/18 13:10:00 (pid:269881) Create_Process succeeded, pid=323738
10/10/18 13:10:00 (pid:269881) Unable to write into oom_adj file for the starter: (errno=13, Permission denied)
10/10/18 13:10:00 (pid:269881) Process exited, pid=323721, status=0
10/10/18 13:10:00 (pid:269881) unhandled job exit: pid=323721, status=0
10/10/18 13:10:00 (pid:269881) Process exited, pid=323738, status=255
10/10/18 13:12:08 (pid:269881) attempt to connect to <192.5.19.13:26449> failed: No route to host (connect errno = 113).
10/10/18 13:12:08 (pid:269881) CCBListener: failed to create reversed connection for request id 1538 to <192.5.19.13:26449>: failed to connect
10/10/18 13:15:53 (pid:269881) attempt to connect to <192.5.19.13:13429> failed: No route to host (connect errno = 113).
10/10/18 13:15:53 (pid:269881) CCBListener: failed to create reversed connection for request id 1541 to <192.5.19.13:13429>: failed to connect
10/10/18 13:21:42 (pid:269881) attempt to connect to <192.5.19.13:14922> failed: Connection timed out (connect errno = 110).  Will keep trying for 300 total seconds (173 to go).
10/10/18 13:24:30 (pid:269881) attempt to connect to <192.5.19.13:27424> failed: No route to host (connect errno = 113).
10/10/18 13:24:30 (pid:269881) CCBListener: failed to create reversed connection for request id 1547 to <192.5.19.13:27424>: failed to connect
10/10/18 13:24:36 (pid:269881) attempt to connect to <192.5.19.13:14922> failed: Connection timed out (connect errno = 110).
10/10/18 13:24:36 (pid:269881) CCBListener: failed to create reversed connection for request id 1544 to <192.5.19.13:14922>: failed to connect

It would seem to me the starter is not aware that the Submit host is in the shared Port since it is trying to connect back to it on the ephemeral ports rather than on the Shared Port port 9615

Condor_Tail shows similar error:

[1327] dantrim@uclhc-1 ~$ condor_tail -debug 856006.142
10/10/18 13:27:22 Requesting GoAhead from the transfer queue manager.
10/10/18 13:27:22 Received GoAhead from the transfer queue manager.
10/10/18 13:27:22 CCBClient: received failure message from CCB server collector 169.228.130.106:9647?addrs=169.228.130.106-9647+[--1]-9647 in response to request for reversed connection to starter at <169.228.132.166:2574>: failed to connect
10/10/18 13:27:22 Failed to reverse connect to starter at <169.228.132.166:2574> via CCB.
Failed to peek at file from starter: Failed to connect to starter

However it works when I run it as root:

[1328] root@uclhc-1 ~# condor_tail -debug 856006.142
10/10/18 13:28:30 Requesting GoAhead from the transfer queue manager.
10/10/18 13:28:30 Received GoAhead from the transfer queue manager.
flow::Process    **** Processing entry 285000 run 337451 event 1113187971 ****
ntupler_rj_stop2l    Superflow::Process    **** Processing entry 290000 run 337451 event 1123488524 ****
ntupler_rj_stop2l    Superflow::Process    **** Processing entry 295000 run 337451 event 1130301791 ****
ntupler_rj_stop2l    Superflow::Process    **** Processing entry 300000 run 337451 event 1129395720 ****
ntupler_rj_stop2l    Superflow::Process    **** Processing entry 305000 run 337451 event 1136221416 ****
ntupler_rj_stop2l    Superflow::Process    **** Processing entry 310000 run 337451 event 1136345517 ****
ntupler_rj_stop2l    Superflow::Process    **** Processing entry 315000 run 337451 event 1149888120 ****
ntupler_rj_stop2l    Superflow::Process    **** Processing entry 320000 run 337451 event 1148858158 ****
ntupler_rj_stop2l    Superflow::Process    **** Processing entry 325000 run 337451 event 1148294223 ****
ntupler_rj_stop2l    Superflow::Process    **** Processing entry 330000 run 337451 event 1163856902 ****


The submit host is running:

 condor_version 
$CondorVersion: 8.6.12 Aug 06 2018 $
$CondorPlatform: X86_64-CentOS_6.10 $


Any ideas here to try?



Edgar M Fajardo Hernandez