[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Starter not using sharedPort when condor_tail



Dear HTCondor Experts,


I am trying to figure out a problem. I have a schedd (with SharedPort) and some pilots(glideinws) running in the wild. I would like to from the scheduler to be able to condor_tail the jobs. But when I do:


condor_tail -debug -stderr 7900859.0


02/22/18 11:21:47 Got connect info for starter <169.228.132.139:43297?CCBID=169.228.130.23:9654%3faddrs%3d169.228.130.23-9654#322198&PrivNet=sdsc-40.t2.ucsd.edu&addrs=169.228.132.139-43297&noUDP>
02/22/18 11:21:47 Requesting GoAhead from the transfer queue manager.
02/22/18 11:21:47 Received GoAhead from the transfer queue manager.
02/22/18 11:21:47 IPVERIFY: checking glidein-collector.t2.ucsd.edu against 169.228.130.23
02/22/18 11:21:47 IPVERIFY: matched 169.228.130.23 to 169.228.130.23
02/22/18 11:21:47 IPVERIFY: ip found is 1
02/22/18 11:21:47 CCBClient: received failure message from CCB server collector 169.228.130.23:9654?addrs=169.228.130.23-9654 in response to request for reversed connection to starter at <169.228.132.139:43297>: failed to connect
02/22/18 11:21:47 Failed to reverse connect to starter at <169.228.132.139:43297> via CCB.
Failed to peek at file from starter: Failed to connect to starter


If I check the starter of the pilot:

02/22/18 11:21:47 (pid:1476931) CCBClient: WARNING: trying to connect to daemon at <169.228.130.74:9615> via CCB, but this appears to be a connection from one private network to another, which is not supported by CCB.  Either that, or you have not configured the private network name to be the same in these two networks when it really should be.  Assuming the latter.
02/22/18 11:21:47 (pid:1476931) attempt to connect to <169.228.130.74:9859> failed: No route to host (connect errno = 113).
02/22/18 11:21:47 (pid:1476931) CCBListener: failed to create reversed connection for request id 29242 to <169.228.130.74:9859>: failed to connect

With the important line being:

02/22/18 11:21:47 (pid:1476931) attempt to connect to <169.228.130.74:9859> failed: No route to host (connect errno = 113).

So my question is why is condor_starter trying to talk back to my scheduler in port 9859 which of course its not open instead of using the shared port, which uses for everything else (except condor_tail).

 I tried setting the CCB_ADDRESS in the scheduler to match the one of the starter but that did not help.

Any help is appreciated,


Edgar M Fajardo Hernandez


Edgar M Fajardo Hernandez