[HTCondor-users] CCB failing, causing jobs to stay IDLE forever

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

Hi,

I have a HTCondor cluster (CM + Submit + 2 Executers), running version 10.0.3 and running on the company network.

When I try to submit jobs to it from an execute node on that same network, everything looks fine.

I also have some submit nodes started inside of a Kubernetes cluster. When trying to submit jobs from there, they stay IDLE forever.

Before going further, I should precise that we had it working before, but we accidentally installed 10.4.0 instead of 10.0.3, and it started failing only after switching to the correct version.

I don’t believe the configuration files to have changed after the version change, but I can’t guarantee it either.

In the NegotiatorLog I can see this at every cycle:

05/26/23 14:50:08 ---------- Started Negotiation Cycle ----------

05/26/23 14:50:08 Phase 1: Obtaining ads from collector ...

05/26/23 14:50:08 Getting startd private ads ...

05/26/23 14:50:08 Getting Scheduler, Submitter and Machine ads ...

05/26/23 14:50:08 Sorting 9 ads ...

05/26/23 14:50:08 Got ads: 9 public and 8 private

05/26/23 14:50:08 Public ads include 1 submitter, 8 startd

05/26/23 14:50:08 Phase 2: Performing accounting ...

05/26/23 14:50:08 Phase 3: Sorting submitter ads by priority ...

05/26/23 14:50:08 Starting prefetch round; 1 potential prefetches to do.

05/26/23 14:50:08 Starting prefetch negotiation for my-user@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.

05/26/23 14:50:08 Got NO_MORE_JOBS; schedd has no more requests

05/26/23 14:50:08 Prefetch summary: 1 attempted, 1 successful.

05/26/23 14:50:08 Phase 4.1: Negotiating with schedds ...

05/26/23 14:50:08 Negotiating with my-user@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx at <172.17.0.15:9618?CCBID=10.1.65.126:9618%3faddrs%3d10.1.65.126-9618%26alias%3dcondor-cm.my-company.com%26noUDP%26sock%3dcollector#4403&PrivNet=condor-submit.kubernetes.cluster.local&addrs=172.17.0.15-9618&alias=condor-submit.kubernetes.cluster.local&noUDP&sock=schedd_63_1af6>

05/26/23 14:50:08 0 seconds so far for this submitter

05/26/23 14:50:08 0 seconds so far for this schedd

05/26/23 14:50:08 Request 00001.00000: autocluster 1 (request count 1 of 1)

05/26/23 14:50:08 Matched 1.0 my-user@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx <172.17.0.15:9618?CCBID=10.1.65.126:9618%3faddrs%3d10.1.65.126-9618%26alias%3dcondor-cm.my-company.com%26noUDP%26sock%3dcollector#4403&PrivNet=condor-submit.kubernetes.cluster.local&addrs=172.17.0.15-9618&alias=condor-submit.kubernetes.cluster.local&noUDP&sock=schedd_63_1af6> preempting none <10.1.65.124:9618?CCBID=10.1.65.126:9618%3faddrs%3d10.1.65.126-9618%26alias%3dcondor-cm.my-company.com%26noUDP%26sock%3dcollector#4400&PrivNet=condor-node1.my-company.com&addrs=10.1.65.124-9618&alias=condor-node1.my-company.com&noUDP&sock=startd_5479_b77d> slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx

05/26/23 14:50:08 Successfully matched with slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx

05/26/23 14:50:08 negotiateWithGroup resources used submitterAds length 0

05/26/23 14:50:08 ---------- Finished Negotiation Cycle ----------

In the ScheddLog:

05/26/23 12:53:28 (pid:102) Number of Active Workers 0

05/26/23 12:53:38 (pid:102) Number of Active Workers 0

05/26/23 12:53:49 (pid:102) Number of Active Workers 0

05/26/23 12:53:59 (pid:102) Number of Active Workers 0

05/26/23 12:54:08 (pid:102) Activity on stashed negotiator socket: <10.1.65.126:9618>

05/26/23 12:54:08 (pid:102) Using negotiation protocol: NEGOTIATE

05/26/23 12:54:08 (pid:102) Negotiating for owner: my-user@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

05/26/23 12:54:08 (pid:102) Finished sending rrls to negotiator

05/26/23 12:54:08 (pid:102) Finished sending RRL for my-user

05/26/23 12:54:08 (pid:102) Activity on stashed negotiator socket: <10.1.65.126:9618>

05/26/23 12:54:08 (pid:102) Using negotiation protocol: NEGOTIATE

05/26/23 12:54:08 (pid:102) Negotiating for owner: my-user@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

05/26/23 12:54:08 (pid:102) SECMAN: removing lingering non-negotiated security session <10.1.65.124:9618?CCBID=10.1.65.126:9618%3faddrs%3d10.1.65.126-9618%26alias%3dcondor-cm.my-company.com%26noUDP%26sock%3dcollector#4400&PrivNet=condor-node1.my-company.com&addrs=10.1.65.124-9618&alias=condor-node1.my-company.com&noUDP&sock=startd_5479_b77d>#1685101045#1 because it conflicts with new request

05/26/23 12:54:08 (pid:102) CCBClient: WARNING: trying to connect to startd slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx <10.1.65.124:9618?CCBID=10.1.65.126:9618%3faddrs%3d10.1.65.126-9618%26alias%3dcondor-cm.my-company.com%26noUDP%26sock%3dcollector#4400&PrivNet=condor-node1.my-company.com&addrs=10.1.65.124-9618&alias=condor-node1.my-company.com&noUDP&sock=startd_5479_b77d> for my-user via CCB, but this appears to be a connection from one private network to another, which is not supported by CCB. Either that, or you have not configured the private network name to be the same in these two networks when it really should be. Assuming the latter.

05/26/23 12:54:08 (pid:102) Negotiation ended - 1 jobs matched

05/26/23 12:54:08 (pid:102) Finished negotiating for my-user in local pool: 1 matched, 0 rejected

05/26/23 12:54:09 (pid:102) Number of Active Workers 0

05/26/23 12:54:11 (pid:102) CCBClient: received failure message from CCB server 10.1.65.126:9618?addrs=10.1.65.126-9618&alias=condor-cm.my-company.com&noUDP&sock=collector in response to (non-blocking) request for reversed connection to startd slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx <10.1.65.124:9618?CCBID=10.1.65.126:9618%3faddrs%3d10.1.65.126-9618%26alias%3dcondor-cm.my-company.com%26noUDP%26sock%3dcollector#4400&PrivNet=condor-node1.my-company.com&addrs=10.1.65.124-9618&alias=condor-node1.my-company.com&noUDP&sock=startd_5479_b77d> for my-user: failed to connect

05/26/23 12:54:11 (pid:102) CCBClient: no more CCB servers to try for requesting reversed connection to startd slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx <10.1.65.124:9618?CCBID=10.1.65.126:9618%3faddrs%3d10.1.65.126-9618%26alias%3dcondor-cm.my-company.com%26noUDP%26sock%3dcollector#4400&PrivNet=condor-node1.my-company.com&addrs=10.1.65.124-9618&alias=condor-node1.my-company.com&noUDP&sock=startd_5479_b77d> for my-user; giving up.

05/26/23 12:54:11 (pid:102) Failed to send REQUEST_CLAIM to startd slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx <10.1.65.124:9618?CCBID=10.1.65.126:9618%3faddrs%3d10.1.65.126-9618%26alias%3dcondor-cm.my-company.com%26noUDP%26sock%3dcollector#4400&PrivNet=condor-node1.my-company.com&addrs=10.1.65.124-9618&alias=condor-node1.my-company.com&noUDP&sock=startd_5479_b77d> for my-user: SECMAN:2003:TCP connection to startd slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx <10.1.65.124:9618?CCBID=10.1.65.126:9618%3faddrs%3d10.1.65.126-9618%26alias%3dcondor-cm.my-company.com%26noUDP%26sock%3dcollector#4400&PrivNet=condor-node1.my-company.com&addrs=10.1.65.124-9618&alias=condor-node1.my-company.com&noUDP&sock=startd_5479_b77d> for my-user failed.

05/26/23 12:54:11 (pid:102) Match record (slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx <10.1.65.124:9618?CCBID=10.1.65.126:9618%3faddrs%3d10.1.65.126-9618%26alias%3dcondor-cm.my-company.com%26noUDP%26sock%3dcollector#4400&PrivNet=condor-node1.my-company.com&addrs=10.1.65.124-9618&alias=condor-node1.my-company.com&noUDP&sock=startd_5479_b77d> for my-user, 1.0) deleted

In the CollectorLog:

05/26/23 14:57:00 Got QUERY_STARTD_ADS

05/26/23 14:57:00 QueryWorker: forked new worker with id 63152 ( max 4 active 1 pending 0 )

05/26/23 14:57:00 WARNING: forward resolution of localhost6 doesn't match 127.0.0.1!

05/26/23 14:57:00 WARNING: forward resolution of localhost6.localdomain6 doesn't match 127.0.0.1!

05/26/23 14:57:00 (Sending 8 ads in response to query)

05/26/23 14:57:00 Query info: matched=8; skipped=0; query_time=0.002500; send_time=0.000840; type=Machine; requirements={true}; locate=0; limit=0; from=TOOL; peer=<127.0.0.1:38076>; projection={Activity Arch CondorLoadAvg EnteredCurrentActivity LastHeardFrom Machine Memory MyCurrentTime Name OpSys State}; filter_private_attrs=1

In the StartLog on the worker nodes: literally nothing, no error, no warning.

I guess it is caused by something that was misconfigured somehow, but I can’t find what it is. The CCB seems to be the reason why my jobs never start, because the schedd can’t connect to the startd daemons outside of the kubernetes cluster.

What I don’t get is that it seems to connect to the CM just fine, and that one is outside of the Kubernetes cluster as weel.

Thanks,

Gaëtan

Gaetan Geffroy
Junior Software Engineer, Space

Terma GmbH
Europaarkaden II, Bratustraße 7, 64293 Darmstadt, Germany
T +49 6151 86005 43 (direct) • T +49 6151 86005-0
Terma GmbH - Sitz Darmstadt • Handelsregister Nr.: HRB 7411, Darmstadt
Geschäftsführer: Poul Vigh / Steen Vejby Sørensen
www.terma.com • Linkedin • Twitter • Instagram • Youtube

Attention:
This e-mail (and attachment(s), if any) - intended for the addressee(s) only - may contain confidential, copyright, or legally privileged information or material, and no one else is authorized to read, print, store, copy, forward, or otherwise use or disclose any part of its contents or attachment(s) in any form. If you have received this e-mail in error, please notify me by telephone or return e-mail, and delete this e-mail and attachment(s). Thank you.

Mailing List Archives

Public Access

[HTCondor-users] CCB failing, causing jobs to stay IDLE forever