Re: [HTCondor-users] CCB failing, causing jobs to stay IDLE forever

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

Hi Cole,

Thanks for the reply !

We found the solution in the end, and it was indeed a configuration issue.

We were mistakenly configuring both nodes (schedd and startd) to CCB, which caused it to think that both nodes were in a private network while only the schedd is in the private network, and removing the CCB config from startd fixed it.

Many thanks for your help,

GaÃtan

Gaetan Geffroy
Junior Software Engineer
Terma GmbH
T +49 6151 86005 43 (direct)

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Cole Bollig via HTCondor-users
Sent: Tuesday, May 30, 2023 22:50
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Cole Bollig <cabollig@xxxxxxxx>
Subject: Re: [HTCondor-users] CCB failing, causing jobs to stay IDLE forever

CAUTION: This email originated from outside of Terma. Do not click links or open attachments unless you recognize the sender and know the content is safe.

Hi Gaeton,

I would agree that this seems like some sort of configuration issue. A good quick starting point if you haven't done so already is to run condor_config_val -summaryâ. This will show all of the configuration macros that are set/changed.

-Cole Bollig

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Gaetan Geffroy <gage@xxxxxxxxx>
Sent: Friday, May 26, 2023 8:05 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Giovanni Scotti <gis@xxxxxxxxx>; Ruediger Gad <ruga@xxxxxxxxx>
Subject: [HTCondor-users] CCB failing, causing jobs to stay IDLE forever

Hi,

I have a HTCondor cluster (CM + Submit + 2 Executers), running version 10.0.3 and running on the company network.

When I try to submit jobs to it from an execute node on that same network, everything looks fine.

I also have some submit nodes started inside of a Kubernetes cluster. When trying to submit jobs from there, they stay IDLE forever.

Before going further, I should precise that we had it working before, but we accidentally installed 10.4.0 instead of 10.0.3, and it started failing only after switching to the correct version.

I donât believe the configuration files to have changed after the version change, but I canât guarantee it either.

In the NegotiatorLog I can see this at every cycle:

05/26/23 14:50:08 ---------- Started Negotiation Cycle ----------

05/26/23 14:50:08 Phase 1: Obtaining ads from collector ...

05/26/23 14:50:08 Getting startd private ads ...

05/26/23 14:50:08 Getting Scheduler, Submitter and Machine ads ...

05/26/23 14:50:08 Sorting 9 ads ...

05/26/23 14:50:08 Got ads: 9 public and 8 private

05/26/23 14:50:08 Public ads include 1 submitter, 8 startd

05/26/23 14:50:08 Phase 2: Performing accounting ...

05/26/23 14:50:08 Phase 3: Sorting submitter ads by priority ...

05/26/23 14:50:08 Starting prefetch round; 1 potential prefetches to do.

05/26/23 14:50:08 Starting prefetch negotiation for my-user@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.

05/26/23 14:50:08 Got NO_MORE_JOBS; schedd has no more requests

05/26/23 14:50:08 Prefetch summary: 1 attempted, 1 successful.

05/26/23 14:50:08 Phase 4.1: Negotiating with schedds ...

05/26/23 14:50:08 Negotiating with my-user@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx at <172.17.0.15:9618?CCBID=10.1.65.126:9618%3faddrs%3d10.1.65.126-9618%26alias%3dcondor-cm.my-company.com%26noUDP%26sock%3dcollector#4403&PrivNet=condor-submit.kubernetes.cluster.local&addrs=172.17.0.15-9618&alias=condor-submit.kubernetes.cluster.local&noUDP&sock=schedd_63_1af6>

05/26/23 14:50:08 0 seconds so far for this submitter

05/26/23 14:50:08 0 seconds so far for this schedd

05/26/23 14:50:08 Request 00001.00000: autocluster 1 (request count 1 of 1)

05/26/23 14:50:08 Matched 1.0 my-user@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx <172.17.0.15:9618?CCBID=10.1.65.126:9618%3faddrs%3d10.1.65.126-9618%26alias%3dcondor-cm.my-company.com%26noUDP%26sock%3dcollector#4403&PrivNet=condor-submit.kubernetes.cluster.local&addrs=172.17.0.15-9618&alias=condor-submit.kubernetes.cluster.local&noUDP&sock=schedd_63_1af6> preempting none <10.1.65.124:9618?CCBID=10.1.65.126:9618%3faddrs%3d10.1.65.126-9618%26alias%3dcondor-cm.my-company.com%26noUDP%26sock%3dcollector#4400&PrivNet=condor-node1.my-company.com&addrs=10.1.65.124-9618&alias=condor-node1.my-company.com&noUDP&sock=startd_5479_b77d> slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx

05/26/23 14:50:08 Successfully matched with slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx

05/26/23 14:50:08 negotiateWithGroup resources used submitterAds length 0

05/26/23 14:50:08 ---------- Finished Negotiation Cycle ----------

In the ScheddLog:

05/26/23 12:53:28 (pid:102) Number of Active Workers 0

05/26/23 12:53:38 (pid:102) Number of Active Workers 0

05/26/23 12:53:49 (pid:102) Number of Active Workers 0

05/26/23 12:53:59 (pid:102) Number of Active Workers 0

05/26/23 12:54:08 (pid:102) Activity on stashed negotiator socket: <10.1.65.126:9618>

05/26/23 12:54:08 (pid:102) Using negotiation protocol: NEGOTIATE

05/26/23 12:54:08 (pid:102) Negotiating for owner: my-user@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

05/26/23 12:54:08 (pid:102) Finished sending rrls to negotiator

05/26/23 12:54:08 (pid:102) Finished sending RRL for my-user

05/26/23 12:54:08 (pid:102) Activity on stashed negotiator socket: <10.1.65.126:9618>

05/26/23 12:54:08 (pid:102) Using negotiation protocol: NEGOTIATE

05/26/23 12:54:08 (pid:102) Negotiating for owner: my-user@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

05/26/23 12:54:08 (pid:102) SECMAN: removing lingering non-negotiated security session <10.1.65.124:9618?CCBID=10.1.65.126:9618%3faddrs%3d10.1.65.126-9618%26alias%3dcondor-cm.my-company.com%26noUDP%26sock%3dcollector#4400&PrivNet=condor-node1.my-company.com&addrs=10.1.65.124-9618&alias=condor-node1.my-company.com&noUDP&sock=startd_5479_b77d>#1685101045#1 because it conflicts with new request

05/26/23 12:54:08 (pid:102) CCBClient: WARNING: trying to connect to startd slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx <10.1.65.124:9618?CCBID=10.1.65.126:9618%3faddrs%3d10.1.65.126-9618%26alias%3dcondor-cm.my-company.com%26noUDP%26sock%3dcollector#4400&PrivNet=condor-node1.my-company.com&addrs=10.1.65.124-9618&alias=condor-node1.my-company.com&noUDP&sock=startd_5479_b77d> for my-user via CCB, but this appears to be a connection from one private network to another, which is not supported by CCB. Either that, or you have not configured the private network name to be the same in these two networks when it really should be. Assuming the latter.

05/26/23 12:54:08 (pid:102) Negotiation ended - 1 jobs matched

05/26/23 12:54:08 (pid:102) Finished negotiating for my-user in local pool: 1 matched, 0 rejected

05/26/23 12:54:09 (pid:102) Number of Active Workers 0

05/26/23 12:54:11 (pid:102) CCBClient: received failure message from CCB server 10.1.65.126:9618?addrs=10.1.65.126-9618&alias=condor-cm.my-company.com&noUDP&sock=collector in response to (non-blocking) request for reversed connection to startd slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx <10.1.65.124:9618?CCBID=10.1.65.126:9618%3faddrs%3d10.1.65.126-9618%26alias%3dcondor-cm.my-company.com%26noUDP%26sock%3dcollector#4400&PrivNet=condor-node1.my-company.com&addrs=10.1.65.124-9618&alias=condor-node1.my-company.com&noUDP&sock=startd_5479_b77d> for my-user: failed to connect

05/26/23 12:54:11 (pid:102) CCBClient: no more CCB servers to try for requesting reversed connection to startd slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx <10.1.65.124:9618?CCBID=10.1.65.126:9618%3faddrs%3d10.1.65.126-9618%26alias%3dcondor-cm.my-company.com%26noUDP%26sock%3dcollector#4400&PrivNet=condor-node1.my-company.com&addrs=10.1.65.124-9618&alias=condor-node1.my-company.com&noUDP&sock=startd_5479_b77d> for my-user; giving up.

05/26/23 12:54:11 (pid:102) Failed to send REQUEST_CLAIM to startd slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx <10.1.65.124:9618?CCBID=10.1.65.126:9618%3faddrs%3d10.1.65.126-9618%26alias%3dcondor-cm.my-company.com%26noUDP%26sock%3dcollector#4400&PrivNet=condor-node1.my-company.com&addrs=10.1.65.124-9618&alias=condor-node1.my-company.com&noUDP&sock=startd_5479_b77d> for my-user: SECMAN:2003:TCP connection to startd slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx <10.1.65.124:9618?CCBID=10.1.65.126:9618%3faddrs%3d10.1.65.126-9618%26alias%3dcondor-cm.my-company.com%26noUDP%26sock%3dcollector#4400&PrivNet=condor-node1.my-company.com&addrs=10.1.65.124-9618&alias=condor-node1.my-company.com&noUDP&sock=startd_5479_b77d> for my-user failed.

05/26/23 12:54:11 (pid:102) Match record (slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx <10.1.65.124:9618?CCBID=10.1.65.126:9618%3faddrs%3d10.1.65.126-9618%26alias%3dcondor-cm.my-company.com%26noUDP%26sock%3dcollector#4400&PrivNet=condor-node1.my-company.com&addrs=10.1.65.124-9618&alias=condor-node1.my-company.com&noUDP&sock=startd_5479_b77d> for my-user, 1.0) deleted

In the CollectorLog:

05/26/23 14:57:00 Got QUERY_STARTD_ADS

05/26/23 14:57:00 QueryWorker: forked new worker with id 63152 ( max 4 active 1 pending 0 )

05/26/23 14:57:00 WARNING: forward resolution of localhost6 doesn't match 127.0.0.1!

05/26/23 14:57:00 WARNING: forward resolution of localhost6.localdomain6 doesn't match 127.0.0.1!

05/26/23 14:57:00 (Sending 8 ads in response to query)

05/26/23 14:57:00 Query info: matched=8; skipped=0; query_time=0.002500; send_time=0.000840; type=Machine; requirements={true}; locate=0; limit=0; from=TOOL; peer=<127.0.0.1:38076>; projection={Activity Arch CondorLoadAvg EnteredCurrentActivity LastHeardFrom Machine Memory MyCurrentTime Name OpSys State}; filter_private_attrs=1

In the StartLog on the worker nodes: literally nothing, no error, no warning.

I guess it is caused by something that was misconfigured somehow, but I canât find what it is. The CCB seems to be the reason why my jobs never start, because the schedd canât connect to the startd daemons outside of the kubernetes cluster.

What I donât get is that it seems to connect to the CM just fine, and that one is outside of the Kubernetes cluster as weel.

Thanks,

GaÃtan

Gaetan Geffroy
Junior Software Engineer, Space

Terma GmbH
Europaarkaden II, BratustraÃe 7, 64293 Darmstadt, Germany
T +49 6151 86005 43 (direct) â T +49 6151 86005-0
Terma GmbH - Sitz Darmstadt â Handelsregister Nr.: HRB 7411, Darmstadt
GeschÃftsfÃhrer: Poul Vigh / Steen Vejby SÃrensen
www.terma.com â Linkedin â Twitter â Instagram â Youtube

Attention:
This e-mail (and attachment(s), if any) - intended for the addressee(s) only - may contain confidential, copyright, or legally privileged information or material, and no one else is authorized to read, print, store, copy, forward, or otherwise use or disclose any part of its contents or attachment(s) in any form. If you have received this e-mail in error, please notify me by telephone or return e-mail, and delete this e-mail and attachment(s). Thank you.

Mailing List Archives

Public Access

Re: [HTCondor-users] CCB failing, causing jobs to stay IDLE forever