[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] HTCondor Helm chart



I am attempting to deploy HTCondor on Kubernetes using the official Docker images by [constructing a Helm chart](https://gitlab.com/manning-ncsa/htcondor-helm-chart/). Currently this Helm chart is a draft, work-in-progress that I only published to solicit feedback. (Don't worry about the password in `templates/secrets.yaml`; there is no ingress to the cluster and I will replace the password when it matters).

The three pods are able to start, and I can submit a job via the access point (i.e. the "submit" pod), but the job remains in an Idle state.

```
ÂÂÂ $ kubectl exec -it -n htcondor htcondor-submit-0 -- bash -c 'cd /tmp/htcondor && condor_submit sleep.sub'
ÂÂÂ Submitting job(s).
ÂÂÂ 1 job(s) submitted to cluster 1.

ÂÂÂ $ kubectl exec -it -n htcondor htcondor-submit-0 -- bash -c 'condor_status' ÂÂÂ NameÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ OpSysÂÂÂÂÂ Arch StateÂÂÂÂ Activity LoadAv MemÂÂ ActvtyTime

ÂÂÂ slot1@htcondor-worker-64d86c7497-f5wsp LINUXÂÂÂÂÂ X86_64 Unclaimed IdleÂÂÂÂÂ 0.000 1024Â 0+00:00:00

 Total Owner Claimed Unclaimed Matched Preempting Drain Backfill BkIdle

ÂÂÂ X86_64/LINUXÂÂÂÂ 1ÂÂÂÂ 0ÂÂÂÂÂÂ 0ÂÂÂÂÂÂÂÂ 1ÂÂÂÂÂÂ 0 0ÂÂÂÂÂ 0ÂÂÂÂÂÂÂ 0ÂÂÂÂÂ 0

ÂÂÂÂÂÂÂÂÂÂÂ TotalÂÂÂÂ 1ÂÂÂÂ 0ÂÂÂÂÂÂ 0ÂÂÂÂÂÂÂÂ 1ÂÂÂÂÂÂ 0 0ÂÂÂÂÂ 0ÂÂÂÂÂÂÂ 0ÂÂÂÂÂ 0

ÂÂÂ $ kubectl exec -it -n htcondor htcondor-submit-0 -- bash -c 'condor_q'


ÂÂÂ -- Schedd: htcondor-submit-0.htcondor-submit.htcondor.svc.cluster.local : <10.42.96.15:38553?... @ 02/05/24 21:55:37
ÂÂÂ OWNERÂ BATCH_NAMEÂÂÂ SUBMITTEDÂÂ DONEÂÂ RUNÂÂÂ IDLEÂ TOTAL JOB_IDS
ÂÂÂ condor ID: 1ÂÂÂÂÂÂÂ 2/5Â 21:55ÂÂÂÂÂ _ÂÂÂÂÂ _ÂÂÂÂÂ 1ÂÂÂÂÂ 1 1.0

ÂÂÂ Total for query: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended ÂÂÂ Total for all users: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended
```

When I look at the CollectorLog on the Central Manager (cm):

```
ÂÂÂ $ kubectl exec -it -n htcondor htcondor-cm-0 -- bash -c 'less /var/log/condor/CollectorLog'
```

I see a mix of PERMISSION_DENIED messages and messages about communication timeouts and failed writes that look like network failures:

```
ÂÂÂ $ kubectl exec -it -n htcondor htcondor-cm-0 -- bash -c 'cat /var/log/condor/CollectorLog'
ÂÂÂ 02/05/24 21:54:52 Setting maximum file descriptors to 10240.
ÂÂÂ 02/05/24 21:54:52 ******************************************************
ÂÂÂ 02/05/24 21:54:52 ** condor_collector (CONDOR_COLLECTOR) STARTING UP
ÂÂÂ 02/05/24 21:54:52 ** /usr/sbin/condor_collector
ÂÂÂ 02/05/24 21:54:52 ** SubsystemInfo: name=COLLECTOR type=COLLECTOR(2) class=DAEMON(1) ÂÂÂ 02/05/24 21:54:52 ** Configuration: subsystem:COLLECTOR local:<NONE> class:DAEMON ÂÂÂ 02/05/24 21:54:52 ** $CondorVersion: 23.0.3 2024-01-04 BuildID: 700474 PackageID: 23.0.3-1 $
ÂÂÂ 02/05/24 21:54:52 ** $CondorPlatform: x86_64_AlmaLinux8 $
ÂÂÂ 02/05/24 21:54:52 ** PID = 32
ÂÂÂ 02/05/24 21:54:52 ** Log last touched time unavailable (No such file or directory) ÂÂÂ 02/05/24 21:54:52 ******************************************************
ÂÂÂ 02/05/24 21:54:52 Using config source: /etc/condor/condor_config
ÂÂÂ 02/05/24 21:54:52 Using local config sources:
ÂÂÂ 02/05/24 21:54:52 /etc/condor/config.d/00-htcondor-9.0.config
ÂÂÂ 02/05/24 21:54:52ÂÂÂ /etc/condor/config.d/01-env.conf
ÂÂÂ 02/05/24 21:54:52ÂÂÂ /etc/condor/config.d/01-misc.conf
ÂÂÂ 02/05/24 21:54:52ÂÂÂ /etc/condor/config.d/01-role.conf
ÂÂÂ 02/05/24 21:54:52ÂÂÂ /etc/condor/config.d/01-security.conf
ÂÂÂ 02/05/24 21:54:52ÂÂÂ /etc/condor/config.d/10-stash-plugin.conf
ÂÂÂ 02/05/24 21:54:52ÂÂÂ /etc/condor/condor_config.local
ÂÂÂ 02/05/24 21:54:52 config Macros = 83, Sorted = 83, StringBytes = 2826, TablesBytes = 3084
ÂÂÂ 02/05/24 21:54:52 CLASSAD_CACHING is ENABLED
ÂÂÂ 02/05/24 21:54:52 Daemon Log is logging: D_ALWAYS D_ERROR D_STATUS
ÂÂÂ 02/05/24 21:54:52 SharedPortEndpoint: waiting for connections to named socket collector ÂÂÂ 02/05/24 21:54:52 DaemonCore: non-shared command socket at <10.42.96.16:46443?alias=htcondor-cm-0.htcondor-cm.htcondor.svc.cluster.local> ÂÂÂ 02/05/24 21:54:52 Daemoncore: Listening at <0.0.0.0:46443> on TCP (ReliSock) and UDP (SafeSock). ÂÂÂ 02/05/24 21:54:52 DaemonCore: command socket at <10.42.96.16:9618?addrs=10.42.96.16-9618&alias=htcondor-cm-0.htcondor-cm.htcondor.svc.cluster.local&noUDP&sock=collector> ÂÂÂ 02/05/24 21:54:52 DaemonCore: private command socket at <10.42.96.16:9618?addrs=10.42.96.16-9618&alias=htcondor-cm-0.htcondor-cm.htcondor.svc.cluster.local&noUDP&sock=collector>
ÂÂÂ 02/05/24 21:54:52 In ViewServer::Init()
ÂÂÂ 02/05/24 21:54:52 In CollectorDaemon::Init()
ÂÂÂ 02/05/24 21:54:52 In ViewServer::Config()
ÂÂÂ 02/05/24 21:54:52 In CollectorDaemon::Config()
ÂÂÂ 02/05/24 21:54:52 COLLECTOR_GETAD_OPTIONS set to fast lazy-parse (0x30)
ÂÂÂ 02/05/24 21:54:52 ABSENT_REQUIREMENTS = None
ÂÂÂ 02/05/24 21:54:52 OfflineCollectorPlugin::configure: no persistent store was defined for off-line ads.
ÂÂÂ 02/05/24 21:54:52 enable: Creating stats hash table
ÂÂÂ 02/05/24 21:54:52 Enabling CCB Server.
ÂÂÂ 02/05/24 21:54:52 Will generate a bootstrap file.
ÂÂÂ 02/05/24 21:54:52 Will generate a new certificate file.
 02/05/24 21:54:53 CollectorAd : Inserting ** "< My Pool - 10.43.222.99@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx >"  02/05/24 21:55:13 condor_read(): timeout reading 5 bytes from collector 10.43.222.99.
ÂÂÂ 02/05/24 21:55:13 IO: Failed to read packet header
ÂÂÂ 02/05/24 21:55:13 SECMAN: no classad from server, failing
ÂÂÂ 02/05/24 21:55:13 ERROR: SECMAN:2007:Failed to end classad message.
ÂÂÂ 02/05/24 21:55:13 Failed to send update to collector 10.43.222.99.
ÂÂÂ 02/05/24 21:55:13 Unable to send UPDATE_COLLECTOR_AD to all configured collectors ÂÂÂ 02/05/24 21:55:13 condor_write(): Socket closed when trying to write 565 bytes to <10.42.240.0:21038>, fd is 14
ÂÂÂ 02/05/24 21:55:13 Buf::write(): condor_write() failed
ÂÂÂ 02/05/24 21:55:13 SECMAN: Error sending response classad to <10.42.240.0:21038>!
ÂÂÂ AuthMethods = "FS,TOKEN,PASSWORD"
ÂÂÂ Authentication = "REQUIRED"
ÂÂÂ Command = 19
ÂÂÂ ConnectSinful = "<10.43.222.99:9618>"
ÂÂÂ CryptoMethods = "AES,BLOWFISH,3DES"
ÂÂÂ ECDHPublicKey = "BILzoNAGQHRbtX38YiEBurMAY3L9pVl1DUVwY61Buf5GMX9An4MI9lzfxJgH2LUuuSfQfKJj6sXnKEay3zurtc8="
ÂÂÂ Enact = "NO"
ÂÂÂ Encryption = "REQUIRED"
ÂÂÂ Integrity = "REQUIRED"
ÂÂÂ IssuerKeys = "POOL"
ÂÂÂ NegotiatedSession = true
ÂÂÂ NewSession = "YES"
ÂÂÂ OutgoingNegotiation = "REQUIRED"
ÂÂÂ ParentUniqueID = "htcondor-cm-0:29:1707170091"
ÂÂÂ RemoteVersion = "$CondorVersion: 23.0.3 2024-01-04 BuildID: 700474 PackageID: 23.0.3-1 $" ÂÂÂ ServerCommandSock = "<10.42.96.16:9618?addrs=10.42.96.16-9618&alias=htcondor-cm-0.htcondor-cm.htcondor.svc.cluster.local&noUDP&sock=collector>"
ÂÂÂ ServerPid = 32
ÂÂÂ SessionDuration = "86400"
ÂÂÂ SessionLease = 3600
ÂÂÂ Subsystem = "COLLECTOR"
ÂÂÂ TrustDomain = "htcondor-cm-0.htcondor-cm.htcondor.svc.cluster.local"
ÂÂÂ 02/05/24 21:55:13 condor_write(): Socket closed when trying to write 13 bytes to <10.42.96.15:33973>, fd is 13
ÂÂÂ 02/05/24 21:55:13 Buf::write(): condor_write() failed
ÂÂÂ 02/05/24 21:55:13 AUTHENTICATE: handshake failed!
ÂÂÂ 02/05/24 21:55:13 DC_AUTHENTICATE: required authentication of 10.42.96.15 failed: AUTHENTICATE:1002:Failure performing handshake ÂÂÂ 02/05/24 21:55:13 WARNING: forward resolution of _gateway doesn't match 10.42.240.0! ÂÂÂ 02/05/24 21:55:13 PERMISSION DENIED to condor@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx from host 10.42.240.0 for command 10 (QUERY_STARTD_PVT_ADS), access level NEGOTIATOR: reason: NEGOTIATOR authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 10.42.240.0, hostname size = 0, original ip address = 10.42.240.0
ÂÂÂ 02/05/24 21:55:13 DC_AUTHENTICATE: Command not authorized, done!
ÂÂÂ 02/05/24 21:55:13 WARNING: forward resolution of _gateway doesn't match 10.42.240.0! ÂÂÂ 02/05/24 21:55:13 MasterAdÂÂÂÂ : Inserting ** "< htcondor-cm-0.htcondor-cm.htcondor.svc.cluster.local >" ÂÂÂ 02/05/24 21:55:13 PERMISSION DENIED to condor@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx from host 10.42.240.0 for command 49 (UPDATE_NEGOTIATOR_AD), access level NEGOTIATOR: reason: cached result for NEGOTIATOR; see first case for the full reason
ÂÂÂ 02/05/24 21:55:13 DC_AUTHENTICATE: Command not authorized, done!
 02/05/24 21:55:13 MasterAd : Inserting ** "< htcondor-submit-0.htcondor-submit.htcondor.svc.cluster.local >"  02/05/24 21:55:13 MasterAd : Inserting ** "< htcondor-worker-64d86c7497-f5wsp >"  02/05/24 21:55:13 StartdAd : Inserting ** "< slot1@htcondor-worker-64d86c7497-f5wsp >"  02/05/24 21:55:13 StartdPvtAd : Inserting ** "< slot1@htcondor-worker-64d86c7497-f5wsp >"  02/05/24 21:55:23 PERMISSION DENIED to condor@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx from host 10.42.240.0 for command 10 (QUERY_STARTD_PVT_ADS), access level NEGOTIATOR: reason: cached result for NEGOTIATOR; see first case for the full reason
ÂÂÂ 02/05/24 21:55:23 DC_AUTHENTICATE: Command not authorized, done!
ÂÂÂ ...
```

My custom configuration should be fully captured in `templates/configmap.yaml` which is mounted to `/etc/condor/condor_config.local` in each container.

Any assistance in how to proceed with debugging this would be appreciated.