[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor Helm chart



Thanks for working with me yesterday to get this working, Greg.

For others who may have a similar problem, I rebased my git repo for clarity and have a basic functioning example tagged "v1" at https://gitlab.com/manning-ncsa/htcondor-helm-chart/-/tree/v1.

There were several configuration options I had to change from defaults in order to get the central manager, access point, and worker communicating:


```
USE_POOL_PASSWORD=yes
ALLOW_WRITE=*
ALLOW_NEGOTIATOR=*
SEC_DEFAULT_AUTHENTICATION_METHODS = FS, IDTOKENS, PASSWORD
```

where the pool password is generated randomly in this Helm chart version upon deployment and installed to the default location (`/etc/condor/passwords.d/POOL`) by the [container initialization scripts](https://github.com/htcondor/htcondor/blob/main/build/docker/services/base/update-secrets).

Additionally, as I captured in the Readme file, you need to submit jobs as user `submituser` (UID 1000) like so:

```
kubectl exec -it -n htcondor htcondor-submit-0 -- bash -c ' \
  runuser submituser bash -c " \
    cd /tmp/sleep_test && condor_submit sleep.sub \
  "'

```




On 2/6/24 11:10, Daues, Gregory Edward wrote:

It can be challenging to debug a fresh setup, but I often start like so:
if a job sits in Idle after submission,  I like to check the NegotiatorLog 
to see if there is a match.   If there is a match, then I expect the two sides
Schedd on the sched pod (writing SchedLog)  and Startd on the worker 
(writing StartLog) to be writing information about the two sides of the match,
perhaps describing communications, authentication, etc issues that may be 
occurring. 

And a general item I have observed about kubernetes setups, what user 
is submitting the job and expecting to run the job on the worker ? , i.e., 
is there a suitable user existing in the schedd & worker pod?  
It looks like the 'condor' user has submitted the job below, 
not sure if that is a suitable user for running a job , others would have to comment at that. 

               Greg 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of T. Andrew Manning <manninga@xxxxxxxxxxxx>
Sent: Monday, February 5, 2024 4:31 PM
To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] HTCondor Helm chart
 
I am attempting to deploy HTCondor on Kubernetes using the official
Docker images by [constructing a Helm
chart](https://urldefense.com/v3/__https://gitlab.com/manning-ncsa/htcondor-helm-chart/__;!!DZ3fjg!8qRLnBvfwHx0c_Vy3vS-BntGVdLyaBmpVPJWmngwwV9s0_6-mnATdFAu9EFv8Q-ZLLHzErD4KwMBzbXtIxP2ug$ ). Currently
this Helm chart is a draft, work-in-progress that I only published to
solicit feedback. (Don't worry about the password in
`templates/secrets.yaml`; there is no ingress to the cluster and I will
replace the password when it matters).

The three pods are able to start, and I can submit a job via the access
point (i.e. the "submit" pod), but the job remains in an Idle state.

```
     $ kubectl exec -it -n htcondor htcondor-submit-0 -- bash -c 'cd
/tmp/htcondor && condor_submit sleep.sub'
     Submitting job(s).
     1 job(s) submitted to cluster 1.

     $ kubectl exec -it -n htcondor htcondor-submit-0 -- bash -c
'condor_status'
     Name                                   OpSys      Arch State    
Activity LoadAv Mem   ActvtyTime

     slot1@htcondor-worker-64d86c7497-f5wsp LINUX      X86_64 Unclaimed
Idle      0.000 1024  0+00:00:00

                     Total Owner Claimed Unclaimed Matched Preempting 
Drain Backfill BkIdle

     X86_64/LINUX     1     0       0         1       0 0      0       
0      0

             Total     1     0       0         1       0 0      0       
0      0

     $ kubectl exec -it -n htcondor htcondor-submit-0 -- bash -c 'condor_q'


     -- Schedd:
htcondor-submit-0.htcondor-submit.htcondor.svc.cluster.local :
<10.42.96.15:38553?... @ 02/05/24 21:55:37
     OWNER  BATCH_NAME    SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
     condor ID: 1        2/5  21:55      _      _      1      1 1.0

     Total for query: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running,
0 held, 0 suspended
     Total for all users: 1 jobs; 0 completed, 0 removed, 1 idle, 0
running, 0 held, 0 suspended
```

When I look at the CollectorLog on the Central Manager (cm):

```
     $ kubectl exec -it -n htcondor htcondor-cm-0 -- bash -c 'less
/var/log/condor/CollectorLog'
```

I see a mix of PERMISSION_DENIED messages and messages about
communication timeouts and failed writes that look like network failures:

```
     $ kubectl exec -it -n htcondor htcondor-cm-0 -- bash -c 'cat
/var/log/condor/CollectorLog'
     02/05/24 21:54:52 Setting maximum file descriptors to 10240.
     02/05/24 21:54:52
******************************************************
     02/05/24 21:54:52 ** condor_collector (CONDOR_COLLECTOR) STARTING UP
     02/05/24 21:54:52 ** /usr/sbin/condor_collector
     02/05/24 21:54:52 ** SubsystemInfo: name=COLLECTOR
type=COLLECTOR(2) class=DAEMON(1)
     02/05/24 21:54:52 ** Configuration: subsystem:COLLECTOR
local:<NONE> class:DAEMON
     02/05/24 21:54:52 ** $CondorVersion: 23.0.3 2024-01-04 BuildID:
700474 PackageID: 23.0.3-1 $
     02/05/24 21:54:52 ** $CondorPlatform: x86_64_AlmaLinux8 $
     02/05/24 21:54:52 ** PID = 32
     02/05/24 21:54:52 ** Log last touched time unavailable (No such
file or directory)
     02/05/24 21:54:52
******************************************************
     02/05/24 21:54:52 Using config source: /etc/condor/condor_config
     02/05/24 21:54:52 Using local config sources:
     02/05/24 21:54:52 /etc/condor/config.d/00-htcondor-9.0.config
     02/05/24 21:54:52    /etc/condor/config.d/01-env.conf
     02/05/24 21:54:52    /etc/condor/config.d/01-misc.conf
     02/05/24 21:54:52    /etc/condor/config.d/01-role.conf
     02/05/24 21:54:52    /etc/condor/config.d/01-security.conf
     02/05/24 21:54:52    /etc/condor/config.d/10-stash-plugin.conf
     02/05/24 21:54:52    /etc/condor/condor_config.local
     02/05/24 21:54:52 config Macros = 83, Sorted = 83, StringBytes =
2826, TablesBytes = 3084
     02/05/24 21:54:52 CLASSAD_CACHING is ENABLED
     02/05/24 21:54:52 Daemon Log is logging: D_ALWAYS D_ERROR D_STATUS
     02/05/24 21:54:52 SharedPortEndpoint: waiting for connections to
named socket collector
     02/05/24 21:54:52 DaemonCore: non-shared command socket at
<10.42.96.16:46443?alias=htcondor-cm-0.htcondor-cm.htcondor.svc.cluster.local>
     02/05/24 21:54:52 Daemoncore: Listening at <0.0.0.0:46443> on TCP
(ReliSock) and UDP (SafeSock).
     02/05/24 21:54:52 DaemonCore: command socket at
<10.42.96.16:9618?addrs=10.42.96.16-9618&alias=htcondor-cm-0.htcondor-cm.htcondor.svc.cluster.local&noUDP&sock=collector>
     02/05/24 21:54:52 DaemonCore: private command socket at
<10.42.96.16:9618?addrs=10.42.96.16-9618&alias=htcondor-cm-0.htcondor-cm.htcondor.svc.cluster.local&noUDP&sock=collector>
     02/05/24 21:54:52 In ViewServer::Init()
     02/05/24 21:54:52 In CollectorDaemon::Init()
     02/05/24 21:54:52 In ViewServer::Config()
     02/05/24 21:54:52 In CollectorDaemon::Config()
     02/05/24 21:54:52 COLLECTOR_GETAD_OPTIONS set to fast lazy-parse (0x30)
     02/05/24 21:54:52 ABSENT_REQUIREMENTS = None
     02/05/24 21:54:52 OfflineCollectorPlugin::configure: no persistent
store was defined for off-line ads.
     02/05/24 21:54:52 enable: Creating stats hash table
     02/05/24 21:54:52 Enabling CCB Server.
     02/05/24 21:54:52 Will generate a bootstrap file.
     02/05/24 21:54:52 Will generate a new certificate file.
     02/05/24 21:54:53 CollectorAd  : Inserting ** "< My Pool -
10.43.222.99@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx >"
     02/05/24 21:55:13 condor_read(): timeout reading 5 bytes from
collector 10.43.222.99.
     02/05/24 21:55:13 IO: Failed to read packet header
     02/05/24 21:55:13 SECMAN: no classad from server, failing
     02/05/24 21:55:13 ERROR: SECMAN:2007:Failed to end classad message.
     02/05/24 21:55:13 Failed to send update to collector 10.43.222.99.
     02/05/24 21:55:13 Unable to send UPDATE_COLLECTOR_AD to all
configured collectors
     02/05/24 21:55:13 condor_write(): Socket closed when trying to
write 565 bytes to <10.42.240.0:21038>, fd is 14
     02/05/24 21:55:13 Buf::write(): condor_write() failed
     02/05/24 21:55:13 SECMAN: Error sending response classad to
<10.42.240.0:21038>!
     AuthMethods = "FS,TOKEN,PASSWORD"
     Authentication = "REQUIRED"
     Command = 19
     ConnectSinful = "<10.43.222.99:9618>"
     CryptoMethods = "AES,BLOWFISH,3DES"
     ECDHPublicKey =
"BILzoNAGQHRbtX38YiEBurMAY3L9pVl1DUVwY61Buf5GMX9An4MI9lzfxJgH2LUuuSfQfKJj6sXnKEay3zurtc8="
     Enact = "NO"
     Encryption = "REQUIRED"
     Integrity = "REQUIRED"
     IssuerKeys = "POOL"
     NegotiatedSession = true
     NewSession = "YES"
     OutgoingNegotiation = "REQUIRED"
     ParentUniqueID = "htcondor-cm-0:29:1707170091"
     RemoteVersion = "$CondorVersion: 23.0.3 2024-01-04 BuildID: 700474
PackageID: 23.0.3-1 $"
     ServerCommandSock =
"<10.42.96.16:9618?addrs=10.42.96.16-9618&alias=htcondor-cm-0.htcondor-cm.htcondor.svc.cluster.local&noUDP&sock=collector>"
     ServerPid = 32
     SessionDuration = "86400"
     SessionLease = 3600
     Subsystem = "COLLECTOR"
     TrustDomain = "htcondor-cm-0.htcondor-cm.htcondor.svc.cluster.local"
     02/05/24 21:55:13 condor_write(): Socket closed when trying to
write 13 bytes to <10.42.96.15:33973>, fd is 13
     02/05/24 21:55:13 Buf::write(): condor_write() failed
     02/05/24 21:55:13 AUTHENTICATE: handshake failed!
     02/05/24 21:55:13 DC_AUTHENTICATE: required authentication of
10.42.96.15 failed: AUTHENTICATE:1002:Failure performing handshake
     02/05/24 21:55:13 WARNING: forward resolution of _gateway doesn't
match 10.42.240.0!
     02/05/24 21:55:13 PERMISSION DENIED to
condor@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx from host
10.42.240.0 for command 10 (QUERY_STARTD_PVT_ADS), access level
NEGOTIATOR: reason: NEGOTIATOR authorization policy contains no matching
ALLOW entry for this request; identifiers used for this host:
10.42.240.0, hostname size = 0, original ip address = 10.42.240.0
     02/05/24 21:55:13 DC_AUTHENTICATE: Command not authorized, done!
     02/05/24 21:55:13 WARNING: forward resolution of _gateway doesn't
match 10.42.240.0!
     02/05/24 21:55:13 MasterAd     : Inserting ** "<
htcondor-cm-0.htcondor-cm.htcondor.svc.cluster.local >"
     02/05/24 21:55:13 PERMISSION DENIED to
condor@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx from host
10.42.240.0 for command 49 (UPDATE_NEGOTIATOR_AD), access level
NEGOTIATOR: reason: cached result for NEGOTIATOR; see first case for the
full reason
     02/05/24 21:55:13 DC_AUTHENTICATE: Command not authorized, done!
     02/05/24 21:55:13 MasterAd     : Inserting ** "<
htcondor-submit-0.htcondor-submit.htcondor.svc.cluster.local >"
     02/05/24 21:55:13 MasterAd     : Inserting ** "<
htcondor-worker-64d86c7497-f5wsp >"
     02/05/24 21:55:13 StartdAd     : Inserting ** "<
slot1@htcondor-worker-64d86c7497-f5wsp >"
     02/05/24 21:55:13 StartdPvtAd  : Inserting ** "<
slot1@htcondor-worker-64d86c7497-f5wsp >"
     02/05/24 21:55:23 PERMISSION DENIED to
condor@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx from host
10.42.240.0 for command 10 (QUERY_STARTD_PVT_ADS), access level
NEGOTIATOR: reason: cached result for NEGOTIATOR; see first case for the
full reason
     02/05/24 21:55:23 DC_AUTHENTICATE: Command not authorized, done!
     ...
```

My custom configuration should be fully captured in
`templates/configmap.yaml` which is mounted to
`/etc/condor/condor_config.local` in each container.

Any assistance in how to proceed with debugging this would be appreciated.
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://urldefense.com/v3/__https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users__;!!DZ3fjg!8qRLnBvfwHx0c_Vy3vS-BntGVdLyaBmpVPJWmngwwV9s0_6-mnATdFAu9EFv8Q-ZLLHzErD4KwMBzbVmVRI7wQ$

The archives can be found at:
https://urldefense.com/v3/__https://lists.cs.wisc.edu/archive/htcondor-users/__;!!DZ3fjg!8qRLnBvfwHx0c_Vy3vS-BntGVdLyaBmpVPJWmngwwV9s0_6-mnATdFAu9EFv8Q-ZLLHzErD4KwMBzbWew4iQWg$

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://urldefense.com/v3/__https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users__;!!DZ3fjg!4wB8opWiqEQeBZz9wB5LftQP4GuQTLIjFoBSkFRUDTP4gXTi9b6u7twzaJC4KsVwSlGscuFqkiaD45ILC8M$ 

The archives can be found at:
https://urldefense.com/v3/__https://lists.cs.wisc.edu/archive/htcondor-users/__;!!DZ3fjg!4wB8opWiqEQeBZz9wB5LftQP4GuQTLIjFoBSkFRUDTP4gXTi9b6u7twzaJC4KsVwSlGscuFqkiaD9Eg7AYQ$