[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] job router debugging



I'm setting up a new LIGO OSG CE, and am having some trouble with the job routing. I've followed the debugging instructions here:

https://htcondor.com/htcondor-ce/v5/troubleshooting/debugging-tools/

...but don't seem to have any results that this would explain.

Here's what I've tried so far:

After getting the OSG stack configured, I try to run 'condor_ce_trace ldas-osg-ce.ligo-la.caltech.edu'. This eventually times out and my job sits idle in the condor-ce queue:

# condor_ce_q

-- Schedd: ldas-osg-ce.ligo-la.caltech.edu : <208.69.128.80:26489?... @ 05/06/22 13:01:15
OWNER          BATCH_NAME    SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
michael.thomas ID: 7        5/5  15:04      _      _      1      1 7.0
michael.thomas ID: 8        5/6  12:36      _      _      1      1 8.0

Total for query: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended Total for all users: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended


Checking the job router shows no helpful information:

# condor_ce_q -l 7.0 | condor_ce_job_router_info -match-jobs -ignore-prior-routing -jobads -

Matching jobs against routes to find candidate jobs.


Looking in the job router log shows an error when trying to commit:

05/06/22 13:03:04 SECMAN: negotiating security for command 1112.
05/06/22 13:03:04 SECMAN: sending DC_AUTHENTICATE command
05/06/22 13:03:04 SECMAN: sending following classad:
AuthMethods = "FS"
AuthMethodsList = "FS,TOKEN"
Authentication = "YES"
Command = 1112
ConnectSinful = "<10.13.5.58:9618?addrs=10.13.5.58-9618&alias=ldas-osg-ce.ligo-la.caltech.edu&noUDP&sock=schedd_594076_4d25>"
CryptoMethods = "AES"
CryptoMethodsList = "AES,BLOWFISH,3DES"
Enact = "YES"
Encryption = "YES"
Integrity = "YES"
IssuerKeys = "POOL"
MyRemoteUserName = "condor@xxxxxxxxxxxxxxxxxxxxxxxx"
OutgoingNegotiation = "PREFERRED"
RemoteVersion = "$CondorVersion: 9.0.12 Apr 19 2022 BuildID: 583935 PackageID: 9.0.12-1 $" ServerCommandSock = "<208.69.128.80:9619?addrs=208.69.128.80-9619&alias=ldas-osg-ce.ligo-la.caltech.edu&noUDP&sock=job_router_1090171_6811>"
SessionDuration = "86400"
SessionLease = 3600
Sid = "ldas-osg-ce:594118:1651787992:4432"
Subsystem = "JOB_ROUTER"
TrackState = true
TriedAuthentication = true
TrustDomain = "ldas-condori"
UseSession = "YES"
User = "unauthenticated@unmapped"
ValidCommands = "60004,60012,60021,60052,421,478,480,486,488,489,487,499,502,464,1112,481,509,511,521,74000,507,60007,457,60020,443,441,6,12,5,515,516,519,1111,471" 05/06/22 13:03:04 SECMAN: resume, other side is $CondorVersion: 9.0.12 Apr 19 2022 BuildID: 583935 PackageID: 9.0.12-1 $, NOT reauthenticating.
05/06/22 13:03:04 SECMAN: about to enable encryption.
05/06/22 13:03:04 CRYPTO: protocol(AES), not clearing StreamCryptoState.
05/06/22 13:03:04 SECMAN: successfully enabled encryption!
05/06/22 13:03:04 SECMAN: about to enable message authenticator with key type 3
05/06/22 13:03:04 SECMAN: because protocal is AES, not using other MAC.
05/06/22 13:03:04 SECMAN: successfully enabled message authenticator!
05/06/22 13:03:04 Getting authenticated user from cached session: unauthenticated@unmapped
05/06/22 13:03:04 SECMAN: startCommand succeeded.
05/06/22 13:03:04 Authorizing server 'unauthenticated@unmapped/10.13.5.58'.
05/06/22 13:03:04 ERROR (schedd ldas-osg-ce.ligo-la.caltech.edu at pool ldas-condor:9618) (2376.0) Failed to commit job submission


But I don't see any matching error in my cluster's central manager logs, nor the local schedd logs on the host. The authenticated user as 'unauthenticated@unmapped' seems likely part of the problem, but I don't see a matching mapping happening in my central collector logs.

--Mike