On May 6, 2022, at 1:16 PM, Michael Thomas <wart@xxxxxxxxxxx> wrote:
I'm setting up a new LIGO OSG CE, and am having some trouble with the job routing. I've followed the debugging instructions here:
https://htcondor.com/htcondor-ce/v5/troubleshooting/debugging-tools/
...but don't seem to have any results that this would explain.
Here's what I've tried so far:
After getting the OSG stack configured, I try to run 'condor_ce_trace ldas-osg-ce.ligo-la.caltech.edu'. This eventually times out and my job sits idle in the condor-ce queue:
# condor_ce_q
-- Schedd: ldas-osg-ce.ligo-la.caltech.edu : <208.69.128.80:26489?... @ 05/06/22 13:01:15
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
michael.thomas ID: 7 5/5 15:04 _ _ 1 1 7.0
michael.thomas ID: 8 5/6 12:36 _ _ 1 1 8.0
Total for query: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended
Total for all users: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended
Checking the job router shows no helpful information:
# condor_ce_q -l 7.0 | condor_ce_job_router_info -match-jobs -ignore-prior-routing -jobads -
Matching jobs against routes to find candidate jobs.
Looking in the job router log shows an error when trying to commit:
05/06/22 13:03:04 SECMAN: negotiating security for command 1112.
05/06/22 13:03:04 SECMAN: sending DC_AUTHENTICATE command
05/06/22 13:03:04 SECMAN: sending following classad:
AuthMethods = "FS"
AuthMethodsList = "FS,TOKEN"
Authentication = "YES"
Command = 1112
ConnectSinful = "<10.13.5.58:9618?addrs=10.13.5.58-9618&alias=ldas-osg-ce.ligo-la.caltech.edu&noUDP&sock=schedd_594076_4d25>"
CryptoMethods = "AES"
CryptoMethodsList = "AES,BLOWFISH,3DES"
Enact = "YES"
Encryption = "YES"
Integrity = "YES"
IssuerKeys = "POOL"
MyRemoteUserName = "condor@xxxxxxxxxxxxxxxxxxxxxxxx"
OutgoingNegotiation = "PREFERRED"
RemoteVersion = "$CondorVersion: 9.0.12 Apr 19 2022 BuildID: 583935 PackageID: 9.0.12-1 $"
ServerCommandSock = "<208.69.128.80:9619?addrs=208.69.128.80-9619&alias=ldas-osg-ce.ligo-la.caltech.edu&noUDP&sock=job_router_1090171_6811>"
SessionDuration = "86400"
SessionLease = 3600
Sid = "ldas-osg-ce:594118:1651787992:4432"
Subsystem = "JOB_ROUTER"
TrackState = true
TriedAuthentication = true
TrustDomain = "ldas-condori"
UseSession = "YES"
User = "unauthenticated@unmapped"
ValidCommands = "60004,60012,60021,60052,421,478,480,486,488,489,487,499,502,464,1112,481,509,511,521,74000,507,60007,457,60020,443,441,6,12,5,515,516,519,1111,471"
05/06/22 13:03:04 SECMAN: resume, other side is $CondorVersion: 9.0.12 Apr 19 2022 BuildID: 583935 PackageID: 9.0.12-1 $, NOT reauthenticating.
05/06/22 13:03:04 SECMAN: about to enable encryption.
05/06/22 13:03:04 CRYPTO: protocol(AES), not clearing StreamCryptoState.
05/06/22 13:03:04 SECMAN: successfully enabled encryption!
05/06/22 13:03:04 SECMAN: about to enable message authenticator with key type 3
05/06/22 13:03:04 SECMAN: because protocal is AES, not using other MAC.
05/06/22 13:03:04 SECMAN: successfully enabled message authenticator!
05/06/22 13:03:04 Getting authenticated user from cached session: unauthenticated@unmapped
05/06/22 13:03:04 SECMAN: startCommand succeeded.
05/06/22 13:03:04 Authorizing server 'unauthenticated@unmapped/10.13.5.58'.
05/06/22 13:03:04 ERROR (schedd ldas-osg-ce.ligo-la.caltech.edu at pool ldas-condor:9618) (2376.0) Failed to commit job submission
But I don't see any matching error in my cluster's central manager logs, nor the local schedd logs on the host. The authenticated user as 'unauthenticated@unmapped' seems likely part of the problem, but I don't see a matching mapping happening in my central collector logs.
--Mike
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/