[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] job router debugging



The job routerâs job submission attempt is being rejected by the local schedd. There should be an error in the scheddâs log. You can try increasing the debug level for the schedd log to get more information:

SCHEDD_DEBUG = $(SCHEDD_DEBUG) D_FULLDEBUG D_SYSCALLS

 - Jaime

> On May 6, 2022, at 1:16 PM, Michael Thomas <wart@xxxxxxxxxxx> wrote:
> 
> I'm setting up a new LIGO OSG CE, and am having some trouble with the job routing.  I've followed the debugging instructions here:
> 
> https://htcondor.com/htcondor-ce/v5/troubleshooting/debugging-tools/
> 
> ...but don't seem to have any results that this would explain.
> 
> Here's what I've tried so far:
> 
> After getting the OSG stack configured, I try to run 'condor_ce_trace ldas-osg-ce.ligo-la.caltech.edu'.  This eventually times out and my job sits idle in the condor-ce queue:
> 
> # condor_ce_q
> 
> -- Schedd: ldas-osg-ce.ligo-la.caltech.edu : <208.69.128.80:26489?... @ 05/06/22 13:01:15
> OWNER          BATCH_NAME    SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
> michael.thomas ID: 7        5/5  15:04      _      _      1      1 7.0
> michael.thomas ID: 8        5/6  12:36      _      _      1      1 8.0
> 
> Total for query: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended
> Total for all users: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended
> 
> 
> Checking the job router shows no helpful information:
> 
> # condor_ce_q -l 7.0 | condor_ce_job_router_info -match-jobs -ignore-prior-routing -jobads -
> 
> Matching jobs against routes to find candidate jobs.
> 
> 
> Looking in the job router log shows an error when trying to commit:
> 
> 05/06/22 13:03:04 SECMAN: negotiating security for command 1112.
> 05/06/22 13:03:04 SECMAN: sending DC_AUTHENTICATE command
> 05/06/22 13:03:04 SECMAN: sending following classad:
> AuthMethods = "FS"
> AuthMethodsList = "FS,TOKEN"
> Authentication = "YES"
> Command = 1112
> ConnectSinful = "<10.13.5.58:9618?addrs=10.13.5.58-9618&alias=ldas-osg-ce.ligo-la.caltech.edu&noUDP&sock=schedd_594076_4d25>"
> CryptoMethods = "AES"
> CryptoMethodsList = "AES,BLOWFISH,3DES"
> Enact = "YES"
> Encryption = "YES"
> Integrity = "YES"
> IssuerKeys = "POOL"
> MyRemoteUserName = "condor@xxxxxxxxxxxxxxxxxxxxxxxx"
> OutgoingNegotiation = "PREFERRED"
> RemoteVersion = "$CondorVersion: 9.0.12 Apr 19 2022 BuildID: 583935 PackageID: 9.0.12-1 $"
> ServerCommandSock = "<208.69.128.80:9619?addrs=208.69.128.80-9619&alias=ldas-osg-ce.ligo-la.caltech.edu&noUDP&sock=job_router_1090171_6811>"
> SessionDuration = "86400"
> SessionLease = 3600
> Sid = "ldas-osg-ce:594118:1651787992:4432"
> Subsystem = "JOB_ROUTER"
> TrackState = true
> TriedAuthentication = true
> TrustDomain = "ldas-condori"
> UseSession = "YES"
> User = "unauthenticated@unmapped"
> ValidCommands = "60004,60012,60021,60052,421,478,480,486,488,489,487,499,502,464,1112,481,509,511,521,74000,507,60007,457,60020,443,441,6,12,5,515,516,519,1111,471"
> 05/06/22 13:03:04 SECMAN: resume, other side is $CondorVersion: 9.0.12 Apr 19 2022 BuildID: 583935 PackageID: 9.0.12-1 $, NOT reauthenticating.
> 05/06/22 13:03:04 SECMAN: about to enable encryption.
> 05/06/22 13:03:04 CRYPTO: protocol(AES), not clearing StreamCryptoState.
> 05/06/22 13:03:04 SECMAN: successfully enabled encryption!
> 05/06/22 13:03:04 SECMAN: about to enable message authenticator with key type 3
> 05/06/22 13:03:04 SECMAN: because protocal is AES, not using other MAC.
> 05/06/22 13:03:04 SECMAN: successfully enabled message authenticator!
> 05/06/22 13:03:04 Getting authenticated user from cached session: unauthenticated@unmapped
> 05/06/22 13:03:04 SECMAN: startCommand succeeded.
> 05/06/22 13:03:04 Authorizing server 'unauthenticated@unmapped/10.13.5.58'.
> 05/06/22 13:03:04 ERROR (schedd ldas-osg-ce.ligo-la.caltech.edu at pool ldas-condor:9618) (2376.0) Failed to commit job submission
> 
> 
> But I don't see any matching error in my cluster's central manager logs, nor the local schedd logs on the host.  The authenticated user as 'unauthenticated@unmapped' seems likely part of the problem, but I don't see a matching mapping happening in my central collector logs.
> 
> --Mike
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/