[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] job router debugging



Hi Jaime,

Thanks, that certainly helped generate more logs, but still didn't turn up anything I could make sense of.

Logs are here:

https://drive.google.com/file/d/1BptGdZeRuUdAR7Dw-Aafs0ipve5Jmwkz/view?usp=sharing

I notice that the final "request" generated a 'errno = 22', which sys/errno.h tells me is an Invalid Argument error, but I'll be darned if I know which argument and in which context this is from.

--Mike

On 5/10/22 12:04, Jaime Frey wrote:
The job routerâs job submission attempt is being rejected by the local schedd. There should be an error in the scheddâs log. You can try increasing the debug level for the schedd log to get more information:

SCHEDD_DEBUG = $(SCHEDD_DEBUG) D_FULLDEBUG D_SYSCALLS

  - Jaime

On May 6, 2022, at 1:16 PM, Michael Thomas <wart@xxxxxxxxxxx> wrote:

I'm setting up a new LIGO OSG CE, and am having some trouble with the job routing.  I've followed the debugging instructions here:

https://htcondor.com/htcondor-ce/v5/troubleshooting/debugging-tools/

...but don't seem to have any results that this would explain.

Here's what I've tried so far:

After getting the OSG stack configured, I try to run 'condor_ce_trace ldas-osg-ce.ligo-la.caltech.edu'.  This eventually times out and my job sits idle in the condor-ce queue:

# condor_ce_q

-- Schedd: ldas-osg-ce.ligo-la.caltech.edu : <208.69.128.80:26489?... @ 05/06/22 13:01:15
OWNER          BATCH_NAME    SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
michael.thomas ID: 7        5/5  15:04      _      _      1      1 7.0
michael.thomas ID: 8        5/6  12:36      _      _      1      1 8.0

Total for query: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended
Total for all users: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended


Checking the job router shows no helpful information:

# condor_ce_q -l 7.0 | condor_ce_job_router_info -match-jobs -ignore-prior-routing -jobads -

Matching jobs against routes to find candidate jobs.


Looking in the job router log shows an error when trying to commit:

05/06/22 13:03:04 SECMAN: negotiating security for command 1112.
05/06/22 13:03:04 SECMAN: sending DC_AUTHENTICATE command
05/06/22 13:03:04 SECMAN: sending following classad:
AuthMethods = "FS"
AuthMethodsList = "FS,TOKEN"
Authentication = "YES"
Command = 1112
ConnectSinful = "<10.13.5.58:9618?addrs=10.13.5.58-9618&alias=ldas-osg-ce.ligo-la.caltech.edu&noUDP&sock=schedd_594076_4d25>"
CryptoMethods = "AES"
CryptoMethodsList = "AES,BLOWFISH,3DES"
Enact = "YES"
Encryption = "YES"
Integrity = "YES"
IssuerKeys = "POOL"
MyRemoteUserName = "condor@xxxxxxxxxxxxxxxxxxxxxxxx"
OutgoingNegotiation = "PREFERRED"
RemoteVersion = "$CondorVersion: 9.0.12 Apr 19 2022 BuildID: 583935 PackageID: 9.0.12-1 $"
ServerCommandSock = "<208.69.128.80:9619?addrs=208.69.128.80-9619&alias=ldas-osg-ce.ligo-la.caltech.edu&noUDP&sock=job_router_1090171_6811>"
SessionDuration = "86400"
SessionLease = 3600
Sid = "ldas-osg-ce:594118:1651787992:4432"
Subsystem = "JOB_ROUTER"
TrackState = true
TriedAuthentication = true
TrustDomain = "ldas-condori"
UseSession = "YES"
User = "unauthenticated@unmapped"
ValidCommands = "60004,60012,60021,60052,421,478,480,486,488,489,487,499,502,464,1112,481,509,511,521,74000,507,60007,457,60020,443,441,6,12,5,515,516,519,1111,471"
05/06/22 13:03:04 SECMAN: resume, other side is $CondorVersion: 9.0.12 Apr 19 2022 BuildID: 583935 PackageID: 9.0.12-1 $, NOT reauthenticating.
05/06/22 13:03:04 SECMAN: about to enable encryption.
05/06/22 13:03:04 CRYPTO: protocol(AES), not clearing StreamCryptoState.
05/06/22 13:03:04 SECMAN: successfully enabled encryption!
05/06/22 13:03:04 SECMAN: about to enable message authenticator with key type 3
05/06/22 13:03:04 SECMAN: because protocal is AES, not using other MAC.
05/06/22 13:03:04 SECMAN: successfully enabled message authenticator!
05/06/22 13:03:04 Getting authenticated user from cached session: unauthenticated@unmapped
05/06/22 13:03:04 SECMAN: startCommand succeeded.
05/06/22 13:03:04 Authorizing server 'unauthenticated@unmapped/10.13.5.58'.
05/06/22 13:03:04 ERROR (schedd ldas-osg-ce.ligo-la.caltech.edu at pool ldas-condor:9618) (2376.0) Failed to commit job submission


But I don't see any matching error in my cluster's central manager logs, nor the local schedd logs on the host.  The authenticated user as 'unauthenticated@unmapped' seems likely part of the problem, but I don't see a matching mapping happening in my central collector logs.

--Mike
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/