[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] job router debugging



I think a submit requirement is failing. When this happens, the schedd sends a detailed error message to the client, but doesnât write anything to its log. condor_submit prints the error message, but the job router ignores it and just reports that the submission failed.
Can you tell me what, if any, submit requirements you have configured?

Weâll need to fix the schedd and job router to be more chatty about these failures.

 - Jaime

> On May 10, 2022, at 4:16 PM, Michael Thomas <wart@xxxxxxxxxxx> wrote:
> 
> Hi Jaime,
> 
> Thanks, that certainly helped generate more logs, but still didn't turn up anything I could make sense of.
> 
> Logs are here:
> 
> https://drive.google.com/file/d/1BptGdZeRuUdAR7Dw-Aafs0ipve5Jmwkz/view?usp=sharing
> 
> I notice that the final "request" generated a 'errno = 22', which sys/errno.h tells me is an Invalid Argument error, but I'll be darned if I know which argument and in which context this is from.
> 
> --Mike
> 
> On 5/10/22 12:04, Jaime Frey wrote:
>> The job routerâs job submission attempt is being rejected by the local schedd. There should be an error in the scheddâs log. You can try increasing the debug level for the schedd log to get more information:
>> SCHEDD_DEBUG = $(SCHEDD_DEBUG) D_FULLDEBUG D_SYSCALLS
>>  - Jaime
>>> On May 6, 2022, at 1:16 PM, Michael Thomas <wart@xxxxxxxxxxx> wrote:
>>> 
>>> I'm setting up a new LIGO OSG CE, and am having some trouble with the job routing.  I've followed the debugging instructions here:
>>> 
>>> https://htcondor.com/htcondor-ce/v5/troubleshooting/debugging-tools/
>>> 
>>> ...but don't seem to have any results that this would explain.
>>> 
>>> Here's what I've tried so far:
>>> 
>>> After getting the OSG stack configured, I try to run 'condor_ce_trace ldas-osg-ce.ligo-la.caltech.edu'.  This eventually times out and my job sits idle in the condor-ce queue:
>>> 
>>> # condor_ce_q
>>> 
>>> -- Schedd: ldas-osg-ce.ligo-la.caltech.edu : <208.69.128.80:26489?... @ 05/06/22 13:01:15
>>> OWNER          BATCH_NAME    SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
>>> michael.thomas ID: 7        5/5  15:04      _      _      1      1 7.0
>>> michael.thomas ID: 8        5/6  12:36      _      _      1      1 8.0
>>> 
>>> Total for query: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended
>>> Total for all users: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended
>>> 
>>> 
>>> Checking the job router shows no helpful information:
>>> 
>>> # condor_ce_q -l 7.0 | condor_ce_job_router_info -match-jobs -ignore-prior-routing -jobads -
>>> 
>>> Matching jobs against routes to find candidate jobs.
>>> 
>>> 
>>> Looking in the job router log shows an error when trying to commit:
>>> 
>>> 05/06/22 13:03:04 SECMAN: negotiating security for command 1112.
>>> 05/06/22 13:03:04 SECMAN: sending DC_AUTHENTICATE command
>>> 05/06/22 13:03:04 SECMAN: sending following classad:
>>> AuthMethods = "FS"
>>> AuthMethodsList = "FS,TOKEN"
>>> Authentication = "YES"
>>> Command = 1112
>>> ConnectSinful = "<10.13.5.58:9618?addrs=10.13.5.58-9618&alias=ldas-osg-ce.ligo-la.caltech.edu&noUDP&sock=schedd_594076_4d25>"
>>> CryptoMethods = "AES"
>>> CryptoMethodsList = "AES,BLOWFISH,3DES"
>>> Enact = "YES"
>>> Encryption = "YES"
>>> Integrity = "YES"
>>> IssuerKeys = "POOL"
>>> MyRemoteUserName = "condor@xxxxxxxxxxxxxxxxxxxxxxxx"
>>> OutgoingNegotiation = "PREFERRED"
>>> RemoteVersion = "$CondorVersion: 9.0.12 Apr 19 2022 BuildID: 583935 PackageID: 9.0.12-1 $"
>>> ServerCommandSock = "<208.69.128.80:9619?addrs=208.69.128.80-9619&alias=ldas-osg-ce.ligo-la.caltech.edu&noUDP&sock=job_router_1090171_6811>"
>>> SessionDuration = "86400"
>>> SessionLease = 3600
>>> Sid = "ldas-osg-ce:594118:1651787992:4432"
>>> Subsystem = "JOB_ROUTER"
>>> TrackState = true
>>> TriedAuthentication = true
>>> TrustDomain = "ldas-condori"
>>> UseSession = "YES"
>>> User = "unauthenticated@unmapped"
>>> ValidCommands = "60004,60012,60021,60052,421,478,480,486,488,489,487,499,502,464,1112,481,509,511,521,74000,507,60007,457,60020,443,441,6,12,5,515,516,519,1111,471"
>>> 05/06/22 13:03:04 SECMAN: resume, other side is $CondorVersion: 9.0.12 Apr 19 2022 BuildID: 583935 PackageID: 9.0.12-1 $, NOT reauthenticating.
>>> 05/06/22 13:03:04 SECMAN: about to enable encryption.
>>> 05/06/22 13:03:04 CRYPTO: protocol(AES), not clearing StreamCryptoState.
>>> 05/06/22 13:03:04 SECMAN: successfully enabled encryption!
>>> 05/06/22 13:03:04 SECMAN: about to enable message authenticator with key type 3
>>> 05/06/22 13:03:04 SECMAN: because protocal is AES, not using other MAC.
>>> 05/06/22 13:03:04 SECMAN: successfully enabled message authenticator!
>>> 05/06/22 13:03:04 Getting authenticated user from cached session: unauthenticated@unmapped
>>> 05/06/22 13:03:04 SECMAN: startCommand succeeded.
>>> 05/06/22 13:03:04 Authorizing server 'unauthenticated@unmapped/10.13.5.58'.
>>> 05/06/22 13:03:04 ERROR (schedd ldas-osg-ce.ligo-la.caltech.edu at pool ldas-condor:9618) (2376.0) Failed to commit job submission
>>> 
>>> 
>>> But I don't see any matching error in my cluster's central manager logs, nor the local schedd logs on the host.  The authenticated user as 'unauthenticated@unmapped' seems likely part of the problem, but I don't see a matching mapping happening in my central collector logs.
>>> 
>>> --Mike
>>> _______________________________________________
>>> HTCondor-users mailing list
>>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>>> subject: Unsubscribe
>>> You can also unsubscribe by visiting
>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>> 
>>> The archives can be found at:
>>> https://lists.cs.wisc.edu/archive/htcondor-users/
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/