[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] dagman fails to authenticate to schedd and update its job ClassAds



Thanks, I agree that it looks like the same issue.

I see that the bug tracker indicates "Fixed Version: v080811" and "Last Change: 2020-Jul-27 08:25". I don't know what the update was, or how much to read into the "Fixed Version" (is it coded and planned for release? just a target?). But fingers crossed.

ïOn 7/29/20, 7:54 PM, "Oliver Freyermuth" <freyermuth@xxxxxxxxxxxxxxxxxx> wrote:

    Dear Jacob,
    
    this seems to be the same issue we mentioned earlier on this list[0]. It's already tracked in this issue:
     https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=6970
    So it is a (tracked) bug, with no known workaround as of now, but at least it does not completely prevent operation of DAGMAN with Kerberos :-). 
    
    Cheers,
    	Oliver
    
    [0] https://www-auth.cs.wisc.edu/lists/htcondor-users/2019-January/msg00012.shtml
        Note that the mail starts with a crash issue that was temporarily seen as consequence of this problem in early 8.8 releases[1],
        but then describes the same issue you see. 
    [1] https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=6917
    
    Am 29.07.20 um 23:36 schrieb Rundall, Jacob D:
    > It appears that condor_dagman is having trouble authenticating to the schedd:
    > 
    > 07/29/20 15:56:09 AUTH_ERROR: Generic preauthentication failure
    > 
    > 07/29/20 15:56:09 SECMAN: required authentication with schedd at <141.142.181.239:9618> failed, so aborting command QMGMT_WRITE_CMD.
    > 
    > 07/29/20 15:56:09 WARNING: failed to connect to queue manager (AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using KERBEROS)
    > 
    >  
    > 
    > This output occurs when running a very simple diamond DAG, for instance. I see it ~4 times right away, and then occasionally later on. With DAGMAN_QUEUE_UPDATE_INTERVAL set to the default of 300 it pretty much only reoccurs at the end of the DAGâs run. When I shorten the DAGMAN_QUEUE_UPDATE_INTERVAL to 10 this reoccurs more frequently (not exactly every 10 seconds, but maybe around the time each time a node in the DAG completes).
    > 
    >  
    > 
    > BTW, we noticed this issue because dagman job ClassAds are seemingly not being updated, i.e., the DAG_ attributes are not getting added as listed here:
    > 
    > https://htcondor.readthedocs.io/en/stable/users-manual/dagman-applications.html#status-information-for-the-dag-in-a-classad
    > 
    >  
    > 
    > And weâre suspicious that these authentication errors may point to the underlying reason.
    > 
    >  
    > 
    > Does anyone have any input towards this issue/these issues? Thanks!
    > 
    > 
    > _______________________________________________
    > HTCondor-users mailing list
    > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
    > subject: Unsubscribe
    > You can also unsubscribe by visiting
    > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
    > 
    > The archives can be found at:
    > https://lists.cs.wisc.edu/archive/htcondor-users/
    >