[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] dagman fails to authenticate to schedd and update its job ClassAds



Dear Jacob,

Am 30.07.20 um 17:14 schrieb Rundall, Jacob D:
Thanks, I agree that it looks like the same issue.

I see that the bug tracker indicates "Fixed Version: v080811" and "Last Change: 2020-Jul-27 08:25". I don't know what the update was, or how much to read into the "Fixed Version" (is it coded and planned for release? just a target?). But fingers crossed.

I think this change was the mass migration of not-yet-fixed issues after 8.8.10 was tagged (you will usually see a check-in linked or a comment from review if an issue is addressed and the code change has become part of the development branch).
So I'd translate it as "target", but the devs can of course explain better.

Cheers,
	Oliver


ïOn 7/29/20, 7:54 PM, "Oliver Freyermuth" <freyermuth@xxxxxxxxxxxxxxxxxx> wrote:

     Dear Jacob,
this seems to be the same issue we mentioned earlier on this list[0]. It's already tracked in this issue:
      https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=6970
     So it is a (tracked) bug, with no known workaround as of now, but at least it does not completely prevent operation of DAGMAN with Kerberos :-).
Cheers,
     	Oliver
[0] https://www-auth.cs.wisc.edu/lists/htcondor-users/2019-January/msg00012.shtml
         Note that the mail starts with a crash issue that was temporarily seen as consequence of this problem in early 8.8 releases[1],
         but then describes the same issue you see.
     [1] https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=6917
Am 29.07.20 um 23:36 schrieb Rundall, Jacob D:
     > It appears that condor_dagman is having trouble authenticating to the schedd:
     >
     > 07/29/20 15:56:09 AUTH_ERROR: Generic preauthentication failure
     >
     > 07/29/20 15:56:09 SECMAN: required authentication with schedd at <141.142.181.239:9618> failed, so aborting command QMGMT_WRITE_CMD.
     >
     > 07/29/20 15:56:09 WARNING: failed to connect to queue manager (AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using KERBEROS)
     >
     >
     >
     > This output occurs when running a very simple diamond DAG, for instance. I see it ~4 times right away, and then occasionally later on. With DAGMAN_QUEUE_UPDATE_INTERVAL set to the default of 300 it pretty much only reoccurs at the end of the DAGâs run. When I shorten the DAGMAN_QUEUE_UPDATE_INTERVAL to 10 this reoccurs more frequently (not exactly every 10 seconds, but maybe around the time each time a node in the DAG completes).
     >
     >
     >
     > BTW, we noticed this issue because dagman job ClassAds are seemingly not being updated, i.e., the DAG_ attributes are not getting added as listed here:
     >
     > https://htcondor.readthedocs.io/en/stable/users-manual/dagman-applications.html#status-information-for-the-dag-in-a-classad
     >
     >
     >
     > And weâre suspicious that these authentication errors may point to the underlying reason.
     >
     >
     >
     > Does anyone have any input towards this issue/these issues? Thanks!
     >
     >
     > _______________________________________________
     > HTCondor-users mailing list
     > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
     > subject: Unsubscribe
     > You can also unsubscribe by visiting
     > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
     >
     > The archives can be found at:
     > https://lists.cs.wisc.edu/archive/htcondor-users/
     >


Attachment: smime.p7s
Description: S/MIME Cryptographic Signature