[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_submit_dag fails with authentication failure



Judging by your findings and the fact that setting the DAGMAN_USE_DIRECT_SUBMIT = false in the config file, it appears to be an issue with the transition from DAGs submitting jobs via shelling out to condor_submit to the direct job submission to the schedd, but I can't say with certainty.

 I will do some digging around. It may take me a little bit as I am a newer team member, but in the meantime can I you send the nodes.log file.

Cheers,
Cole Bollig

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Marco van Zwetselaar <zwets@xxxxxxxxxx>
Sent: Friday, April 8, 2022 3:16 PM
To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] condor_submit_dag fails with authentication failure
 
For those running into this issue, the workaround is to set config variable

    DAGMAN_USE_DIRECT_SUBMIT = False

This reverts to the pre-9.7.0 behaviour where DAGMan submits through condor_submit.

Marco


On 08/04/2022 12:34, Marco van Zwetselaar wrote:
Dear all,

I have tracked this down to what looks like a bug related to this in the 9.7.0 Changelog:

- DAGMan submits jobs directly (does not shell out to condor_submit)

The process can't open /etc/krb5.keytab because it is running as the submitting user, not the daemon, whereas it wants to authenticate to the local schedd as daemon (and hence needs the host credentials from the keytab).

This is the code in src/condor_io/condor_auth_kerberos (comments elided) where the process appears to take the wrong turn:

    int Condor_Auth_Kerberos :: authenticate(const char * /* remoteHost */, CondorError* /* errstack */, bool /*non_blocking*/)
    {
        if ( mySock_->isClient() )
        {
            if (init_kerberos_context() && init_server_info())
            {
                if (isDaemon() || get_mySubSystem()->isDaemon() ) {
                    status = init_daemon();
                } else {
                    status = init_user();
                }

The process goes into init_daemon(), whereas for regular job submissions it goes into init_user().

I'm happy to work on fixing this (I have some familiarity with that file, winning the bug of the week in 2020 for a 16y old one[1]) but it seems to a bit of a deeper issue: what _should_ the code be running as?

Cheers,
Marco

[1] https://github.com/htcondor/htcondor/pull/99


On 07/04/2022 04:33, Marco van Zwetselaar wrote:
Dear all,

Submitting jobs on my cluster works fine, but submitted DAGs fail with what looks like an authentication failure.

I've tested with the simplest possible DAG ("JOB test test.job") and the simplest possible job ("executable = /bin/hostname"), submitted by the same user, on the same machine. The job goes well, but the DAG fails with this in the logs:

In test.dag.dagman.out:

04/07/22 03:14:20 Submitting HTCondor Node test job(s)...
04/07/22 03:14:20 Submitting node test from file test.job using direct job submission
04/07/22 03:14:20 AUTH_ERROR: Generic preauthentication failure
04/07/22 03:14:20 SECMAN: required authentication with local schedd failed, so aborting command QMGMT_WRITE_CMD.
04/07/22 03:14:20 Can't connect to queue manager: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using KERBEROS


And the SchedLog has:

04/07/22 03:14:20 (pid:1072) DC_AUTHENTICATE: authentication of <x.x.x.x:43467> did not result in a valid mapped user name, which is required for this command (1112 QMGMT_WRITE_CMD), so aborting.
04/07/22 03:14:20 (pid:1072) DC_AUTHENTICATE: reason for authentication failure: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using KERBEROS


Our authentication is all Kerberos (FreeIPA) and works well across the cluster. The x.x.x.x is the IP of the local machine. The user's credentials are OK: 'condor_submit test.job' right after the failure runs fine.

When I run with debug logging, I see this in the lead up to the above:

04/07/22 03:46:03 (D_SECURITY) SECMAN: new session, doing initial authentication.
04/07/22 03:46:03 (fd:7) (pid:287909) (D_SECURITY) SECMAN: Auth methods: KERBEROS
04/07/22 03:46:03 (D_SECURITY) AUTHENTICATE: setting timeout for <x.x.x.x:9618?addrs=x.x.x.x-9618&alias=crick.my.domain&noUDP&sock=schedd_968_8bc4> to 20.
04/07/22 03:46:03 (D_SECURITY) HANDSHAKE: in handshake(my_methods = 'KERBEROS')
04/07/22 03:46:03 (D_SECURITY) HANDSHAKE: handshake() - i am the client
04/07/22 03:46:03 (D_SECURITY) HANDSHAKE: server replied (method = 64)
04/07/22 03:46:03 (D_SECURITY) KERBEROS: get remote server principal for "host/crick.my.domain"
04/07/22 03:46:03 (D_SECURITY) KERBEROS: krb5_unparse_name: host/crick.my.domain@xxxxxxxxx
04/07/22 03:46:03 (D_SECURITY) KERBEROS: no user yet determined, will grab up to slash
04/07/22 03:46:03 (D_SECURITY) KERBEROS: picked user: host
04/07/22 03:46:03 (D_SECURITY) KERBEROS: remapping 'host' to 'condor'
04/07/22 03:46:03 (D_SECURITY) unable to open map file (null), errno 22
04/07/22 03:46:03 (D_SECURITY) Client is condor@xxxxxxxxx
04/07/22 03:46:03 (D_SECURITY) init_daemon: client principal is 'host/crick.my.domain@xxxxxxxxx'
04/07/22 03:46:03 (D_SECURITY) init_daemon: Using default keytab FILE:/etc/krb5.keytab
04/07/22 03:46:03 (D_SECURITY) init_daemon: Trying to get tgt credential for service host/crick.my.domain@xxxxxxxxx
04/07/22 03:46:03 (D_PRIV) PRIV_CONDOR --> PRIV_ROOT at /var/lib/condor/execute/slot1/dir_93903/userdir/.tmpWTI97r/condor-9.7.0/src/condor_io/condor_auth_kerberos.cpp:632
04/07/22 03:46:03 (D_PRIV) PRIV_ROOT --> PRIV_CONDOR at /var/lib/condor/execute/slot1/dir_93903/userdir/.tmpWTI97r/condor-9.7.0/src/condor_io/condor_auth_kerberos.cpp:634
04/07/22 03:46:03 (D_ALWAYS) AUTH_ERROR: Generic preauthentication failure
04/07/22 03:46:03 (D_SECURITY) AUTHENTICATE: method 64 (KERBEROS) failed.
04/07/22 03:46:03 (D_SECURITY) HANDSHAKE: in handshake(my_methods = '')
04/07/22 03:46:03 (D_SECURITY) HANDSHAKE: handshake() - i am the client
04/07/22 03:46:03 (D_SECURITY) HANDSHAKE: sending (methods == 0) to server
04/07/22 03:46:03 (D_SECURITY) HANDSHAKE: server replied (method = 0)
04/07/22 03:46:03 (D_ALWAYS) SECMAN: required authentication with local schedd failed, so aborting command QMGMT_WRITE_CMD.
04/07/22 03:46:03 (fd:6) (pid:287909) (D_ALWAYS) WARNING: failed to connect to queue manager (AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using KERBEROS)

I'm not sure what the mechanics are supposed to be, but it looks like the local machine (rather than the user) credentials are being used to authenticate with the local schedd, and this somehow doesn't work? Could it be that this code is running as non-root so the krb5.keytab is inaccessible?

Cheers
Marco


--
KCRI
Marco van Zwetselaar
Bioinformatician
Kilimanjaro Clinical Research Institute
P.O. Box 2236 | Moshi, Kilimanjaro | Tanzania