[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_submit_dag fails with authentication failure



Thanks Cole,

The nodes.log file is 0 bytes. Essentially what happens is that the "condor_dagman" job sits in the queue and is "running" but does not move to the Idle, Run, or Done columns. I eventually have to condor_rm it, because all it is doing is retrying every 5s to queue the node job, which fails with an AUTH_ERROR. The auth error is caused by a file permissions problem (which shows up as the "Generic preauthentication failure" in the log).

I'm quite certain that the issue is due to the direct submit change, and more specifically to Cond_Auth_Kerberos::authenticate thinking that it is running as a daemon, whereas it is running as a user. This happens at condor_auth_kerberos:284:

ÂÂÂ if (isDaemon() || get_mySubSystem()->isDaemon() )

The isDaemon() is defined by 'get_my_uid() == 0' (condor_auth.cpp:59) which is false as the process is running as the submitter, not as root. However, get_mySubSystem()->isDaemon() evaluates to true for SUBSYSTEM_CLASS_DAEMON (condor_utils/subsystem_info.h:165).

Consequently, the authenticate() call incorrectly enters init_daemon() which then at condor_auth_kerberos:633 fails with a permissions error trying to open the /etc/krb5.keytab file that holds the key to the host credentials (and is therefore only root-readable).

Thanks for digging into this. The various revisions and comments ("hack for now") surrounding daemon vs user and server vs client in the authentication code suggest a long history of organic growth :-) and you may need to start with some conceptual clean-up.

Relevant for reproducing the issue: I use Kerberos (managed through FreeIPA) and only Kerberos for authentication, so this is the whole security configuration:

# One ring to rule them all: Kerberos
SEC_DEFAULT_AUTHENTICATION_METHODS = KERBEROS
SEC_DEFAULT_AUTHENTICATION=required
SEC_DEFAULT_INTEGRITY=required

All KERBEROS_* config variables are left at their defaults. This works perfectly in a standard Kerberos setup. Let me know if I can be of further assistance.

Kind regards,
Marco


On 09/04/2022 00:07, Cole Bollig via HTCondor-users wrote:
Judging by your findings and the fact that setting the DAGMAN_USE_DIRECT_SUBMIT = false in the config file, it appears to be an issue with the transition from DAGs submitting jobs via shelling out to condor_submit to the direct job submission to the schedd, but I can't say with certainty.

ÂI will do some digging around. It may take me a little bit as I am a newer team member, but in theÂmeantime can I you send the nodes.log file.

Cheers,
Cole Bollig

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Marco van Zwetselaar <zwets@xxxxxxxxxx>
Sent: Friday, April 8, 2022 3:16 PM
To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] condor_submit_dag fails with authentication failure
Â
For those running into this issue, the workaround is to set config variable

ÂÂÂ DAGMAN_USE_DIRECT_SUBMIT = False

This reverts to the pre-9.7.0 behaviour where DAGMan submits through condor_submit.

Marco


On 08/04/2022 12:34, Marco van Zwetselaar wrote:
Dear all,

I have tracked this down to what looks like a bug related to this in the 9.7.0 Changelog:

- DAGMan submits jobs directly (does not shell out to condor_submit)

The process can't open /etc/krb5.keytab because it is running as the submitting user, not the daemon, whereas it wants to authenticate to the local schedd as daemon (and hence needs the host credentials from the keytab).

This is the code in src/condor_io/condor_auth_kerberos (comments elided) where the process appears to take the wrong turn:

ÂÂÂ int Condor_Auth_Kerberos :: authenticate(const char * /* remoteHost */, CondorError* /* errstack */, bool /*non_blocking*/)
ÂÂÂ {
ÂÂÂÂÂÂÂ if ( mySock_->isClient() )
ÂÂÂÂÂÂÂ {
ÂÂÂÂÂÂÂÂÂÂÂ if (init_kerberos_context() && init_server_info())
ÂÂÂÂÂÂÂÂÂÂÂ {
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ if (isDaemon() || get_mySubSystem()->isDaemon() ) {
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ status = init_daemon();
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ } else {
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ status = init_user();
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ }

The process goes into init_daemon(), whereas for regular job submissions it goes into init_user().

I'm happy to work on fixing this (I have some familiarity with that file, winning the bug of the week in 2020 for a 16y old one[1]) but it seems to a bit of a deeper issue: what _should_ the code be running as?

Cheers,
Marco

[1] https://github.com/htcondor/htcondor/pull/99


On 07/04/2022 04:33, Marco van Zwetselaar wrote:
Dear all,

Submitting jobs on my cluster works fine, but submitted DAGs fail with what looks like an authentication failure.

I've tested with the simplest possible DAG ("JOB test test.job") and the simplest possible job ("executable = /bin/hostname"), submitted by the same user, on the same machine. The job goes well, but the DAG fails with this in the logs:

In test.dag.dagman.out:

04/07/22 03:14:20 Submitting HTCondor Node test job(s)...
04/07/22 03:14:20 Submitting node test from file test.job using direct job submission
04/07/22 03:14:20 AUTH_ERROR: Generic preauthentication failure
04/07/22 03:14:20 SECMAN: required authentication with local schedd failed, so aborting command QMGMT_WRITE_CMD.
04/07/22 03:14:20 Can't connect to queue manager: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using KERBEROS


And the SchedLog has:

04/07/22 03:14:20 (pid:1072) DC_AUTHENTICATE: authentication of <x.x.x.x:43467> did not result in a valid mapped user name, which is required for this command (1112 QMGMT_WRITE_CMD), so aborting.
04/07/22 03:14:20 (pid:1072) DC_AUTHENTICATE: reason for authentication failure: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using KERBEROS


Our authentication is all Kerberos (FreeIPA) and works well across the cluster. The x.x.x.x is the IP of the local machine. The user's credentials are OK: 'condor_submit test.job' right after the failure runs fine.

When I run with debug logging, I see this in the lead up to the above:

04/07/22 03:46:03 (D_SECURITY) SECMAN: new session, doing initial authentication.
04/07/22 03:46:03 (fd:7) (pid:287909) (D_SECURITY) SECMAN: Auth methods: KERBEROS
04/07/22 03:46:03 (D_SECURITY) AUTHENTICATE: setting timeout for <x.x.x.x:9618?addrs=x.x.x.x-9618&alias=crick.my.domain&noUDP&sock=schedd_968_8bc4> to 20.
04/07/22 03:46:03 (D_SECURITY) HANDSHAKE: in handshake(my_methods = 'KERBEROS')
04/07/22 03:46:03 (D_SECURITY) HANDSHAKE: handshake() - i am the client
04/07/22 03:46:03 (D_SECURITY) HANDSHAKE: server replied (method = 64)
04/07/22 03:46:03 (D_SECURITY) KERBEROS: get remote server principal for "host/crick.my.domain"
04/07/22 03:46:03 (D_SECURITY) KERBEROS: krb5_unparse_name: host/crick.my.domain@xxxxxxxxx
04/07/22 03:46:03 (D_SECURITY) KERBEROS: no user yet determined, will grab up to slash
04/07/22 03:46:03 (D_SECURITY) KERBEROS: picked user: host
04/07/22 03:46:03 (D_SECURITY) KERBEROS: remapping 'host' to 'condor'
04/07/22 03:46:03 (D_SECURITY) unable to open map file (null), errno 22
04/07/22 03:46:03 (D_SECURITY) Client is condor@xxxxxxxxx
04/07/22 03:46:03 (D_SECURITY) init_daemon: client principal is 'host/crick.my.domain@xxxxxxxxx'
04/07/22 03:46:03 (D_SECURITY) init_daemon: Using default keytab FILE:/etc/krb5.keytab
04/07/22 03:46:03 (D_SECURITY) init_daemon: Trying to get tgt credential for service host/crick.my.domain@xxxxxxxxx
04/07/22 03:46:03 (D_PRIV) PRIV_CONDOR --> PRIV_ROOT at /var/lib/condor/execute/slot1/dir_93903/userdir/.tmpWTI97r/condor-9.7.0/src/condor_io/condor_auth_kerberos.cpp:632
04/07/22 03:46:03 (D_PRIV) PRIV_ROOT --> PRIV_CONDOR at /var/lib/condor/execute/slot1/dir_93903/userdir/.tmpWTI97r/condor-9.7.0/src/condor_io/condor_auth_kerberos.cpp:634
04/07/22 03:46:03 (D_ALWAYS) AUTH_ERROR: Generic preauthentication failure
04/07/22 03:46:03 (D_SECURITY) AUTHENTICATE: method 64 (KERBEROS) failed.
04/07/22 03:46:03 (D_SECURITY) HANDSHAKE: in handshake(my_methods = '')
04/07/22 03:46:03 (D_SECURITY) HANDSHAKE: handshake() - i am the client
04/07/22 03:46:03 (D_SECURITY) HANDSHAKE: sending (methods == 0) to server
04/07/22 03:46:03 (D_SECURITY) HANDSHAKE: server replied (method = 0)
04/07/22 03:46:03 (D_ALWAYS) SECMAN: required authentication with local schedd failed, so aborting command QMGMT_WRITE_CMD.
04/07/22 03:46:03 (fd:6) (pid:287909) (D_ALWAYS) WARNING: failed to connect to queue manager (AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using KERBEROS)

I'm not sure what the mechanics are supposed to be, but it looks like the local machine (rather than the user) credentials are being used to authenticate with the local schedd, and this somehow doesn't work? Could it be that this code is running as non-root so the krb5.keytab is inaccessible?

Cheers
Marco


--
KCRI
Marco van Zwetselaar
Bioinformatician
Kilimanjaro Clinical Research Institute
P.O. Box 2236 | Moshi, Kilimanjaro | Tanzania



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/