[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] condor_submit_dag fails with authentication failure



Dear all,

Submitting jobs on my cluster works fine, but submitted DAGs fail with what looks like an authentication failure.

I've tested with the simplest possible DAG ("JOB test test.job") and the simplest possible job ("executable = /bin/hostname"), submitted by the same user, on the same machine. The job goes well, but the DAG fails with this in the logs:

In test.dag.dagman.out:

04/07/22 03:14:20 Submitting HTCondor Node test job(s)...
04/07/22 03:14:20 Submitting node test from file test.job using direct job submission
04/07/22 03:14:20 AUTH_ERROR: Generic preauthentication failure
04/07/22 03:14:20 SECMAN: required authentication with local schedd failed, so aborting command QMGMT_WRITE_CMD.
04/07/22 03:14:20 Can't connect to queue manager: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using KERBEROS


And the SchedLog has:

04/07/22 03:14:20 (pid:1072) DC_AUTHENTICATE: authentication of <x.x.x.x:43467> did not result in a valid mapped user name, which is required for this command (1112 QMGMT_WRITE_CMD), so aborting.
04/07/22 03:14:20 (pid:1072) DC_AUTHENTICATE: reason for authentication failure: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using KERBEROS


Our authentication is all Kerberos (FreeIPA) and works well across the cluster. The x.x.x.x is the IP of the local machine. The user's credentials are OK: 'condor_submit test.job' right after the failure runs fine.

When I run with debug logging, I see this in the lead up to the above:

04/07/22 03:46:03 (D_SECURITY) SECMAN: new session, doing initial authentication.
04/07/22 03:46:03 (fd:7) (pid:287909) (D_SECURITY) SECMAN: Auth methods: KERBEROS
04/07/22 03:46:03 (D_SECURITY) AUTHENTICATE: setting timeout for <x.x.x.x:9618?addrs=x.x.x.x-9618&alias=crick.my.domain&noUDP&sock=schedd_968_8bc4> to 20.
04/07/22 03:46:03 (D_SECURITY) HANDSHAKE: in handshake(my_methods = 'KERBEROS')
04/07/22 03:46:03 (D_SECURITY) HANDSHAKE: handshake() - i am the client
04/07/22 03:46:03 (D_SECURITY) HANDSHAKE: server replied (method = 64)
04/07/22 03:46:03 (D_SECURITY) KERBEROS: get remote server principal for "host/crick.my.domain"
04/07/22 03:46:03 (D_SECURITY) KERBEROS: krb5_unparse_name: host/crick.my.domain@xxxxxxxxx
04/07/22 03:46:03 (D_SECURITY) KERBEROS: no user yet determined, will grab up to slash
04/07/22 03:46:03 (D_SECURITY) KERBEROS: picked user: host
04/07/22 03:46:03 (D_SECURITY) KERBEROS: remapping 'host' to 'condor'
04/07/22 03:46:03 (D_SECURITY) unable to open map file (null), errno 22
04/07/22 03:46:03 (D_SECURITY) Client is condor@xxxxxxxxx
04/07/22 03:46:03 (D_SECURITY) init_daemon: client principal is 'host/crick.my.domain@xxxxxxxxx'
04/07/22 03:46:03 (D_SECURITY) init_daemon: Using default keytab FILE:/etc/krb5.keytab
04/07/22 03:46:03 (D_SECURITY) init_daemon: Trying to get tgt credential for service host/crick.my.domain@xxxxxxxxx
04/07/22 03:46:03 (D_PRIV) PRIV_CONDOR --> PRIV_ROOT at /var/lib/condor/execute/slot1/dir_93903/userdir/.tmpWTI97r/condor-9.7.0/src/condor_io/condor_auth_kerberos.cpp:632
04/07/22 03:46:03 (D_PRIV) PRIV_ROOT --> PRIV_CONDOR at /var/lib/condor/execute/slot1/dir_93903/userdir/.tmpWTI97r/condor-9.7.0/src/condor_io/condor_auth_kerberos.cpp:634
04/07/22 03:46:03 (D_ALWAYS) AUTH_ERROR: Generic preauthentication failure
04/07/22 03:46:03 (D_SECURITY) AUTHENTICATE: method 64 (KERBEROS) failed.
04/07/22 03:46:03 (D_SECURITY) HANDSHAKE: in handshake(my_methods = '')
04/07/22 03:46:03 (D_SECURITY) HANDSHAKE: handshake() - i am the client
04/07/22 03:46:03 (D_SECURITY) HANDSHAKE: sending (methods == 0) to server
04/07/22 03:46:03 (D_SECURITY) HANDSHAKE: server replied (method = 0)
04/07/22 03:46:03 (D_ALWAYS) SECMAN: required authentication with local schedd failed, so aborting command QMGMT_WRITE_CMD.
04/07/22 03:46:03 (fd:6) (pid:287909) (D_ALWAYS) WARNING: failed to connect to queue manager (AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using KERBEROS)

I'm not sure what the mechanics are supposed to be, but it looks like the local machine (rather than the user) credentials are being used to authenticate with the local schedd, and this somehow doesn't work? Could it be that this code is running as non-root so the krb5.keytab is inaccessible?

Cheers
Marco


--
KCRI
Marco van Zwetselaar
Bioinformatician
Kilimanjaro Clinical Research Institute
P.O. Box 2236 | Moshi, Kilimanjaro | Tanzania