[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Issues with condor_dagman



Dear Tim,

many thanks for the quick help!
I did actually check exactly the URL you have provided, but apparently I checked it too early, when "stable" contained 8.8 already but "previous" did still only contain "8.4".
Good to find the 8.6 packages back so we can perform a "downgrade" early next week :-).

We'll also test out whether things change with 8.8 (may take a while since we need to adapt puppet-htcondor first).

Also, congratulations on the new repository structure with version numbers!
It's heavily appreciated and will allow us to keep automatic updates active and sleep even better at night :-).

All the best for investigating the issue and have a nice weekend (in case you need more info, just let me know),
Oliver

Am 05.01.19 um 04:02 schrieb Tim Theisen:
I am sorry that you are having difficulty. The trusty 8.6.13 versions are available in our previous repository.

Here is a short web page for you:

https://research.cs.wisc.edu/htcondor/instructions/ubuntu/14/previous/

So, you can get the Trusty Tahr versions if you need them.

Hope this helps in the interim. We will look at your issues.

...Tim

P.S. The Debian 8 (jessie) and Ubuntu 14 (Trusty Tahr) are still managed by and older tool chain (reprepro). All the newer Debian and Ubuntu release repositories are managed by aptly and contain the version number.

On 1/4/19 5:15 PM, Oliver Freyermuth wrote:
Dear HTCondor experts,

we are still on 8.6.13, but have encountered several issues (auth issues, segfaults) with condor_dagman. We are using KERBEROS auth
for both users and condor daemons. Everything apart from DAGMAN works well.

Until yesterday, we have been using the Ubuntu Trusty (14) version of 8.6.13 on Ubuntu 18.04, and every time we used condor_dagman, we got this in the logs:
-------------------------------------------------------------------------------------------
Jan 04 18:22:26 cip001 condor_scheduniv_exec.1.0[1358119]: Client iscondor@xxxxxxxxxxx
Jan 04 18:22:26 cip001 condor_scheduniv_exec.1.0[1358119]: KERBEROS: Server principal ishost/condor-cm1.example.com@xxxxxxxxx
Jan 04 18:22:26 cip001 condor_scheduniv_exec.1.0[1358119]: init_daemon: client principal is 'host/cip001.example.com@xxxxxxxxx'
Jan 04 18:22:26 cip001 condor_scheduniv_exec.1.0[1358119]: init_daemon: Using default keytabFILE:/etc/krb5.keytab
Jan 04 18:22:26 cip001 condor_scheduniv_exec.1.0[1358119]: init_daemon: Trying to get tgt credential for servicehost/condor-cm1.example.com@xxxxxxxxx
Jan 04 18:22:26 cip001 condor_scheduniv_exec.1.0[1358119]: AUTH_ERROR: Permission denied
Jan 04 18:22:26 cip001 condor_scheduniv_exec.1.0[1358119]: AUTHENTICATE: method 64 (KERBEROS) failed.
-------------------------------------------------------------------------------------------
This apparently only affected the QMGMT commands with which (to my understanding) DAGMAN modifies the TOTAL of the queue and other things.
So it was still usable just fine, but the TOTAL in the condor_q was off.

Now, last night with the release of 8.8, all trusty versions of 8.6.13 appear to have been purged from repos.
So we have finally upgraded to the official version of 8.6.13 for Bionic from HTCondor repos.

Now, however, condor_dagman reproducibly segfaults after:
-------------------------------------------------------------------------------------------
Jan 04 18:07:49 cip000 condor_scheduniv_exec.36.0[20684]: init_daemon: Trying to get tgt credential for servicehost/cip000.example.com@xxxxxxxxx
-------------------------------------------------------------------------------------------
and hence is fully unusable.
Downgrading to 8.6 trusty packages is not possible anymore (they are gone from repos).

I investigated what's happening with D_SECURITY:
-------------------------------------------------------------------------------------------
condor_schedd[49872]: Will return to DC because authentication is incomplete.
condor_scheduniv_exec.41.0[49914]: HANDSHAKE: server replied (method = 64)
condor_scheduniv_exec.41.0[49914]: KERBEROS: krb5_unparse_name:host/cip000.example.com@xxxxxxxxx
condor_scheduniv_exec.41.0[49914]: KERBEROS: no user yet determined, will grab up to slash
condor_scheduniv_exec.41.0[49914]: KERBEROS: picked user: host
condor_scheduniv_exec.41.0[49914]: KERBEROS: remapping 'host' to 'condor'
condor_scheduniv_exec.41.0[49914]: Client iscondor@xxxxxxxxxxx
condor_scheduniv_exec.41.0[49914]: KERBEROS: Server principal ishost/cip000.example.com@xxxxxxxxx
condor_scheduniv_exec.41.0[49914]: init_daemon: client principal is 'host/cip000.example.com@xxxxxxxxx'
condor_scheduniv_exec.41.0[49914]: init_daemon: Using default keytabFILE:/etc/krb5.keytab
condor_scheduniv_exec.41.0[49914]: init_daemon: Trying to get tgt credential for servicehost/cip000.example.com@xxxxxxxxx
condor_schedd[49872]: KERBEROS: entered authenticate_continue, state==100
condor_schedd[49872]: KERBEROS: leaving authenticate_continue, state==100, return=0
condor_schedd[49872]: AUTHENTICATE: method -1 (KERBEROS) failed.
condor_schedd[49872]: HANDSHAKE: in handshake(my_methods = 'KERBEROS,SSL')
condor_schedd[49872]: HANDSHAKE: handshake() - i am the server
condor_schedd[49872]: AUTHENTICATE: handshake failed!
condor_schedd[49872]: Authentication was a FAILURE.
condor_schedd[49872]: DC_AUTHENTICATE: authentication of <IP_ADDR:26711> did not result in a valid mapped user name, which is required for this command (1112 QMGMT_WRITE_CMD), so aborting.
condor_schedd[49872]: DC_AUTHENTICATE: reason for authentication failure: AUTHENTICATE:1002:Failure performing handshake|AUTHENTICATE:1004:Failed to authenticate using KERBEROS
condor_procd[49868]: PROC_FAMILY_KILL_FAMILY
condor_procd[49868]: taking a snapshot...
condor_procd[49868]: process 49914 (of family 49914) has exited
condor_procd[49868]: ...snapshot complete
condor_procd[49868]: sending signal 9 to family with root 49914
condor_schedd[49872]: scheduler universe job (41.0) pid 49914 died with signal 11 (Segmentation fault)
-------------------------------------------------------------------------------------------

So the SEGFAULT happens directly after "Trying to get tgt credential". It seems it actually happens when trying to generate the error message ("Permission denied").

There are several things strange here.
- Why does DAGMAN try to authenticate as a DAEMON in the first place? It naturally can't do that, since it is not running as root (but as normal user) and hence can not (and should not) read /etc/krb5.keytab.
   For inter-daemon authentication, we have configured SSL auth as fallback. Of course, the certs are also not user-readable, so this also will fail
   (if it would ever get there, which it can not do anymore due to the segfault).
- Why does it crash only after changing from a Trusty build of 8.6.13 to a Bionic build of 8.6.13 (we are on 18.04, i.e. Bionic)?

For some more information, I installed debug symbols and attached a debugger.
Here's a full trace:
-------------------------------------------------------------------------------------------
#0  raise (sig=sig@entry=11) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007fd5b1264cfa in unix_sig_coredump (signum=11, s_info=<optimized out>) at ./src/condor_daemon_core.V6/daemon_core_main.cpp:765
#2  <signal handler called>
#3  0x0000000000000000 in ?? ()
#4  0x00007fd5b11ece35 in Condor_Auth_Kerberos::init_daemon (this=this@entry=0x5577f1280eb0) at ./src/condor_io/condor_auth_kerberos.cpp:649
#5  0x00007fd5b11ee710 in Condor_Auth_Kerberos::authenticate (this=0x5577f1280eb0) at ./src/condor_io/condor_auth_kerberos.cpp:284
#6  0x00007fd5b11e71f9 in Authentication::authenticate_continue (this=this@entry=0x5577f127c640, errstack=errstack@entry=0x7ffc0e5c5f20, non_blocking=<optimized out>) at ./src/condor_io/authentication.cpp:321
#7  0x00007fd5b11e7a9a in Authentication::authenticate_inner (this=this@entry=0x5577f127c640,
     hostAddr=hostAddr@entry=0x5577f1298660 "<IP_ADDR:9618?addrs=IP_ADDR-9618+[IP_ADDR]-9618&noUDP&sock=6382_1982_3>", auth_methods=auth_methods@entry=0x5577f129c910 "KERBEROS,SSL",
     errstack=errstack@entry=0x7ffc0e5c5f20, timeout=timeout@entry=20, non_blocking=non_blocking@entry=false) at ./src/condor_io/authentication.cpp:162
#8  0x00007fd5b11e7b64 in Authentication::authenticate (this=this@entry=0x5577f127c640, hostAddr=0x5577f1298660 "<IP_ADDR:9618?addrs=IP_ADDR+[IP_ADDR]-9618&noUDP&sock=6382_1982_3>",
     auth_methods=auth_methods@entry=0x5577f129c910 "KERBEROS,SSL", errstack=errstack@entry=0x7ffc0e5c5f20, timeout=timeout@entry=20, non_blocking=non_blocking@entry=false) at ./src/condor_io/authentication.cpp:116
#9  0x00007fd5b11e7bae in Authentication::authenticate (this=this@entry=0x5577f127c640, hostAddr=<optimized out>, key=@0x5577f12a8a60: 0x0, auth_methods=auth_methods@entry=0x5577f129c910 "KERBEROS,SSL",
     errstack=errstack@entry=0x7ffc0e5c5f20, timeout=timeout@entry=20, non_blocking=false) at ./src/condor_io/authentication.cpp:104
#10 0x00007fd5b120a7de in ReliSock::perform_authenticate (this=0x5577f1291c90, with_key=with_key@entry=true, key=@0x5577f12a8a60: 0x0, methods=0x5577f129c910 "KERBEROS,SSL", errstack=0x7ffc0e5c5f20, auth_timeout=20, non_blocking=false,
     method_used=0x0) at ./src/condor_io/reli_sock.cpp:1185
#11 0x00007fd5b120a864 in ReliSock::authenticate (this=<optimized out>, key=<optimized out>, methods=<optimized out>, errstack=<optimized out>, auth_timeout=<optimized out>, non_blocking=<optimized out>, method_used=0x0)
     at ./src/condor_io/reli_sock.cpp:1242
#12 0x00007fd5b1201195 in SecManStartCommand::authenticate_inner (this=0x5577f12a87a0) at ./src/condor_io/condor_secman.cpp:1920
#13 0x00007fd5b1205465 in SecManStartCommand::startCommand_inner (this=this@entry=0x5577f12a87a0) at ./src/condor_io/condor_secman.cpp:1295
#14 0x00007fd5b1205732 in SecManStartCommand::startCommand (this=this@entry=0x5577f12a87a0) at ./src/condor_io/condor_secman.cpp:1227
#15 0x00007fd5b1206ece in SecMan::startCommand (this=<optimized out>, cmd=cmd@entry=-1322673917, sock=<optimized out>, raw_protocol=<optimized out>, errstack=<optimized out>, subcmd=<optimized out>, callback_fn=0x0, misc_data=0x0,
     nonblocking=false, cmd_description=0x0, sec_session_id_hint=0x0) at ./src/condor_io/condor_secman.cpp:1119
#16 0x00007fd5b121f43d in Daemon::startCommand (cmd=cmd@entry=-1322673917, sock=<optimized out>, timeout=timeout@entry=0, errstack=errstack@entry=0x7ffc0e5c5f20, subcmd=subcmd@entry=32725, callback_fn=<optimized out>,
     misc_data=<optimized out>, nonblocking=<optimized out>, cmd_description=<optimized out>, sec_man=<optimized out>, raw_protocol=false, sec_session_id=0x0) at ./src/condor_daemon_client/daemon.cpp:559
#17 0x00007fd5b1223f9e in Daemon::startCommand (this=this@entry=0x7ffc0e5c5cc0, cmd=-1322673917, cmd@entry=1112, st=st@entry=Stream::reli_sock, sock=sock@entry=0x7ffc0e5c5c48, timeout=timeout@entry=0,
     errstack=errstack@entry=0x7ffc0e5c5f20, subcmd=<optimized out>, callback_fn=<optimized out>, misc_data=<optimized out>, nonblocking=<optimized out>, cmd_description=<optimized out>, raw_protocol=<optimized out>,
     sec_session_id=<optimized out>) at ./src/condor_daemon_client/daemon.cpp:629
#18 0x00007fd5b122417b in Daemon::startCommand (this=this@entry=0x7ffc0e5c5cc0, cmd=cmd@entry=1112, st=st@entry=Stream::reli_sock, timeout=timeout@entry=0, errstack=errstack@entry=0x7ffc0e5c5f20,
     cmd_description=cmd_description@entry=0x0, raw_protocol=false, sec_session_id=0x0) at ./src/condor_daemon_client/daemon.cpp:685
#19 0x00007fd5b127d0d4 in ConnectQ (qmgr_location=<optimized out>, timeout=timeout@entry=0, read_only=read_only@entry=false, errstack=errstack@entry=0x7ffc0e5c5f20, effective_owner=effective_owner@entry=0x0,
     schedd_version_str=<optimized out>, schedd_version_str@entry=0x5577f1284320 "$CondorVersion: 8.6.13 Oct 30 2018 BuildID: Debian-8.6.13-1 Debian-8.6.13-1 $") at ./src/condor_schedd.V6/qmgr_lib_support.cpp:85
#20 0x00005577f06af45a in DagmanClassad::OpenConnection (this=this@entry=0x5577f127c2a0) at ./src/condor_dagman/dagman_classad.cpp:219
#21 0x00005577f06af5ae in DagmanClassad::InitializeMetrics (this=this@entry=0x5577f127c2a0) at ./src/condor_dagman/dagman_classad.cpp:191
#22 0x00005577f06afcfb in DagmanClassad::DagmanClassad (this=0x5577f127c2a0, DAGManJobId=...) at ./src/condor_dagman/dagman_classad.cpp:54
#23 0x00005577f06b2aca in main_init (argc=17, argv=0x7ffc0e5c6560) at ./src/condor_dagman/dagman_main.cpp:663
#24 0x00007fd5b126860c in dc_main (argc=17, argv=<optimized out>) at ./src/condor_daemon_core.V6/daemon_core_main.cpp:2746
#25 0x00007fd5ac5efb97 in __libc_start_main (main=0x5577f06a2520 <main(int, char**)>, argc=22, argv=0x7ffc0e5c6538, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffc0e5c6528)
     at ../csu/libc-start.c:310
#26 0x00005577f06a273a in _start ()
-------------------------------------------------------------------------------------------

I jumped into Condor_Auth_Kerberos::init_daemon and found:
- The crash appears to happen in Condor_Auth_Kerberos::init_daemon at:
    dprintf(D_ALWAYS, "AUTH_ERROR: %s\n", (*error_message_ptr)(code));
   (I'm not fully sure since this is the official build with optimizations, so it may also be further down in the cleanup section).
- "code" is correctly at value 13 (permission denied). Sadly, error_message_ptr is optimized out :-(.
   libcomerr2 is installed (as expected).
- DAGMAN believes it is a DAEMON:
    (gdb) p *mySubSystem
     $4 = {m_Name = 0x5577f126b220 "DAGMAN", m_TempName = 0x0, m_NameValid = true, m_Type = SUBSYSTEM_TYPE_DAGMAN, m_TypeName = 0x7fd5b12bb929 "DAGMAN", m_Class = SUBSYSTEM_CLASS_DAEMON, m_Info = 0x5577f126b240,
      m_InfoTable = 0x5577f126b040, m_ClassName = 0x7fd5b128bc4e "DAEMON", m_LocalName = 0x0}
   This explains why Condor_Auth_Kerberos tries to elevate privileges and access /etc/krb5.keytab. Since condor_dagman is started by the user, that can never work.
   Dagman could submit jobs just fine, though, since it can access the users Kerberos token. In the trusty version, that worked very well.

Any help is greatly appreciated. We'll try to upgrade to 8.8 in the near future, but this part of the code was not touched as far as I can see,
and I do conceptually wonder how DAGMAN is supposed to authenticate as a DAEMON when run with user privileges.

Cheers and thanks for any assistance!
	Oliver


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message tohtcondor-users-request@xxxxxxxxxxx  with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
--
Tim Theisen
Release Manager
HTCondor & Open Science Grid
Center for High Throughput Computing
Department of Computer Sciences
University of Wisconsin - Madison
4261 Computer Sciences and Statistics
1210 W Dayton St
Madison, WI 53706-1685
+1 608 265 5736


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


Attachment: smime.p7s
Description: S/MIME Cryptographic Signature