[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor 9.0.0 master getting SIGABRT during token request on RHEL/CentOS 8



Dear all,

in order to supplement Rene's email, I would like to share the current assumptions and investigations.

It looks like that the Centos 8 packages are build using -D _GLIBCXX_ASSERTIONS, is that correct? If that is the case, the code in [1] seems to be broken.

* Line 101 will only reserve contiguous memory for the vector without changing its size. 
* In case -D _GLIBCXX_ASSERTIONS is used, it will check its size while accessing it in Line 102
* Since its size is still zero, that would cause a SIGABRT. 

Best regards,
Manuel

[1] https://github.com/htcondor/htcondor/blob/397ce7a3488d7b4e41168b0d039b19468138eeea/src/condor_utils/token_utils.cpp#L101-L102

Dr. Manuel Giffels, Karlsruhe Institute of Technology (KIT), Steinbuch Centre for Computing (SCC)
Hermann-von-Helmholtz-Platz 1, 76344 Eggenstein-Leopoldshafen
Phone: +49 721 608 28636, Email: Manuel.Giffels@xxxxxxx

> Am 27.04.2021 um 11:30 schrieb Caspart, Renà (SCC) <rene.caspart@xxxxxxx>:
> 
> Dear all,
> 
> When upgrading one of our systems running RHEL 8 to HTCondor 9.0.0
> (previously we were running 8.9.11 without any problems) we encountered
> the condor_master terminating after receiving a SIGABRT. Based on the
> logs [1] this seems to be related to using token authentication and
> condor trying to request a token from the host running the collector.
> The machine where we saw this behavior is a worker node running a STARTD
> and having access to a token with ADVERTISE_STARTD permissions.
> 
> We were able to reproduce this behavior on a test machine (here we were
> only able to use CentOS 8 not RHEL 8) and were able to trace it down to
> the following backtrace [2], which points to [3] as the place in
> HTCondor where the abort is triggered.
> 
> Since this does not seem to related to the specific setup of our hosts,
> has anyone encountered a similar issue?
> 
> Thanks,
> Rene
> 
> 
> [1]
> 04/26/21 12:15:00 (pid:1293) (D_SECURITY) Trying token request to remote
> host cloud-htcondor.gridka.de for user (default).
> Caught signal 6: si_code=4294967290, si_pid=1293, si_uid=232883,
> si_addr=0x50D
> Stack dump for process 1293 at timestamp 1619432100 (13 frames)
> /hkfs/home/project/hk-project-test-hep/scc-sdm-hep-0001/software/condor/condor-9.0.0-1-x86_64_CentOS8-stripped/usr/sbin/../lib64/libcondor_utils_9_0_0.so(dprintf_dump_stack+0x28)[0x147ceb85baf8]
> /hkfs/home/project/hk-project-test-hep/scc-sdm-hep-0001/software/condor/condor-9.0.0-1-x86_64_CentOS8-stripped/usr/sbin/../lib64/libcondor_utils_9_0_0.so(_Z17unix_sig_coredumpiP9siginfo_tPv+0x6d)[0x147ceba8686d]
> /lib64/libpthread.so.0(+0x12dd0)[0x147ce9928dd0]
> /lib64/libc.so.6(gsignal+0x10f)[0x147ce958b70f]
> /lib64/libc.so.6(abort+0x127)[0x147ce9575b25]
> /hkfs/home/project/hk-project-test-hep/scc-sdm-hep-0001/software/condor/condor-9.0.0-1-x86_64_CentOS8-stripped/usr/sbin/../lib64/libcondor_utils_9_0_0.so(_ZN8htcondor18generate_client_idB5cxx11Ev+0x87)[0x147ceb998157]
> /hkfs/home/project/hk-project-test-hep/scc-sdm-hep-0001/software/condor/condor-9.0.0-1-x86_64_CentOS8-stripped/usr/sbin/../lib64/libcondor_utils_9_0_0.so(+0x3d10b8)[0x147ceba8d0b8]
> /hkfs/home/project/hk-project-test-hep/scc-sdm-hep-0001/software/condor/condor-9.0.0-1-x86_64_CentOS8-stripped/usr/sbin/../lib64/libcondor_utils_9_0_0.so(+0x3d18af)[0x147ceba8d8af]
> /hkfs/home/project/hk-project-test-hep/scc-sdm-hep-0001/software/condor/condor-9.0.0-1-x86_64_CentOS8-stripped/usr/sbin/../lib64/libcondor_utils_9_0_0.so(_ZN12TimerManager7TimeoutEPiPd+0x3a3)[0x147cebaa1f13]
> /hkfs/home/project/hk-project-test-hep/scc-sdm-hep-0001/software/condor/condor-9.0.0-1-x86_64_CentOS8-stripped/usr/sbin/../lib64/libcondor_utils_9_0_0.so(_ZN10DaemonCore6DriverEv+0x788)[0x147ceba72ed8]
> /hkfs/home/project/hk-project-test-hep/scc-sdm-hep-0001/software/condor/condor-9.0.0-1-x86_64_CentOS8-stripped/usr/sbin/../lib64/libcondor_utils_9_0_0.so(_Z7dc_mainiPPc+0x1890)[0x147ceba8b4d0]
> /lib64/libc.so.6(__libc_start_main+0xf3)[0x147ce95776a3]
> condor_master(_start+0x2e)[0x558f85cedb4e]
> 
> [2]
> #0  0x00007ffff557499f in raise () from /usr/lib64/libc.so.6
> #1  0x00007ffff555ecf5 in abort () from /usr/lib64/libc.so.6
> #2  0x00007ffff7979157 in std::__replacement_assert
> (__condition=0x7ffff7a965a8 "__builtin_expect(__n < this->size(),
> true)", __function=<synthetic pointer>, __line=932,
>     __file=0x7ffff7a965d8 "/usr/include/c++/8/bits/stl_vector.h") at
> /usr/include/c++/8/x86_64-redhat-linux/bits/c++config.h:2391
> #3  std::vector<char, std::allocator<char> >::operator[] (__n=0,
> this=<synthetic pointer>) at /usr/include/c++/8/bits/stl_vector.h:932
> #4  htcondor::generate_client_id[abi:cxx11]() () at
> /usr/src/debug/condor-9.0.0-1.el8.x86_64/src/condor_utils/token_utils.cpp:102
> #5  0x00007ffff7a6e0b8 in (anonymous
> namespace)::TokenRequest::tryTokenRequest (req=...) at
> /usr/src/debug/condor-9.0.0-1.el8.x86_64/src/condor_daemon_core.V6/daemon_core_main.cpp:462
> #6  0x00007ffff7a6e8af in (anonymous
> namespace)::TokenRequest::tryTokenRequests () at
> /usr/src/debug/condor-9.0.0-1.el8.x86_64/src/condor_daemon_core.V6/daemon_core_main.cpp:422
> #7  0x00007ffff7a82f13 in TimerManager::Timeout (this=0x55555579f290,
> pNumFired=pNumFired@entry=0x7fffffffdbf4,
> pruntime=pruntime@entry=0x7fffffffdbf8)
>     at
> /usr/src/debug/condor-9.0.0-1.el8.x86_64/src/condor_daemon_core.V6/timer_manager.cpp:473
> #8  0x00007ffff7a53ed8 in DaemonCore::Driver (this=0x5555557a03b0) at
> /usr/src/debug/condor-9.0.0-1.el8.x86_64/src/condor_daemon_core.V6/daemon_core.cpp:3513
> #9  0x00007ffff7a6c4d0 in dc_main (argc=1, argv=<optimized out>) at
> /usr/src/debug/condor-9.0.0-1.el8.x86_64/src/condor_daemon_core.V6/daemon_core_main.cpp:4386
> #10 0x00007ffff5560873 in __libc_start_main () from /usr/lib64/libc.so.6
> #11 0x0000555555560b4e in _start () at
> /usr/src/debug/condor-9.0.0-1.el8.x86_64/src/condor_utils/dc_service.h:70
> 
> [3]
> https://github.com/htcondor/htcondor/blob/V9_0_0/src/condor_utils/token_utils.cpp#L102
> 
> -- 
> Karlsruher Institut fÃr Technologie (KIT)
> Steinbuch Centre for Computing (SCC)
> 
> Dr. Renà Caspart
> 
> Hermann-von-Helmholtz-Platz 1 
> 76344 Eggenstein-Leopoldshafen, Germany
> Telefon: +49 721 608-25631
> E-mail: Rene.Caspart@xxxxxxx
> 
> 
> Sitz der KÃrperschaft:
> KaiserstraÃe 12, 76131 Karlsruhe
> 
> 
> 
> KIT â Die ForschungsuniversitÃt in der Helmholtz-Gemeinschaft
> 
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/

Attachment: smime.p7s
Description: S/MIME cryptographic signature