[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor 9.0.0 condor_starter segfaulting during x509 proxy update



Hi Zach,

thanks a lot for opening the issue and your suggestion. I tested setting it only on the submit node and this does not seem to be sufficient. Setting it on both the worker node and submit node does the trick.

Cheers,
Rene

Am 30.04.21 um 01:21 schrieb Zach Miller via HTCondor-users:

Hello again,

 

I was able to reproduce this issue locally exactly as you described.

 

I've created a ticket for this if you want to follow the issue:

    https://opensciencegrid.atlassian.net/browse/HTCONDOR-456

 

Thanks again for the detailed report and let me know if the workaround does the trick for you for now.

 

 

Cheers,

-zach

 

 

-----Original Message-----
From: Zach Miller <zmiller@xxxxxxxxxxx>
Date: Thursday, April 29, 2021 at 3:24 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] HTCondor 9.0.0 condor_starter segfaulting during x509 proxy update

Hi Rene,

 

This certainly seems like a bug.  We will look into it and also try to reproduce it.  Thank you for the report.

 

In the meantime, you may be able to work around it by disabling AES (which was added in 9.0.0 and I suspect related to what you are seeing).  In your condor_config set:

    SEC_DEFAULT_CRYPTO_METHODS = BLOWFISH

 

I think setting it just on the submit node should be sufficient but it won't hurt to set it everywhere.  Let me know if that helps and I will let you know what we find once we know more.

 

 

Cheers,

-zach

 

 

-----Original Message-----

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Caspart, René (SCC) <rene.caspart@xxxxxxx>

Date: Thursday, April 29, 2021 at 8:58 AM

To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>

Subject: [HTCondor-users] HTCondor 9.0.0 condor_starter segfaulting during x509 proxy update

Dear all,

 

 

 

After updating to HTCondor 9.0.0 we are experiencing problems on our

 

worker nodes (running SL7). The condor_starter encounter a segmentation

 

fault after ~1h runtime. In the StarterLog.slot1_X there is nothing

 

being logged around that time until the start of the next

 

condor_starter. In the StartLog I only see the report about the

 

segmentation fault [1]. In addition a core dump is created. Having a

 

look at the core dump the corresponding backtrace is [2].

 

 

 

As to me it seems like this is be related to the update of the X509

 

proxy for the job, I tried submitting a job without a userproxy, which

 

so far does not trigger this problem. Other than the update of the

 

HTCondor version nothing changed about the setup and jobs we are submitting.

 

 

 

Has anyone experienced similar issues? Please let me know if any

 

additional information can be useful to debug this issue.

 

 

 

Thanks,

 

Rene

 

 

 

[1]

 

StartLog

 

04/29/21 13:01:35 (pid:4025) (D_ALWAYS|D_FAILURE) Starter pid 3337908

 

died on signal 11 (signal 11 (Segmentation fault))

 

 

 

[2]

 

#0  0x00007fa4d9d7f657 in kill () from /usr/lib64/libc.so.6

 

#1  0x00007fa4dc520e64 in unix_sig_coredump (signum=11,

 

s_info=<optimized out>) at

 

/usr/src/debug/condor-9.0.0/src/condor_daemon_core.V6/daemon_core_main.cpp:1355

 

#2  <signal handler called>

 

#3  0x00007fa4db0a9802 in EVP_DigestFinal_ex () from

 

/usr/lib64/libcrypto.so.10

 

#4  0x00007fa4dc4aee1d in ReliSock::SndMsg::snd_packet

 

(this=this@entry=0x55caf89bd470, peer_description=0x55caf89bd35c

 

"<[2a00:139c:5:1dc:0:43:1:8c]:18245>", _sock=_sock@entry=19,

 

end=end@entry=1,

 

    _timeout=_timeout@entry=10) at

 

/usr/src/debug/condor-9.0.0/src/condor_io/reli_sock.cpp:1199

 

#5  0x00007fa4dc4af538 in ReliSock::end_of_message_internal

 

(this=this@entry=0x55caf89bd170) at

 

/usr/src/debug/condor-9.0.0/src/condor_io/reli_sock.cpp:564

 

#6  0x00007fa4dc4af5fc in ReliSock::end_of_message (this=0x55caf89bd170)

 

at /usr/src/debug/condor-9.0.0/src/condor_io/reli_sock.cpp:546

 

#7  0x00007fa4dc4753de in relisock_gsi_put (arg=0x55caf89bd170,

 

buf=0x55caf89f99a0, size=667) at

 

/usr/src/debug/condor-9.0.0/src/condor_io/cedar_no_ckpt.cpp:943

 

#8  0x00007fa4dc3c4fc3 in x509_receive_delegation

 

(destination_file=destination_file@entry=0x55caf89bdc40 "proxy.tmp",

 

    recv_data_func=recv_data_func@entry=0x7fa4dc4752c0

 

<relisock_gsi_get(void*, void**, unsigned long*)>,

 

recv_data_ptr=recv_data_ptr@entry=0x55caf89bd170,

 

    send_data_func=send_data_func@entry=0x7fa4dc4753b0

 

<relisock_gsi_put(void*, void*, unsigned long)>,

 

send_data_ptr=send_data_ptr@entry=0x55caf89bd170,

 

state_ptr=state_ptr@entry=0x7ffc6dbd1738)

 

    at /usr/src/debug/condor-9.0.0/src/condor_utils/globus_utils.cpp:1676

 

#9  0x00007fa4dc47673e in ReliSock::get_x509_delegation

 

(this=this@entry=0x55caf89bd170, destination=0x55caf89bdc40 "proxy.tmp",

 

flush_buffers=flush_buffers@entry=false, state_ptr=state_ptr@entry=0x0)

 

    at /usr/src/debug/condor-9.0.0/src/condor_io/cedar_no_ckpt.cpp:757

 

#10 0x000055caf678fe84 in updateX509Proxy (path=0x55caf8993ac5 "proxy",

 

rsock=0x55caf89bd170, cmd=500) at

 

/usr/src/debug/condor-9.0.0/src/condor_starter.V6.1/jic_shadow.cpp:1791

 

#11 JICShadow::updateX509Proxy (this=0x55caf898c4f0, cmd=500,

 

s=0x55caf89bd170) at

 

/usr/src/debug/condor-9.0.0/src/condor_starter.V6.1/jic_shadow.cpp:1896

 

#12 0x000055caf676e65b in Starter::updateX509Proxy (this=<optimized

 

out>, cmd=<optimized out>, s=<optimized out>) at

 

/usr/src/debug/condor-9.0.0/src/condor_starter.V6.1/starter.cpp:3723

 

#13 0x00007fa4dc50b46a in DaemonCore::CallCommandHandler

 

(this=0x55caf897d0e0, req=500, stream=0x55caf89bd170,

 

delete_stream=delete_stream@entry=false,

 

check_payload=check_payload@entry=true,

 

    time_spent_on_sec=0.000277000014,

 

time_spent_waiting_for_payload=time_spent_waiting_for_payload@entry=0)

 

at

 

/usr/src/debug/condor-9.0.0/src/condor_daemon_core.V6/daemon_core.cpp:4468

 

#14 0x00007fa4dc4fa19a in DaemonCommandProtocol::ExecCommand

 

(this=0x55caf89b35f0) at

 

/usr/src/debug/condor-9.0.0/src/condor_daemon_core.V6/daemon_command.cpp:1810

 

#15 0x00007fa4dc4fd385 in DaemonCommandProtocol::doProtocol

 

(this=this@entry=0x55caf89b35f0) at

 

/usr/src/debug/condor-9.0.0/src/condor_daemon_core.V6/daemon_command.cpp:176

 

#16 0x00007fa4dc4fd485 in DaemonCommandProtocol::SocketCallback

 

(this=this@entry=0x55caf89b35f0, stream=0x55caf89bd170) at

 

/usr/src/debug/condor-9.0.0/src/condor_daemon_core.V6/daemon_command.cpp:239

 

#17 0x00007fa4dc50c850 in DaemonCore::CallSocketHandler_worker

 

(this=0x55caf897d0e0, i=3, default_to_HandleCommand=<optimized out>,

 

asock=<optimized out>)

 

    at

 

/usr/src/debug/condor-9.0.0/src/condor_daemon_core.V6/daemon_core.cpp:4235

 

#18 0x00007fa4dc50c8ed in

 

DaemonCore::CallSocketHandler_worker_demarshall (arg=0x55caf89b2d00) at

 

/usr/src/debug/condor-9.0.0/src/condor_daemon_core.V6/daemon_core.cpp:4194

 

#19 0x00007fa4dc345dd5 in CondorThreads::pool_add

 

(routine=routine@entry=0x7fa4dc50c8d0

 

<DaemonCore::CallSocketHandler_worker_demarshall(void*)>,

 

arg=arg@entry=0x55caf89b2d00, tid=<optimized out>,

 

    descrip=<optimized out>) at

 

/usr/src/debug/condor-9.0.0/src/condor_utils/condor_threads.cpp:1109

 

#20 0x00007fa4dc508617 in DaemonCore::CallSocketHandler

 

(this=this@entry=0x55caf897d0e0, i=@0x7ffc6dbd1c60: 3,

 

default_to_HandleCommand=default_to_HandleCommand@entry=true)

 

    at

 

/usr/src/debug/condor-9.0.0/src/condor_daemon_core.V6/daemon_core.cpp:4182

 

#21 0x00007fa4dc51130e in DaemonCore::Driver (this=0x55caf897d0e0) at

 

/usr/src/debug/condor-9.0.0/src/condor_daemon_core.V6/daemon_core.cpp:4019

 

#22 0x00007fa4dc525d12 in dc_main (argc=2, argv=0x7ffc6dbd2550) at

 

/usr/src/debug/condor-9.0.0/src/condor_daemon_core.V6/daemon_core_main.cpp:4386

 

#23 0x00007fa4d9d6b555 in __libc_start_main () from /usr/lib64/libc.so.6

 

#24 0x000055caf676d961 in _start ()

 

 

 

--

 

Karlsruher Institut für Technologie (KIT)

 

Steinbuch Centre for Computing (SCC)

 

 

 

Dr. René Caspart

 

 

 

Hermann-von-Helmholtz-Platz 1

 

76344 Eggenstein-Leopoldshafen, Germany

 

Telefon: +49 721 608-25631

 

E-mail: Rene.Caspart@xxxxxxx

 

 

 

 

 

Sitz der Körperschaft:

 

Kaiserstraße 12, 76131 Karlsruhe

 

 

 

 

 

 

 

KIT – Die Forschungsuniversität in der Helmholtz-Gemeinschaft

 

 

 

 

 

 

 

 


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

-- 
Karlsruher Institut für Technologie (KIT)
Steinbuch Centre for Computing (SCC)

Dr. René Caspart

Hermann-von-Helmholtz-Platz 1 
76344 Eggenstein-Leopoldshafen, Germany
Telefon: +49 721 608-25631
E-mail: Rene.Caspart@xxxxxxx


Sitz der Körperschaft:
Kaiserstraße 12, 76131 Karlsruhe



KIT – Die Forschungsuniversität in der Helmholtz-Gemeinschaft

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature