[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor 8.6.13 stack dump



On 11/14/18 2:50 AM, Carles Acosta wrote:
Hello all,

We have recently updated to CentOs7 and HTCondor 8.6.13 our WorkerNodes. For our WNs with more slots, 48, we are seeing that the condor_master is crashing from time to time with this error:


At Wisconsin, we run some machine with 80 cores, so I think this problem isn't solely caused by a large number of cores. Would it be possible for you to send me (off list) the entire MasterLog, so we can see what is going on?


-greg


Caught signal 6: si_code=0, si_pid=1, si_uid=0, si_addr=0x1
Stack dump for process 23309 at timestamp 1542183984 (16 frames)
/usr/lib64/libcondor_utils_8_6_13.so(dprintf_dump_stack+0x24)[0x7ff576c8b2e4]
/usr/lib64/libcondor_utils_8_6_13.so(_Z17unix_sig_coredumpiP9siginfo_tPv+0x69)[0x7ff576e15db9]
/usr/lib64/libpthread.so.0(+0xf6d0)[0x7ff5753666d0]
/usr/lib64/libpthread.so.0(read+0x10)[0x7ff5753657e0]
/usr/lib64/libcondor_utils_8_6_13.so(_ZN10DaemonCore9Read_PipeEiPvi+0x6c)[0x7ff576de538c]
/usr/lib64/libcondor_utils_8_6_13.so(_ZN15ProcFamilyProxy11start_procdEv+0x728)[0x7ff576d12e78]
/usr/lib64/libcondor_utils_8_6_13.so(_ZN15ProcFamilyProxyC1EPKc+0x1f8)[0x7ff576d13748]
/usr/lib64/libcondor_utils_8_6_13.so(_ZN19ProcFamilyInterface6createEPKc+0x5e)[0x7ff576c7c4ee]
/usr/lib64/libcondor_utils_8_6_13.so(_ZN10DaemonCore14Create_ProcessEPKcRK7ArgList10priv_stateiiiPK3EnvS1_P10FamilyInfoPP6StreamPiSE_iP10__sigset_tiPmSE_S1_P8MyStringP15FilesystemRemapl+0x19e8)[0x7ff576dff188]
/usr/sbin/condor_master(_ZN6daemon9RealStartEv+0xca3)[0x410303]
/usr/sbin/condor_master(_ZN7Daemons15StartDaemonHereEP6daemon+0x26)[0x410e06]
/usr/sbin/condor_master(_ZN7Daemons15StartAllDaemonsEv+0x68)[0x410e78]
/usr/sbin/condor_master(_Z9main_initiPPc+0x6ff)[0x41583f]
/usr/lib64/libcondor_utils_8_6_13.so(_Z7dc_mainiPPc+0x138d)[0x7ff576e1942d]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ff574fac445]
/usr/sbin/condor_master[0x40a6af]

The /var/log/condor directory has a lot of core.XXXX dump files.

This is not happening for the rest of our machines with fewer slots. The WNs can be several days toÂone-week executing jobs without any issue until the crash. Any ideas? Increasing the stack size of the condor user with ulimit does not solve the issue.

Cheers,

Carles



--
Carles Acosta i Silva
PIC (Port d'Informacià CientÃfica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
AvÃs - Aviso - Legal Notice: http://www.ifae.es/legal.html

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/