[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor 8.6.13 stack dump

On 11/14/2018 2:50 AM, Carles Acosta wrote:
Hello all,

We have recently updated to CentOs7 and HTCondor 8.6.13 our WorkerNodes. For our WNs with more slots, 48, we are seeing that the condor_master is crashing from time to time with this error:

Caught signal 6: si_code=0, si_pid=1, si_uid=0, si_addr=0x1

What's your hardware? E.g. we have a couple of 48-core supermicros that shipped with "balanced fan plan" in the bios and no thermal sensors under the memory banks on the motherboard. Guess what happened when they got going.

We have about 10 older servers of varying vintage running centos 7 pretty much since it came out (tracking cr), and the current stable condor. No condor crashes that I can recall, nor any sig 6'es on any of them.