[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] condor_starter... kernel: NMI watchdog: BUG: soft lockup - CPU stuck for...
- Date: Wed, 19 Oct 2016 09:56:49 +0100
- From: Antonio Dorta <adorta@xxxxxx>
- Subject: Re: [HTCondor-users] condor_starter... kernel: NMI watchdog: BUG: soft lockup - CPU stuck for...
Thanks for replying!
I wouldn't say it happens so rarely in our case. I've checked in our
logs of about 1 year in a ~200-machine pool and there were 9038 lines
like: "Starter pid XXXXX died on signal 11 (signal 11 (Segmentation
Thanks once again for your help.
Quoting Jaime Frey <jfrey@xxxxxxxxxxx>:
On Oct 18, 2016, at 3:51 AM, Antonio Dorta
Yeah, you're right, there is pretty much information in StarterLog.slotX...
It seems it was trying to communicate to another machine and then it died...
This looks like some sloppy cleanup code when the starter exits in
certain error conditions. It lost its network connection the shadow
daemon on the submit machine during file transfer. It was then told
to exit, which precipitated jumbled cleanup code that included
trying to tell the shadow it was exiting (over a dead connection). I
havenât trace the exact cause of the crash, but the code involved
has several problems that need to be fixed.
The startd should recover from the crash without any assistance. If
these crashes are only happening rarely, then I wouldnât worry about
them. We will work on a fix, though.
Thanks and regards,
UW-Madison HTCondor Project
Servicios InformÃticos EspecÃficos (SIE)
InvestigaciÃn y EnseÃanza
Instituto de AstrofÃsica de Canarias (IAC)
C/ VÃa LÃctea, s/n. 38205 - La Laguna, Santa Cruz de Tenerife
Despacho: 1124. Tfno: 922 60 5278. email: adorta@xxxxxx
Supercomputing at IAC:
ADVERTENCIA: Sobre la privacidad y cumplimiento de la Ley de
Proteccion de Datos, acceda a http://www.iac.es/disclaimer.php
WARNING: For more information on privacy and fulfilment of the Law
concerning the Protection of Data, consult