Re: [HTCondor-users] condor_starter... kernel: NMI watchdog: BUG: soft lockup

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

On Oct 19, 2016, at 3:56 AM, Antonio Dorta <adorta@xxxxxx> wrote:

Hi!

Thanks for replying!

I wouldn't say it happens so rarely in our case. I've checked in our logs of about 1 year in a ~200-machine pool and there were 9038 lines like: "Starter pid XXXXX died on signal 11 (signal 11 (Segmentation fault))".

Thanks once again for your help.

Best regards,

Quoting Jaime Frey <jfrey@xxxxxxxxxxx>:

On Oct 18, 2016, at 3:51 AM, Antonio Dorta <adorta@xxxxxx<mailto:adorta@xxxxxx>> wrote:

Yeah, you're right, there is pretty much information in StarterLog.slotX...

It seems it was trying to communicate to another machine and then it died...

This looks like some sloppy cleanup code when the starter exits in certain error conditions. It lost its network connection the shadow daemon on the submit machine during file transfer. It was then told to exit, which precipitated jumbled cleanup code that included trying to tell the shadow it was exiting (over a dead connection). I havenât trace the exact cause of the crash, but the code involved has several problems that need to be fixed.

The startd should recover from the crash without any assistance. If these crashes are only happening rarely, then I wouldnât worry about them. We will work on a fix, though.

Thanks and regards,
Jaime Frey
UW-Madison HTCondor Project

--
Antonio Dorta
Servicios InformÃticos EspecÃficos (SIE)
InvestigaciÃn y EnseÃanza
Instituto de AstrofÃsica de Canarias (IAC)
C/ VÃa LÃctea, s/n. 38205 - La Laguna, Santa Cruz de Tenerife
Despacho: 1124. Tfno: 922 60 5278. email: adorta@xxxxxx
Supercomputing at IAC: http://www.iac.es/sieinvens/SINFIN/Main/supercomputing.php
----------------------------------------------------------------
ADVERTENCIA: Sobre la privacidad y cumplimiento de la Ley de Proteccion de Datos, acceda a http://www.iac.es/disclaimer.php
WARNING: For more information on privacy and fulfilment of the Law concerning the Protection of Data, consult http://www.iac.es/disclaimer.php?lang=en

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Mailing List Archives

Public Access

Re: [HTCondor-users] condor_starter... kernel: NMI watchdog: BUG: soft lockup - CPU stuck for...