Iâve made a ticket in our bug-tracking system:
I expect to have a fix in place for our next release.
On Oct 19, 2016, at 3:56 AM, Antonio Dorta <adorta@xxxxxx
wouldn't say it happens so rarely in our case. I've checked in our logs of about 1 year in a ~200-machine pool and there were 9038 lines like: "Starter pid XXXXX died on signal 11 (signal 11 (Segmentation fault))".
once again for your help.
Jaime Frey <jfrey@xxxxxxxxxxx>:
On Oct 18, 2016, at 3:51 AM, Antonio Dorta <adorta@xxxxxx<mailto:adorta@xxxxxx>> wrote:
InformÃticos EspecÃficos (SIE)
de AstrofÃsica de Canarias (IAC)
VÃa LÃctea, s/n. 38205 - La Laguna, Santa Cruz de Tenerife
1124. Tfno: 922 60 5278. email: adorta@xxxxxx
at IAC: http://www.iac.es/sieinvens/SINFIN/Main/supercomputing.php
Sobre la privacidad y cumplimiento de la Ley de Proteccion de Datos, acceda a http://www.iac.es/disclaimer.php
For more information on privacy and fulfilment of the Law concerning the Protection of Data, consult http://www.iac.es/disclaimer.php?lang=en
unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with
can also unsubscribe by visiting
archives can be found at:
Yeah, you're right, there is pretty much information in StarterLog.slotX...
It seems it was trying to communicate to another machine and then it died...
This looks like some sloppy cleanup code when the starter exits in certain error conditions. It lost its network connection the shadow daemon on the submit machine during file transfer. It was then told to exit, which precipitated jumbled cleanup code that
included trying to tell the shadow it was exiting (over a dead connection). I havenât trace the exact cause of the crash, but the code involved has several problems that need to be fixed.
The startd should recover from the crash without any assistance. If these crashes are only happening rarely, then I wouldnât worry about them. We will work on a fix, though.
Thanks and regards,
UW-Madison HTCondor Project