[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_starter... kernel: NMI watchdog: BUG: soft lockup - CPU stuck for...



On Oct 18, 2016, at 3:51 AM, Antonio Dorta <adorta@xxxxxx> wrote:

Yeah, you're right, there is pretty much information in StarterLog.slotX...

It seems it was trying to communicate to another machine and then it died...

This looks like some sloppy cleanup code when the starter exits in certain error conditions. It lost its network connection the shadow daemon on the submit machine during file transfer. It was then told to exit, which precipitated jumbled cleanup code that included trying to tell the shadow it was exiting (over a dead connection). I havenât trace the exact cause of the crash, but the code involved has several problems that need to be fixed.

The startd should recover from the crash without any assistance. If these crashes are only happening rarely, then I wouldnât worry about them. We will work on a fix, though.

Thanks and regards,
Jaime Frey
UW-Madison HTCondor Project