[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_starter... kernel: NMI watchdog: BUG: soft lockup - CPU stuck for...



> On Oct 17, 2016, at 8:28 AM, Antonio Dorta <adorta@xxxxxx> wrote:
> 
> Hi!
> 
> after executing journalctl I can see some errors like next ones:
> 
> Oct 15 14:34:46 vial kernel: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [condor_starter:685]
> Oct 15 14:34:46 vial kernel: Modules linked in: bnep bluetooth fuse nfsv3 nfs fscache xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptabl
> Oct 15 14:34:46 vial kernel:  drm_kms_helper e1000e drm serio_raw ptp pps_core video vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE)
> Oct 15 14:34:46 vial kernel: CPU: 0 PID: 685 Comm: condor_starter Tainted: G        W  OEL  4.1.5-100.fc21.x86_64 #1
> Oct 15 14:34:46 vial kernel: Hardware name: ASUS All Series/Q87M-E, BIOS 1303 10/17/2014
> Oct 15 14:34:46 vial kernel: task: ffff8801e7b3b160 ti: ffff880103d84000 task.ti: ffff880103d84000
> Oct 15 14:34:46 vial kernel: RIP: 0010:[<ffffffff813b6535>]  [<ffffffff813b6535>] copy_user_enhanced_fast_string+0x5/0x10
> Oct 15 14:34:46 vial kernel: RSP: 0018:ffff880103d87c00  EFLAGS: 00010286
> Oct 15 14:34:46 vial kernel: RAX: 00007ffdddbed000 RBX: ffffea00031cccc0 RCX: 0000000000000760
> Oct 15 14:34:46 vial kernel: RDX: 0000000000001000 RSI: 00007ffdddbed8a0 RDI: ffff8800c73338a0
> Oct 15 14:34:46 vial kernel: RBP: ffff880103d87c38 R08: 0000000000001000 R09: ffff88020f908000
> Oct 15 14:34:46 vial kernel: R10: ffff880103d879b8 R11: ffffea00031cccc0 R12: ffff8802156174e0
> Oct 15 14:34:46 vial kernel: R13: ffffea00031cccc0 R14: 00000000a2bb9665 R15: ffff880103d87b78
> Oct 15 14:34:46 vial kernel: FS:  00007f1d4908db80(0000) GS:ffff88021fa00000(0000) knlGS:0000000000000000
> Oct 15 14:34:46 vial kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Oct 15 14:34:46 vial kernel: CR2: 00007f1d490b8000 CR3: 000000020640b000 CR4: 00000000001406f0
> Oct 15 14:34:46 vial kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Oct 15 14:34:46 vial kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Oct 15 14:34:46 vial kernel: Stack:
> Oct 15 14:34:46 vial kernel:  ffffffff813bc0ca 0000000000010286 0000000000001000 00000000000d9000
> Oct 15 14:34:46 vial kernel:  ffff880103d87e68 0000000000000000 ffff880211d8f448 ffff880103d87ce8
> Oct 15 14:34:46 vial kernel:  ffffffff811aa1d6 ffff880103d87ca8 ffffffff8124a254 ffff880103d87c78
> Oct 15 14:34:46 vial kernel: Call Trace:
> Oct 15 14:34:46 vial kernel:  [<ffffffff813bc0ca>] ? iov_iter_copy_from_user_atomic+0x8a/0x210
> Oct 15 14:34:46 vial kernel:  [<ffffffff811aa1d6>] generic_perform_write+0xe6/0x1e0
> Oct 15 14:34:46 vial kernel:  [<ffffffff8124a254>] ? mntput+0x24/0x40
> Oct 15 14:34:46 vial kernel:  [<ffffffff811ac738>] __generic_file_write_iter+0x188/0x1d0
> Oct 15 14:34:46 vial kernel:  [<ffffffff8109e3dd>] ? get_task_mm+0x1d/0x50
> Oct 15 14:34:46 vial kernel:  [<ffffffff812b0ec5>] ext4_file_write_iter+0x255/0x4c0
> Oct 15 14:34:46 vial kernel:  [<ffffffff81297574>] ? proc_single_show+0x54/0xa0
> Oct 15 14:34:46 vial kernel:  [<ffffffff81798936>] ? mutex_lock+0x16/0x40
> Oct 15 14:34:46 vial kernel:  [<ffffffff8124dd7d>] ? seq_read+0xbd/0x3d0
> Oct 15 14:34:46 vial kernel:  [<ffffffff81269b3c>] ? fsnotify+0x3ac/0x580
> Oct 15 14:34:46 vial kernel:  [<ffffffff81227f21>] __vfs_write+0xd1/0x110
> Oct 15 14:34:46 vial kernel:  [<ffffffff812285f9>] vfs_write+0xa9/0x1b0
> Oct 15 14:34:46 vial kernel:  [<ffffffff81798936>] ? mutex_lock+0x16/0x40
> Oct 15 14:34:46 vial kernel:  [<ffffffff812294b5>] SyS_write+0x55/0xd0
> Oct 15 14:34:46 vial kernel:  [<ffffffff8179a8ee>] system_call_fastpath+0x12/0x71
> Oct 15 14:34:46 vial kernel: Code: 48 ff c6 48 ff c7 ff c9 75 f2 89 d1 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 31 c0 0f 1f 00 c3 0f 1f 80 00 00 00 00 0f 1f 00
> 
> 
> If I check the HTCondor StarterLog file, I can only see the next related event with few info:
> 
> 10/15/16 14:35:53 Starter pid 32644 died on signal 11 (signal 11 (Segmentation fault))
> 10/15/16 14:35:53 slot1: State change: starter exited
> 
> 
> 
> Please, do you know what the problem is and how it can be fixed?
> I'm running HTCondor on Linux Fedora21 with the last stable version of HTCondor 8.4.9 (it was updated last week, although this problem also happened with previous versions).


The lines you show are from the HTCondor StartLog. The StarterLog.slotX file should have a full stack trace at the point of the crash. That would be helpful in determining what happened.

Thanks and regards,
Jaime Frey
UW-Madison HTCondor Project