[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] condor_starter... kernel: NMI watchdog: BUG: soft lockup - CPU stuck for...



Hi!

after executing journalctl I can see some errors like next ones:

Oct 15 14:34:46 vial kernel: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [condor_starter:685] Oct 15 14:34:46 vial kernel: Modules linked in: bnep bluetooth fuse nfsv3 nfs fscache xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptabl Oct 15 14:34:46 vial kernel: drm_kms_helper e1000e drm serio_raw ptp pps_core video vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) Oct 15 14:34:46 vial kernel: CPU: 0 PID: 685 Comm: condor_starter Tainted: G W OEL 4.1.5-100.fc21.x86_64 #1 Oct 15 14:34:46 vial kernel: Hardware name: ASUS All Series/Q87M-E, BIOS 1303 10/17/2014 Oct 15 14:34:46 vial kernel: task: ffff8801e7b3b160 ti: ffff880103d84000 task.ti: ffff880103d84000 Oct 15 14:34:46 vial kernel: RIP: 0010:[<ffffffff813b6535>] [<ffffffff813b6535>] copy_user_enhanced_fast_string+0x5/0x10
Oct 15 14:34:46 vial kernel: RSP: 0018:ffff880103d87c00  EFLAGS: 00010286
Oct 15 14:34:46 vial kernel: RAX: 00007ffdddbed000 RBX: ffffea00031cccc0 RCX: 0000000000000760 Oct 15 14:34:46 vial kernel: RDX: 0000000000001000 RSI: 00007ffdddbed8a0 RDI: ffff8800c73338a0 Oct 15 14:34:46 vial kernel: RBP: ffff880103d87c38 R08: 0000000000001000 R09: ffff88020f908000 Oct 15 14:34:46 vial kernel: R10: ffff880103d879b8 R11: ffffea00031cccc0 R12: ffff8802156174e0 Oct 15 14:34:46 vial kernel: R13: ffffea00031cccc0 R14: 00000000a2bb9665 R15: ffff880103d87b78 Oct 15 14:34:46 vial kernel: FS: 00007f1d4908db80(0000) GS:ffff88021fa00000(0000) knlGS:0000000000000000
Oct 15 14:34:46 vial kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 15 14:34:46 vial kernel: CR2: 00007f1d490b8000 CR3: 000000020640b000 CR4: 00000000001406f0 Oct 15 14:34:46 vial kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Oct 15 14:34:46 vial kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Oct 15 14:34:46 vial kernel: Stack:
Oct 15 14:34:46 vial kernel: ffffffff813bc0ca 0000000000010286 0000000000001000 00000000000d9000 Oct 15 14:34:46 vial kernel: ffff880103d87e68 0000000000000000 ffff880211d8f448 ffff880103d87ce8 Oct 15 14:34:46 vial kernel: ffffffff811aa1d6 ffff880103d87ca8 ffffffff8124a254 ffff880103d87c78
Oct 15 14:34:46 vial kernel: Call Trace:
Oct 15 14:34:46 vial kernel: [<ffffffff813bc0ca>] ? iov_iter_copy_from_user_atomic+0x8a/0x210 Oct 15 14:34:46 vial kernel: [<ffffffff811aa1d6>] generic_perform_write+0xe6/0x1e0
Oct 15 14:34:46 vial kernel:  [<ffffffff8124a254>] ? mntput+0x24/0x40
Oct 15 14:34:46 vial kernel: [<ffffffff811ac738>] __generic_file_write_iter+0x188/0x1d0
Oct 15 14:34:46 vial kernel:  [<ffffffff8109e3dd>] ? get_task_mm+0x1d/0x50
Oct 15 14:34:46 vial kernel: [<ffffffff812b0ec5>] ext4_file_write_iter+0x255/0x4c0 Oct 15 14:34:46 vial kernel: [<ffffffff81297574>] ? proc_single_show+0x54/0xa0
Oct 15 14:34:46 vial kernel:  [<ffffffff81798936>] ? mutex_lock+0x16/0x40
Oct 15 14:34:46 vial kernel:  [<ffffffff8124dd7d>] ? seq_read+0xbd/0x3d0
Oct 15 14:34:46 vial kernel:  [<ffffffff81269b3c>] ? fsnotify+0x3ac/0x580
Oct 15 14:34:46 vial kernel:  [<ffffffff81227f21>] __vfs_write+0xd1/0x110
Oct 15 14:34:46 vial kernel:  [<ffffffff812285f9>] vfs_write+0xa9/0x1b0
Oct 15 14:34:46 vial kernel:  [<ffffffff81798936>] ? mutex_lock+0x16/0x40
Oct 15 14:34:46 vial kernel:  [<ffffffff812294b5>] SyS_write+0x55/0xd0
Oct 15 14:34:46 vial kernel: [<ffffffff8179a8ee>] system_call_fastpath+0x12/0x71 Oct 15 14:34:46 vial kernel: Code: 48 ff c6 48 ff c7 ff c9 75 f2 89 d1 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 31 c0 0f 1f 00 c3 0f 1f 80 00 00 00 00 0f 1f 00


If I check the HTCondor StarterLog file, I can only see the next related event with few info:

10/15/16 14:35:53 Starter pid 32644 died on signal 11 (signal 11 (Segmentation fault))
10/15/16 14:35:53 slot1: State change: starter exited



Please, do you know what the problem is and how it can be fixed?
I'm running HTCondor on Linux Fedora21 with the last stable version of HTCondor 8.4.9 (it was updated last week, although this problem also happened with previous versions).

Thank you very much,




--
Antonio Dorta
Servicios InformÃticos EspecÃficos (SIE)
InvestigaciÃn y EnseÃanza
Instituto de AstrofÃsica de Canarias (IAC)
C/ VÃa LÃctea, s/n. 38205 - La Laguna, Santa Cruz de Tenerife
Despacho: 1124. Tfno: 922 60 5278. email: adorta@xxxxxx
Supercomputing at IAC: http://www.iac.es/sieinvens/SINFIN/Main/supercomputing.php
----------------------------------------------------------------
ADVERTENCIA: Sobre la privacidad y cumplimiento de la Ley de Proteccion de Datos, acceda a http://www.iac.es/disclaimer.php WARNING: For more information on privacy and fulfilment of the Law concerning the Protection of Data, consult http://www.iac.es/disclaimer.php?lang=en