[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_starter... kernel: NMI watchdog: BUG: soft lockup - CPU stuck for...



Hi!

Yeah, you're right, there is pretty much information in StarterLog.slotX...

It seems it was trying to communicate to another machine and then it died...

Thanks!


10/15/16 14:51:38 (pid:1045) ******************************************************
10/15/16 14:51:38 (pid:1045) ** condor_starter (CONDOR_STARTER) STARTING UP
10/15/16 14:51:38 (pid:1045) ** /usr/pkg/condor/condor-8.4.9-x86_64_RedHat7-stripped/sbin/condor_starter 10/15/16 14:51:38 (pid:1045) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1) 10/15/16 14:51:38 (pid:1045) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON 10/15/16 14:51:38 (pid:1045) ** $CondorVersion: 8.4.9 Sep 29 2016 BuildID: 382747 $
10/15/16 14:51:38 (pid:1045) ** $CondorPlatform: x86_64_RedHat7 $
10/15/16 14:51:38 (pid:1045) ** PID = 1045
10/15/16 14:51:38 (pid:1045) ** Log last touched 10/15 14:35:53
10/15/16 14:51:38 (pid:1045) ******************************************************
10/15/16 14:51:38 (pid:1045) Using config source: /home/condor/condor_config
10/15/16 14:51:38 (pid:1045) Using local config sources:
10/15/16 14:51:38 (pid:1045)    /home/condor/local.vial/condor_config.local
10/15/16 14:51:38 (pid:1045) /home/condor/condor_config_time_restriction.local 10/15/16 14:51:38 (pid:1045) /home/condor/local.vial/time_restrict_condor_config.local
10/15/16 14:51:38 (pid:1045)    /home/condor/condor_config.X86_64.LINUX
10/15/16 14:51:38 (pid:1045) config Macros = 134, Sorted = 133, StringBytes = 5454, TablesBytes = 4896
10/15/16 14:51:38 (pid:1045) CLASSAD_CACHING is OFF
10/15/16 14:51:38 (pid:1045) Daemon Log is logging: D_ALWAYS D_ERROR
10/15/16 14:51:38 (pid:1045) SharedPortEndpoint: waiting for connections to named socket 17272_6fb7_29 10/15/16 14:51:38 (pid:1045) DaemonCore: command socket at <161.72.201.15:9618?addrs=161.72.201.15-9618&noUDP&sock=17272_6fb7_29> 10/15/16 14:51:38 (pid:1045) DaemonCore: private command socket at <161.72.201.15:9618?addrs=161.72.201.15-9618&noUDP&sock=17272_6fb7_29> 10/15/16 14:51:38 (pid:1045) Communicating with shadow <161.72.202.2:9618?addrs=161.72.202.2-9618&noUDP&sock=956_56a0_20304>
10/15/16 14:51:38 (pid:1045) Submitting machine is "ibero.ll.iac.es"
10/15/16 14:51:38 (pid:1045) setting the orig job name in starter
10/15/16 14:51:38 (pid:1045) setting the orig job iwd in starter
10/15/16 14:51:38 (pid:1045) Chirp config summary: IO false, Updates false, Delayed updates true.
10/15/16 14:51:38 (pid:1045) Initialized IO Proxy.
10/15/16 14:51:38 (pid:1045) Done setting resource limits
10/15/16 15:21:51 (pid:1045) Suspending all jobs.
10/15/16 15:35:47 (pid:1045) Connection to shadow may be lost, will test by sending whoami request. 10/15/16 15:35:47 (pid:1045) condor_write(): Socket closed when trying to write 21 bytes to <161.72.202.2:54087>, fd is 16, errno=104 Connection reset by peer
10/15/16 15:35:47 (pid:1045) Buf::write(): condor_write() failed
10/15/16 15:35:47 (pid:1045) i/o error result is 0, errno is 104
10/15/16 15:35:47 (pid:1045) Lost connection to shadow, waiting 2400 secs for reconnect
10/15/16 15:35:47 (pid:1045) Continuing all jobs.
10/15/16 15:35:47 (pid:1045) Got SIGTERM. Performing graceful shutdown.
10/15/16 15:35:47 (pid:1045) ShutdownGraceful all jobs.
10/15/16 15:35:47 (pid:1048) condor_write(): Socket closed when trying to write 13 bytes to daemon at <161.72.202.2:9618>, fd is 13, errno=104 Connection reset by peer 10/15/16 15:35:47 (pid:1045) **** condor_starter (condor_STARTER) pid 1045 EXITING WITH STATUS 0 10/15/16 15:35:47 (pid:1045) ERROR "Assertion ERROR on (daemonCore)" at line 3823 in file /slots/02/dir_1864183/userdir/src/condor_utils/file_transfer.cpp
10/15/16 15:35:47 (pid:1048) Buf::write(): condor_write() failed
Stack dump for process 1045 at timestamp 1476542147 (30 frames)
/usr/pkg/condor/condor-8.4.9-x86_64_RedHat7-stripped/sbin/../lib/libcondor_utils_8_4_9.so(dprintf_dump_stack+0x72)[0x7f1e40995852]
/usr/pkg/condor/condor-8.4.9-x86_64_RedHat7-stripped/sbin/../lib/libcondor_utils_8_4_9.so(_Z18linux_sig_coredumpi+0x24)[0x7f1e40af0f44]
/lib64/libpthread.so.0(+0x100d0)[0x7f1e3f31c0d0]
/usr/pkg/condor/condor-8.4.9-x86_64_RedHat7-stripped/sbin/../lib/libclassad.so.7(_ZNK7classad7ClassAd13LookupInScopeERKSsRPNS_8ExprTreeERNS_9EvalStateE+0xcc)[0x7f1e403d93dc]
/usr/pkg/condor/condor-8.4.9-x86_64_RedHat7-stripped/sbin/../lib/libclassad.so.7(_ZNK7classad7ClassAd12EvaluateAttrERKSsRNS_5ValueE+0x3e)[0x7f1e403d987e]
/usr/pkg/condor/condor-8.4.9-x86_64_RedHat7-stripped/sbin/../lib/libclassad.so.7(_ZNK7classad7ClassAd18EvaluateAttrStringERKSsRSs+0x2c)[0x7f1e403d9d2c]
/usr/pkg/condor/condor-8.4.9-x86_64_RedHat7-stripped/sbin/../lib/libcondor_utils_8_4_9.so(+0x1448cc)[0x7f1e4098a8cc]
10/15/16 15:35:47 (pid:1048) DoReceiveTransferGoAhead: failed to send alive_interval
/usr/pkg/condor/condor-8.4.9-x86_64_RedHat7-stripped/sbin/../lib/libcondor_utils_8_4_9.so(_Z11_putClassAdP6StreamRN7classad7ClassAdEi+0x2bf)[0x7f1e4098be2f]
condor_starter(REMOTE_CONDOR_ulog+0x5b)[0x43a41b]
condor_starter(REMOTE_CONDOR_ulog_error+0x87)[0x43e007]
condor_starter(_ZN9JICShadow18notifyStarterErrorEPKcbii+0x49)[0x4271b9]
condor_starter(exception_cleanup+0x33)[0x43e1f3]
/usr/pkg/condor/condor-8.4.9-x86_64_RedHat7-stripped/sbin/../lib/libcondor_utils_8_4_9.so(_EXCEPT_+0x126)[0x7f1e40975296]
/usr/pkg/condor/condor-8.4.9-x86_64_RedHat7-stripped/sbin/../lib/libcondor_utils_8_4_9.so(+0x181feb)[0x7f1e409c7feb]
/usr/pkg/condor/condor-8.4.9-x86_64_RedHat7-stripped/sbin/../lib/libcondor_utils_8_4_9.so(_ZN12FileTransfer10stopServerEv+0x16)[0x7f1e409c8006]
/usr/pkg/condor/condor-8.4.9-x86_64_RedHat7-stripped/sbin/../lib/libcondor_utils_8_4_9.so(_ZN12FileTransferD1Ev+0x1f2)[0x7f1e409c8912]
/usr/pkg/condor/condor-8.4.9-x86_64_RedHat7-stripped/sbin/../lib/libcondor_utils_8_4_9.so(_ZN12FileTransferD0Ev+0x9)[0x7f1e409c8a29]
condor_starter(_ZN9JICShadowD2Ev+0x77)[0x425937]
condor_starter(_ZN9JICShadowD0Ev+0x9)[0x425ac9]
condor_starter(_ZN8CStarterD2Ev+0x40)[0x455140]
/lib64/libc.so.6(+0x39392)[0x7f1e3ef88392]
/lib64/libc.so.6(+0x393e5)[0x7f1e3ef883e5]
/usr/pkg/condor/condor-8.4.9-x86_64_RedHat7-stripped/sbin/../lib/libcondor_utils_8_4_9.so(__wrap_exit+0x65)[0x7f1e40ac22d5]
/usr/pkg/condor/condor-8.4.9-x86_64_RedHat7-stripped/sbin/../lib/libcondor_utils_8_4_9.so(_Z7DC_ExitiPKc+0x215)[0x7f1e40af1ac5]
condor_starter[0x4563f5]
/usr/pkg/condor/condor-8.4.9-x86_64_RedHat7-stripped/sbin/../lib/libcondor_utils_8_4_9.so(_Z17handle_dc_sigtermP7Servicei+0x97)[0x7f1e40af1337]
/usr/pkg/condor/condor-8.4.9-x86_64_RedHat7-stripped/sbin/../lib/libcondor_utils_8_4_9.so(_ZN10DaemonCore6DriverEv+0x851)[0x7f1e40ad4761]
/usr/pkg/condor/condor-8.4.9-x86_64_RedHat7-stripped/sbin/../lib/libcondor_utils_8_4_9.so(_Z7dc_mainiPPc+0x13a4)[0x7f1e40af4544]
/lib64/libc.so.6(__libc_start_main+0xf0)[0x7f1e3ef6efe0]



Quoting Jaime Frey <jfrey@xxxxxxxxxxx>:

On Oct 17, 2016, at 8:28 AM, Antonio Dorta <adorta@xxxxxx> wrote:

Hi!

after executing journalctl I can see some errors like next ones:

Oct 15 14:34:46 vial kernel: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [condor_starter:685] Oct 15 14:34:46 vial kernel: Modules linked in: bnep bluetooth fuse nfsv3 nfs fscache xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptabl Oct 15 14:34:46 vial kernel: drm_kms_helper e1000e drm serio_raw ptp pps_core video vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) Oct 15 14:34:46 vial kernel: CPU: 0 PID: 685 Comm: condor_starter Tainted: G W OEL 4.1.5-100.fc21.x86_64 #1 Oct 15 14:34:46 vial kernel: Hardware name: ASUS All Series/Q87M-E, BIOS 1303 10/17/2014 Oct 15 14:34:46 vial kernel: task: ffff8801e7b3b160 ti: ffff880103d84000 task.ti: ffff880103d84000 Oct 15 14:34:46 vial kernel: RIP: 0010:[<ffffffff813b6535>] [<ffffffff813b6535>] copy_user_enhanced_fast_string+0x5/0x10
Oct 15 14:34:46 vial kernel: RSP: 0018:ffff880103d87c00  EFLAGS: 00010286
Oct 15 14:34:46 vial kernel: RAX: 00007ffdddbed000 RBX: ffffea00031cccc0 RCX: 0000000000000760 Oct 15 14:34:46 vial kernel: RDX: 0000000000001000 RSI: 00007ffdddbed8a0 RDI: ffff8800c73338a0 Oct 15 14:34:46 vial kernel: RBP: ffff880103d87c38 R08: 0000000000001000 R09: ffff88020f908000 Oct 15 14:34:46 vial kernel: R10: ffff880103d879b8 R11: ffffea00031cccc0 R12: ffff8802156174e0 Oct 15 14:34:46 vial kernel: R13: ffffea00031cccc0 R14: 00000000a2bb9665 R15: ffff880103d87b78 Oct 15 14:34:46 vial kernel: FS: 00007f1d4908db80(0000) GS:ffff88021fa00000(0000) knlGS:0000000000000000 Oct 15 14:34:46 vial kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Oct 15 14:34:46 vial kernel: CR2: 00007f1d490b8000 CR3: 000000020640b000 CR4: 00000000001406f0 Oct 15 14:34:46 vial kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Oct 15 14:34:46 vial kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Oct 15 14:34:46 vial kernel: Stack:
Oct 15 14:34:46 vial kernel: ffffffff813bc0ca 0000000000010286 0000000000001000 00000000000d9000 Oct 15 14:34:46 vial kernel: ffff880103d87e68 0000000000000000 ffff880211d8f448 ffff880103d87ce8 Oct 15 14:34:46 vial kernel: ffffffff811aa1d6 ffff880103d87ca8 ffffffff8124a254 ffff880103d87c78
Oct 15 14:34:46 vial kernel: Call Trace:
Oct 15 14:34:46 vial kernel: [<ffffffff813bc0ca>] ? iov_iter_copy_from_user_atomic+0x8a/0x210 Oct 15 14:34:46 vial kernel: [<ffffffff811aa1d6>] generic_perform_write+0xe6/0x1e0
Oct 15 14:34:46 vial kernel:  [<ffffffff8124a254>] ? mntput+0x24/0x40
Oct 15 14:34:46 vial kernel: [<ffffffff811ac738>] __generic_file_write_iter+0x188/0x1d0
Oct 15 14:34:46 vial kernel:  [<ffffffff8109e3dd>] ? get_task_mm+0x1d/0x50
Oct 15 14:34:46 vial kernel: [<ffffffff812b0ec5>] ext4_file_write_iter+0x255/0x4c0 Oct 15 14:34:46 vial kernel: [<ffffffff81297574>] ? proc_single_show+0x54/0xa0
Oct 15 14:34:46 vial kernel:  [<ffffffff81798936>] ? mutex_lock+0x16/0x40
Oct 15 14:34:46 vial kernel:  [<ffffffff8124dd7d>] ? seq_read+0xbd/0x3d0
Oct 15 14:34:46 vial kernel:  [<ffffffff81269b3c>] ? fsnotify+0x3ac/0x580
Oct 15 14:34:46 vial kernel:  [<ffffffff81227f21>] __vfs_write+0xd1/0x110
Oct 15 14:34:46 vial kernel:  [<ffffffff812285f9>] vfs_write+0xa9/0x1b0
Oct 15 14:34:46 vial kernel:  [<ffffffff81798936>] ? mutex_lock+0x16/0x40
Oct 15 14:34:46 vial kernel:  [<ffffffff812294b5>] SyS_write+0x55/0xd0
Oct 15 14:34:46 vial kernel: [<ffffffff8179a8ee>] system_call_fastpath+0x12/0x71 Oct 15 14:34:46 vial kernel: Code: 48 ff c6 48 ff c7 ff c9 75 f2 89 d1 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 31 c0 0f 1f 00 c3 0f 1f 80 00 00 00 00 0f 1f 00


If I check the HTCondor StarterLog file, I can only see the next related event with few info:

10/15/16 14:35:53 Starter pid 32644 died on signal 11 (signal 11 (Segmentation fault))
10/15/16 14:35:53 slot1: State change: starter exited



Please, do you know what the problem is and how it can be fixed?
I'm running HTCondor on Linux Fedora21 with the last stable version of HTCondor 8.4.9 (it was updated last week, although this problem also happened with previous versions).


The lines you show are from the HTCondor StartLog. The StarterLog.slotX file should have a full stack trace at the point of the crash. That would be helpful in determining what happened.

Thanks and regards,
Jaime Frey
UW-Madison HTCondor Project

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


--
Antonio Dorta
Servicios InformÃticos EspecÃficos (SIE)
InvestigaciÃn y EnseÃanza
Instituto de AstrofÃsica de Canarias (IAC)
C/ VÃa LÃctea, s/n. 38205 - La Laguna, Santa Cruz de Tenerife
Despacho: 1124. Tfno: 922 60 5278. email: adorta@xxxxxx
Supercomputing at IAC: http://www.iac.es/sieinvens/SINFIN/Main/supercomputing.php
----------------------------------------------------------------
ADVERTENCIA: Sobre la privacidad y cumplimiento de la Ley de Proteccion de Datos, acceda a http://www.iac.es/disclaimer.php WARNING: For more information on privacy and fulfilment of the Law concerning the Protection of Data, consult http://www.iac.es/disclaimer.php?lang=en