Mailing List Archives
Public Access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] watchdog: BUG: soft lockup - CPU#57 stuck for 22s! [condor_procd:3044]
- Date: Thu, 27 May 2021 19:52:49 +0200
- From: Gianmauro Cuccuru <gmauro@xxxxxxxxxxxxxxxxxxxxxxxxxx>
- Subject: Re: [HTCondor-users] watchdog: BUG: soft lockup - CPU#57 stuck for 22s! [condor_procd:3044]
Hi Todd,
I can add more details:
centos@vgcnbwc-worker-c125m425-4622:~$ condor_version
$CondorVersion: 8.8.13 Mar 23 2021 BuildID: 534541 PackageID: 8.8.13-1 $
$CondorPlatform: x86_64_CentOS8 $\
centos@vgcnbwc-worker-c125m425-4622:~$ uname -a
Linux vgcnbwc-worker-c125m425-4622.novalocal
4.18.0-240.22.1.el8_3.x86_64 #1 SMP Thu Apr 8 19:01:30 UTC 2021 x86_64
x86_64 x86_64 GNU/Linux
centos@vgcnbwc-worker-c125m425-9515:~$ cat /etc/redhat-release
CentOS Linux release 8.3.2011
We run it as root, and we are using cgroups.
This problem happens mostly on our nodes with the highest core counts
(128 logical cores, 400GB)
The worker nodes are VMs spanned from our OpenStack cloud.
Thank you
On 27.05.21 17:14, Todd Tannenbaum wrote:
On 5/27/2021 7:42 AM, Gianmauro Cuccuru wrote:
Hi!
It's happening quiet often to see the message below in several worker
nodes, and the jobs running there suddenly slow down.
if I restart the node, everything returns to normal for a while.
Any suggestions on what I could investigate?
Thank you in advance,
Gianmauro
Hi Gianmauro,
Could you let us know a) what version of HTCondor you are using (output
from condor_version), and b) what Linux distro and kernel you are
using? (output from 'uname -a') ? Also, are you running HTCondor as
root, or as a normal user (as would be the case with pilot jobs or
glideins)?
The condor_procd monitors the monitors the processes running on the
system so that HTCondor can do things like report total CPU utilization
for multi-process jobs, clean-up any orphaned processes left behind when
a job completes. To do this, it periodically scans files in /proc. We
have noticed that occasionally reading from /proc hangs and/or times out
in some versions of the Linux kernel. This is, imho, seems like a Linux
kernel bug.... simply reading files from a directory (/proc) from user
space should not cause part of the kernel to lockup. Failing to
successfully read information from /proc can cause rare problems for
HTCondor including restarting the HTCondor daemons, an issue which we
addressed in HTCondor v8.9.12. (see ticket
https://opensciencegrid.atlassian.net/browse/HTCONDOR-33)
Just a brainstorm idea: I am now wondering if we should more
aggressively avoid this apparent Linux kernel issue by having the
condor_procd probe for process status information less often if, for
instance, HTCondor is running as root and thus can rely more on cgroups
to do job management....
regards
Todd
[Thu May 27 14:19:32 2021] watchdog: BUG: soft lockup - CPU#93 stuck
for 22s! [condor_procd:3159]
[Thu May 27 14:19:32 2021] Modules linked in: nfsv3 nfs_acl nfs lockd
grace fscache nf_conntrack_netlink xt_addrtype br_netfilter
xt_CHECKSUM ipt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4
nft_compat nft_counter nft_chain_nat nf_nat nf_conntrack
nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink tun bridge stp llc
overlay sunrpc kvm_amd ccp kvm irqbypass crct10dif_pclmul crc32_pclmul
ghash_clmulni_intel joydev pcspkr virtio_balloon cirrus drm_kms_helper
syscopyarea sysfillrect sysimgblt fb_sys_fops drm i2c_piix4 fuse
ip_tables xfs libcrc32c ata_generic ata_piix virtio_net libata
net_failover crc32c_intel virtio_blk failover serio_raw
[Thu May 27 14:19:32 2021] CPU: 93 PID: 3159 Comm: condor_procd Kdump:
loaded Tainted: GÂÂÂÂÂÂÂÂÂÂÂÂ LÂÂ --------- -Â -
4.18.0-240.22.1.el8_3.x86_64 #1
[Thu May 27 14:19:32 2021] Hardware name: RDO OpenStack Compute, BIOS
1.11.0-2.el7 04/01/2014
[Thu May 27 14:19:32 2021] RIP:
0010:mem_cgroup_node_nr_lru_pages+0x89/0x100
[Thu May 27 14:19:32 2021] Code: 44 00 00 bf ff ff ff ff 4a 8d 2c fd
00 00 00 00 45 31 ed eb 19 48 63 c7 48 8b 93 50 02 00 00 48 8b 04 c5
60 e8 15 91 48 01 ea <4c> 03 2c 10 4c 89 f6 e8 7b 3a 5f 00 3b 05 49 8d
15 01 89 c7 72 d5
[Thu May 27 14:19:32 2021] RSP: 0018:ffffa8c5cfe03df8 EFLAGS: 00000206
ORIG_RAX: ffffffffffffff13
[Thu May 27 14:19:32 2021] RAX: ffff9d1e15380000 RBX: ffff9d1a6565a000
RCX: 000000000000001e
[Thu May 27 14:19:32 2021] RDX: 00002ba797216728 RSI: 0000000000000000
RDI: 000000000000001e
[Thu May 27 14:19:32 2021] RBP: 0000000000000010 R08: 0000000000001000
R09: 0000000000000000
[Thu May 27 14:19:32 2021] R10: ffff9d1def767000 R11: ffff9d1def766090
R12: 000000000000000c
[Thu May 27 14:19:32 2021] R13: ffffffffffffa0ff R14: ffffffff91415bc0
R15: 0000000000000002
[Thu May 27 14:19:32 2021] FS:Â 00007fef44e6d400(0000)
GS:ffff9d1e16340000(0000) knlGS:0000000000000000
[Thu May 27 14:19:32 2021] CS:Â 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[Thu May 27 14:19:32 2021] CR2: 00007f93b52a70ec CR3: 0000006685326000
CR4: 0000000000340ee0
[Thu May 27 14:19:32 2021] Call Trace:
[Thu May 27 14:19:32 2021]Â memcg_numa_stat_show+0x16d/0x1e0
[Thu May 27 14:19:32 2021]Â seq_read+0x153/0x410
[Thu May 27 14:19:32 2021]Â vfs_read+0x91/0x140
[Thu May 27 14:19:32 2021]Â ksys_read+0x4f/0xb0
[Thu May 27 14:19:32 2021]Â do_syscall_64+0x5b/0x1a0
[Thu May 27 14:19:32 2021] entry_SYSCALL_64_after_hwframe+0x65/0xca
[Thu May 27 14:19:32 2021] RIP: 0033:0x7fef437d2805
[Thu May 27 14:19:32 2021] Code: fe ff ff 50 48 8d 3d 7a cb 09 00 e8
65 01 02 00 0f 1f 44 00 00 f3 0f 1e fa 48 8d 05 55 4d 2d 00 8b 00 85
c0 75 0f 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 53 c3 66 90 41 54 49 89 d4
55 48 89 f5 53 89
[Thu May 27 14:19:32 2021] RSP: 002b:00007fff180bf4d8 EFLAGS: 00000246
ORIG_RAX: 0000000000000000
[Thu May 27 14:19:32 2021] RAX: ffffffffffffffda RBX: 000000000164f170
RCX: 00007fef437d2805
[Thu May 27 14:19:32 2021] RDX: 0000000000001000 RSI: 0000000001bf0c60
RDI: 000000000000000a
[Thu May 27 14:19:32 2021] RBP: 0000000000000d68 R08: 0000000000000080
R09: 00007fef43aa2c00
[Thu May 27 14:19:32 2021] R10: 0000000000000001 R11: 0000000000000246
R12: 00007fef43a9e880
[Thu May 27 14:19:32 2021] R13: 00007fef43a9f3c0 R14: 0000000000000d68
R15: 00007fef43a9e880
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
--
Todd Tannenbaum<tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing Department of Computer Sciences
Calendar:https://tinyurl.com/yd55mtgd 1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132 Madison, WI 53706-1685
--
Gianmauro Cuccuru
UseGalaxy.eu
Bioinformatics Group
Department of Computer Science
Albert-Ludwigs-University Freiburg
Georges-KÃhler-Allee 106
79110 Freiburg, Germany