Mailing List Archives
Public Access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] watchdog: BUG: soft lockup - CPU#57 stuck for 22s! [condor_procd:3044]
- Date: Thu, 27 May 2021 10:14:46 -0500
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] watchdog: BUG: soft lockup - CPU#57 stuck for 22s! [condor_procd:3044]
On 5/27/2021 7:42 AM, Gianmauro Cuccuru
wrote:
Hi!
It's happening quiet often to see the message below in several
worker nodes, and the jobs running there suddenly slow down.
if I restart the node, everything returns to normal for a while.
Any suggestions on what I could investigate?
Thank you in advance,
Gianmauro
Hi Gianmauro,
Could you let us know a) what version of HTCondor you are using
(output from condor_version), and b) what Linux distro and kernel
you are using? (output from 'uname -a') ? Also, are you running
HTCondor as root, or as a normal user (as would be the case with
pilot jobs or glideins)?
The condor_procd monitors the monitors the processes running on the
system so that HTCondor can do things like report total CPU
utilization for multi-process jobs, clean-up any orphaned processes
left behind when a job completes. To do this, it periodically scans
files in /proc. We have noticed that occasionally reading from
/proc hangs and/or times out in some versions of the Linux kernel.
This is, imho, seems like a Linux kernel bug.... simply reading
files from a directory (/proc) from user space should not cause part
of the kernel to lockup. Failing to successfully read information
from /proc can cause rare problems for HTCondor including restarting
the HTCondor daemons, an issue which we addressed in HTCondor
v8.9.12. (see ticket
https://opensciencegrid.atlassian.net/browse/HTCONDOR-33)
Just a brainstorm idea: I am now wondering if we should more
aggressively avoid this apparent Linux kernel issue by having the
condor_procd probe for process status information less often if, for
instance, HTCondor is running as root and thus can rely more on
cgroups to do job management....
regards
Todd
[Thu May 27 14:19:32 2021] watchdog: BUG: soft lockup - CPU#93
stuck for 22s! [condor_procd:3159]
[Thu May 27 14:19:32 2021] Modules linked in: nfsv3 nfs_acl nfs
lockd grace fscache nf_conntrack_netlink xt_addrtype br_netfilter
xt_CHECKSUM ipt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4
nft_compat nft_counter nft_chain_nat nf_nat nf_conntrack
nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink tun bridge stp
llc overlay sunrpc kvm_amd ccp kvm irqbypass crct10dif_pclmul
crc32_pclmul ghash_clmulni_intel joydev pcspkr virtio_balloon
cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt
fb_sys_fops drm i2c_piix4 fuse ip_tables xfs libcrc32c ata_generic
ata_piix virtio_net libata net_failover crc32c_intel virtio_blk
failover serio_raw
[Thu May 27 14:19:32 2021] CPU: 93 PID: 3159 Comm: condor_procd
Kdump: loaded Tainted: G L --------- - -
4.18.0-240.22.1.el8_3.x86_64 #1
[Thu May 27 14:19:32 2021] Hardware name: RDO OpenStack Compute,
BIOS 1.11.0-2.el7 04/01/2014
[Thu May 27 14:19:32 2021] RIP:
0010:mem_cgroup_node_nr_lru_pages+0x89/0x100
[Thu May 27 14:19:32 2021] Code: 44 00 00 bf ff ff ff ff 4a 8d 2c
fd 00 00 00 00 45 31 ed eb 19 48 63 c7 48 8b 93 50 02 00 00 48 8b
04 c5 60 e8 15 91 48 01 ea <4c> 03 2c 10 4c 89 f6 e8 7b 3a
5f 00 3b 05 49 8d 15 01 89 c7 72 d5
[Thu May 27 14:19:32 2021] RSP: 0018:ffffa8c5cfe03df8 EFLAGS:
00000206 ORIG_RAX: ffffffffffffff13
[Thu May 27 14:19:32 2021] RAX: ffff9d1e15380000 RBX:
ffff9d1a6565a000 RCX: 000000000000001e
[Thu May 27 14:19:32 2021] RDX: 00002ba797216728 RSI:
0000000000000000 RDI: 000000000000001e
[Thu May 27 14:19:32 2021] RBP: 0000000000000010 R08:
0000000000001000 R09: 0000000000000000
[Thu May 27 14:19:32 2021] R10: ffff9d1def767000 R11:
ffff9d1def766090 R12: 000000000000000c
[Thu May 27 14:19:32 2021] R13: ffffffffffffa0ff R14:
ffffffff91415bc0 R15: 0000000000000002
[Thu May 27 14:19:32 2021] FS: 00007fef44e6d400(0000)
GS:ffff9d1e16340000(0000) knlGS:0000000000000000
[Thu May 27 14:19:32 2021] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[Thu May 27 14:19:32 2021] CR2: 00007f93b52a70ec CR3:
0000006685326000 CR4: 0000000000340ee0
[Thu May 27 14:19:32 2021] Call Trace:
[Thu May 27 14:19:32 2021] memcg_numa_stat_show+0x16d/0x1e0
[Thu May 27 14:19:32 2021] seq_read+0x153/0x410
[Thu May 27 14:19:32 2021] vfs_read+0x91/0x140
[Thu May 27 14:19:32 2021] ksys_read+0x4f/0xb0
[Thu May 27 14:19:32 2021] do_syscall_64+0x5b/0x1a0
[Thu May 27 14:19:32 2021]
entry_SYSCALL_64_after_hwframe+0x65/0xca
[Thu May 27 14:19:32 2021] RIP: 0033:0x7fef437d2805
[Thu May 27 14:19:32 2021] Code: fe ff ff 50 48 8d 3d 7a cb 09 00
e8 65 01 02 00 0f 1f 44 00 00 f3 0f 1e fa 48 8d 05 55 4d 2d 00 8b
00 85 c0 75 0f 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 53 c3 66
90 41 54 49 89 d4 55 48 89 f5 53 89
[Thu May 27 14:19:32 2021] RSP: 002b:00007fff180bf4d8 EFLAGS:
00000246 ORIG_RAX: 0000000000000000
[Thu May 27 14:19:32 2021] RAX: ffffffffffffffda RBX:
000000000164f170 RCX: 00007fef437d2805
[Thu May 27 14:19:32 2021] RDX: 0000000000001000 RSI:
0000000001bf0c60 RDI: 000000000000000a
[Thu May 27 14:19:32 2021] RBP: 0000000000000d68 R08:
0000000000000080 R09: 00007fef43aa2c00
[Thu May 27 14:19:32 2021] R10: 0000000000000001 R11:
0000000000000246 R12: 00007fef43a9e880
[Thu May 27 14:19:32 2021] R13: 00007fef43a9f3c0 R14:
0000000000000d68 R15: 00007fef43a9e880
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to
htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing Department of Computer Sciences
Calendar: https://tinyurl.com/yd55mtgd 1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132 Madison, WI 53706-1685