[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] watchdog: BUG: soft lockup - CPU#57 stuck for 22s! [condor_procd:3044]



I believe this case is also running in a VM, which may exacerbate the NUMA problem.

-greg

On 5/27/21 11:05 AM, tpdownes@xxxxxxxxx wrote:
Just toÂ+1 this, I recall seeing errors that smelled very much like this when I tried to use numad 6 or 7 years back.

Tom

On Thu, May 27, 2021 at 10:54 AM Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx> wrote:
Hi together,

Am 27.05.21 um 17:14 schrieb Todd Tannenbaum:
> On 5/27/2021 7:42 AM, Gianmauro Cuccuru wrote:
>> Hi!
>>
>> It's happening quiet often to see the message below in several worker nodes, and the jobs running there suddenly slow down.
>> if I restart the node, everything returns to normal for a while.
>>
>> Any suggestions on what I could investigate?
>>
>> Thank you in advance,
>> Gianmauro
>
> Hi Gianmauro,
>
> Could you let us know a) what version of HTCondor you are using (output from condor_version), and b) what Linux distro and kernel you are using? (output from 'uname -a') ? Also, are you running HTCondor as root, or as a normal user (as would be the case with pilot jobs or glideins)?

to add on this, we saw something very similar just a week ago. In our case, this was with CentOS 7 (3.10.0-1160.25.1.el7.x86_64, but RHEL kernels are heavily patched).
I took some time to track this down, and for us, this did not manifest itself with condor_procd first (we run it as root, and use cgroups) â but "ps aux" became so slow that our health check almost froze
(>30 seconds for a "ps aux" call), and also our Zabbix monitoring became flaky. HTCondor itself was unfazed ;-).

Checking why this only started to happen on some of our nodes and only after years of operation without this issue,
I found this happened mostly when trying to read from "/proc/<pid>/cmdline". It seems these reads block especially when a process is moved from one CPU socket to another CPU socket,
i.e. when NUMA migration takes place[0]. I presume this is the case since the processes can edit their own cmdline at runtime,
so probably the kernel takes a "let's block this until it's migrated"-approach.

With this in mind, I could confirm this only happened on our nodes with the highest core counts (128 logical cores)
and large node distance (numactl -H, 32 between the sockets). It also correlated cleanly to user jobs constantly allocating and deallocation memory
(after contacting the user, I found these jobs are essentially data reformatting jobs, sadly using a pythonic library which did not reuse memory, but reallocate each time).
Needless to say, this ate a lot of memory bandwidth on the node.

We are not using "numad" yet, so the kernel NUMA balancing apparently was doing this. Expectedly, this also slowed down the processes significantly.
Since these were a rare job species for us, I have not yet tried out solutions. Possibilities which come to mind would be to set CPU affinities, try out if numad works better,
or reduce the balancing frequency. Or, of course, change the allocation scheme of the jobs, if possible, or at least direct them to nodes with less maximum NUMA node distance,
to increase the throughput.
So at least in our case, it helped us find an inefficiency ;-).

> The condor_procd monitors the monitors the processes running on the system so that HTCondor can do things like report total CPU utilization for multi-process jobs, clean-up any orphaned processes left behind when a job completes. To do this, it periodically scans files in /proc. We have noticed that occasionally reading from /proc hangs and/or times out in some versions of the Linux kernel. This is, imho, seems like a Linux kernel bug.... simply reading files from a directory (/proc) from user space should not cause part of the kernel to lockup. Failing to successfully read information from /proc can cause rare problems for HTCondor including restarting the HTCondor daemons, an issue which we addressed in HTCondor v8.9.12. (see ticket https://opensciencegrid.atlassian.net/browse/HTCONDOR-33)
>
> Just a brainstorm idea: I am now wondering if we should more aggressively avoid this apparent Linux kernel issue by having the condor_procd probe for process status information less often if, for instance, HTCondor is running as root and thus can rely more on cgroups to do job management....

I think this is a good idea :-). Probably also the kernel could do better, but I am not sure which semantics it has to guarantee (maybe blocking is indeed required?).

Cheers,
    Oliver

[0] https://unix.stackexchange.com/q/451673


>
> regards
> Todd
>
>
>
>>
>>
>> [Thu May 27 14:19:32 2021] watchdog: BUG: soft lockup - CPU#93 stuck for 22s! [condor_procd:3159]
>> [Thu May 27 14:19:32 2021] Modules linked in: nfsv3 nfs_acl nfs lockd grace fscache nf_conntrack_netlink xt_addrtype br_netfilter xt_CHECKSUM ipt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 nft_compat nft_counter nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink tun bridge stp llc overlay sunrpc kvm_amd ccp kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev pcspkr virtio_balloon cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm i2c_piix4 fuse ip_tables xfs libcrc32c ata_generic ata_piix virtio_net libata net_failover crc32c_intel virtio_blk failover serio_raw
>> [Thu May 27 14:19:32 2021] CPU: 93 PID: 3159 Comm: condor_procd Kdump: loaded Tainted: GÂÂÂÂÂÂÂÂÂÂÂÂ LÂÂ --------- -Â - 4.18.0-240.22.1.el8_3.x86_64 #1
>> [Thu May 27 14:19:32 2021] Hardware name: RDO OpenStack Compute, BIOS 1.11.0-2.el7 04/01/2014
>> [Thu May 27 14:19:32 2021] RIP: 0010:mem_cgroup_node_nr_lru_pages+0x89/0x100
>> [Thu May 27 14:19:32 2021] Code: 44 00 00 bf ff ff ff ff 4a 8d 2c fd 00 00 00 00 45 31 ed eb 19 48 63 c7 48 8b 93 50 02 00 00 48 8b 04 c5 60 e8 15 91 48 01 ea <4c> 03 2c 10 4c 89 f6 e8 7b 3a 5f 00 3b 05 49 8d 15 01 89 c7 72 d5
>> [Thu May 27 14:19:32 2021] RSP: 0018:ffffa8c5cfe03df8 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13
>> [Thu May 27 14:19:32 2021] RAX: ffff9d1e15380000 RBX: ffff9d1a6565a000 RCX: 000000000000001e
>> [Thu May 27 14:19:32 2021] RDX: 00002ba797216728 RSI: 0000000000000000 RDI: 000000000000001e
>> [Thu May 27 14:19:32 2021] RBP: 0000000000000010 R08: 0000000000001000 R09: 0000000000000000
>> [Thu May 27 14:19:32 2021] R10: ffff9d1def767000 R11: ffff9d1def766090 R12: 000000000000000c
>> [Thu May 27 14:19:32 2021] R13: ffffffffffffa0ff R14: ffffffff91415bc0 R15: 0000000000000002
>> [Thu May 27 14:19:32 2021] FS:Â 00007fef44e6d400(0000) GS:ffff9d1e16340000(0000) knlGS:0000000000000000
>> [Thu May 27 14:19:32 2021] CS:Â 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [Thu May 27 14:19:32 2021] CR2: 00007f93b52a70ec CR3: 0000006685326000 CR4: 0000000000340ee0
>> [Thu May 27 14:19:32 2021] Call Trace:
>> [Thu May 27 14:19:32 2021]Â memcg_numa_stat_show+0x16d/0x1e0
>> [Thu May 27 14:19:32 2021]Â seq_read+0x153/0x410
>> [Thu May 27 14:19:32 2021]Â vfs_read+0x91/0x140
>> [Thu May 27 14:19:32 2021]Â ksys_read+0x4f/0xb0
>> [Thu May 27 14:19:32 2021]Â do_syscall_64+0x5b/0x1a0
>> [Thu May 27 14:19:32 2021] entry_SYSCALL_64_after_hwframe+0x65/0xca
>> [Thu May 27 14:19:32 2021] RIP: 0033:0x7fef437d2805
>> [Thu May 27 14:19:32 2021] Code: fe ff ff 50 48 8d 3d 7a cb 09 00 e8 65 01 02 00 0f 1f 44 00 00 f3 0f 1e fa 48 8d 05 55 4d 2d 00 8b 00 85 c0 75 0f 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 53 c3 66 90 41 54 49 89 d4 55 48 89 f5 53 89
>> [Thu May 27 14:19:32 2021] RSP: 002b:00007fff180bf4d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
>> [Thu May 27 14:19:32 2021] RAX: ffffffffffffffda RBX: 000000000164f170 RCX: 00007fef437d2805
>> [Thu May 27 14:19:32 2021] RDX: 0000000000001000 RSI: 0000000001bf0c60 RDI: 000000000000000a
>> [Thu May 27 14:19:32 2021] RBP: 0000000000000d68 R08: 0000000000000080 R09: 00007fef43aa2c00
>> [Thu May 27 14:19:32 2021] R10: 0000000000000001 R11: 0000000000000246 R12: 00007fef43a9e880
>> [Thu May 27 14:19:32 2021] R13: 00007fef43a9f3c0 R14: 0000000000000d68 R15: 00007fef43a9e880
>>
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
>
>
> --
> Todd Tannenbaum<tannenba@xxxxxxxxxxx>Â ÂUniversity of Wisconsin-Madison
> Center for High Throughput Computing  Department of Computer Sciences
> Calendar:https://tinyurl.com/yd55mtgd Â1210 W. Dayton St. Rm #4257
> Phone: (608) 263-7132Â Â Â Â Â Â Â Â Â ÂMadison, WI 53706-1685
>
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
>


--
Oliver Freyermuth
UniversitÃt Bonn
Physikalisches Institut, Raum 1.047
NuÃallee 12
53115 Bonn
--
Tel.: +49 228 73 2367
Fax:Â +49 228 73 7869
--

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/