[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] spontaneous reboots after enabling cgroups



Hi Jason,

If memory serves, the RHEL 6.4 kernel can crash when attempting to freeze a set of SIGSTOP'd processes.  I don't know if it is fixed in the upstream kernel though...

Two workarounds come to mind:

1) Unmount the freezer controller.  HTCondor should simply not use controllers that are not available.
2) Set SUSPEND=FALSE on the worker node configuration.

Hope this helps,

Brian

On Jun 26, 2013, at 7:31 PM, Jason Ferrara <jason.ferrara@xxxxxxxxxxxxx> wrote:

> I have a pool of machines running CentOS 6.4, Kernel 2.6.32-358, and HTCondor 7.9.4.
> 
> Today, in order to try to stop jobs which underestimate their memory usage from making the machines swap a lot and get slow, I enabled cgroups and set
> 
> CGROUP_MEMORY_LIMIT_POLICY = soft
> RESERVED_MEMORY = 1024
> 
> The idea was to make sure there was always at least 1G of physical memory available for system and interactive processes. This worked as intended, and the thrashing problems went away, but now I'm seeing machines randomly reboot, without any error messages in the system logs.
> 
> In the one machine where I have kdump enabled, the error below was in vcore-dmesg.txt from the crash dump.
> 
> 
> <2>kernel BUG at kernel/cgroup_freezer.c:247!
> <4>invalid opcode: 0000 [#1] SMP
> <4>last sysfs file: /sys/devices/virtual/block/dm-0/uevent
> <4>CPU 1
> <4>Modules linked in: fuse nfsd exportfs gfs2 nfs lockd fscache auth_rpcgss nfs_acl bnx2fc fcoe lib
> fcoe libfc scsi_transport_fc scsi_tgt dlm configfs 8021q garp stp llc sunrpc ipt_REJECT nf_conntrac
> k_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_stat
> e nf_conntrack ip6table_filter ip6_tables ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr
> iscsi_tcp sg dcdbas k10temp amd64_edac_mod edac_core edac_mce_amd i2c_piix4 i2c_core shpchp ext4 mb
> cache jbd2 sd_mod crc_t10dif ixgbe igb dca ptp pps_core ata_generic pata_acpi pata_atiixp ahci dm_m
> irror dm_region_hash dm_log dm_mod be2iscsi bnx2i cnic uio ipv6 cxgb4i cxgb4 cxgb3i libcxgbi cxgb3
> mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: scsi_wait_
> scan]
> <4>
> <4>Pid: 3618, comm: condor_procd Not tainted 2.6.32-358.11.1.el6.x86_64 #1 Dell Inc.              P
> owerEdge C6105       /0MVKG0
> <4>RIP: 0010:[<ffffffff810ca64b>] [<ffffffff810ca64b>] update_if_frozen+0x9b/0xc0
> <4>RSP: 0018:ffff880803183d98  EFLAGS: 00010097
> <4>RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff8800378e3e18
> <4>RDX: 0000000000000000 RSI: ffff880803183da8 RDI: ffff88055242d000
> <4>RBP: ffff880803183de8 R08: ffff88080527c318 R09: 0000000000000000
> <4>R10: 00000000ffffffff R11: 0000000000000246 R12: ffff88055242d000
> <4>R13: ffff880803183da8 R14: 0000000000000000 R15: 0000000000000002
> <4>FS:  00007f2e19ca0b40(0000) GS:ffff88002c240000(0000) knlGS:0000000000000000
> <4>CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> <4>CR2: 00007f2e1b4c7000 CR3: 0000000819747000 CR4: 00000000000007e0
> <4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> <4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> <4>Process condor_procd (pid: 3618, threadinfo ffff880803182000, task ffff8808042c2ae0)
> <4>Stack:
> <4> 00007f2e1b4c7000 ffff8808197cbb80 0000000000000000 ffff8800378e3e18
> <4><d> ffff88055242d000 ffff88055242d000 00000000ffffffed 0000000000000000
> <4><d> ffff8808197cbb80 ffff8808197cbba4 ffff880803183e38 ffffffff810ca6fd
> <4>Call Trace:
> <4> [<ffffffff810ca6fd>] freezer_write+0x8d/0x1a0
> <4> [<ffffffff8104757c>] ? __do_page_fault+0x1ec/0x480
> <4> [<ffffffff810ca670>] ? freezer_write+0x0/0x1a0
> <4> [<ffffffff810c59df>] cgroup_file_write+0x16f/0x320
> <4> [<ffffffff8114a8da>] ? do_mmap_pgoff+0x33a/0x380
> <4> [<ffffffff811810d8>] vfs_write+0xb8/0x1a0
> <4> [<ffffffff811819d1>] sys_write+0x51/0x90
> <4> [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
> <4>Code: 1f 45 85 f6 75 44 4c 89 ee 4c 89 e7 e8 af 9f ff ff 48 83 c4 28 5b 41 5c 41 5d 41 5e 41 5f
> c9 c3 41 83 ff 01 74 12 41 39 de 74 db <0f> 0b 0f 1f 00 eb fb 66 0f 1f 44 00 00 41 39 de 75 c9 48 8
> b 45
> <1>RIP  [<ffffffff810ca64b>] update_if_frozen+0x9b/0xc0
> <4> RSP <ffff880803183d98>
> 
> 
> 
> Has anyone seen this before? Does anyone know of a solution? Is anyone successfully using cgroups with HTCondor under CentOS 6.4?
> 
> Thanks
> 
> - Jason
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/

Attachment: smime.p7s
Description: S/MIME cryptographic signature