Linux 内核参数:overcommit 相关
源码基于:Linux 5.4
针对节点 :
/proc/sys/vm/overcommit_memory
/proc/sys/vm/overcommit_kbytes
/proc/sys/vm/overcommit_ratio
/proc/sys/vm/admin_reserve_kbytes
/proc/sys/vm/user_reserve_kbytes
1. overcommit
Memory Overcommit 的意思是操作系统承诺给进程的内存大小超过了实际可用的内存。一个保守的操作系统不会允许 memory overcommit,有多少就分配多少,再申请就没有了,这其实有些浪费内存,因为进程实际使用到的内存往往比申请的内存要少,比如某个进程 malloc() 了 200MB 内存,但实际上只用到了 100MB,按照 UNIX/Linux 的算法,物理内存页的分配发生在使用的瞬间,而不是在申请的瞬间,也就是说未用到的 100MB 内存根本就没有分配,这 100MB 内存就闲置了。
注意一个概念:
overcommit 只是针对内存申请,而不是针对内存分配,内存分配只有在使用内存时进行。
Linux 是允许overcommit memory,因为申请是可以不重点关注,但是在分配的时候会因为内存的不足出现OOM。
2. 节点简介
2.1 overcommit_memory
overcommit_memory
=================This value contains a flag that enables memory overcommitment.
When this flag is 0, the kernel attempts to estimate the amount
of free memory left when userspace requests more memory.When this flag is 1, the kernel pretends there is always enough
memory until it actually runs out.When this flag is 2, the kernel uses a "never overcommit"
policy that attempts to prevent any overcommit of memory.
Note that user_reserve_kbytes affects this policy.This feature can be very useful because there are a lot of
programs that malloc() huge amounts of memory "just-in-case"
and don't use much of it.The default value is 0.
See Documentation/vm/overcommit-accounting.rst and
mm/util.c::__vm_enough_memory() for more information.
这个值用以使能memory overcommit,大概有三个值:
- 0 (GUESS) :这个是默认值,kernel 会尝试评估下剩余free 内存(试探),如果有明显的浮夸申请是拒绝的。详细的判断标准可以看下一节代码。
- 1 (ALWAYS):允许overcommit,对于内存申请是来者不拒。
- 2 (NEVER) :禁止overcommit,kernel 会试图阻止overcommit,但是会有一个计算过程,节点user_reserve_kbytes 就只有在该策略下有效。
默认值定义在mm/util.c 中,包括下面几个节点的默认值:
int sysctl_overcommit_memory __read_mostly = OVERCOMMIT_GUESS;
int sysctl_overcommit_ratio __read_mostly = 50;
unsigned long sysctl_overcommit_kbytes __read_mostly;
int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
unsigned long sysctl_user_reserve_kbytes __read_mostly = 1UL << 17; /* 128MB */
unsigned long sysctl_admin_reserve_kbytes __read_mostly = 1UL << 13; /* 8MB */
详细使用可以看第 3 节。
2.2 overcommit_kbytes
When overcommit_memory is set to 2, the committed address space is not
permitted to exceed swap plus this amount of physical RAM. See below.Note: overcommit_kbytes is the counterpart of overcommit_ratio. Only one
of them may be specified at a time. Setting one disables the other (which
then appears as 0 when read).
- 只有在overcommit_memory 被设为 2 时有效。该值吧允许超过swap + 物理RAM。
- 该节点默认值为0,详见2.1 节代码。
- 与overcommit_ratio 是互斥的,即只有一个值是有效的,当设定其中一个值时,另一个值将被置0.
int overcommit_ratio_handler(struct ctl_table *table, int write,void __user *buffer, size_t *lenp,loff_t *ppos)
{int ret;ret = proc_dointvec(table, write, buffer, lenp, ppos);if (ret == 0 && write)sysctl_overcommit_kbytes = 0;return ret;
}int overcommit_kbytes_handler(struct ctl_table *table, int write,void __user *buffer, size_t *lenp,loff_t *ppos)
{int ret;ret = proc_doulongvec_minmax(table, write, buffer, lenp, ppos);if (ret == 0 && write)sysctl_overcommit_ratio = 0;return ret;
}
两个函数,当在sysctl 设置时,都会将另外一个值置 0.
另外,两个值只有在overcommit_memory 值为2(never) 时有效,详细看第3节。
2.3 overcommit_ratio
overcommit_ratio
================When overcommit_memory is set to 2, the committed address
space is not permitted to exceed swap plus this percentage
of physical RAM. See above.
- 同 overcommit_kbytes,只要在overcommit_memory 值为2时生效。与overcommit_kbytes 只要一个值有效。详细看2.2 节代码。
- 默认值为 50;
2.4 user_reserve_kbytes
user_reserve_kbytes
===================When overcommit_memory is set to 2, "never overcommit" mode, reserve
min(3% of current process size, user_reserve_kbytes) of free memory.
This is intended to prevent a user from starting a single memory hogging
process, such that they cannot recover (kill the hog).user_reserve_kbytes defaults to min(3% of the current process size, 128MB).
If this is reduced to zero, then the user will be allowed to allocate
all free memory with a single process, minus admin_reserve_kbytes.
Any subsequent attempts to execute a command will result in
"fork: Cannot allocate memory".Changing this takes effect whenever an application requests memory.
- 该节点值是为了防止一个用户启动一个单独的内存抢占进程,这样没有空间恢复;
- 当overcommit_memory 被设为never 时,会根据min(进程total_vm * 3%,128M) 进行reserve;
- 没有设定节点时,该值默认值为 min(free pages * 3%, 128M);
- 当该节点值设为0,则表示用户可以使用一个进程allocate 所有的free memory - admin_reserve_kbytes 的内存;
如下代码就是 user_reserve_kbytes 节点使用,更详细可以看第 3 节:
mm/util.c
/** Don't let a single process grow so big a user can't recover*/if (mm) {long reserve = sysctl_user_reserve_kbytes >> (PAGE_SHIFT - 10);allowed -= min_t(long, mm->total_vm / 32, reserve);}
2.5 admin_reserve_kbytes
admin_reserve_kbytes
====================The amount of free memory in the system that should be reserved for users
with the capability cap_sys_admin.admin_reserve_kbytes defaults to min(3% of free pages, 8MB)
That should provide enough for the admin to log in and kill a process,
if necessary, under the default overcommit 'guess' mode.Systems running under overcommit 'never' should increase this to account
for the full Virtual Memory Size of programs used to recover. Otherwise,
root may not be able to log in to recover the system.How do you calculate a minimum useful reserve?
sshd or login + bash (or some other shell) + top (or ps, kill, etc.)
For overcommit 'guess', we can sum resident set sizes (RSS).
On x86_64 this is about 8MB.For overcommit 'never', we can take the max of their virtual sizes (VSZ)
and add the sum of their RSS.
On x86_64 this is about 128MB.Changing this takes effect whenever an application requests memory.
- 没有设定节点时,该值默认为 min(free pages * 3%, 8M);
- 该节点是为admin 提供足够的内存,以便登陆和kill 进程,如果有必要在overcommit 为0 (guess mode)时也可以生效。
- 计算该admin reserve 最小值:sshd/login + bash/other shell + top/ps/kill/...
3. overcommit 机制的触发
核心的处理在函数__vm_enough_memory() 中,
mm/util.c
int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
{long allowed;.../** Sometimes we want to use more memory than we have*/if (sysctl_overcommit_memory == OVERCOMMIT_ALWAYS)return 0;if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) {if (pages > totalram_pages() + total_swap_pages)goto error;return 0;}allowed = vm_commit_limit();/** Reserve some for root*/if (!cap_sys_admin)allowed -= sysctl_admin_reserve_kbytes >> (PAGE_SHIFT - 10);/** Don't let a single process grow so big a user can't recover*/if (mm) {long reserve = sysctl_user_reserve_kbytes >> (PAGE_SHIFT - 10);allowed -= min_t(long, mm->total_vm / 32, reserve);}if (percpu_counter_read_positive(&vm_committed_as) < allowed)return 0;
error:vm_unacct_memory(pages);return -ENOMEM;
}
这个函数是overcommit 处理的核心,例如在mmap时就会确定memory availability(详见mm/mmap.c):
if (accountable_mapping(file, vm_flags)) {charged = len >> PAGE_SHIFT;if (security_vm_enough_memory_mm(mm, charged))return -ENOMEM;vm_flags |= VM_ACCOUNT;}
或者是在fork 时,也会进行memory overcommit 的check。详见fork.c,这里不做过多的阐述。
最终会调到overcommit 中,回头来简单看下该函数,大致分为:
- overcommit_memory 为1 (always) 直接pass,允许overcommit;
- overcommit_memory 为0 (guess,也是默认值),需要确认申请的pages 是否过大,不要太过分就行;
- 进入overcommit_memory 为2 (never)的流程;
- 进入vm_commit_limit,确认最终overcommit 允许的memory,最终结果存于allowed中;
- allowed 减去admin reserve,节点admin_reserve_kbytes 生效;
- allowed 减去user reserve,节点user_reserve_kbytes 生效;
- 根据上一步的allowed 确认是否达到overcommit,如果小于allowed,即该次申请认为有效。
补充,/proc/meminfo 中有两个值CommitLimit 和Commited_AS。其中,
- CommitLimit 就是通过 vm_commit_limit 计算出来的allowed 值。
- Commited_AS 为已经申请的内存值,通过变量 vm_committed_as 保存;
本文来自互联网用户投稿,文章观点仅代表作者本人,不代表本站立场,不承担相关法律责任。如若转载,请注明出处。 如若内容造成侵权/违法违规/事实不符,请点击【内容举报】进行投诉反馈!
