Linux 内核参数:lowmem_reserve_ratio

源码基于:Linux 5.4

针对节点 /proc/sys/vm/lowmem_reserve_ratio

0. 官方描述

lowmem_reserve_ratio
====================

For some specialised workloads on highmem machines it is dangerous for
the kernel to allow process memory to be allocated from the "lowmem"
zone.  This is because that memory could then be pinned via the mlock()
system call, or by unavailability of swapspace.

And on large highmem machines this lack of reclaimable lowmem memory
can be fatal.

So the Linux page allocator has a mechanism which prevents allocations
which *could* use highmem from using too much lowmem.  This means that
a certain amount of lowmem is defended from the possibility of being
captured into pinned user memory.

(The same argument applies to the old 16 megabyte ISA DMA region.  This
mechanism will also defend that region from allocations which could use
highmem or lowmem).

The `lowmem_reserve_ratio` tunable determines how aggressive the kernel is
in defending these lower zones.

If you have a machine which uses highmem or ISA DMA and your
applications are using mlock(), or if you are running with no swap then
you probably should change the lowmem_reserve_ratio setting.

    kernel在分配内存时,可能会涉及到多个zone,分配会尝试从zonelist第一个zone分配,如果失败就会尝试下一个低级的zone(这里的低级仅仅指zone内存的位置,实际上低地址zone是更稀缺的资源)。考虑这样一种场景应用进程通过内存映射申请Highmem 并且加mlock分配,如果此时HIGH zone无法满足分配,则会尝试从Normal进行分配。问题来了,应用进程在从HIHG“降”到Normal区的分配请求有可能会耗尽Normal区的内存,而且由于mlock又无法回收,最终的结果就是Normal区无内存--在i386这样的架构上内核能够正常访问的线性区正是Normal区,这就导致kernel可能无法正常工作,然而HIGH zone却可能有足量的可回收内存。
    针对这个情形,当Normal zone在碰到来自HIGH的分配请求时,可以通过lowmem_reserve声明:可以使用我的内存,但是必须要保留lowmem_reserve[NORMAL]给我自己使用。
    同样当从Normal失败后,会尝试从zonelist中的DMA申请分配,通过lowmem_reserve[DMA],限制来自HIGHMEM和Normal的分配请求。

    例如现在常见的一个node的机器有三个zone: DMA,DMA32和NORMAL。DMA和DMA32属于低端zone,内存也较小,如96G内存的机器两个zone总和才1G左右,NORMAL就相对属于高端内存(现在一般没有HIGH zone),而且数量较大(>90G)。低端内存有一定的特殊作用比如发生DMA时只能分配DMA zone的低端内存,因此需要在 尽量可以使用高端内存时 而 不使用低端内存,同时防止高端内存分配不足的时候抢占稀有的低端内存。

1. 源码分析

1.1 初始化时配置每个zone的protection

这里同Linux 内核参数:min_free_kbytes 一文。

内核在初始化阶段会调用 init_per_zone_wmark_min 来进行每个zone 的内存水位线初始化,同时也会设置zone 的lowmem_reserve 值。本篇主要剖析lowmem reserve 的配置过程。

/mm/page_alloc.c

int __meminit init_per_zone_wmark_min(void)
{unsigned long lowmem_kbytes;...setup_per_zone_wmarks();...setup_per_zone_lowmem_reserve();...return 0;
}

对于水位的配置,详细可以查看:Linux 内核参数:min_free_kbytes

这里主要会调用setup_per_zone_lowmem_reserve():

static void setup_per_zone_lowmem_reserve(void)
{struct pglist_data *pgdat;enum zone_type j, idx;for_each_online_pgdat(pgdat) {for (j = 0; j < MAX_NR_ZONES; j++) {struct zone *zone = pgdat->node_zones + j;unsigned long managed_pages = zone_managed_pages(zone);zone->lowmem_reserve[j] = 0;idx = j;while (idx) {struct zone *lower_zone;idx--;lower_zone = pgdat->node_zones + idx;if (sysctl_lowmem_reserve_ratio[idx] < 1) {sysctl_lowmem_reserve_ratio[idx] = 0;lower_zone->lowmem_reserve[j] = 0;} else {lower_zone->lowmem_reserve[j] =managed_pages / sysctl_lowmem_reserve_ratio[idx];}managed_pages += zone_managed_pages(lower_zone);}}}/* update totalreserve_pages */calculate_totalreserve_pages();
}

上面的逻辑等会再剖析。用户也可以根据sysctl 进行节点/proc/sys/vm/lowmem_reserve_ratio 的动态配置:

/kernel/mm/sysctl.c

	{.procname	= "lowmem_reserve_ratio",.data		= &sysctl_lowmem_reserve_ratio,.maxlen		= sizeof(sysctl_lowmem_reserve_ratio),.mode		= 0644,.proc_handler	= lowmem_reserve_ratio_sysctl_handler,},

无论是初始化,还是后面的sysctl,最终都会调用setup_per_zone_wmarks() 来配置每个zone 的protection:

/mm/page_alloc.c

int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *table, int write,void __user *buffer, size_t *length, loff_t *ppos)
{proc_dointvec_minmax(table, write, buffer, length, ppos);setup_per_zone_lowmem_reserve();return 0;
}

1.2 setup_per_zone_lowmem_reserve

函数虽然不长,但是逻辑比较麻烦,还是分布剖析。

step 1. 变量sysctl_lowmem_reserve_ratio 初始值

int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES] = {
#ifdef CONFIG_ZONE_DMA[ZONE_DMA] = 256,
#endif
#ifdef CONFIG_ZONE_DMA32[ZONE_DMA32] = 256,
#endif[ZONE_NORMAL] = 32,
#ifdef CONFIG_HIGHMEM[ZONE_HIGHMEM] = 0,
#endif[ZONE_MOVABLE] = 0,
};

根据CONFIG 的配置,确认当前系统中有多少个zone。例如,系统只有ZONE_DMA32、ZONE_NORMAL、ZONE_MOVABLE,这里的MAX_NR_ZONES 则为3。详细看mmzone.h:

/include/linux/mmzone.h

enum zone_type {...__MAX_NR_ZONES
};

step 2. 计算规则

逻辑可能有点复杂,主要是根据 high mem 的managed 去计算lower zone 的lowmem_reserve。

对于zone[i] 来说,会计算出比i 小的低 mem 地址的所有zone 的lowmem_reserve。

当 i 为2,

首先,zone[2]->lowmem_reserve[2] = 0;

进入while 循环,

zone[1]->lowmem_reserve[2] = zone[2]->managed / reserve_ratio[1];

zone[0]->lowmem_reserve[2] = (zone[2]->managed + zone[1]->managed) / reserve_ratio[0];

当 i 为1,

首先,zone[1]->low_reserve[1] = 0;

进入while 循环,

zone[0]->lowmem_reserve[1] = zone[1]->managed / reserve_ratio[0];

当 i 为 0,

首先zone[0]->lowmem_reserve[0] = 0;

while 循环进不去;

统计下规则大致如下:

(i < j) 时:
zone[i]->protection[j] = (total sums of present_pages from zone[i+1] to zone[j] on the node) / lowmem_reserve_ratio[i];(i = j) 时:
(should not be protected. = 0;(i > j) 时:
(not necessary, but looks 0)

可以通过结果来反证明,lowmem_reserve 最终体现在/proc/zoneinfo 中每个zone 的protection 中:

Node 0, zone    DMA32pages free     44525min      1044low      5924high     6267spanned  524288present  524288managed  428797protection: (0, 946, 946)Node 0, zone   Normalpages free     7378min      4685low      7440high     7633spanned  262144present  262144managed  242196protection: (0, 0, 0)Node 0, zone  Movablepages free     0min      0low      0high     0spanned  0present  0managed  0protection: (0, 0, 0)

加入平台Node 0 有三个zone,分别是ZONE_DMA32、ZONE_NORMAL、ZONE_MOVABLE,那

数组 sysctl_lowmem_reserve_ratio 最大MAX_NR_ZONES值为3,zone 中数组lowmem_reserve 的MAX_NR_ZONES 也是3.

对于ZONE DMA32,i为0,

  • 当j 为0时,i=j,所以lowmem_reserve[0] = 0;
  • 当j 为1时,i
  • 当j 为2时,i

对于ZONE NORMAL,i为1,

  • 当j 为0时,i>j,所以lowmem_reserve[0] = 0;
  • 当j 为1时,i=j,所以lowmem_reserve[1] = 0;
  • 当j 为2时,i

1.3 calculate_totalreserve_pages

更新totalreserve_pages的值,这个值用于评估系统正常运行时需要使用的内存,该值会作用于overcommit时,判断当前是否允许此次内存分配。

static void calculate_totalreserve_pages(void)
{struct pglist_data *pgdat;unsigned long reserve_pages = 0;enum zone_type i, j;// 遍历每个nodefor_each_online_pgdat(pgdat) {pgdat->totalreserve_pages = 0;//遍历每个zonefor (i = 0; i < MAX_NR_ZONES; i++) {struct zone *zone = pgdat->node_zones + i;long max = 0;unsigned long managed_pages = zone_managed_pages(zone);/* Find valid and maximum lowmem_reserve in the zone *///查找当前zone 中,为上一级zone 内存类型保留最大的保留值for (j = i; j < MAX_NR_ZONES; j++) {if (zone->lowmem_reserve[j] > max)max = zone->lowmem_reserve[j];}/* we treat the high watermark as reserved pages. *///每个zone 的high 水位值和reserve值之和,当做是系统运行保留阈值max += high_wmark_pages(zone);//这个保留值最大为managed的值if (max > managed_pages)max = managed_pages;pgdat->totalreserve_pages += max;//最终统计的是所有保留阈值之和reserve_pages += max;}}totalreserve_pages = reserve_pages;
}

2. 总结

节点/proc/sys/vm/lowmem_reserve_ratio 是一个数组值,里面存放的是不同zone 为上一级zone 设定保留值的数量统计。如果想要为上一级zone 保留的内存稍微大一点,则可以减小该zone 上的lowmem_reserve_ratio。

参考:

Linux 内核参数:min_free_kbytes


本文来自互联网用户投稿,文章观点仅代表作者本人,不代表本站立场,不承担相关法律责任。如若转载,请注明出处。 如若内容造成侵权/违法违规/事实不符,请点击【内容举报】进行投诉反馈!

相关文章

立即
投稿

微信公众账号

微信扫一扫加关注

返回
顶部