Farseerfc的小窝 - swap

关于 swap 的一些补充

2020-10-06T13:45:00+09:00

上周翻译完【译】替 swap 辩护：常见的误解之后很多朋友们似乎还有些疑问和误解，于是写篇后续澄清一下。事先声明我不是内核开发者，这里说的只是我的理解，基于内核文档中关于物理内存的描述，新的内核代码的具体行为可能和我的理解有所出入，欢迎踊跃讨论。

Introduction to Memory Management in Linux

误解1: swap 是虚拟内存，虚拟内存肯定比物理内存慢嘛

这种误解进一步的结论通常是：「使用虚拟内存肯定会减慢系统运行时性能，如果物理内存足够为什么还要用虚拟的？」这种误解是把虚拟内存和交换区的实现方式类比于「虚拟磁盘」或者「虚拟机」等同的方式，也隐含「先用物理内存，用完了之后用虚拟内存」也即下面的「误解3」的理解。

首先，交换区（swap）不是虚拟内存。操作系统中说「物理内存」还是「虚拟内存」的时候在指程序代码寻址时使用的内存地址方式，使用物理地址空间时是在访问物理内存，使用虚拟地址空间时是在访问虚拟内存。现代操作系统在大部分情况下都在使用虚拟地址空间寻址，包括在执行内核代码的时候。

并且，交换区不是实现虚拟内存的方式。操作系统使用内存管理单元（MMU，Memory Management Unit）做虚拟内存地址到物理内存地址的地址翻译，现代架构下 MMU 通常是 CPU 的一部分，配有它专用的一小块存储区叫做地址转换旁路缓存（TLB，Translation Lookaside Buffer），只有在 TLB 中没有相关地址翻译信息的时候 MMU 才会以缺页中断的形式调用操作系统内核帮忙。除了 TLB 信息不足的时候，大部分情况下使用虚拟内存都是硬件直接实现的地址翻译，没有软件模拟开销。实现虚拟内存不需要用到交换区，交换区只是操作系统实现虚拟内存后能提供的一个附加功能，即便没有交换区，操作系统大部分时候也在用虚拟内存，包括在大部分内核代码中。

误解2: 但是没有交换区的话，虚拟内存地址都有物理内存对应嘛

很多朋友也理解上述操作系统实现虚拟内存的方式，但是仍然会有疑问：「我知道虚拟内存和交换区的区别，但是没有交换区的话，虚拟内存地址都有物理内存对应，不用交换区的话就不会遇到读虚拟内存需要读写磁盘导致的卡顿了嘛」。

这种理解也是错的，禁用交换区的时候，也会有一部分分配给程序的虚拟内存不对应物理内存，比如使用 mmap 调用实现内存映射文件的时候。实际上即便是使用 read/write 读写文件， Linux 内核中（可能现代操作系统内核都）在底下是用和 mmap 相同的机制建立文件到虚拟地址空间的地址映射，然后实际读写到虚拟地址时靠缺页中断把文件内容载入页面缓存（page cache ）。内核加载可执行程序和动态链接库的方式也是通过内存映射文件。甚至可以进一步说，用户空间的虚拟内存地址范围内，除了匿名页之外，其它虚拟地址都是文件后备（backed by file ），而匿名页通过交换区作为文件后备。上篇文章中提到的别的类型的内存，比如共享内存页面（shm ）是被一个内存中的虚拟文件系统后备的，这一点有些套娃先暂且不提。于是事实是无论有没有交换区，缺页的时候总会有磁盘读写从慢速存储加载到物理内存，这进一步引出上篇文章中对于交换区和页面缓存这两者的讨论。

误解3：不是内存快用完的时候才会交换的么？

简短的答案可以说「是」，但是内核理解的「内存快用完」和你理解的很可能不同。也可以说「不是」，就算按照内核理解的「内存快用完」的定义，内存快用完的时候内核的行为是去回收内存，至于回收内存的时候内核会做什么有个复杂的启发式经验算法，实际上真的内存快满的时候根本来不及做 swap ，内核可能会尝试丢弃 page cache 甚至丢弃 vfs cache (dentry cache / inode cache) 这些不需要磁盘I/O就能更快获取可用内存的动作。

深究这些内核机制之前，我在思考为什么很多朋友会问出这样的问题。可能大部分这么问的人，学过编程，稍微学过基本的操作系统原理，在脑海里对内核分配页面留着这样一种印象（C伪代码）：

////////////////////  userspace space  ////////////////
void* malloc(int size){
    void* pages = mmap(...);                                    // 从内核分配内存页
    return alloc_from_page(pages, size);                        // 从拿到的内存页细分
}

////////////////////  kernel space  //////////////////
void * SYSCALL do_mmap(...){
   //...
   return kmalloc_pages(nr_page);
}

void* kmalloc_pages(int size){
  while ( available_mem < size ) {
    // 可用内存不够了！尝试搞点内存
    page_frame_info* least_accessed = lru_pop_page_frame();     // 找出最少访问的页面
    switch ( least_accessed -> pf_type ){
      case PAGE_CACHE: drop_page_cache(least_accessed); break;  // 丢弃文件缓存
      case SWAP:       swap_out(least_accessed);        break;  // <- 写磁盘，所以系统卡了！
      // ... 别的方式回收 least_accessed
    }
    append_free_page(free_page_list, least_accessed);           // 回收到的页面加入可用列表
    available_mem += least_accessed -> size;
  }
  // 搞到内存了！返回给程序
  available_mem -= size;
  void * phy_addr = take_from_free_list(free_page_list, size);
  return assign_virtual_addr(phy_addr);
}

这种逻辑隐含三层 错误的 假设：

分配物理内存是发生在从内核分配内存的时候的，比如 malloc/mmap 的时候。
内存回收是发生在进程请求内存分配的上下文里的，换句话说进程在等内核的内存回收返回内存，不回收到内存，进程就得不到内存。
交换出内存到 swap 是发生在内存回收的时候的，会阻塞内核的内存回收，进而阻塞程序的内存分配。

这种把内核代码当作「具有特权的库函数调用」的看法，可能很易于理解，甚至早期可能的确有操作系统的内核是这么实现的，但是很可惜现代操作系统都不是这么做的。上面三层假设的错误之处在于：

在程序请求内存的时候，比如 malloc/mmap 的时候，内核只做虚拟地址分配，记录下某段虚拟地址空间对这个程序是可以合法访问的，但是不实际分配物理内存给程序。在程序第一次访问到虚拟地址的时候，才会实际分配物理内存。这种叫 惰性分配（lazy allocation） 。
在内核感受到内存分配压力之后，早在内核内存用尽之前，内核就会在后台慢慢扫描并回收内存页。内存回收通常不发生在内存分配的时候，除非在内存非常短缺的情况下，后台内存回收来不及满足当前分配请求，才会发生 直接回收(direct reclamation) 。
同样除了直接回收的情况，大部分正常情况下换出页面是内存管理子系统调用 DMA 在后台慢慢做的，交换页面出去不会阻塞内核的内存回收，更不会阻塞程序做内存分配（malloc ）和使用内存(实际访问惰性分配的内存页)。

也就是说，现代操作系统内核是高度并行化的设计，内存分配方方面面需要消耗计算资源或者 I/O 带宽的场景，都会尽量并行化，最大程度利用好计算机所有组件（CPU/MMU/DMA/IO）的吞吐率，不到紧要关头需要直接回收的场合，就不会阻塞程序的正常执行流程。

惰性分配有什么好处？

或许会有人问：「我让你分配内存，你给我分配了个虚拟的，到用的时候还要做很多事情才能给我，这不是骗人嘛」，或者会有人担心惰性分配会对性能造成负面影响。

这里实际情况是程序从分配虚拟内存的时候，「到用的时候」，这之间有段时间间隔，可以留给内核做准备。程序可能一下子分配一大片内存地址，然后再在执行过程中解析数据慢慢往地址范围内写东西。程序分配虚拟内存的速率可以是「突发」的，比如一个系统调用中分配 1GiB 大小，而实际写入数据的速率会被 CPU 执行速度等因素限制，不会短期内突然写入很多页面。这个分配速率导致的时间差内内核可以完成很多后台工作，比如回收内存，比如把回收到的别的进程用过的内存页面初始化为全0，这部分后台工作可以和程序的执行过程并行，从而当程序实际用到内存的时候，需要的准备工作已经做完了，大部分场景下可以直接分配物理内存出来。

如果程序要做实时响应，想避免因为惰性分配造成的性能不稳定，可以使用 mlock/mlockall 将得到的虚拟内存锁定在物理内存中，锁的过程中内核会做物理内存分配。不过要区分「性能不稳定」和「低性能」，预先分配内存可以避免实际使用内存时分配物理页面的额外开销，但是会拖慢整体吞吐率，所以要谨慎使用。

很多程序分配了很大一片地址空间，但是实际并不会用完这些地址，直到程序执行结束这些虚拟地址也一直处于没有对应物理地址的情况。惰性分配可以避免为这些情况浪费物理内存页面，使得很多程序可以无忧无虑地随意分配内存地址而不用担心性能损失。这种分配方式也叫「超额分配（overcommit）」。飞机票有超售， VPS 提供商划分虚拟机有超售，操作系统管理内存也同样有这种现象，合理使用超额分配能改善整体系统效率。

内核要高效地做到惰性分配而不影响程序执行效率的前提之一，在于程序真的用到内存的时候，内核能不做太多操作就立刻分配出来，也就是说内核需要时时刻刻在手上留有一部分空页，满足程序执行时内存分配的需要。换句话说，内核需要早在物理内存用尽之前，就开始回收内存。

那么内核什么时候会开始回收内存？

首先一些背景知识：物理内存地址空间并不是都平等，因为一些地址范围可以做 DMA 而另一些不能，以及 NUMA 等硬件环境倾向于让 CPU 访问其所在 NUMA 节点内存范围。在 32bit 系统上内核的虚拟地址空间还有低端内存和高端内存的区分，他们会倾向于使用不同属性的物理内存，到 64bit 系统上已经没有了这种限制。

硬件限制了内存分配的自由度，于是内核把物理内存空间分成多个 Zone ，每个 Zone 内各自管理可用内存， Zone 内的内存页之间是相互平等的。

zone 内水位线

一个 Zone 内的页面分配情况可以右图描绘。除了已用内存页，剩下的就是空闲页（free pages），空闲页范围中有三个水位线（watermark ）评估当前内存压力情况，分别是高位（high）、低位（low）、最小位（min）。

当内存分配使得空闲页水位低于低位线，内核会唤醒 kswapd 后台线程， kswapd 负责扫描物理页面的使用情况并挑选一部分页面做回收，直到可用页面数量恢复到水位线高位（high）以上。如果 kswapd 回收内存的速度慢于程序执行实际分配内存的速度，可用空闲页数量可能进一步下降，降至低于最小水位（min）之后，内核会让内存分配进入 直接回收(direct reclamation) 模式，在直接回收模式下，程序分配某个物理页的请求（第一次访问某个已分配虚拟页面的时候）会导致在进程上下文中阻塞式地调用内存回收代码。

除了内核在后台回收内存，进程也可以主动释放内存，比如有程序退出的时候就会释放一大片内存页，所以可用页面数量可能会升至水位线高位以上。有太多可用页面浪费资源对整体系统运行效率也不是好事，所以系统会积极缓存文件读写，所有 page cache 都留在内存中，直到可用页面降至低水位以下触发 kswapd 开始工作。

设置最小水位线（min）的原因在于，内核中有些硬件也会突然请求大量内存，比如来自网卡接收到的数据包，预留出最小水位线以下的内存给内核内部和硬件使用。

设置高低两个控制 kswapd 开关的水位线是基于控制理论。唤醒 kswapd 扫描内存页面本身有一定计算开销，于是每次唤醒它干活的话就让它多做一些活（ high - low ），避免频繁多次唤醒。

因为有这些水位线，系统中根据程序请求内存的「速率」，整个系统的内存分配在宏观的一段时间内可能处于以下几种状态：

不回收： 系统中的程序申请内存速度很慢，或者程序主动释放内存的速度很快，（比如程序执行时间很短，不怎么进行文件读写就马上退出，）此时可用页面数量可能一直处于低水位线以上，内核不会主动回收内存，所有文件读写都会以页面缓存的形式留在物理内存中。
后台回收： 系统中的程序在缓慢申请内存，比如做文件读写，比如分配并使用匿名页面。系统会时不时地唤醒 kswapd 在后台做内存回收，不会干扰到程序的执行效率。
直接回收： 如果程序申请内存的速度快于 kswapd 后台回收内存的速度，空闲内存最终会跌破最小水位线，随后的内存申请会进入直接回收的代码路径，从而极大限制内存分配速度。在直接分配和后台回收的同时作用下，空闲内存可能会时不时回到最小水位线以上，但是如果程序继续申请内存，空闲内存量就会在最小水位线附近上下徘徊。
杀进程回收： 甚至直接分配和后台回收的同时作用也不足以拖慢程序分配内存的速度的时候，最终空闲内存会完全用完，此时触发 OOM 杀手干活杀进程。

系统状态处于 1. 不回收 的时候表明分配给系统的内存量过多，比如系统刚刚启动之类的时候。理想上应该让系统长期处于 2. 后台回收 的状态，此时最大化利用缓存的效率而又不会因为内存回收减缓程序执行速度。如果系统引导后长期处于 1. 不回收 的状态下，那么说明没有充分利用空闲内存做文件缓存，有些 unix 服务比如 preload 可用来提前填充文件缓存。

如果系统频繁进入 3. 直接回收 的状态，表明在这种工作负载下系统需要减慢一些内存分配速度，让 kswapd 有足够时间回收内存。就如前一篇翻译中 Chris 所述，频繁进入这种状态也不一定代表「内存不足」，可能表示内存分配处于非常高效的利用状态下，系统充分利用慢速的磁盘带宽，为快速的内存缓存提供足够的可用空间。 直接回收 是否对进程负载有负面影响要看具体负载的特性。此时选择禁用 swap 并不能降低磁盘I/O，反而可能缩短 2. 后台回收 状态能持续的时间，导致更快进入 4. 杀进程回收 的极端状态。

当然如果系统长期处于 直接回收 的状态的话，则说明内存总量不足，需要考虑增加物理内存，或者减少系统负载了。如果系统进入 4. 杀进程回收 的状态，不光用空间的进程会受影响，并且还可能导致内核态的内存分配受影响，产生网络丢包之类的结果。

微调内存管理水位线

可以看一下运行中的系统中每个 Zone 的水位线在哪儿。比如我手上这个 16GiB 的系统中：

$ cat /proc/zoneinfo
Node 0, zone      DMA
   pages free     3459
         min      16
         low      20
         high     24
         spanned  4095
         present  3997
         managed  3975
Node 0, zone    DMA32
   pages free     225265
         min      3140
         low      3925
         high     4710
         spanned  1044480
         present  780044
         managed  763629
Node 0, zone   Normal
   pages free     300413
         min      13739
         low      17173
         high     20607
         spanned  3407872
         present  3407872
         managed  3328410

因为不是 NUMA 系统，所以只有一个 NUMA node，其中根据 DMA 类型共有 3 个 Zone 分别叫 DMA, DMA32, Normal 。三个 Zone 的物理地址范围（spanned）加起来大概有 \(4095+1044480+3407872\) 大约 17GiB 的地址空间，而实际可访问的地址范围（present ）加起来有 \(3997+780044+3407872\) 大约 16GiB 的可访问物理内存。

其中空闲页面有 \(3459+762569+1460218\) 大约 8.5GiB ，三条水位线分别在： \(\texttt{high} = 24+4710+20607 = 98\texttt{MiB}\) ， \(\texttt{low} = 20+3925+17173 = 82\texttt{MiB}\) ， \(\texttt{min} = 16+3140+13739 = 65\texttt{MiB}\) 的位置。

具体这些水位线的确定方式基于几个 sysctl 。首先 min 基于 vm.min_free_kbytes 默认是基于内核低端内存量的平方根算的值，并限制到最大 64MiB 再加点余量，比如我这台机器上 vm.min_free_kbytes = 67584 ，于是 min 水位线在这个位置。其它两个水位线基于这个计算，在 min 基础上增加总内存量的 vm.watermark_scale_factor / 10000 比例（在小内存的系统上还有额外考虑），默认 vm.watermark_scale_factor = 10 在大内存系统上意味着 low 比 min 高 0.1% ， high 比 low 高 0.1% 。

可以手动设置这些值，以更早触发内存回收，比如将 vm.watermark_scale_factor 设为 100:

$ echo 100 | sudo tee /proc/sys/vm/watermark_scale_factor
$ cat /proc/zoneinfo
Node 0, zone      DMA
   pages free     3459
         min      16
         low      55
         high     94
         spanned  4095
         present  3997
         managed  3975
   Node 0, zone    DMA32
   pages free     101987
         min      3149
         low      10785
         high     18421
         spanned  1044480
         present  780044
         managed  763629
   Node 0, zone   Normal
   pages free     61987
         min      13729
         low      47013
         high     80297
         spanned  3407872
         present  3407872
         managed  3328410

得到的三条水位线分别在 \(\texttt{min} = 16+3149+13729 = 66\texttt{MiB}\) ， \(\texttt{low} = 55+10785+47013 = 226\texttt{MiB}\) ， \(\texttt{high} = 94+18421+80297 = 386\texttt{MiB}\) ，从而 low 和 high 分别比 min 提高 160MiB 也就是内存总量的 1% 左右。

在 swap 放在 HDD 的系统中，因为换页出去的速度较慢，除了上篇文章说的降低 vm.swappiness 之外，还可以适当提高 vm.watermark_scale_factor 让内核更早开始回收内存，这虽然会稍微降低缓存命中率，但是另一方面可以在进入直接回收模式之前有更多时间做后台换页，也将有助于改善系统整体流畅度。

只有 0.1% ，这不就是说内存快用完的时候么？

所以之前的「误解3」我说答案可以说「是」或者「不是」，但是无论回答是或不是，都代表了认为「swap 就是额外的慢速内存」的错误看法。当有人在强调「swap 是内存快用完的时候才交换」的时候，隐含地，是在把系统总体的内存分配看作是一个静态的划分过程：打个比方这就像在说，我的系统里存储空间有快速 128GiB SSD 和慢速 HDD 的 1TiB ，同样内存有快速的 16GiB RAM 和慢速 16GiB 的 swap 。这种静态划分的类比是错误的看待方式，因为系统回收内存进而做页面交换的方式是动态平衡的过程，需要考虑到「时间」和「速率」而非单纯看「容量」。

假设 swap 所在的存储设备可以支持 5MiB/s 的吞吐率（ HDD 上可能更慢， SSD 上可能更快，这里需要关注数量级），相比之下 DDR3 大概有 10GiB/s 的吞吐率，DDR4 大概有 20GiB/s ，无论多快的 SSD 也远达不到这样的吞吐（可能 Intel Optane 这样的 DAX 设备会改变这里的状况）。从而把 swap 当作慢速内存的视角来看的话，加权平均的速率是非常悲观的，「 16G 的 DDR3 + 16G 的 swap 会有 \(\frac{16 \times 10 \times 1024 + 16 \times 5}{16+16} = 5 \texttt{GiB/s}\) 的吞吐？所以开 swap 导致系统速度降了一半？」显然不能这样看待。

动态的看待方式是， swap 设备能提供 5MiB/s 的吞吐，这意味着：如果我们能把未来 10 分钟内不会访问到的页面换出到 swap ，那么就相当于有 \(10 \times 60 \texttt{s} \times 5 \texttt{MiB/s} = 3000 \texttt{MiB}\) 的额外内存，用来放那 10 分钟内可能会访问到的页面缓存。 10 分钟只是随口说的一段时间，可以换成 10 秒或者 10 小时，重要的是只要页面交换发生在后台，不阻塞前台程序的执行，那么 swap 设备提供的额外吞吐率相当于一段时间内提供了更大的物理内存，总是能提升页面缓存的命中，从而改善系统性能。

当然系统内核不能预知「未来 10 分钟内需要的页面」，只能根据历史上访问内存的情况预估之后可能会访问的情况，估算不准的情况下，比如最近10分钟内用过的页面缓存在之后10分钟内不再被使用的时候，为了把最近这10分钟内访问过的页面留在物理内存中，可能会把之后10分钟内要用到的匿名页面换出到了交换设备上。于是会有下面的情况：

但是我开了 swap 之后，一旦复制大文件，系统就变卡，不开 swap 不会这样的

大概电脑用户都经历过这种现象，不限于 Linux 用户，包括 macOS 和 Windows 上也是。在文件管理器中复制了几个大文件之后，切换到别的程序系统就极其卡顿，复制已经结束之后的一段时间也会如此。复制的过程中系统交换区的使用率在上涨，复制结束后下降，显然 swap 在其中有重要因素，并且禁用 swap 或者调低 swappiness 之后就不会这样了。于是网上大量流传着解释这一现象，并进一步建议禁用 swap 或者调低 swappiness 的文章。我相信不少关心系统性能调优的人看过这篇「 Tales from responsivenessland: why Linux feels slow, and how to fix that 」或是它的转载、翻译，用中文搜索的话还能找到更多错误解释 swappiness 目的的文章，比如这篇将 swappiness 解释成是控制内存和交换区比例的参数。

除去那些有技术上谬误的文章，这些网文中描述的现象是有道理的，不单纯是以讹传讹。桌面环境中内存分配策略的不确定性和服务器环境中很不一样，复制、下载、解压大文件等导致一段时间内大量占用页面缓存，以至于把操作结束后需要的页面撵出物理内存，无论是交换出去的方式还是以丢弃页面缓存的方式，都会导致桌面响应性降低。

不过就像前文 Chris 所述，这种现象其实并不能通过禁止 swap 的方式缓解：禁止 swap 或者调整 swappiness 让系统尽量避免 swap 只影响回收匿名页面的策略，不影响系统回收页面的时机，也不能避免系统丢弃将要使用的页面缓存而导致的卡顿。

以前在 Linux 上也没有什么好方法能避免这种现象。 macOS 转用 APFS 作为默认文件系统之后，从文件管理器（Finder）复制文件默认启用 file clone 快速完成，这操作不实际复制文件数据，一个隐含优势在不需要读入文件内容，从而不会导致大量页面缓存失效。 Linux 上同样可以用支持 reflink 的文件系统比如 btrfs 或者开了 reflink=1 的 xfs 达到类似的效果。不过 reflink 也只能拯救复制文件的情况，不能改善解压文件、下载文件、计算文件校验等情况下，一次性处理大文件对内存产生的压力。

好在最近几年 Linux 有了 cgroup ，允许更细粒度地调整系统资源分配。进一步现在我们有了 cgroup v2 ，前面 Chris 的文章也有提到 cgroup v2 的 memory.low 可以某种程度上建议内存子系统尽量避免回收某些 cgroup 进程的内存。

于是有了 cgroup 之后，另一种思路是把复制文件等大量使用内存而之后又不需要保留页面缓存的程序单独放入 cgroup 内限制它的内存用量，用一点点复制文件时的性能损失换来整体系统的响应流畅度。

关于 cgroup v1 和 v2

稍微跑题说一下 cgroup v2 相对于 v1 带来的优势。这方面优势在 Chris Down 另一个关于 cgroup v2 演讲中有提到。老 cgroup v1 按控制器区分 cgroup 层级，从而内存控制器所限制的东西和 IO 控制器所限制的东西是独立的。在内核角度来看，页面写回（page writeback）和交换（swap）正是夹在内存控制器和IO控制器管理的边界上，从而用 v1 的 cgroup 难以同时管理。 v2 通过统一控制器层级解决了这方面限制。具体见下面 Chris Down 的演讲。

用 cgroup v2 限制进程的内存分配

实际上有了 cgroup v2 之后，还有更多控制内存分配的方案。 cgroup v2 的内存控制器可以对某个 cgroup 设置这些阈值：

memory.min : 最小内存限制。内存用量低于此值后系统不会回收内存。
memory.low : 低内存水位。内存用量低于此值后系统会尽量避免回收内存。
memory.high : 高内存水位。内存用量高于此值后系统会积极回收内存，并且会对内存分配节流（throttle）。
memory.max : 最大内存限制。内存用量高于此值后系统会对内存分配请求返回 ENOMEM，或者在 cgroup 内触发 OOM 。

可见这些设定值可以当作 per-cgroup 的内存分配水位线，作用于某一部分进程而非整个系统。针对交换区使用情况也可设置这些阈值：

memory.swap.high : 高交换区水位，交换区用量高于此值后会对交换区分配节流。
memory.swap.max : 最大交换区限制，交换区用量高于此值后不再会发生匿名页交换。

到达这些 cgroup 设定阈值的时候，还可以设置内核回调的处理程序，从用户空间做一些程序相关的操作。

Linux 有了 cgroup v2 之后，就可以通过对某些程序设置内存用量限制，避免他们产生的页面请求把别的程序所需的页面挤出物理内存。使用 systemd 的系统中，首先需要启用 cgroup v2 ，在内核引导参数中加上 systemd.unified_cgroup_hierarchy=1 。然后开启用户权限代理：

# systemctl edit user@1000.service
[Service]
Delegate=yes

然后可以定义用户会话的 slice （slice 是 systemd 术语，用来映射 cgroup ），比如创建一个叫 limit-mem 的 slice ：

$ cat ~/.config/systemd/user/limit-mem.slice
[Slice]
MemoryHigh=3G
MemoryMax=4G
MemorySwapMax=2G

然后可以用 systemd-run 限制在某个 slice 中打开一个 shell：

$ systemd-run --user --slice=limit-mem.slice --shell

或者定义一个 shell alias 用来限制任意命令：

$ type limit-mem
limit-mem is an alias for /usr/bin/time systemd-run --user --pty --same-dir --wait --collect --slice=limit-mem.slice
$ limit-mem cp some-large-file dest/

实际用法有很多，可以参考 systemd 文档 man systemd.resource-control ， xuanwo 也有篇博客介绍过 systemd 下资源限制， lilydjwg 也写过用 cgroup 限制进程内存的用法和用 cgroup 之后对 CPU 调度的影响。

未来展望

最近新版的 gnome 和 KDE 已经开始为桌面环境下用户程序的进程创建 systemd scope 了，可以通过 systemd-cgls 观察到，每个通过桌面文件（.desktop）开启的用户空间程序都有个独立的名字叫 app-APPNAME-HASH.scope 之类的 systemd scope 。有了这些 scope 之后，事实上用户程序的资源分配某种程度上已经相互独立，不过默认的用户程序没有施加多少限制。

今后可以展望，桌面环境可以提供用户友好的方式对这些桌面程序施加公平性的限制。不光是内存分配的大小限制，包括 CPU 和 IO 占用方面也会更公平。值得一提的是传统的 ext4/xfs/f2fs 之类的文件系统虽然支持 cgroup writeback 节流但是因为他们有额外的 journaling 写入，难以单独针对某些 cgroup 限制 IO 写入带宽（对文件系统元数据的写入难以统计到具体某组进程）。而 btrfs 通过 CoW 避免了 journaling ，在这方面有更好的支持。相信不远的将来，复制大文件之类常见普通操作不再需要手动调用加以限制，就能避免单个程序占用太多资源影响别的程序。

【译】替 swap 辩护：常见的误解

2020-09-30T13:45:00+09:00

这篇翻译自 Chris Down 的博文 In defence of swap: common misconceptions 。原文的协议是 CC BY-SA 4.0 ，本文翻译同样也使用 CC BY-SA 4.0 。其中加入了一些我自己的理解作为旁注，所有译注都在侧边栏中。

翻译这篇文章是因为经常看到朋友们（包括有经验的程序员和 Linux 管理员）对 swap 和 swappiness 有诸多误解，而这篇文章正好澄清了这些误解，也讲清楚了 Linux 中这两者的本质。值得一提的是本文讨论的 swap 针对 Linux 内核，在别的系统包括 macOS/WinNT 或者 Unix 系统中的交换文件可能有不同一样的行为，需要不同的调优方式。比如在 FreeBSD handbook 中明确建议了 swap 分区通常应该是两倍物理内存大小，这一点建议对 FreeBSD 系内核的内存管理可能非常合理，而不一定适合 Linux 内核，FreeBSD 和 Linux 有不同的内存管理方式尤其是 swap 和 page cache 和 buffer cache 的处理方式有诸多不同。

经常有朋友看到系统卡顿之后看系统内存使用状况观察到大量 swap 占用，于是觉得卡顿是来源于 swap 。就像文中所述，相关不蕴含因果，产生内存颠簸之后的确会造成大量 swap 占用，也会造成系统卡顿，但是 swap 不是导致卡顿的原因，关掉 swap 或者调低 swappiness 并不能阻止卡顿，只会将 swap 造成的 I/O 转化为加载文件缓存造成的 I/O 。

以下是原文翻译：

这篇文章也有日文和俄文翻译。

tl;dr:

Having swap is a reasonably important part of a well functioning system. Without it, sane memory management becomes harder to achieve.

Swap is not generally about getting emergency memory, it's about making memory reclamation egalitarian and efficient. In fact, using it as "emergency memory" is generally actively harmful.

Disabling swap does not prevent disk I/O from becoming a problem under memory contention, it simply shifts the disk I/O thrashing from anonymous pages to file pages. Not only may this be less efficient, as we have a smaller pool of pages to select from for reclaim, but it may also contribute to getting into this high contention state in the first place.

The swapper on kernels before 4.0 has a lot of pitfalls, and has contributed to a lot of people's negative perceptions about swap due to its overeagerness to swap out pages. On kernels >4.0, the situation is significantly better.

On SSDs, swapping out anonymous pages and reclaiming file pages are essentially equivalent in terms of performance/latency. On older spinning disks, swap reads are slower due to random reads, so a lower vm.swappiness setting makes sense there (read on for more about vm.swappiness ).

Disabling swap doesn't prevent pathological behaviour at near-OOM, although it's true that having swap may prolong it. Whether the system global OOM killer is invoked with or without swap, or was invoked sooner or later, the result is the same: you are left with a system in an unpredictable state. Having no swap doesn't avoid this.

You can achieve better swap behaviour under memory pressure and prevent thrashing using memory.low and friends in cgroup v2.

太长不看：

对维持系统的正常功能而言，有 swap 是相对挺重要的一部分。没有它的话会更难做到合理的内存管理。
swap 的目的通常并不是用作紧急内存，它的目的在于让内存回收能更平等和高效。事实上把它当作「紧急内存」来用的想法通常是有害的。
禁用 swap 在内存压力下并不能避免磁盘I/O造成的性能问题，这么做只是让磁盘I/O颠簸的范围从匿名页面转化到文件页面。这不仅更低效，因为系统能回收的页面的选择范围更有限了，而且这种做法还可能是加重了内存压力的原因之一。
内核 4.0 版本之前的交换进程（swapper）有一些问题，导致很多人对 swap 有负面印象，因为它太急于（overeagerness）把页面交换出去。在 4.0 之后的内核上这种情况已经改善了很多。
在 SSD 上，交换出匿名页面的开销和回收文件页面的开销基本上在性能/延迟方面没有区别。在老式的磁盘上，读取交换文件因为属于随机访问读取所以会更慢，于是设置较低的 vm.swappiness 可能比较合理（继续读下面关于 vm.swappiness 的描述）。
禁用 swap 并不能避免在接近 OOM 状态下最终表现出的症状，尽管的确有 swap 的情况下这种症状持续的时间可能会延长。在系统调用 OOM 杀手的时候无论有没有启用 swap ，或者更早/更晚开始调用 OOM 杀手，结果都是一样的：整个系统留在了一种不可预知的状态下。有 swap 也不能避免这一点。
可以用 cgroup v2 的 memory.low 相关机制来改善内存压力下 swap 的行为并且避免发生颠簸。

As part of my work improving kernel memory management and cgroup v2, I've been talking to a lot of engineers about attitudes towards memory management, especially around application behaviour under pressure and operating system heuristics used under the hood for memory management.

我的工作的一部分是改善内核中内存管理和 cgroup v2 相关，所以我和很多工程师讨论过看待内存管理的态度，尤其是在压力下应用程序的行为和操作系统在底层内存管理中用的基于经验的启发式决策逻辑。

A repeated topic in these discussions has been swap. Swap is a hotly contested and poorly understood topic, even by those who have been working with Linux for many years. Many see it as useless or actively harmful: a relic of a time where memory was scarce, and disks were a necessary evil to provide much-needed space for paging. This is a statement that I still see being batted around with relative frequency in recent years, and I've had many discussions with colleagues, friends, and industry peers to help them understand why swap is still a useful concept on modern computers with significantly more physical memory available than in the past.

在这种讨论中经常重复的话题是交换区（swap）。交换区的话题是非常有争议而且很少被理解的话题，甚至包括那些在 Linux 上工作过多年的人也是如此。很多人觉得它没什么用甚至是有害的：它是历史遗迹，从内存紧缺而磁盘读写是必要之恶的时代遗留到现在，为计算机提供在当年很必要的页面交换功能作为内存空间。最近几年我还经常能以一定频度看到这种论调，然后我和很多同事、朋友、业界同行们讨论过很多次，帮他们理解为什么在现代计算机系统中交换区仍是有用的概念，即便现在的电脑中物理内存已经远多于过去。

There's also a lot of misunderstanding about the purpose of swap – many people just see it as a kind of "slow extra memory" for use in emergencies, but don't understand how it can contribute during normal load to the healthy operation of an operating system as a whole.

围绕交换区的目的还有很多误解——很多人觉得它只是某种为了应对紧急情况的「慢速额外内存」，但是没能理解在整个操作系统健康运作的时候它也能改善普通负载的性能。

Many of us have heard most of the usual tropes about memory: " Linux uses too much memory ", " swap should be double your physical memory size ", and the like. While these are either trivial to dispel, or discussion around them has become more nuanced in recent years, the myth of "useless" swap is much more grounded in heuristics and arcana rather than something that can be explained by simple analogy, and requires somewhat more understanding of memory management to reason about.

我们很多人也听说过描述内存时所用的常见说法：「 Linux 用了太多内存」，「 swap 应该设为物理内存的两倍大小」，或者类似的说法。虽然这些误解要么很容易化解，或者关于他们的讨论在最近几年已经逐渐变得琐碎，但是关于「无用」交换区的传言有更深的经验传承的根基，而不是一两个类比就能解释清楚的，并且要探讨这个先得对内存管理有一些基础认知。

This post is mostly aimed at those who administrate Linux systems and are interested in hearing the counterpoints to running with undersized/no swap or running with vm.swappiness set to 0.

本文主要目标是针对那些管理 Linux 系统并且有兴趣理解「让系统运行于低/无交换区状态」或者「把 vm.swappiness 设为 0 」这些做法的反论。

背景

It's hard to talk about why having swap and swapping out pages are good things in normal operation without a shared understanding of some of the basic underlying mechanisms at play in Linux memory management, so let's make sure we're on the same page.

如果没有基本理解 Linux 内存管理的底层机制是如何运作的，就很难讨论为什么需要交换区以及交换出页面对正常运行的系统为什么是件好事，所以我们先确保大家有讨论的基础。

内存的类型

There are many different types of memory in Linux, and each type has its own properties. Understanding the nuances of these is key to understanding why swap is important.

Linux 中内存分为好几种类型，每种都有各自的属性。想理解为什么交换区很重要的关键一点在于理解这些的细微区别。

For example, there are pages ("blocks" of memory, typically 4k) responsible for holding the code for each process being run on your computer. There are also pages responsible for caching data and metadata related to files accessed by those programs in order to speed up future access. These are part of the page cache , and I will refer to them as file memory.

比如说，有种 页面（「整块」的内存，通常 4K） 是用来存放电脑里每个程序运行时各自的代码的。也有页面用来保存这些程序所需要读取的文件数据和元数据的缓存，以便加速随后的文件读写。这些内存页面构成 页面缓存（page cache），后文中我称他们为文件内存。

There are also pages which are responsible for the memory allocations made inside that code, for example, when new memory that has been allocated with malloc is written to, or when using mmap 's MAP_ANONYMOUS flag. These are "anonymous" pages – so called because they are not backed by anything – and I will refer to them as anon memory.

还有一些页面是在代码执行过程中做的内存分配得到的，比如说，当代码调用 malloc 能分配到新内存区，或者使用 mmap 的 MAP_ANONYMOUS 标志分配的内存。这些是「匿名(anonymous)」页面——之所以这么称呼它们是因为他们没有任何东西作后备—— 后文中我称他们为匿名内存。

There are other types of memory too – shared memory, slab memory, kernel stack memory, buffers, and the like – but anonymous memory and file memory are the most well known and easy to understand ones, so I will use these in my examples, although they apply equally to these types too.

还有其它类型的内存——共享内存、slab内存、内核栈内存、文件缓冲区（buffers），这种的—— 但是匿名内存和文件内存是最知名也最好理解的，所以后面的例子里我会用这两个说明，虽然后面的说明也同样适用于别的这些内存类型。

可回收/不可回收内存

One of the most fundamental questions when thinking about a particular type of memory is whether it is able to be reclaimed or not. "Reclaim" here means that the system can, without losing data, purge pages of that type from physical memory.

考虑某种内存的类型时，一个非常基础的问题是这种内存是否能被回收。「回收（Reclaim）」在这里是指系统可以，在不丢失数据的前提下，从物理内存中释放这种内存的页面。

For some page types, this is typically fairly trivial. For example, in the case of clean (unmodified) page cache memory, we're simply caching something that we have on disk for performance, so we can drop the page without having to do any special operations.

对一些内存类型而言，是否可回收通常可以直接判断。比如对于那些干净（未修改）的页面缓存内存，我们只是为了性能在用它们缓存一些磁盘上现有的数据，所以我们可以直接扔掉这些页面，不需要做什么特殊的操作。

For some page types, this is possible, but not trivial. For example, in the case of dirty (modified) page cache memory, we can't just drop the page, because the disk doesn't have our modifications yet. As such we either need to deny reclamation or first get our changes back to disk before we can drop this memory.

对有些内存类型而言，回收是可能的，但是不是那么直接。比如对脏（修改过）的页面缓存内存，我们不能直接扔掉这些页面，因为磁盘上还没有写入我们所做的修改。这种情况下，我们可以选择拒绝回收，或者选择先等待我们的变更写入磁盘之后才能扔掉这些内存。

For some page types, this is not possible. For example, in the case of the anonymous pages mentioned previously, they only exist in memory and in no other backing store, so they have to be kept there.

对还有些内存类型而言，是不能回收的。比如前面提到的匿名页面，它们只存在于内存中，没有任何后备存储，所以它们必须留在内存里。

说到交换区的本质

If you look for descriptions of the purpose of swap on Linux, you'll inevitably find many people talking about it as if it is merely an extension of the physical RAM for use in emergencies. For example, here is a random post I got as one of the top results from typing "what is swap" in Google:

Swap is essentially emergency memory; a space set aside for times when your system temporarily needs more physical memory than you have available in RAM. It's considered "bad" in the sense that it's slow and inefficient, and if your system constantly needs to use swap then it obviously doesn't have enough memory. […] If you have enough RAM to handle all of your needs, and don't expect to ever max it out, then you should be perfectly safe running without a swap space.

如果你去搜 Linux 上交换区的目的的描述，肯定会找到很多人说交换区只是在紧急时用来扩展物理内存的机制。比如下面这段是我在 google 中输入「什么是 swap」从前排结果中随机找到的一篇：

交换区本质上是紧急内存；是为了应对你的系统临时所需内存多余你现有物理内存时，专门分出一块额外空间。大家觉得交换区「不好」是因为它又慢又低效，并且如果你的系统一直需要使用交换区那说明它明显没有足够的内存。［……］如果你有足够内存覆盖所有你需要的情况，而且你觉得肯定不会用满内存，那么完全可以不用交换区安全地运行系统。

To be clear, I don't blame the poster of this comment at all for the content of their post – this is accepted as "common knowledge" by a lot of Linux sysadmins and is probably one of the most likely things that you will hear from one if you ask them to talk about swap. It is unfortunately also, however, a misunderstanding of the purpose and use of swap, especially on modern systems.

事先说明，我不想因为这些文章的内容责怪这些文章的作者——这些内容被很多 Linux 系统管理员认为是「常识」，并且很可能你问他们什么是交换区的时候他们会给你这样的回答。但是也很不幸的是，这种认识是使用交换区的目的的一种普遍误解，尤其在现代系统上。

Above, I talked about reclamation for anonymous pages being "not possible", as anonymous pages by their nature have no backing store to fall back to when being purged from memory – as such, their reclamation would result in complete data loss for those pages. What if we could create such a store for these pages, though?

前文中我说过回收匿名页面的内存是「不可能的」，因为匿名内存的特点，把它们从内存中清除掉之后，没有别的存储区域能作为他们的备份——因此，要回收它们会造成数据丢失。但是，如果我们为这种内存页面创建一种后备存储呢？

Well, this is precisely what swap is for. Swap is a storage area for these seemingly "unreclaimable" pages that allows us to page them out to a storage device on demand. This means that they can now be considered as equally eligible for reclaim as their more trivially reclaimable friends, like clean file pages, allowing more efficient use of available physical memory.

嗯，这正是交换区存在的意义。交换区是一块存储空间，用来让这些看起来「不可回收」的内存页面在需要的时候可以交换到存储设备上。这意味着有了交换区之后，这些匿名页面也和别的那些可回收内存一样，可以作为内存回收的候选，就像干净文件页面，从而允许更有效地使用物理内存。

Swap is primarily a mechanism for equality of reclamation, not for emergency "extra memory". Swap is not what makes your application slow – entering overall memory contention is what makes your application slow.

交换区主要是为了平等的回收机制，而不是为了紧急情况的「额外内存」。使用交换区不会让你的程序变慢—— 进入内存竞争的状态才是让程序变慢的元凶。

So in what situations under this "equality of reclamation" scenario would we legitimately choose to reclaim anonymous pages? Here are, abstractly, some not uncommon scenarios:

那么在这种「平等的可回收机遇」的情况下，让我们选择回收匿名页面的行为在何种场景中更合理呢？抽象地说，比如在下述不算罕见的场景中：

During initialisation, a long-running program may allocate and use many pages. These pages may also be used as part of shutdown/cleanup, but are not needed once the program is "started" (in an application-specific sense). This is fairly common for daemons which have significant dependencies to initialise.

During the program's normal operation, we may allocate memory which is only used rarely. It may make more sense for overall system performance to require a major fault to page these in from disk on demand, instead using the memory for something else that's more important.

程序初始化的时候，那些长期运行的程序可能要分配和使用很多页面。这些页面可能在最后的关闭/清理的时候还需要使用，但是在程序「启动」之后（以具体的程序相关的方式）持续运行的时候不需要访问。对后台服务程序来说，很多后台程序要初始化不少依赖库，这种情况很常见。
在程序的正常运行过程中，我们可能分配一些很少使用的内存。对整体系统性能而言可能比起让这些内存页一直留在内存中，只有在用到的时候才按需把它们用 缺页异常（major fault） 换入内存，可以空出更多内存留给更重要的东西。

cgroupv2: Linux's new unified control group hierarchy (QCON London 2017)

考察有无交换区时会发生什么

Let's look at typical situations, and how they perform with and without swap present. I talk about metrics around "memory contention" in my talk on cgroup v2 .

我们来看一些在常见场景中，有无交换区时分别会如何运行。在我的关于 cgroup v2 的演讲中探讨过「内存竞争」的指标。

在无/低内存竞争的状态下

With swap: We can choose to swap out rarely-used anonymous memory that may only be used during a small part of the process lifecycle, allowing us to use this memory to improve cache hit rate, or do other optimisations.

Without swap We cannot swap out rarely-used anonymous memory, as it's locked in memory. While this may not immediately present as a problem, on some workloads this may represent a non-trivial drop in performance due to stale, anonymous pages taking space away from more important use.

有交换区: 我们可以选择换出那些只有在进程生存期内很小一部分时间会访问的匿名内存，这允许我们空出更多内存空间用来提升缓存命中率，或者做别的优化。
无交换区: 我们不能换出那些很少使用的匿名内存，因为它们被锁在了内存中。虽然这通常不会直接表现出问题，但是在一些工作条件下这可能造成卡顿导致不平凡的性能下降，因为匿名内存占着空间不能给更重要的需求使用。

译注：关于 内存热度 和 内存颠簸（thrash）

讨论内核中内存管理的时候经常会说到内存页的冷热程度。这里冷热是指历史上内存页被访问到的频度，内存管理的经验在说，历史上在近期频繁访问的热内存，在未来也可能被频繁访问，从而应该留在物理内存里；反之历史上不那么频繁访问的冷内存，在未来也可能很少被用到，从而可以考虑交换到磁盘或者丢弃文件缓存。

颠簸（thrash） 这个词在文中出现多次但是似乎没有详细介绍，实际计算机科学专业的课程中应该有讲过。一段时间内，让程序继续运行所需的热内存总量被称作程序的工作集（workset），估算工作集大小，换句话说判断进程分配的内存页中哪些属于热内存哪些属于冷内存，是内核中内存管理的最重要的工作。当分配给程序的内存大于工作集的时候，程序可以不需要等待I/O全速运行；而当分配给程序的内存不足以放下整个工作集的时候，意味着程序每执行一小段就需要等待换页或者等待磁盘缓存读入所需内存页，产生这种情况的时候，从用户角度来看可以观察到程序肉眼可见的「卡顿」。当系统中所有程序都内存不足的时候，整个系统都处于颠簸的状态下，响应速度直接降至磁盘I/O的带宽。如本文所说，禁用交换区并不能防止颠簸，只是从等待换页变成了等待文件缓存，给程序分配超过工作集大小的内存才能防止颠簸。

在中/高内存竞争的状态下

With swap: All memory types have an equal possibility of being reclaimed. This means we have more chance of being able to reclaim pages successfully – that is, we can reclaim pages that are not quickly faulted back in again (thrashing).

Without swap Anonymous pages are locked into memory as they have nowhere to go. The chance of successful long-term page reclamation is lower, as we have only some types of memory eligible to be reclaimed at all. The risk of page thrashing is higher. The casual reader might think that this would still be better as it might avoid having to do disk I/O, but this isn't true – we simply transfer the disk I/O of swapping to dropping hot page caches and dropping code segments we need soon.

有交换区: 所有内存类型都有平等的被回收的可能性。这意味着我们回收页面有更高的成功率—— 成功回收的意思是说被回收的那些页面不会在近期内被缺页异常换回内存中（颠簸）。
无交换区: 匿名内存因为无处可去所以被锁在内存中。长期内存回收的成功率变低了，因为我们成体上能回收的页面总量少了。发生缺页颠簸的危险性更高了。缺乏经验的读者可能觉得这某时也是好事，因为这能避免进行磁盘I/O，但是实际上不是如此——我们只是把交换页面造成的磁盘I/O变成了扔掉热缓存页和扔掉代码段，这些页面很可能马上又要从文件中读回来。

在临时内存占用高峰时

With swap: We're more resilient to temporary spikes, but in cases of severe memory starvation, the period from memory thrashing beginning to the OOM killer may be prolonged. We have more visibility into the instigators of memory pressure and can act on them more reasonably, and can perform a controlled intervention.

Without swap The OOM killer is triggered more quickly as anonymous pages are locked into memory and cannot be reclaimed. We're more likely to thrash on memory, but the time between thrashing and OOMing is reduced. Depending on your application, this may be better or worse. For example, a queue-based application may desire this quick transfer from thrashing to killing. That said, this is still too late to be really useful – the OOM killer is only invoked at moments of severe starvation, and relying on this method for such behaviour would be better replaced with more opportunistic killing of processes as memory contention is reached in the first place.

有交换区: 我们对内存使用激增的情况更有抵抗力，但是在严重的内存不足的情况下，从开始发生内存颠簸到 OOM 杀手开始工作的时间会被延长。内存压力造成的问题更容易观察到，从而可能更有效地应对，或者更有机会可控地干预。
无交换区: 因为匿名内存被锁在内存中了不能被回收，所以 OOM 杀手会被更早触发。发生内存颠簸的可能性更大，但是发生颠簸之后到 OOM 解决问题的时间间隔被缩短了。基于你的程序，这可能更好或是更糟。比如说，基于队列的程序可能更希望这种从颠簸到杀进程的转换更快发生。即便如此，发生 OOM 的时机通常还是太迟于是没什么帮助——只有在系统极度内存紧缺的情况下才会请出 OOM 杀手，如果想依赖这种行为模式，不如换成更早杀进程的方案，因为在这之前已经发生内存竞争了。

好吧，所以我需要系统交换区，但是我该怎么为每个程序微调它的行为？

You didn't think you'd get through this entire post without me plugging cgroup v2, did you? ;-)

你肯定想到了我写这篇文章一定会在哪儿插点 cgroup v2 的安利吧？ ;-)

Obviously, it's hard for a generic heuristic algorithm to be right all the time, so it's important for you to be able to give guidance to the kernel. Historically the only tuning you could do was at the system level, using vm.swappiness . This has two problems: vm.swappiness is incredibly hard to reason about because it only feeds in as a small part of a larger heuristic system, and it also is system-wide instead of being granular to a smaller set of processes.

显然，要设计一种对所有情况都有效的启发算法会非常难，所以给内核提一些指引就很重要。历史上我们只能在整个系统层面做这方面微调，通过 vm.swappiness 。这有两方面问题： vm.swappiness 的行为很难准确控制，因为它只是传递给一个更大的启发式算法中的一个小参数；并且它是一个系统级别的设置，没法针对一小部分进程微调。

You can also use mlock to lock pages into memory, but this requires either modifying program code, fun with LD_PRELOAD , or doing horrible things with a debugger at runtime. In VM-based languages this also doesn't work very well, since you generally have no control over allocation and end up having to mlockall , which has no precision towards the pages you actually care about.

你可以用 mlock 把页面锁在内存里，但是这要么必须改程序代码，或者折腾 LD_PRELOAD ，或者在运行期用调试器做一些魔法操作。对基于虚拟机的语言来说这种方案也不能很好工作，因为通常你没法控制内存分配，最后得用上 mlockall ，而这个没有办法精确指定你实际上想锁住的页面。

cgroup v2 has a tunable per-cgroup in the form of memory.low , which allows us to tell the kernel to prefer other applications for reclaim below a certain threshold of memory used. This allows us to not prevent the kernel from swapping out parts of our application, but prefer to reclaim from other applications under memory contention. Under normal conditions, the kernel's swap logic is generally pretty good, and allowing it to swap out pages opportunistically generally increases system performance. Swap thrash under heavy memory contention is not ideal, but it's more a property of simply running out of memory entirely than a problem with the swapper. In these situations, you typically want to fail fast by self-killing non-critical processes when memory pressure starts to build up.

cgroup v2 提供了一套可以每个 cgroup 微调的 memory.low ，允许我们告诉内核说当使用的内存低于一定阈值之后优先回收别的程序的内存。这可以让我们不强硬禁止内核换出程序的一部分内存，但是当发生内存竞争的时候让内核优先回收别的程序的内存。在正常条件下，内核的交换逻辑通常还是不错的，允许它有条件地换出一部分页面通常可以改善系统性能。在内存竞争的时候发生交换颠簸虽然不理想，但是这更多地是单纯因为整体内存不够了，而不是因为交换进程（swapper）导致的问题。在这种场景下，你通常希望在内存压力开始积攒的时候通过自杀一些非关键的进程的方式来快速退出（fail fast）。

You can not simply rely on the OOM killer for this. The OOM killer is only invoked in situations of dire failure when we've already entered a state where the system is severely unhealthy and may well have been so for a while. You need to opportunistically handle the situation yourself before ever thinking about the OOM killer.

你不能依赖 OOM 杀手达成这个。 OOM 杀手只有在非常急迫的情况下才会出动，那时系统已经处于极度不健康的状态了，而且很可能在这种状态下保持了一阵子了。需要在开始考虑 OOM 杀手之前，积极地自己处理这种情况。

Determination of memory pressure is somewhat difficult using traditional Linux memory counters, though. We have some things which seem somewhat related, but are merely tangential – memory usage, page scans, etc – and from these metrics alone it's very hard to tell an efficient memory configuration from one that's trending towards memory contention. There is a group of us at Facebook, spearheaded by Johannes , working on developing new metrics that expose memory pressure more easily that should help with this in future. If you're interested in hearing more about this, I go into detail about one metric being considered in my talk on cgroup v2.

不过，用传统的 Linux 内存统计数据还是挺难判断内存压力的。我们有一些看起来相关的系统指标，但是都只是支离破碎的——内存用量、页面扫描，这些——单纯从这些指标很难判断系统是处于高效的内存利用率还是在滑向内存竞争状态。我们在 Facebook 有个团队，由 Johannes 牵头，努力开发一些能评价内存压力的新指标，希望能在今后改善目前的现状。如果你对这方面感兴趣，在我的 cgroup v2 的演讲中介绍到一个被提议的指标。

调优

那么，我需要多少交换空间？

In general, the minimum amount of swap space required for optimal memory management depends on the number of anonymous pages pinned into memory that are rarely reaccessed by an application, and the value of reclaiming those anonymous pages. The latter is mostly a question of which pages are no longer purged to make way for these infrequently accessed anonymous pages.

通常而言，最优内存管理所需的最小交换空间取决于程序固定在内存中而又很少访问到的匿名页面的数量，以及回收这些匿名页面换来的价值。后者大体上来说是问哪些页面不再会因为要保留这些很少访问的匿名页面而被回收掉腾出空间。

If you have a bunch of disk space and a recent (4.0+) kernel, more swap is almost always better than less. In older kernels kswapd , one of the kernel processes responsible for managing swap, was historically very overeager to swap out memory aggressively the more swap you had. In recent times, swapping behaviour when a large amount of swap space is available has been significantly improved. If you're running kernel 4.0+, having a larger swap on a modern kernel should not result in overzealous swapping. As such, if you have the space, having a swap size of a few GB keeps your options open on modern kernels.

如果你有足够大的磁盘空间和比较新的内核版本（4.0+），越大的交换空间基本上总是越好的。更老的内核上 kswapd ，内核中负责管理交换区的内核线程，在历史上倾向于有越多交换空间就急于交换越多内存出去。在最近一段时间，可用交换空间很大的时候的交换行为已经改善了很多。如果在运行 4.0+ 以后的内核，即便有很大的交换区在现代内核上也不会很激进地做交换。因此，如果你有足够的容量，现代内核上有个几个 GB 的交换空间大小能让你有更多选择。

If you're more constrained with disk space, then the answer really depends on the tradeoffs you have to make, and the nature of the environment. Ideally you should have enough swap to make your system operate optimally at normal and peak (memory) load. What I'd recommend is setting up a few testing systems with 2-3GB of swap or more, and monitoring what happens over the course of a week or so under varying (memory) load conditions. As long as you haven't encountered severe memory starvation during that week – in which case the test will not have been very useful – you will probably end up with some number of MB of swap occupied. As such, it's probably worth having at least that much swap available, in addition to a little buffer for changing workloads. atop in logging mode can also show you which applications are having their pages swapped out in the SWAPSZ column, so if you don't already use it on your servers to log historic server state you probably want to set it up on these test machines with logging mode as part of this experiment. This also tells you when your application started swapping out pages, which you can tie to log events or other key data.

如果你的磁盘空间有限，那么答案更多取决于你愿意做的取舍，以及运行的环境。理想上应该有足够的交换空间能高效应对正常负载和高峰（内存）负载。我建议先用 2-3GB 或者更多的交换空间搭个测试环境，然后监视在不同（内存）负载条件下持续一周左右的情况。只要在那一周里没有发生过严重的内存不足—— 发生了的话说明测试结果没什么用——在测试结束的时候大概会留有多少 MB 交换区占用。作为结果说明你至少应该有那么多可用的交换空间，再多加一些以应对负载变化。用日志模式跑 atop 可以在 SWAPSZ 栏显示程序的页面被交换出去的情况，所以如果你还没用它记录服务器历史日志的话，这次测试中可以试试在测试机上用它记录日志。这也会告诉你什么时候你的程序开始换出页面，你可以用这个对照事件日志或者别的关键数据。

Another thing worth considering is the nature of the swap medium. Swap reads tend to be highly random, since we can't reliably predict which pages will be refaulted and when. On an SSD this doesn't matter much, but on spinning disks, random I/O is extremely expensive since it requires physical movement to achieve. On the other hand, refaulting of file pages is likely less random, since files related to the operation of a single application at runtime tend to be less fragmented. This might mean that on a spinning disk you may want to bias more towards reclaiming file pages instead of swapping out anonymous pages, but again, you need to test and evaluate how this balances out for your workload.

另一点值得考虑的是交换空间所在存储设备的媒介。读取交换区倾向于很随机，因为我们不能可靠预测什么时候什么页面会被再次访问。在 SSD 上这不是什么问题，但是在传统磁盘上，随机 I/O 操作会很昂贵，因为需要物理动作寻道。另一方面，重新加载文件缓存可能不那么随机，因为单一程序在运行期的文件读操作一般不会太碎片化。这可能意味着在传统磁盘上你想更多地回收文件页面而不是换出匿名页面，但仍旧，你需要做测试评估在你的工作负载下如何取得平衡。

译注：关于休眠到磁盘时的交换空间大小

原文这里建议交换空间至少是物理内存大小，我觉得实际上不需要。休眠到磁盘的时候内核会写回并丢弃所有有文件作后备的可回收页面，交换区只需要能放下那些没有文件后备的页面就可以了。如果去掉文件缓存页面之后剩下的已用物理内存总量能完整放入交换区中，就可以正常休眠。对于桌面浏览器这种内存大户，通常有很多缓存页可以在休眠的时候丢弃。

For laptop/desktop users who want to hibernate to swap, this also needs to be taken into account – in this case your swap file should be at least your physical RAM size.

对笔记本/桌面用户如果想要休眠到交换区，这也需要考虑——这种情况下你的交换文件应该至少是物理内存大小。

我的 swappiness 应该如何设置？

First, it's important to understand what vm.swappiness does. vm.swappiness is a sysctl that biases memory reclaim either towards reclamation of anonymous pages, or towards file pages. It does this using two different attributes: file_prio (our willingness to reclaim file pages) and anon_prio (our willingness to reclaim anonymous pages). vm.swappiness`plays into this, as it becomes the default value for :code:`anon_prio , and it also is subtracted from the default value of 200 for file_prio , which means for a value of vm.swappiness = 50 , the outcome is that anon_prio is 50, and file_prio is 150 (the exact numbers don't matter as much as their relative weight compared to the other).

首先很重要的一点是，要理解 vm.swappiness 是做什么的。 vm.swappiness 是一个 sysctl 用来控制在内存回收的时候，是优先回收匿名页面，还是优先回收文件页面。内存回收的时候用两个属性： file_prio （回收文件页面的倾向）和 anon_prio （回收匿名页面的倾向）。 vm.swappiness 控制这两个值，因为它是 anon_prio 的默认值，然后也是默认 200 减去它之后 file_prio 的默认值。意味着如果我们设置 vm.swappiness = 50 那么结果是 anon_prio 是 50， file_prio 是 150 （这里数值本身不是很重要，重要的是两者之间的权重比）。

译注：关于 SSD 上的 swappiness

原文这里说 SSD 上 swap 和 drop page cache 差不多开销所以 vm.swappiness = 100 。我觉得实际上要考虑 swap out 的时候会产生写入操作，而 drop page cache 可能不需要写入（要看页面是否是脏页）。如果负载本身对I/O带宽比较敏感，稍微调低 swappiness 可能对性能更好，内核的默认值 60 是个不错的默认值。以及桌面用户可能对性能不那么关心，反而更关心 SSD 的写入寿命，虽然说 SSD 写入寿命一般也足够桌面用户，不过调低 swappiness 可能也能减少一部分不必要的写入（因为写回脏页是必然会发生的，而写 swap 可以避免）。当然太低的 swappiness 会对性能有负面影响（因为太多匿名页面留在物理内存里而降低了缓存命中率），这里的权衡也需要根据具体负载做测试。

另外澄清一点误解， swap 分区还是 swap 文件对系统运行时的性能而言没有差别。或许有人会觉得 swap 文件要经过文件系统所以会有性能损失，在译文之前译者说过 Linux 的内存管理子系统基本上独立于文件系统。实际上 Linux 上的 swapon 在设置 swap 文件作为交换空间的时候会读取一次文件系统元数据，确定 swap 文件在磁盘上的地址范围，随后运行的过程中做交换就和文件系统无关了。关于 swap 空间是否连续的影响，因为 swap 读写基本是页面单位的随机读写，所以即便连续的 swap 空间（swap 分区）也并不能改善 swap 的性能。希疏文件的地址范围本身不连续，写入希疏文件的空洞需要文件系统分配磁盘空间，所以在 Linux 上交换文件不能是希疏文件。只要不是希疏文件，连续的文件内地址范围在磁盘上是否连续（是否有文件碎片）基本不影响能否 swapon 或者使用 swap 时的性能。

This means that, in general, vm.swappiness is simply a ratio of how costly reclaiming and refaulting anonymous memory is compared to file memory for your hardware and workload. The lower the value, the more you tell the kernel that infrequently accessed anonymous pages are expensive to swap out and in on your hardware. The higher the value, the more you tell the kernel that the cost of swapping anonymous pages and file pages is similar on your hardware. The memory management subsystem will still try to mostly decide whether it swaps file or anonymous pages based on how hot the memory is, but swappiness tips the cost calculation either more towards swapping or more towards dropping filesystem caches when it could go either way. On SSDs these are basically as expensive as each other, so setting vm.swappiness = 100 (full equality) may work well. On spinning disks, swapping may be significantly more expensive since swapping in generally requires random reads, so you may want to bias more towards a lower value.

这意味着，通常来说 vm.swappiness 只是一个比例，用来衡量在你的硬件和工作负载下， 回收和换回匿名内存还是文件内存哪种更昂贵 。设定的值越低，你就是在告诉内核说换出那些不常访问的匿名页面在你的硬件上开销越昂贵；设定的值越高，你就是在告诉内核说在你的硬件上交换匿名页和文件缓存的开销越接近。内存管理子系统仍然还是会根据实际想要回收的内存的访问热度尝试自己决定具体是交换出文件还是匿名页面，只不过 swappiness 会在两种回收方式皆可的时候，在计算开销权重的过程中左右是该更多地做交换还是丢弃缓存。在 SSD 上这两种方式基本上是同等开销，所以设成 vm.swappiness = 100 （同等比重）可能工作得不错。在传统磁盘上，交换页面可能会更昂贵，因为通常需要随机读取，所以你可能想要设低一些。

The reality is that most people don't really have a feeling about which their hardware demands, so it's non-trivial to tune this value based on instinct alone – this is something that you need to test using different values. You can also spend time evaluating the memory composition of your system and core applications and their behaviour under mild memory reclamation.

现实是大部分人对他们的硬件需求没有什么感受，所以根据直觉调整这个值可能挺困难的 —— 你需要用不同的值做测试。你也可以花时间评估一下你的系统的内存分配情况和核心应用在大量回收内存的时候的行为表现。

When talking about vm.swappiness , an extremely important change to consider from recent(ish) times is this change to vmscan by Satoru Moriya in 2012 , which changes the way that vm.swappiness = 0 is handled quite significantly.

讨论 vm.swappiness 的时候，一个极为重要需要考虑的修改是（相对）近期在 2012 年左右 Satoru Moriya 对 vmscan 行为的修改，它显著改变了代码对 vm.swappiness = 0 这个值的处理方式。

Essentially, the patch makes it so that we are extremely biased against scanning (and thus reclaiming) any anonymous pages at all with vm.swappiness = 0 , unless we are already encountering severe memory contention. As mentioned previously in this post, that's generally not what you want, since this prevents equality of reclamation prior to extreme memory pressure occurring, which may actually lead to this extreme memory pressure in the first place. vm.swappiness = 1 is the lowest you can go without invoking the special casing for anonymous page scanning implemented in that patch.

基本上来说这个补丁让我们在 vm.swappiness = 0 的时候会极度避免扫描（进而回收）匿名页面，除非我们已经在经历严重的内存抢占。就如本文前面所属，这种行为基本上不会是你想要的，因为这种行为会导致在发生内存抢占之前无法保证内存回收的公平性，这甚至可能是最初导致发生内存抢占的原因。想要避免这个补丁中对扫描匿名页面的特殊行为的话， vm.swappiness = 1 是你能设置的最低值。

The kernel default here is vm.swappiness = 60 . This value is generally not too bad for most workloads, but it's hard to have a general default that suits all workloads. As such, a valuable extension to the tuning mentioned in the "how much swap do I need" section above would be to test these systems with differing values for vm.swappiness , and monitor your application and system metrics under heavy (memory) load. Some time in the near future, once we have a decent implementation of refault detection in the kernel, you'll also be able to determine this somewhat workload-agnostically by looking at cgroup v2's page refaulting metrics.

内核在这里设置的默认值是 vm.swappiness = 60 。这个值对大部分工作负载来说都不会太坏，但是很难有一个默认值能符合所有种类的工作负载。因此，对上面「那么，我需要多少交换空间？」那段讨论的一点重要扩展可以说，在测试系统中可以尝试使用不同的 vm.swappiness ，然后监视你的程序和系统在重（内存）负载下的性能指标。在未来某天，如果我们在内核中有了合理的缺页检测，你也将能通过 cgroup v2 的页面缺页指标来以负载无关的方式决定这个。

SREcon19 Asia/Pacific - Linux Memory Management at Scale: Under the Hood

2019年07月更新：内核 4.20+ 中的内存压力指标

The refault metrics mentioned as in development earlier are now in the kernel from 4.20 onwards and can be enabled with CONFIG_PSI=y . See my talk at SREcon at around the 25:05 mark:

前文中提到的开发中的内存缺页检测指标已经进入 4.20+ 以上版本的内核，可以通过 CONFIG_PSI=y 开启。详情参见我在 SREcon 大约 25:05 左右的讨论。

结论

Swap is a useful tool to allow equality of reclamation of memory pages, but its purpose is frequently misunderstood, leading to its negative perception across the industry. If you use swap in the spirit intended, though – as a method of increasing equality of reclamation – you'll find that it's a useful tool instead of a hindrance.

Disabling swap does not prevent disk I/O from becoming a problem under memory contention, it simply shifts the disk I/O thrashing from anonymous pages to file pages. Not only may this be less efficient, as we have a smaller pool of pages to select from for reclaim, but it may also contribute to getting into this high contention state in the first place.

Swap can make a system slower to OOM kill, since it provides another, slower source of memory to thrash on in out of memory situations – the OOM killer is only used by the kernel as a last resort, after things have already become monumentally screwed. The solutions here depend on your system:

You can opportunistically change the system workload depending on cgroup-local or global memory pressure. This prevents getting into these situations in the first place, but solid memory pressure metrics are lacking throughout the history of Unix. Hopefully this should be better soon with the addition of refault detection.

You can bias reclaiming (and thus swapping) away from certain processes per-cgroup using memory.low, allowing you to protect critical daemons without disabling swap entirely.

交换区是允许公平地回收内存的有用工具，但是它的目的经常被人误解，导致它在业内这种负面声誉。如果你是按照原本的目的使用交换区的话——作为增加内存回收公平性的方式——你会发现它是很有效的工具而不是阻碍。
禁用交换区并不能在内存竞争的时候防止磁盘I/O的问题，它只不过把匿名页面的磁盘I/O变成了文件页面的磁盘I/O。这不仅更低效，因为我们回收内存的时候能选择的页面范围更小了，而且它可能是导致高度内存竞争状态的元凶。
有交换区会导致系统更慢地使用 OOM 杀手，因为在缺少内存的情况下它提供了另一种更慢的内存，会持续地内存颠簸——内核调用 OOM 杀手只是最后手段，会晚于所有事情已经被搞得一团糟之后。解决方案取决于你的系统：
- 你可以预先更具每个 cgroup 的或者系统全局的内存压力改变系统负载。这能防止我们最初进入内存竞争的状态，但是 Unix 的历史中一直缺乏可靠的内存压力检测方式。希望不久之后在有了缺页检测这样的性能指标之后能改善这一点。
- 你可以使用 memory.low 让内核不倾向于回收（进而交换）特定一些 cgroup 中的进程，允许你在不禁用交换区的前提下保护关键后台服务。

感谢在撰写本文时 Rahul ， Tejun 和 Johannes 提供的诸多建议和反馈。