Implement Linux memory policy youki-dev/youki #3230by n4mlz Add support for Linux memory policy opencontainers/runtime-spec #1282by askervin

set_mempolicy(2)

$ man 2 set_mempolicy
long set_mempolicy(int mode, const unsigned long *nodemask, unsigned long maxnode);

This function controls which NUMA node a process's memory should be allocated to.

NUMA
+-------------------------------------------------------------+
|                    Multi-Socket Server                      |
+-----------------------------+-------------------------------+
|         Node 0              |           Node 1              |
|  +---------+  +---------+   |   +---------+  +---------+    |
|  |  CPU 0  |  |  CPU 1  |   |   |  CPU 2  |  |  CPU 3  |    |
|  +----+----+  +----+----+   |   +----+----+  +----+----+    |
|       |            |        |        |            |         |
|       +------+-----+        |        +------+-----+         |
|              |              |               |               |
|       +------+------+       |        +------+------+        |
|       |  Memory 0   |<------+------->|  Memory 1   |        |
|       |   (local)   | slow  |        |   (local)   |        |
|       +-------------+       |        +-------------+        |
+-----------------------------+-------------------------------+

Modes

ModeBehavior
MPOL_DEFAULTSystem default: Local memory preference
MPOL_LOCALAllocate memory to the local node of current CPU
MPOL_PREFERREDPrefer specified node; use other nodes if unavailable
MPOL_BINDUse only the specified node
MPOL_INTERLEAVEAlternate allocation between specified nodes
MPOL_PREFERRED_MANYPrioritize multiple nodes
MPOL_WEIGHTED_INTERLEAVEAlternate allocation with weighted priorities
MPOL_BIND
{
  "memoryPolicy": {
    "mode": "MPOL_BIND",
    "nodes": "0,1"
  }
}

Memory allocation behavior:

  • Node 0: Available
  • Node 1: Available
  • Node 2: Unavailable → Triggers OOM if memory is exhausted
MPOL_INTERLEAVE
{
  "memoryPolicy": {
    "mode": "MPOL_INTERLEAVE",
    "nodes": "0,1"
  }
}

Memory allocation behavior:

  • Page 1 → Node 0
  • Page 2 → Node 1
  • Page 3 → Node 0
  • Page 4 → Node 1
  • ... (alternating allocation)

Use case: Distributes large datasets across multiple nodes to optimize bandwidth utilization

MPOL_PREFERRED
{
  "memoryPolicy": {
    "mode": "MPOL_PREFERRED",
    "nodes": "0"
  }
}

Memory allocation behavior:

  • First attempts allocation on Node 0
  • If Node 0 is full, fallback to Node 1
    • Unlike BIND, this does not result in an error

Flags

FlagEffect
MPOL_F_STATIC_NODESMaintains node numbers even when changing cpusets
MPOL_F_RELATIVE_NODESInterprets node numbers as relative positions within cpusets
MPOL_F_NUMA_BALANCINGApplicable only to BIND. The kernel monitors access patterns and automatically migrates pages

Validation

Validation is quite challenging :P

  • MPOL_F_NUMA_BALANCING can only be used with MPOL_BIND
  • MPOL_F_STATIC_NODES and MPOL_F_RELATIVE_NODES cannot be used simultaneously
  • DEFAULT: nodes must be empty, flags must be empty
  • LOCAL: nodes must be empty, flags must be empty
  • PREFERRED + empty nodes: STATIC/RELATIVE flags are prohibited
  • BIND: nodes must have one or more entries
  • INTERLEAVE: nodes must have one or more entries
  • PREFERRED_MANY: nodes must have one or more entries
  • WEIGHTED_INTERLEAVE: nodes must have one or more entries