NUMA emulation (Emulating Non-Uniform memory access)

NUMA emulation is a software method that forces the operating system to treat a single physical server as several independent nodes each with its own memory. A processor in one node gets fast access to its own memory and slow access to memory belonging to other nodes, even if the physical latency is the same across all nodes. This allows testing application behavior without expensive NUMA hardware.

This technology is in demand when developing and debugging large enterprise systems. Engineers test how databases and hypervisors distribute threads and memory pages before deploying on real multiprocessor servers. Emulation helps to detect thread-to-core affinity errors and incorrect data placement during continuous integration. The method is also used in educational environments to demonstrate the impact of memory topology on performance without buying specialized hardware.

Typical problems

The main challenge is false computational slowdown. Emulation introduces artificial delays when crossing node boundaries, which masks real synchronization overhead. At the same time, the operating system scheduler may isolate tasks too aggressively, causing unnecessary page migration and a drop in throughput. Incorrectly set remote access weights can sometimes trigger false monitoring alarms, where a test configuration is mistaken for a critical hardware failure in a production environment.

How it works

The mechanism is based on modifying the ACPI tables handed to the operating system kernel during boot. Instead of describing a single physical domain with uniform memory, the firmware or hypervisor generates an SRAT structure in which each processor or group of cores is declared to belong to a different proximity domain. A separate range of physical addresses is assigned to each domain, even though all memory physically resides in a single controller and has identical latency.

Given such a table, the Linux or Windows kernel activates its internal NUMA balancing subsystems. Automatic thread-to-node affinity mechanisms and page placement policies are enabled. The key component of emulation is adjusting the SLIT value, which specifies the relative access distance for each pair of nodes. By setting a value of 20 or 30 for a remote node instead of 10 for a local one, the administrator makes the scheduler believe that accessing foreign memory takes twice as long.

After boot, the operating system behaves as if the server really consisted of several physical platforms. The process scheduler minimizes thread migration between nodes, and the page allocator by default allocates memory from the local bank. If an application allocates a buffer on one processor but actively works with it on another, performance counters record remote access events. This allows developers to track suboptimal scenarios: unbalanced data distribution, unnecessary cache line contention, and excessive structure migration in multithreaded programs. NUMA emulation thus reproduces topological inequality among nodes on fully symmetric hardware, providing reliable code behavior profiling before deployment on real multiprocessor systems.

NUMA functionality

  1. Configuring virtual node parameters. The hypervisor allows setting the maximum number of virtual processors that can simultaneously belong to one virtual NUMA node. This value limits the node width, preventing excessive logical domain sprawl, and should match the physical hardware topology to maximize memory bandwidth.
  2. Memory limiting per virtual node. The administrator can specify the maximum amount of RAM in megabytes allocated to a single virtual NUMA node. Setting this limit ensures that virtual machine memory consumption does not lead to uncontrolled crossing of hardware node boundaries, preserving data locality.
  3. Controlling the number of nodes per socket. The emulation functionality provides a parameter that determines the maximum number of virtual NUMA nodes allowed within one physical socket. Tuning this parameter allows manual fragmentation of the processor socket representation, adapting the virtual topology to the requirements of a specific application.
  4. NUMA spanning mode. The NUMA Spanning option allows a virtual machine to allocate memory from multiple physical NUMA nodes if the resources of a single node are insufficient. When spanning is enabled, an individual virtual node can use both local and remote memory. Disabling the option forces each virtual node to be pinned to memory from exactly one physical domain.
  5. Virtual topology initialization. On first boot of a virtual machine that has vNUMA enabled, the hypervisor creates a virtual topology based on the host’s physical NUMA structure. The initialized topology is fixed and does not change during migrations or reboots unless the virtual machine’s virtual CPU count configuration is changed.
  6. Ignoring socket and core configuration. Settings for the number of virtual sockets and cores per socket do not affect the virtual NUMA topology. The optimal structure is determined automatically based on the server’s physical architecture. The corespersocket parameter is used exclusively to present processors to the guest OS for license compliance purposes.
  7. Automatic activation threshold. By default, virtual NUMA topology is automatically activated only for virtual machines with more than eight virtual CPUs. This threshold is set because for machines with a small number of cores, the scheduler can typically place all vCPUs within a single physical node without needing to expose the topology.
  8. Forced enablement for small VMs. It is possible to activate vNUMA for virtual machines with eight or fewer processors via the advanced numa.vcpu.min parameter. Setting this parameter forces a minimum VM size in cores at which the guest OS will receive NUMA topology information to optimize thread placement.
  9. Topology reset on hot CPU add. Enabling CPU Hot Add deactivates virtual NUMA for a given machine. This results in the guest OS seeing a single flat NUMA node regardless of the real physical architecture. Deactivation occurs because of the inability to dynamically change ACPI tables without rebooting the guest system.
  10. Configuring abstract distance. In OS-level NUMA emulation mechanisms (e.g., numa=fake), the abstract distance parameter numa_emulation.adistance is available. By specifying different distance values for fake nodes, you can manually control their placement across memory tiers, creating a topology with varying access latency.
  11. Strict memory binding. To achieve deterministic performance, strict memory binding mode is used. In this configuration, all memory pages of the virtual machine are allocated strictly from a specified set of physical NUMA nodes. If free memory runs out on the target node, the allocation operation fails, excluding hidden fallback to remote memory.
  12. Virtual processor pinning. Emulation is effective only when vCPUs are pinned to physical cores of the target node. For this, a cputune section is introduced in the configuration, where each virtual thread is assigned a dedicated physical core. Strict pinning prevents the scheduler from migrating vCPUs to other nodes and the associated processor cache invalidation.
  13. Emulator thread isolation. In addition to vCPUs, the threads of the QEMU device emulator itself must be pinned using the emulatorpin directive. Without such affinity, I/O and timer helper threads may land on cores reserved for guest computations. Such contention causes guest vCPU eviction from cores and unpredictable network latency.
  14. QEMU (Emulator and hardware virtualizer of a computer)
  15. Static vCPU allocation. When using strict binding, vCPU placement must be switched to static mode. This forces the hypervisor to reserve physical cores at virtual machine start time, preventing the creation of shared pools. Only in this configuration is a one-to-one mapping between guest and host processors guaranteed.
  16. Reserving housekeeping cores. When planning NUMA emulation, housekeeping cores (typically CPU 0-3) must be reserved for host system needs and interrupt handling. It is strictly forbidden to include these cores in the isolated CPU pool given to guest vCPUs, as this will lead to scheduling conflicts and performance degradation.
  17. Verifying memory placement. After the virtual machine starts, actual memory consumption per node must be checked using the numastat utility. This tool shows the distribution of QEMU process memory across NUMA domains, allowing you to confirm that there are no allocation leaks to remote nodes due to configuration errors.
  18. Handling out-of-memory errors. When configuring strict memory placement rules, the administrator must guarantee sufficient free RAM on each target node. If a node is overloaded, attempting to start a machine with strict binding will immediately trigger the OOM Killer in the guest system or cause startup to fail.
  19. Topology symmetry on Hyper-V. When emulating NUMA on Hyper-V, there is a restriction to create only symmetric virtual topologies. It is impossible to assign unequal amounts of memory or processor cores among different virtual nodes of the same machine. The hypervisor does not support asymmetric resource distribution within vNUMA.
  20. Compatibility with Dynamic Memory. When vNUMA is enabled on Hyper-V, Dynamic Memory technology cannot work simultaneously. Virtual NUMA topology configuration requires static reservation of memory at startup. Attempting to activate dynamic memory changes causes the virtual machine configuration to be rejected.
  21. Requesting topology via metadata. In cloud environments such as OpenStack, a request to enable vNUMA and specify the number of nodes is passed via image or flavor properties using the key hw:numa_nodes. The hypervisor driver parses this metadata and, subject to hardware constraints, constructs the corresponding virtual topology before launching the instance.

Comparisons

  • NUMA emulation vs CPU Pinning. Hypervisor NUMA emulation creates a virtual topology that hides the real distance to memory, while pinning rigidly fixes vCPUs to specific cores without faking latencies. The first method is flexible and allows VM migration while preserving internal geometry but adds a layer of indirection; the second provides maximum determinism at the cost of complete staticity and no memory ballooning at the guest level.
  • NUMA emulation vs Virtual NUMA. Although the terms often overlap, vNUMA is a pass-through of host physical topology to the guest, whereas NUMA emulation can artificially equalize or distort that topology, for example masking two real nodes as four. Emulation is indispensable when the number of host nodes and VM configuration do not match, but vNUMA provides the best cache transparency and minimizes double LLC misses when boundaries align correctly.
  • NUMA emulation vs Memory Interleaving. Interleaving mode turns all memory into a UMA pool, removing the concept of proximity, while emulation actively cultivates a node topology, even if synthetic. Interleaving wins on legacy workloads unaware of NUMA by eliminating remote access penalties, but for modern applications with data locality, emulation is preferable because it allows the guest to make informed decisions about page placement.
  • NUMA emulation vs Sub-NUMA Clustering. SNC splits a real socket into logical LLC domains to reduce intra-chip latency, while NUMA emulation defines boundaries at the level of whole virtual sockets. Their synergy is complex: enabling SNC on a host with emulated topology may unexpectedly expose four nodes to the guest instead of two, requiring manual alignment of the emulated mask to prevent spurious guest OS thread migrations.
  • NUMA emulation vs Auto-NUMA Balancing. Emulation provides a static view of the world, whereas Auto-NUMA Balancing in Linux dynamically moves pages and tasks following access patterns. When using emulation, the guest balancer blindly trusts the faked distances, which is dangerous under overcommit: auto-balancing may create excessive migration activity inside the VM, not understanding that heavily overcommitted physical silicon lies behind the virtual nodes.

OS and driver support

The implementation of NUMA emulation in Linux for non-x86 architectures such as arm64 is done as a separate module numa_emulation.c in the drivers/base/ subsystem, configured via CONFIG_NUMA_FAKE, and allows creating fake NUMA nodes by partitioning the physical memory of one real node into several pseudo-nodes by faking distance tables (SLIT) and CPU-to-node mappings during early kernel initialization.

Security

Because emulated nodes do not correspond to physical memory controller boundaries, data from different virtual machines or processes may actually reside in the same hardware domain, making isolation methods based on NUMA node affinity ineffective; to prevent side‑channel attacks such as DRAMA on systems with a single real node, strict page interleaving or manual data placement must be used, since hardware channel separation is absent.

Logging

Statistics on emulated nodes are available via standard interfaces such as numastat, perf, and the CPU to DRAM Requests to Target Node event, which show counters for local and remote memory requests to each fabricated node. However, these counters reflect not the real topology but the layout imposed by the emulator, allowing profiling of application NUMA behavior without corresponding hardware.

Limitations

The key limitation is the inability to reproduce the real difference in latency and bandwidth between local and remote memory: the emulator only enforces page placement policies, but physically all accesses happen at the same speed. In addition, code duplication with the x86 implementation exists, requiring manual backporting of fixes and creating a risk of inconsistent behavior.

History and development

Initially the NUMA emulation subsystem was tightly coupled to the x86 architecture for testing the scheduler and page migration in schednuma patches. In 2024, extraction into a generic arch layer was performed to cover ARM servers, evolving from proprietary aggregator hypervisors like ScaleMP and TidalScale to a native in‑kernel solution for containers and virtualization.