What is CXL Memory (PCIe-attached memory expansion with coherency)

CXL Memory allows additional RAM to be connected to a server via a standard high-speed interface. The CPU sees this remote memory almost as if it were its own, and can flexibly add capacity without replacing main modules, albeit with slightly higher latency.

This technology is in high demand in data centers and cloud infrastructures, where large RAM capacity is critical for virtualization, in-memory databases, and complex analytics. CXL Memory enables pooling memory across servers, dynamically allocating resources for specific workloads and reducing total cost of ownership by eliminating expensive, over-provisioned DRAM modules in every node.

DRAM (Storage and Byte-addressing of Data)

In practice, the main limitation is significantly higher latency compared to local memory, making the technology unsuitable for latency-sensitive computations. There are also challenges in ensuring fault tolerance within shared pools and managing heat dissipation on intermediate switches. Furthermore, the CXL 2.0 and higher standards require hardware support at the switch level, and full software ecosystem maturity for memory orchestration has not yet been achieved.

How CXL Memory works

The operating principle is based on the PCIe 5.0 and 6.0 physical layer protocol, extended with three specialized logical protocols: CXL.io for device configuration and initialization, CXL.cache for coherent CPU access to accelerator cached memory, and the key CXL.mem. CXL.mem allows the host to directly access device memory via the physical address space using a home agent mechanism. The memory controller in the CPU or chipset intercepts requests addressed to the CXL device range and packs them into memory protocol packets. The Flex Bus on the physical layer then transmits them over standard PCIe lanes to the CXL controller on the target device, which converts transactions into commands for the installed DRAM modules. Coherency is maintained by tracking cache line states in the home agent, which resolves conflicts between requests from different cores and external accelerators, ensuring the CPU always gets the most recent data as it would with local memory. Additional management logic in switches allows pooling memory from many devices into a single pool and dynamically reallocating logical partitions among servers without rebooting.

CXL Memory functionality

CXL.mem subprotocol and its semantics. The core functionality of CXL Memory is the CXL.mem protocol running over the standard PCIe 5.0/6.0 physical bus. This subprotocol provides the host with coherent, byte-addressable access to device memory. Unlike traditional block devices or network disks, CXL.mem allows the CPU to execute load/store instructions directly into the attached module address space, bypassing software I/O stacks.
CPU-less programming model on the device. CXL Memory devices function as headless nodes (NUMA nodes without their own compute cores). This is a fundamental difference from CXL.cache accelerators. The platform discovers such a device as an asymmetric node, where the memory controller on the expansion board processes requests to DRAM or non-volatile memory without requiring an OS to run on the device itself.
Bandwidth and capacity expansion. A key function is adding bandwidth by connecting additional DRAM channels in parallel through the CXL interface. This overcomes the limit of DIMM slots per CPU socket. The system aggregates the bandwidth of host local memory and remote CXL memory, providing linear growth in aggregate access performance as the number of expansion modules scales.
Heterogeneous memory interleaving. Platform hardware and BIOS can implement heterogeneous interleave. This function combines host local DRAM and CXL memory into a single logical address space within a NUMA node. Cache-line or page-level interleaving allows hot host memory and cold expansion memory to be used as a single pool, hiding latency differences from non-critical workloads.
Memory tiering. The Linux kernel software infrastructure uses CXL memory as a higher-latency lower tier in the hierarchy. A migration daemon (e.g., TPP) based on access statistics (hot/cold page tracking) moves infrequently used cold pages from host DRAM to CXL memory. This frees expensive fast memory for latency-sensitive tasks without forcing writes to swap partitions or block devices.
Device-side access profiling. Advanced architectures like NeoMem use hardware monitoring blocks in the CXL controller (NeoProf). This feature collects hot page statistics directly on the memory device side without CPU load. Offloading profiling reduces migration overhead and improves accuracy in identifying data working sets suitable for moving between tiers.
Persistent memory support (CXL-PMEM). The CXL.mem protocol specifies byte-addressable access to non-volatile memory. CXL-PMEM devices function as a direct extension of non-volatile space, allowing applications to use flush instructions and write to persistent regions via direct mapping (DAX) without a page cache. This provides data persistence across power failures with latency an order of magnitude lower than traditional NVMe SSDs.
Dynamic Capacity subsystem (DCD). The CXL 3.x specification introduces dynamic capacity, allowing memory to be added or removed without node reboot. The Fabric Manager API manages extents – sets of physical address space blocks on the device. The host accepts or releases extents in response to manager events, creating or removing corresponding DAX devices on the fly.
Shared memory mode. The CXL 3.1 architecture implements the Global Integrated Memory (GIM) concept for direct host-host and host-device interaction. In a switch topology with Port-Based Routing (PBR), multiple hosts can have coherent access to the same memory region on a CXL device. This is critical for fault-tolerant clusters and distributed data locking systems.
Direct peer-to-peer transactions (P2P CXL.mem). The CXL 3.1 specification allows direct data transfer between CXL.mem devices via a PBR-capable switch. A CXL-enabled accelerator or network card can directly read from or write to expansion memory, bypassing host system memory and CPU. This reduces load on CPU memory controllers and eliminates unnecessary DMA copy operations in I/O buffers.
Advanced error handling and RAS. The CXL Memory RAS (Reliability, Availability, Serviceability) subsystem provides error detection and correction with extended metadata up to 32 bits per cache line. This includes failure signaling via AER (Advanced Error Reporting) and support for MEFN (Memory Error Firmware Notification). The kernel handler isolates faulty pages or disables the device without panicking the entire system if the recovery policy allows.
Configuration via CXL Fixed Memory Window (CFMWS). Platform firmware describes system physical address space windows allocated for CXL in ACPI CEDT.CFMWS tables. The CFMWS function defines mappings between host address regions and root bridges, specifying NUMA topology. A single device can be mapped into multiple independent CFMWS with different attributes (volatile/persistent), creating separate NUMA nodes for different memory classes on one physical module.
Multilevel switch routing. CXL switches use Port-Based Routing for CXL.mem transactions in disaggregated pools. Unlike device address-based routing, PBR directs packets via physical ports as defined in Fabric Manager tables. This allows building many-to-many multiport configurations where one host can reach any memory module in a rack through multiple switching cascades.
Sparse DAX regions abstraction. To support dynamic capacity, Linux has introduced sparse DAX regions. Unlike static areas, such a region can have zero initial size and expand as extents are added. This function allows a resource orchestrator to allocate virtual address space for future memory expansion without reserving physical pages until capacity is actually connected.
Hot surprise removal fault tolerance. The CXL software model anticipates sudden memory device failure (surprise removal). When a CXL.mem device is disconnected without prior OS notification, the RAS subsystem emulates hot removal of the NUMA node. The driver disables error handling registers and marks pages as unavailable, preventing cascading machine checks on subsequent accesses to the vanished address space.
Quality of Service (QoS) management. CXL Memory hardware controllers provide QoS control for performance isolation in multi-tenant environments. Channel busyness and request queue monitoring mechanisms allow setting minimum and maximum bandwidth for each virtual channel. This ensures background tasks on one host do not violate latency targets for latency-critical applications on another.
Home agent coherency. CXL memory coherency is implemented via a home agent in the CPU responsible for a specific address space. When a CXL.mem device services a request, the home agent tracks cache line states in private caches of all cores. On host write to CXL memory, the agent initiates snoop requests to invalidate stale copies, guaranteeing the architectural coherency model.
Large memory page support. The Linux CXL Memory subsystem supports large pages (1GB THP, 2MB hugetlbfs). When mapping CXL regions to userspace via DAX, address translation can use hugepage entries in the TLB. This significantly reduces TLB misses when working with massive datasets, compensating for increased remote memory access latency.
TLB (Translation Lookaside Buffer)
Telemetry and device monitoring. The CXL specification defines telemetry registers accessible via PCIe DVSEC (Designated Vendor-Specific Extended Capability). These registers provide real-time data on module temperature, power consumption, ECC correction counters, and cell wear (for persistent memory). Platform management agents (BMC) read telemetry over a side channel without interfering with main-band CXL.mem traffic.
ECC (Memory Error Detection and Correction)

Comparisons

CXL Memory vs RDMA. CXL provides load/store semantics and hardware-level cache coherency, while RDMA operates through software messages. This eliminates the need for code refactoring and reduces latency, making remote memory effectively local to the CPU, whereas RDMA requires explicit API calls for data exchange.
CXL Memory vs NVMe-oF. CXL targets memory expansion with byte addressing and minimal latency, while NVMe-oF operates at the block I/O level for SSDs. CXL creates a single pool of coherent DRAM memory critical for CPU workloads, whereas NVMe-oF accelerates access to flash storage over a network while retaining the classic file-based interaction model.
CXL Memory vs HBM. HBM provides extreme bandwidth (TB/s) through integration with GPU/CPU but is limited in capacity and costly. CXL Memory, by contrast, trades peak speed for capacity scaling via cheaper DDR memory. CXL acts as an elastic buffer for data that does not fit into expensive, small HBM.
HBM (3D stacked memory with silicon vias)
CXL Memory vs Persistent Memory (legacy Optane). Intel Optane combined persistence with near-DRAM performance but was discontinued. CXL Memory, reviving the concept, relies on volatile DDR memory and offers a universal interface for different media types. The main difference is that CXL focuses on capacity and coherent sharing across hosts rather than data persistence across power failures.
CXL Memory vs GPU Direct Storage (GDS). GDS creates a direct data path between SSD and GPU video memory bypassing the CPU. CXL Memory, in turn, allows the GPU to directly access system DRAM as expanded video memory. While GDS accelerates large dataset loading, CXL virtualizes memory itself for heterogeneous computing without data copying.

OS and driver support

Operating systems, particularly Linux, implement CXL memory support through a dedicated kernel subsystem that manages device lifecycle and system topology binding: the cxl_pci driver initially provided basic memory expansion without dependent operations; however, for accelerators requiring coherent access, the cxl_memdev_attach mechanism was introduced, allowing problematic user procedure calls within the cxl_mem_probe context only after successful port attachment. Meanwhile, CXL link loss mirrors a PCIe link loss event, triggering driver deregistration and returning the device to PCIe-only mode with possible subsequent reconnection.

Security

CXL memory security is built on mandatory hardware encryption and link attestation. Starting with version 2.0, the specification introduced the Integrity and Data Encryption mechanism to protect data integrity and confidentiality during transmission over physical lines. Version 3.1 added the Trusted Security Protocol, providing access control and virtual machine isolation in multi-tenant cloud environments. Subsequent version 3.2 strengthened protection for coherent transactions with cache invalidation (HDM-DB) and extended IDE to authentication procedures, using SPDM secured messages and a hardware root of trust for key exchange.

Logging

The logging function in the CXL ecosystem is implemented through integration with the Linux kernel RAS subsystem and the rasdaemon daemon. According to CXL 3.0 and 3.1 specifications, drivers capture structured event records such as DRAM errors or memory module failures, decoding memory type specific fields and updated validity flags or component identifiers. The ras-mc-ctl utility then stores detailed information about media health, temperature, and corrected error counters in an SQLite database, providing administrators with structured historical output for predictive failure analysis.

Limitations

Despite its scaling advantages, CXL memory has fundamental physical limitations that make its use as a page cache impractical. Each CXL channel consumes PCIe lanes and, running over PCIe Gen5, theoretically reaches only 32 Gbit/s. However, due to structural overhead and low efficiency in transmitting large packets (only about 3 cache lines per frame), actual bandwidth drops to roughly 3 GB/s per lane. In aggregate, this provides only a modest gain compared to native DDR5 controllers, and adds a minimum of 10 ns latency. Therefore, CXL usage is largely limited to low-priority anonymous memory or swapping tasks.

DDR5 (High-speed energy-efficient computer RAM)

History and development

CXL evolution has progressed from simple point-to-point connections to multidimensional fabrics. Versions 1.0/1.1 established three device types and cache-coherent access over PCIe 5.0. Version 2.0 introduced switching for distributed memory pooling and basic IDE encryption. Starting with version 3.0, the transition to PCIe 6.0 doubled bandwidth and introduced multilevel non-tree topologies and peer-to-peer DMA. The development vector of recent specifications 3.1 and 3.2 focuses on implementing port-based routing (PBR), global integrated memory (GIM), and enhanced monitoring mechanisms such as the cache hot page monitoring unit (CHMU). This ultimately transforms the technology from local memory expansion into the foundation for large-scale heterogeneous computing.