What is VMCS (Virtual Machine control structure)

VMCS (Virtual Machine Control Structure) is a page of physical memory that the processor uses as a cheat sheet for switching between hypervisor root mode and guest system non-root mode. It stores processor state, interrupt masks, and control bitmaps, allowing the hardware to automatically handle guest exits without software intervention.

VMCS is a fundamental element of Intel VT-x hardware virtualization and is used exclusively by type 1 hypervisors such as KVM, Hyper-V, VMware ESXi, and Xen. Without this structure, running unmodified operating systems in isolated domains is impossible. Each virtual processor core receives its own unique VMCS, which manages its behavior in the guest environment and records the cause of all synchronous and asynchronous exits to the virtual machine monitor.

Intel VT-x (Hardware Virtualization of the CPU)KVM (Turns the Linux kernel into a hypervisor)

The main challenges when working with VMCS are data integrity corruption due to incorrect programming of fields by guest or shadow page tables. Migration errors occur when moving the structure between different physical cores without first clearing the cache, because the VMCS is tied to a specific logical processor. Incorrect configuration of I/O exception bitmaps or secondary execution controls leads to a cascade of exits to the hypervisor, paralyzing guest performance.

How VMCS works

The operating principle is based on cyclic context switching controlled exclusively by processor logic. The hypervisor executes the VMPTRLD instruction, loading a pointer to the physical address of the VMCS into a special non-evictable state register. Then, using VMWRITE instructions, six logical groups of fields are filled: guest registers, host registers, execution control masks, exit data, entry settings, and pointer tables.

The key point is configuring the VM-execution control fields, where it is determined which guest actions will cause an immediate exit. After starting the guest via VMLAUNCH or VMRESUME, the processor hardware loads the guest state from the VMCS and begins execution in a de-privileged mode. When an interceptable event occurs, such as a physical interrupt, execution of the HLT instruction, or access to a monitored port, the processor microcode saves the exit reason in the VM-exit information field, records the state of all guest registers in the guest area, and atomically restores the host mode state from the corresponding section. The guest instruction pointer is saved in the RIP field, and the degree of detail about the exit reason is revealed through qualification fields that describe offsets and the nature of the violation.

VMLAUNCH (Launching a guest virtual machine)VMRESUME (Resuming a suspended virtual machine)

The hypervisor in the handler reads the fields via VMREAD, emulates the necessary hardware or handles the exception, modifies the guest state if needed, and returns control using the VMRESUME instruction, after which the processor again switches context to non-root mode, resuming execution from a strictly fixed point. Thus, VMCS eliminates the need for binary translation, replacing it with hardware traps, which critically reduces virtualization overhead.

VMCS functionality

VMCS format and pointers. VMCS is defined as a data structure in memory aligned to a 4 KB boundary. The first 4 bytes contain a revision identifier that must match the version reported via MSR. Access to the fields is not done directly by dereferencing pointers but exclusively through VMREAD and VMWRITE instructions, which allows the processor to cache and optimize the state.
Guest state area. This area contains a snapshot of the architectural state of the virtual processor at the time of the last exit. It includes the values of general-purpose registers such as RAX, RBX, RSP, and RBP, as well as control registers CR0, CR3, CR4. Segment registers are saved along with their hidden parts: base, limit, and access rights.
Host state area. This section defines the context to which control will be returned upon a VM Exit. It strictly sets the CR3 value for switching to the hypervisor address space, as well as stack pointers (RSP) and the entry point (RIP). Host segment registers, including CS, DS, and SS, define the execution environment of the virtual machine monitor after an intercept.
Interrupt-based control fields. These flags control the processor’s response to asynchronous events occurring outside the execution pipeline. The External-interrupt exiting bit, when set, forces all external interrupts to cause a VM Exit, bypassing the guest IDT. The NMI exiting bit works similarly for non-maskable interrupts, ensuring their deterministic interception by the hypervisor.
Primary processor-based controls. This bit field defines the execution policy for synchronous instructions and events. The RDTSC exiting flag allows intercepting reads of the timestamp counter, while HLT exiting intercepts the halt instruction. The INVLPG exiting bit controls the response to TLB entry invalidation, which is critical for maintaining shadow page table coherence.
TLB (Translation Lookaside Buffer)
Secondary processor-based controls. Extend the functionality of the primary controls by activating features such as EPT (Extended Page Tables) and VPID. The Enable EPT bit enables hardware support for nested page tables, eliminating the need for complex shadow structures. The VPID bit allows tagging TLB entries with a virtual processor identifier to avoid cache flushes on context switches.
EPT (Hardware second-level memory address translation)
Exception bitmap. Represents a 32-bit vector where each bit corresponds to a specific IA-32 exception vector. If the #GP (General Protection) or #UD (Undefined Opcode) bit is set, then the occurrence of that exception in guest mode immediately causes an exit. The page fault exception (#PF) is controlled not only by this field but also by additional masks for filtering by error code.
IA-32 (Provides execution of 32-bit computations)PF (Hardware virtualization of Input-Output devices)
I/O bitmaps. Two 4-kilobyte structures whose physical addresses are stored in the VMCS. Map A covers ports 0000h–7FFFh, map B covers ports 8000h–FFFFh. If the corresponding bit is set for a port, the IN, OUT, INS, and OUTS instructions cause a VM Exit when accessing it. This allows the hypervisor to selectively emulate critical or nonexistent I/O devices.
CR0/CR4 masks and shadows. These fields allow fine-grained ownership of control register bits to be shared between host and guest. The guest/host mask determines the bits controlled by the hypervisor, and the read shadow determines the values returned when the guest reads them. A guest attempt to change a host-owned bit causes a VM Exit, preventing the guest from disabling paging or protection subsystems, for example.
Timestamp counter control. The TSC offset field contains a 64-bit signed offset. When guest software executes RDTSC (and interception is disabled), the value is calculated as the real TSC plus the offset. This technique provides the illusion of monotonic time for migrating virtual machines, allowing the hypervisor to hide real processor timestamps and compensate for delays.
CR3 target value field. A set of four 64-bit values and a counter of how many of them are valid. If the guest executes MOV to CR3 and the value being written matches one of the target values, no VM exit occurs. This is an optimization for guest operating systems that frequently switch process address spaces without requiring VMM intervention on every task context switch.
VMM (Hardware resource isolation and emulation)
Virtual APIC control. The virtual-APIC page address points to a 4-kilobyte memory area where the processor hardware maintains a shadow of the TPR register. If TPR shadowing is activated, MOV operations to CR8 are handled via this page without an exit. The TPR threshold causes a VM Exit if the guest attempts to lower the priority below a specified level, simulating the operation of an interrupt controller.
APIC (Interrupt Routing and Prioritization in multiprocessor systems)
VM-entry control fields. Define the transition process from hypervisor to guest. The IA-32e mode guest bit loads the corresponding value into the shadow EFER.LMA register. The VM-entry interruption-information field allows injecting a pending interrupt, such as an NMI or external vector, into the guest immediately before executing the first guest instruction.
VM-exit control fields. Contain flags that format the processor state upon return to the host. The Host address-space size bit switches the addressing mode of the host context. The MSR save and restore address fields automatically load hypervisor-critical values from model-specific registers (e.g., EFER, STAR, LSTAR) at the moment of exit.
Exit information fields. A read-only section filled by hardware upon a VM Exit. The Exit reason field encodes the cause of the transition: port access, exception, triple fault, or external interrupt. The Exit qualification field provides details such as the fault offset, segment selector, or register bitmask, necessary for handling the incident.
VMCS caching. To improve performance, the processor uses an internal cache, hiding the storage format details from the hypervisor. Access to the control structures via VMWRITE/VMREAD synchronizes memory and the cache. The explicit VMCLEAR instruction must be used before reinitialization to flush the cached state and write the current data back to physical memory.
I/O bitmaps structure. The addresses of bitmaps A and B must be aligned to a 4096-byte boundary. Each bit represents a port: a bit set to 1 causes an exit when the guest accesses that port. The map resolution is one bit per port, providing full coverage of the x86 architecture 64-kilobyte I/O address space for hardware isolation.
x86 (Execution of instructions based on CISC architecture)
Linked and shadow VMCS. The architecture supports creating shadow VMCS via the VMCS link pointer. This allows the hypervisor to hardware-switch between a regular and a shadow structure without additional overhead, used for nested virtualization when the guest hypervisor itself manages its own virtual machines.
TSC counter mechanism. The hardware addition of the TSC offset is performed during RDTSC or RDTSCP execution without microcode exit intervention. This allows the guest to read virtualized time with near-native performance, as this operation does not cause an expensive context switch to hypervisor root mode.
APIC page address. The physical address of the 4-kilobyte virtual-APIC page is used by the processor to directly access the shadow copy of the local interrupt controller. Accesses to TPR via CR8 are hardware-emulated, and in combination with posted-interrupt support, interrupts are delivered directly to the guest state without an exit.
MSR save fields. MSR address lists allow up to 512 entries to be specified for automatic loading on entry and exit. Each MSR in the list occupies 16 bytes: the lower 4 bytes store the register index, the next 4 are reserved, and the upper 8 contain the value. This ensures fast switching of critical registers such as SYSENTER, FS/GS base, and EFER.

Comparisons

VMCS vs VMCB (AMD-V). VMCS is the Intel VT-x control structure, while VMCB is its counterpart in AMD-V technology. Both store guest and host state and control the reasons for exiting non-root mode, but they have different internal formats and use different instructions (VMREAD/VMWRITE vs VMRUN). Despite hardware incompatibility, their functional purpose is identical: to provide transparent entry to and exit from the guest environment with minimal overhead.
AMD-V (Hardware virtualization using the processor)VMCB (Virtual Machine state data structure)
VMCS vs ARM64 EL2 registers. Unlike the dedicated VMCS structure in x86, ARM64 does not have a single virtualization control block. The vCPU context is stored in system registers at the EL2 level and in the software KVM vcpu structure. Where VMCS hardware manages automatic state saving on mode switches, ARM relies on flexible saving of general registers by the hypervisor, reducing hardware complexity at the cost of greater software load.
ARM (Energy efficient execution of processor instructions)
VMCS vs kvm_vcpu (KVM software). VMCS serves exclusively as a hardware container for context switching and controlling sensitive guest instructions. The software kvm_vcpu structure in the Linux kernel is much broader: it includes request queues, device emulation states, timers, and a pointer to the VMCS. Thus, VMCS solves the hardware switching task, while kvm_vcpu manages the full lifecycle of the virtual processor in the host system.
VMCS vs nested VMCS (L1 vs L2). When using nested virtualization, a regular VMCS manages the L1 guest, while a shadow or virtualized VMCS is needed to control the L2 guest. The primary VMCS contains the L1 state, whereas structures for L2 manage the switching between L1 and the hypervisor nested within it, creating cascading VM-Exit handling for proper privilege level isolation.
VMCS vs shadow page tables. VMCS includes fields for configuring address translation (EPT), which is an alternative to software shadow page tables. When using shadow page tables, the hypervisor intercepts every guest page table change for synchronization, generating frequent VM-Exits. Hardware EPT support via VMCS eliminates these exits, replacing many synchronizations with a single hardware EPT walk on a TLB miss.

OS and driver support

VMCS serves as the foundation for hypervisor-guest OS interaction at the processor level. Support is implemented through direct programming of the structure in kernel mode (Ring 0): the driver allocates a physically contiguous 4-kilobyte page of memory, writes the revision identifier into it, and executes the VMPTRLD instruction to activate the structure on the current logical core. Then, VMCS fields are configured via VMREAD and VMWRITE instructions, filling the host and guest state areas (CR3, RIP, RSP, segment registers) as well as the control fields (exception bitmaps, I/O bitmaps, MSRs). After initialization, the driver launches the guest system with the VMLAUNCH instruction, and on each subsequent event (such as an intercepted exception or external interrupt), the processor automatically saves the guest state in the VMCS and restores the host state, passing control to the hypervisor handler.

Security

VMCS provides isolation and control over the guest environment through a fine-tuning mechanism of conditions that force an exit to the hypervisor (VM-Exit). Security is implemented by setting bitmaps in the execution control fields: for example, setting the Descriptor-Table Exiting bit forces the processor to leave guest mode when attempting to change the IDTR, GDTR, or LDTR registers, allowing timely blocking of attacks aimed at intercepting system call tables or moving the kernel. For memory protection, Extended Page Tables (EPT) are configured, which prevent the guest from accessing other physical pages, along with Mode-based Execute Control for EPT, which separates code execution rights for user and supervisor modes, blocking malicious code execution in kernel space.

Logging

Logging in the context of VMCS is a hardware mechanism for deterministic recording of exit causes and associated information each time the guest machine exits. This is implemented by the processor automatically writing information into read-only fields of the structure: upon a VM-Exit, a numeric reason code (interrupt, exception, CPUID execution, EPT violation, etc.) is placed in the Exit Reason field, and qualifying details such as the linear address of the instruction that caused the access violation are placed in the Exit Qualification field. The hypervisor, having taken control, reads these fields using the VMREAD instruction, decodes them according to Intel documentation, and writes them to the system log, allowing precise determination of which instruction stopped the virtual machine and why.

Limitations

The main technical limitation of VMCS is its strict binding to a specific logical processor core: the structure is not shared across different cores, so on multi-core systems, the hypervisor must allocate and initialize a separate VMCS area for each active logical processor of the guest. The hardware imposes strict requirements on memory alignment (4 KB) and revision identifier format; attempting to load an incompatible or misaligned structure via VMPTRLD causes an error. Furthermore, nested virtualization without hardware VMCS shadowing support requires building software virtual structures, which drastically slows performance due to multiple context switches on every access to nested hypervisor shadow copies.

History and evolution

The evolution of VMCS is inextricably linked to the evolution of Intel VT-x hardware virtualization, starting in 2005 when the structure first appeared in Pentium 4 processors to solve the problem of virtualizing 19 non-interceptable instructions of the x86 architecture. Early versions included only basic guest and host state control fields, but with the release of the Nehalem architecture in 2008, the structure was expanded with fields for activating Extended Page Tables (EPT) and VPID, accelerating address translation and context switching. In 2013, Haswell added VMCS Shadowing, allowing the processor to hardware-manage a chain of VMCS structures for nested virtualization without heavy software emulation. Subsequent generations introduced MBEC and accelerated APIC handling, steadily increasing the bitmaps of control fields to ensure security and performance for modern cloud infrastructures.