VXLAN offload (VXLAN hardware acceleration)

VXLAN offload technology offloads the tasks of packaging and unpacking virtual network tunnels from the server central processing unit to the network interface card. Without this feature, every byte of virtual machine traffic would require CPU resources to add or remove outer headers. Offload frees the processor, allowing it to focus on computation, while the network card hardware processes traffic at near line rate speeds without performance loss.

VXLAN offload is used in large data centers where thousands of virtual machines communicate over logically isolated networks. The technology is critically important for cloud computing platforms and environments with high‑load software switches such as Open vSwitch. Its implementation is necessary in telecommunications NFV systems and hyperconverged infrastructures where bare‑metal network performance must be provided while maintaining virtualization flexibility.

When offload is absent or not working correctly, an abnormally high CPU load is often observed during network traffic processing, along with throughput dropping below ten gigabits per second. Difficulties arise with packet fragmentation due to the increased frame size after encapsulation. Diagnostics are complicated because traditional tools like tcpdump may show packets before offload processing on the sender side or after it on the receiver side, distorting the actual communication pattern.

How it works

The working principle is based on offloading header processing from the operating system driver level to the hardware logic of the network adapter. The operating system stack builds an inner frame belonging to a virtual machine and passes it to the network card through a queue. When offload is enabled, the driver does not send a complete packet to the adapter but rather the inner frame together with metadata: the virtual network identifier VNI, outer source and destination addresses, and UDP ports. The hardware engine of the network card independently constructs a correct packet: it adds the outer MAC header, the IP header with correct addresses, and the UDP field with destination port 4789, then inserts the VXLAN header containing the VNI before them. The adapter then calculates checksums for the outer headers, sometimes doing this simultaneously for UDP and IP within a single computation merging function, and sends the ready frame to the network.

On the receiving side, the process is mirrored: the network card analyzes the incoming stream, recognizes packets with port 4789, checks that the VNI matches the configured virtual networks, hardware‑removes all outer headers — Ethernet, IP, UDP, and VXLAN — and places the extracted original frame directly into the memory buffer allocated for the target virtual machine, bypassing the hypervisor network stack for that traffic.

Functionality

  1. How VXLAN offload works. Hardware VXLAN encapsulation offload moves the operations of adding and removing outer headers from the operating system kernel to the network adapter, bypassing the main processor. This radically reduces latency and frees up CPU cycles for application payload.
  2. Standardized network identifier. The VXLAN Network Identifier is a 24‑bit field, allowing up to 16 million isolated Layer 2 segments to be created over Layer 3. The card offload engine operates with this identifier at the hardware level, providing wire‑rate switching between virtual machines without hypervisor involvement.
  3. Outer header parsing. The network card chip performs Outer Ethernet, Outer IP, and Outer UDP header parsing on hardware parsers. When offload is activated, the adapter independently checks the correctness of the outer IP packet checksums and the destination port match against the standard VXLAN port 4789.
  4. Packet mapping and hardware tables. To engage offload, the driver programs the network card tables with rules that map inner traffic headers to specific VXLAN tunnels. The entries contain the VNI, the MAC addresses of remote VTEP nodes, and the IP addresses for encapsulation, allowing the card to make decisions without slow software FIB lookups.
  5. Neighbor table management. Adapters that support full offload store neighbor tables in dedicated memory. The card independently tracks remote VTEP availability through address resolution protocols, offloading the kernel stack from generating ARP requests and processing ARP replies for tunnel endpoints.
  6. Hardware processing of Geneve‑like options. Modern adapters implement flexible parsing not only for fixed VXLAN but also for arbitrary TLV headers. Flexible Match Tables allow the parser to be reconfigured to handle network virtualization protocols with variable‑length fields at port speed.
  7. Stateless Offload mechanism. When using basic offload, inner packet checksums including the TCP/UDP pseudo‑header, as well as large segmentation offload (LSO/LRO), are performed in hardware. The card transparently recalculates the CRC for the encapsulating part without requiring driver intervention in the data packet path.
  8. Stateful Offload and connection setup mechanism. Next‑generation accelerators take over the full lifecycle of tunnels, including establishing and terminating overlay network connections. The adapter virtualizes physical queues, isolating traffic of different VNIs into dedicated hardware contexts without performance penalty.
  9. Hardware eSwitch switching. The virtual switch built into the NIC performs hairpin switching between virtual functions located on the same physical host. If two VMs in different VNIs communicate through an external gateway, the card hardware‑encapsulates the frame, sends it to the external network, and receives it back without raising traffic to the hypervisor.
  10. BUM traffic multicast offload. Broadcast, unknown unicast, and multicast traffic requires packet replication to multiple VTEPs. The hardware replication function allows the network card to clone the packet according to a specified multicast group, preserving the PCIe bus bandwidth.
  11. Security policy filtering offload. Smart network cards integrate a hardware firewall between VXLAN tunnels. Access list rules are applied to traffic of a specific VNI before encapsulation or after decapsulation at data‑transfer speed, providing micro‑segmentation with zero CPU overhead.
  12. VNI‑based Quality of Service management. Network adapters are capable of classifying outgoing traffic based on the VXLAN identifier and applying individual bandwidth limiting or prioritization policies to different segments. The hardware queue scheduler guarantees minimum latency for real‑time traffic inside the overlay.
  13. Asymmetric offload support. The hardware logic allows a configuration where receive traffic is fully offloaded to the hardware engine while transmission remains under OS driver control. This approach is used during debugging or when chip capabilities are incompatible with software‑dynamically generated tunnel headers.
  14. Interaction with the RoCEv2 protocol. When RDMA and VXLAN are used together, encapsulation offload for RoCEv2 packets becomes critical. Specialized adapters recognize RDMA packets by their transport header and wrap them in VXLAN without losing direct memory access semantics, bypassing the network stack bottleneck.
  15. Overlay packet fragmentation handling. An outgoing packet after encapsulation may exceed the physical link MTU. Advanced offload performs hardware IP fragmentation of the outer header or, conversely, reassembles fragmented encapsulated datagrams on the receiving side before passing data to the host system buffers.
  16. Receive Side Scaling for VXLAN mechanism. To evenly distribute load across multicore systems, the network card computes a hash by looking inside the encapsulated traffic. The hardware RSS engine parses inner IP addresses and ports, ensuring that flows belonging to one overlay connection are processed by a single CPU core.
  17. Hardware diagnostics and VXLAN‑aware counters. Statistics of dropped packets and errors are maintained per VNI. NIC hardware counters allow the administrator to precisely determine on which virtual segment a queue overflow or bandwidth limit exceedance occurred without running heavyweight traffic analyzers.
  18. PCI Express bus load reduction. Without hardware offload, the hypervisor is forced to transfer data between memory and the adapter multiple times to process headers. The offload engine eliminates unnecessary DMA data moves, delivering payload directly to the guest machine memory.
  19. Encapsulation offload for IPv6 underlay. Modern functionality implies native hardware processing of VXLAN packet encapsulation into IPv6 datagrams. The adapter parsers correctly handle extended headers and IPv6 mobility while preserving checksum offload and segmentation for the UDP transport protocol.
  20. Offload capability negotiation. The driver and network adapter firmware exchange capability structures during initialization. The system enables offload only for those tunnel parameter combinations that are guaranteed to be supported by the microcode version, preventing data corruption due to incorrect hardware interpretation of non‑standard ports.
  21. Future of programmable offload. Development is moving toward the use of programmable P4 pipelines and SmartNICs. Such architectures allow custom overlay encapsulation formats and custom processing functions to be programmed, executed at 100/400GE speeds without modifying host system application software.

Comparisons

  • VXLAN offload vs VLAN offload — in the case of VLAN, 802.1Q tag‑based traffic switching has been hardware‑supported by virtually any network chip for two decades, whereas VXLAN offload requires significantly more complex encapsulation, routing, and decapsulation logic in the ASIC, which for a long time remained the prerogative of expensive switches based on Trident or Jericho, limiting scaling beyond 4094 networks without software gateways.
  • VXLAN offload vs VXLAN software — software VXLAN termination on the host CPU via ovs‑vswitchd or DPDK creates unpredictable delays and consumes up to 30% of CPU resources under 100 Gbit/s load, while hardware offload on a SmartNIC moves encapsulation and checksumming for Geneve/VXLAN into ASIC eBPF hooks, freeing cores entirely for user applications and reducing latency to microseconds.
  • DPDK (High speed shared memory access)OVS (Programmable multi-layer virtual switch)
  • VXLAN offload vs NVGRE offload — both technologies use a similar L2‑over‑L3 encapsulation mechanism with a 24‑bit network identifier, but NVGRE uses the GRE key field in the protocol 47 header, whereas VXLAN relies on UDP port 4789. This makes VXLAN offload more widespread due to better ECMP load balancing from UDP header entropy and widespread support in Broadcom and Barefoot Tofino chips compared to the limited NVGRE ecosystem.
  • VXLAN offload vs Geneve offload — Geneve offers an extensible variable‑length TLV header, unlike the fixed VXLAN header, which complicates hardware parsing and requires a flexible parser in Barefoot Tofino chips or programmable SmartNICs, whereas VXLAN offload is implemented more trivially in the fixed pipelines of most ASICs thanks to its deterministic structure, sacrificing metadata flexibility for lower cost and power consumption.
  • VXLAN offload vs GENEVE with NSH offload — combining Geneve with a network service header turns the overlay into a full service chain where each hop adds metadata, which is critical for the 5G User Plane Function, but requires significantly more expensive FPGAs or ASICs such as the Intel N3000, whereas classic VXLAN offload remains the domain of mass‑market data centers where simplicity of L2 connectivity without service routing at the underlay level is the priority.

OS and driver support

Hardware acceleration of VXLAN is implemented through support at the OS kernel level and specialized network card drivers: the driver must explicitly register callback functions for ndo_add_vxlan_port and ndo_del_vxlan_port operations, allowing the NIC to recognize traffic on UDP port 4789 and perform VXLAN header parsing at the hardware level. In Linux systems, tx-udp_tnl‑segmentation and rx-udp_tnl‑gro mechanisms are enabled via the ethtool utility, which sends commands to the kernel to change the NETIF_F_GSO_UDP_TUNNEL and NETIF_F_GRO_UDP_TUNNEL flags, after which the stack passes segmented packets directly to the NIC queue. Proprietary implementations, such as NVIDIA ASAP², require installation of a specific firmware set and DOCA libraries, which create an abstract layer between hardware and software, allowing a unified API for programming Flow Tables in the ASIC.

Security

The VXLAN protocol itself has no built‑in authentication or encryption mechanisms. Therefore, when performing hardware offload, packets are directly injected into the ASIC pipeline before passing through the host OS Netfilter filtering rules, creating a risk of accepting traffic from unauthorized VTEPs. To protect confidentiality at the hardware level, MACsec is used, which works at Layer 2 and encrypts the entire frame before it is wrapped in VXLAN, or IPsec in transport mode, where after encapsulation the outer header and the original payload are encrypted by the AES-NI engine on the NIC before being sent to the network. Administrators must strictly control trust domain boundaries, because the CVE-2023-28842 vulnerability demonstrates how an attacker can send plain VXLAN datagrams that will be processed by the NIC without checking for membership in an encrypted overlay network.

Logging

Diagnostics of offload operations are performed by analyzing hardware counters of the network adapter: the command ethtool -S <interface> | grep -E "err|drop|fail" outputs register values that record packet drops during rx_vxlan_errors or tx_encapsulation_failures, directly indicating hardware problems with tunnel processing. In the Linux kernel, when checksum errors occur before offload or when segments are assembled incorrectly, tracing events such as skb_checksum_error and gro_packet_malformed are triggered, accessible via perf trace or the tracepoint/skb subsystem, allowing monitoring scripts to track the state of the offload pipeline in real time. Programmable switches based on DPUs allow logging operations using P4 pipelines, exporting the processing state of each packet to external collectors using the sFlow(R) protocol without CPU load.

Limitations

The hardware logic of most ASICs does not support routing operations between different VXLAN Network Identifiers within a single chip, specifically Integrated Routing and Bridging. This forces the processor to take over traffic when routing between VXLAN segments is required. Offload is often incompatible with additional header processing: for example, adding or removing VLAN tags inside a VXLAN tunnel, IGMP Snooping for multicast traffic, or MLAG mechanisms on bridge interfaces almost always disable hardware acceleration and return packets to the slow software path. Fragmentation options are also critical — setting the dont‑fragment flag is inherited by the outer IP header, and if a PMTU smaller than the size of the encapsulated packet is encountered in the underlay network, the hardware engine cannot perform fragmentation and simply drops the packet with an error.

History and development

Evolution began with the first SmartNICs, which had fixed FPGA firmware and could only hardware‑compute checksums for non‑encapsulated traffic, but with the growing popularity of VXLAN in data centers, manufacturers added static header parsers. The next stage was the emergence of full‑fledged DPUs and IPUs, where a dedicated ARM cluster with its own DDR5 memory services the network stack: the Open vSwitch software virtual switch runs directly on the card processor, and the software data plane is compiled into ASIC rules via an SDK on the fly. The development of frameworks such as NVIDIA DOCA or the Intel Infrastructure Programmer Development Kit shifted development from low‑level register programming to an API, allowing complex processing of VXLAN headers and GBP options to be described in high‑level languages, ensuring seamless migration of network policies from the host OS to the hardware.