POWER10 (Multithreaded scaling with hardware AI acceleration)

POWER10 is a RISC architecture central processor from IBM, built to process colossal volumes of data in enterprises. Imagine a chip that combines powerful cores, fast memory, and built-in artificial intelligence right on the die, allowing a single server to replace dozens of ordinary ones without overheating the room and consuming significantly less energy.

The processor serves as the foundation for IBM Power E1050 enterprise servers oriented toward hybrid cloud computing. Its key application environment is working with mission-critical SAP HANA and Oracle DB databases in large banks, where millisecond delays are unacceptable. POWER10 is actively used by insurance companies for mathematical risk modeling and by logistics operators for real-time routing. Thanks to hardware acceleration of unprivileged containers, it is indispensable in Red Hat OpenShift for consolidating hundreds of microservices within minimal physical space.

Typical problems

Operating POWER10 often reveals difficulties with power consumption under peak load across all cores, requiring precise tuning of the cooling system. Software incompatibility of legacy code written for x86 architecture, without a recompilation stage under the OpenPOWER standard, leads to performance degradation in virtualized environments. Administrators also encounter the noisy neighbor effect when leasing logical LPAR partitions, where aggressive load in one partition temporarily limits memory throughput for other virtual machines.

How POWER10 works

At the core of POWER10 is a rejection of clock frequency increases in favor of densely placing up to 15 SMT8 cores on a chip manufactured using a 7 nm process. The main innovation is the integration of built-in Matrix Math Accelerators (MMA) into each core, performing INT8 and BFLOAT16 computations without accessing RAM. This radically differs from the Intel Xeon Scalable approach, where accelerating AI inference requires connecting external coprocessors or waiting for specialized DL Boost instruction sets that create additional bus latency. Unlike AMD EPYC, which focuses on universally increasing x86 cores, the POWER10 architecture offers Memory Inception technology: a cluster of four physical servers forms a unified memory pool of up to 64 TB with shared addressing. When an application on node A requests data absent from local RAM, the OpenCAPI controller directly retrieves it from node B memory via a symmetric multiprocessing protocol without halting the instruction stream.

There are currently no analogues of such coherence depth among mainstream ARM processors like Ampere Altra: connectivity there is limited to standard network protocols such as RoCE, which slows down the processing of large analytical models and the consolidation of virtualized environments across an entire data center.

POWER10 functionality

  1. Multilevel branch prediction. The unit uses a combined scheme of local and global predictors with a correlated counter in an associative-access hash table. The history depth reaches 256 entries, minimizing pipeline flushes on deep speculative paths.
  2. Hardware predictive L1 cache prefetch. The mechanism analyzes access patterns, detecting regular stride based on effective address offsets. The logic initiates anticipatory prefetch before the processor core demands it, masking second-level memory subsystem access latency without overhead from software hints.
  3. Pipelined fixed-point data processing. Execution units implement superscalar issue of up to eight operations per cycle. Operands pass through a non-blocking scheduler with dynamic register renaming to eliminate WAR and WAW false dependencies in tight computational loops.
  4. SIMD acceleration unit with 512-bit register width. Matrix extensions process packed INT8 and BF16 formats in fused multiply-accumulate mode. A single-cycle design realizes high-dimensional vector product computation characteristic of neural network convolution kernels.
  5. Hardware decimal arithmetic implementation. The integrated DFU module operates on packed binary-coded decimal numbers of variable length directly, bypassing software conversion to binary format. This radically reduces execution delays for COBOL-like computations and operations with monetary data types.
  6. Multicore coherence architecture. The cache consistency maintenance protocol is based on a multilevel directory tracking MESI-like modification states. SMP connectivity of up to eight physical sockets operates via dedicated buses with duplex snoop request transmission.
  7. Dynamic decode width management. The instruction grouping scheme adapts the stream of eight slots to the current level of intracore parallelism. The dispatcher packs independent operations into issue blocks, balancing port load to avoid structural hazards.
  8. Hardware transactional memory support. The core provides a mechanism for speculative execution of atomic critical sections without acquiring classical locks. On conflict over a tracked cache set, microcode rolls back the state to a checkpoint with automatic transaction retry.
  9. POWER10 hierarchical interrupt controller. The router distributes external signals to core priority queues, bypassing the outdated off-chip interface. The mechanism is capable of direct delivery without hypervisor involvement, reducing jitter when processing low-latency I/O traffic.
  10. Transparent memory compression technology. An inline compressor at the RAM controller boundary compacts memory pages before writing to DDR5 modules. Upon subsequent reading, the unit hardware-decompresses the data without generating exceptions. This increases effective throughput without modifying application code.
  11. DDR5 (High-speed energy-efficient computer RAM)
  12. Second-level associative translation buffer. The hierarchical TLB caches virtual address translations with a total coverage of several gigabytes. On a core miss, accessing the nested structure eliminates a lengthy radix tree walk, returning the physical frame number in a minimal number of cycles.
  13. On-chip cryptographic coprocessor. The subsystem implements AES-GCM symmetric encryption algorithms and SHA-3 hash functions on hardware pipelines. Streaming processing occurs over data in the cache without offloading to memory, preserving key secrecy within a secure execution enclave.
  14. Specialized Event-Based Branch interrupt assistant. The microarchitectural trap allows the hypervisor to set hooks on the execution of guest instruction streams. Upon a branch signature matching the specified pattern, a lightweight event is generated, eliminating polling loops in paravirtualization.
  15. Power autoscaling module. A distributed network of activity sensors on the chip collects CMOS cell switching telemetry with nanosecond granularity. The controller varies voltage and frequency of independent domains, keeping the thermal budget within acceptable limits without triggering throttling.
  16. OpenCAPI and OMI link acceleration. The interface offloads the central processor from managing sequential Near-Memory packet transactions. The controller operates on abstract command descriptors in shared memory, providing FPGA accelerators with direct access to a coherent address space.
  17. Register file with parity control and recovery. The architectural state storage array is augmented with single-error correction bits. A microcode scrubbing-analysis routine scans unused ports in the background, detecting soft error accumulation before they reach an uncorrectable stage.
  18. Turbo mode based on neural network load prediction. A predictive model trained on task launch history forecasts high utilization phases. The power logic preemptively raises the TDP limit before actual peak demand arrives, gaining tens of microseconds in single-thread workload performance.
  19. Hardware isolation of virtual machines. The memory access controller distributes protection keys by logical partition identifiers. An attempt to read a zone with an alien tag is blocked at the system bus switch without raising an exception to the core level, preventing leaks via cache side channels.

Comparisons

  • POWER10 vs POWER9 in core microarchitecture organization. POWER10 implements a significantly wider superscalar core with an increased number of execution units, providing a multiple-fold performance-per-cycle gain compared to POWER9. Engineers implemented an improved branch predictor and deeper scheduling queues, which minimize pipeline stalls when handling irregular enterprise-class computational workloads.
  • POWER10 vs AMD EPYC Milan in the memory subsystem. The POWER10 architecture uses the Open Memory Interface with Centaur buffers, delivering aggregated throughput exceeding a terabyte per second, whereas EPYC Milan relies on classic multi-channel DDR4 with lower aggregate bandwidth. The IBM approach allows flexible scaling of memory capacity and throughput independently of media type, critically reducing latency in transactional databases.
  • DDR4 (High-speed synchronous data transfer)
  • POWER10 vs Intel Xeon Ice Lake in hardware security support. The transparent memory encryption feature in POWER10 is implemented at the hardware level without requiring application modification and with negligible overhead, unlike Intel SGX software solutions. Full addressable space encryption protects data from physical attacks and insider threats at all stages of the container or virtual machine lifecycle.
  • POWER10 vs Fujitsu A64FX in vector computations. The Matrix Math Assist instruction set in POWER10 is specialized for low-precision INT4 and INT8 matrix operations aimed at artificial intelligence inference, while A64FX emphasizes high-precision FP64 computations via Scalable Vector Extension. The IBM solution provides higher operation density per watt for industrial deployment of deep learning models.
  • Vector (Ordered storage of numbers in continuous memory)
  • POWER10 vs SPARC M8 in multithreaded efficiency. POWER10 offers a configurable Simultaneous Multithreading mode with eight threads per core, dynamically adapting the number of active threads to the workload character, whereas the SPARC M8 critical region is fixed at eight threads without adaptive management. This elasticity allows the operating system to more efficiently utilize core resources under mixed scenarios, avoiding single-thread performance degradation.
  • SPARC (Open standard RISC architecture)

OS support

POWER10 implements hardware workload isolation through the built-in PowerVM hypervisor, which provides direct partitioning of physical processor cores and memory between logical partitions without emulation, while driver support in AIX, IBM i, and Linux on Power is implemented through a unified EEH event model, where the chipset independently isolates malfunctioning PCIe 5.0 endpoints and recovers them without host reboot.

Security

The processor microarchitecture includes transparent memory encryption using the AES-XTS 256-bit standard with zero latency thanks to the integration of crypto engines directly into the memory controllers, allowing encryption of the entire DRAM without application modification, while the Ultravisor hardware security container isolates the trusted computing base at the hardware access bit level, excluding interference even from a privileged operating system in protected virtual machines.

Tracing and accounting

The processor provides a built-in Performance Monitoring Unit with six hierarchical levels of counters, capable of capturing cache hits, branch predictions, pipeline delays, and OMI interconnect bandwidth utilization without introducing software overhead, with results stored directly in a ring hardware buffer with real cycle-clock timestamps.

Hardware limitations

The POWER10 architecture uses a proprietary OMI bus with fixed synchronization exclusively with IBM’s own DDR4 CDIMM memory, so standard industrial DIMM modules are not supported, and maximum system scalability is limited to four physical processor sockets due to SMP coherence protocol latency when connecting eight or more chips within a single NUMA coordinate plane.

Development history

POWER10 is the commercial result of a multi-year research program initiated in 2015 with IBM Research publications on the internal AXI chiplet interconnect and PowerAXON modules, moving industrial production for the first time to the Samsung 7 nm EUV process and introducing a matrix math accelerator whose microarchitecture is a direct evolution of the POWER ISA 3.0 vector extensions, with the first systems delivered to customers in September 2021 in the form of E1080 servers.