ARM (Energy efficient execution of processor instructions)

ARM (Advanced RISC Machine) is a processor architecture optimized for minimal power consumption. Unlike desktop chips, it uses a simplified fixed length instruction set. This allows the device to run longer on battery, generating less heat, and makes it the standard for mobile electronics.

The ARM architecture dominates smartphones, tablets, and smartwatches thanks to exceptional battery life. It underpins single board computers (Raspberry Pi), smart home systems, automotive controllers, and medical implants. In the server segment, ARM is gaining popularity in cloud computing (AWS Graviton), and has recently been used in laptops (Apple Silicon, Snapdragon X Elite) to increase run time without recharging.

Typical ARM problems

The main limitation is incompatibility with software written for the x86 architecture, which requires emulation and leads to performance drops. The ARM ecosystem is historically fragmented: chip manufacturers often modify cores, making universal operating system images harder to create than for the PC world. At high clock speeds, energy efficiency drops sharply, so the architecture rarely reaches the peak frequencies of x86 processors while maintaining thermal balance.

How ARM works

ARM is based on the RISC (Reduced Instruction Set Computing) philosophy. The key difference from CISC architectures (such as x86) is that all instructions have a fixed length (usually 32 bits in ARMv7 and earlier). This allows the instruction decoder to parse them in hardware in a single cycle without the complex microcode logic of competitors. Almost all operations follow the register to register principle: before computation, data is loaded from memory into the core by a separate load instruction, and the result is written back by another store instruction. This approach offloads the pipeline and saves transistors.

Additionally, energy efficiency is achieved through a sophisticated clocking and power management system. Compared to x86, where a core can decode one complex instruction, performing dozens of micro operations with out of order execution, ARM executes simple instructions strictly one per cycle. This simplifies branch prediction and reduces energy expenditure on reordering. Modern flagship Cortex X cores have complicated the logic, introducing superscalar execution and out of order processing to compete with powerful desktop chips. However, the main mechanism of the architecture remains conditional execution (in ARM state) and predication in the Thumb 2 instruction set: many instructions can execute or be ignored depending on status flags, which avoids pipeline flushes on short branches. This is fundamentally different from Intel and AMD solutions, where such logic requires a high complexity branch prediction unit.

ARM functionality

  1. Register file and operating modes. The ARM processor core contains a register bank of 31 general purpose registers, not all visible at once. The current mode determines the active set of 16 registers (R0 to R15), plus the CPSR status register. In exception modes, some registers are replaced by banked registers, eliminating the need to save context onto the stack when entering a handler.
  2. Conditional execution of instructions. Almost all ARM architecture instructions are preceded by a 4 bit condition field. The condition code is checked against the status flags in the CPSR register at the decode stage. If the condition is false, the instruction becomes a pipeline bubble. This mechanism eliminates branches over short distances and minimizes pipeline flushes typical of branch instructions.
  3. Built in barrel shifter. The second operand of any data processing instruction passes through a hardware shift unit. The shifter performs logical shifts, arithmetic shifts, or rotations directly in the execution cycle. This allows complex address expressions or constants to be computed without additional instructions, effectively implementing multiplication by a power of two in a single cycle.
  4. Thumb code compression state. A subset of 16 bit Thumb instructions provides high code density while maintaining access to the same physical core resources. In Thumb state, dynamic decompression of instructions into 32 bit equivalents occurs at the decode stage. Switching between ARM and Thumb states is performed by the least significant bit of the address in the BX branch instruction, not requiring a processor mode change.
  5. Advanced SIMD (NEON). The NEON media coprocessor operates with its own 256 byte register file, shareable as vectors of variable length. It supports integer and packed single precision floating point arithmetic with saturation. The pipelined NEON architecture implements parallel processing of up to 16 operations on 8 bit operands per cycle, accelerating codecs and digital signal filtering.
  6. VFPv3 floating point. The Vector Floating Point unit complements NEON with precise double precision computations (IEEE 754). The VFPv3 architecture is fully pipelined for multiply accumulate operations. The coprocessor is tightly coupled with the integer core, receiving instructions from a single stream, and has its own register set, removing constraints on the parallelism of load and compute operations.
  7. Vector (Ordered storage of numbers in continuous memory)
  8. Cache memory hierarchy. The cache controller implements a multi level structure with a Harvard architecture at the L1 level. Prefetch blocks monitor access patterns and speculatively fill lines before a direct core request. The way locking mechanism allows critical data to be pinned in the cache without the right of eviction, ensuring deterministic latencies in hard real time systems.
  9. System coprocessor CP15. Management of memory virtualization, cache configuration, and translation tables is carried out exclusively through CP15. The MCR and MRC instructions provide access to internal core identification registers, TTB (Translation Table Base) descriptors, and domain control. Programming CP15 changes the topology of the memory subsystem without resetting the core, activating Memory Protection Unit blocks.
  10. Power management. Dynamic power scaling is implemented through the clock frequency control interface and separate voltage planes. The core applies automatic clock gating for inactive functional blocks. In retention mode, the logic state is latched, power is removed from the core, and data is preserved in retention flip flops, minimizing leakage currents of nanometer process technologies.
  11. TrustZone security separation. The hardware extension creates orthogonal worlds: secure and normal. The monitor mode switches contexts atomically, ensuring that the resources of the protected side are physically isolated from the non secure operating system. The bus controller adds a Non Secure bit to transactions, allowing peripherals to differentiate access at the hardware level without software overhead.
  12. Precise interrupt handling. The ARM core pipeline captures the processor state at the moment of the exception such that its cause and return address are unambiguously identified. The interrupt vector table allows assigning unique addresses for each source. The Fast Interrupt (FIQ) has its own banked registers, allowing processing to begin on the next cycle without saving user context.
  13. CoreSight hardware debug. The trace subsystem is embedded into the core and provides a JTAG and Serial Wire Debug interface. The Embedded Trace Macrocell module generates a compressed stream of information about executed instructions in real time. Data trace allows tracking read and write operations of selected variables without stopping the core, which is critically important for analyzing hard to reproduce thread races in multitasking operating systems.
  14. Memory Management Unit (MMU). The module performs page based two stage translation of virtual addresses for hypervisor configurations. The Translation Lookaside Buffer (TLB) caches recent entries, and hardware table walks automatically load descriptors on a miss. Page attributes define the caching policy and access rights with granularity of sections or small pages (4 KB).
  15. Exclusive access mechanism. The LDREX and STREX instructions provide atomic memory updates in multiprocessor systems. The exclusive monitor hardware tracks the physical address, marking it as reserved. If a concurrent write is attempted from another bus master, an error code is returned, forcing the program to repeat the transaction, which forms the basis of non blocking spinlocks.
  16. Large Physical Address Extension (LPAE). The extension of the physical address space to 40 bits in the ARMv7 architecture overcomes the 4 GB limit. The format of page descriptors is modified to 64 bit entities, allowing up to 1 TB of RAM to be addressed. The MMU translates 32 bit virtual addresses of tasks into the extended physical space without modifying user code.
  17. big.LITTLE technology. The heterogeneous computing subsystem combines high performance and energy efficient cores with an identical microarchitectural software model. The coherent interconnect provides thread migration between different core types in milliseconds. The shared L2 cache is transparent to switching, and the preservation of architectural state is guaranteed by the common format of the Generic Timer structures.
  18. Interrupt virtualization (GICv2). The Generic Interrupt Controller physically separates physical interrupt sources between guest operating systems and the hypervisor. Virtual interrupt support allows the hypervisor to inject interrupts directly into the virtual core without a costly exit to host mode. The maintenance mechanism for saving and restoring the distributor state accelerates context switching of virtual machines.
  19. Platform Security Architecture (PSA). The platform security architecture framework defines the isolation of the root of trust through hardware blocks. The secure boot vector verifies firmware integrity along a chain from the immutable bootloader. Peripheral isolation at the middleware compilation stage prevents attempts by the non secure side to control critical sensors or the reset system through driver vulnerabilities.

Comparisons

  • ARM vs x86. The ARM architecture implements the RISC philosophy with fixed length instructions and a focus on energy efficiency, which ensures minimal heat dissipation. In contrast, the x86 architecture is based on CISC principles, where complex instructions can execute in one cycle but require microcode decoding, increasing die area and peak processor power consumption.
  • Cortex M vs Cortex A. Cortex M processor cores are designed for deterministic processing of real time microcontroller tasks and support only the Thumb instruction set. Cortex A cores, targeted at high level operating systems, are equipped with a Memory Management Unit (MMU) and superscalar execution pipelines, which radically changes their computational potential.
  • ARMv8 A AArch64 vs AArch32. The AArch64 execution state introduces 64 bit addressing and an expanded register file, abandoning predicated execution of most instructions to simplify decoder logic. The AArch32 state retains backward compatibility with the classic 32 bit ARM architecture, allowing a single core to dynamically switch between modes to run legacy code.
  • AArch64 (64-bit processor architecture with fixed instruction length)
  • Dynamic prediction vs Static branch prediction. Modern ARM cores employ complex two level adaptive predictors with global branch history tables to minimize pipeline stalls. In contrast, the simplest energy efficient cores often use a static prediction method, where backward loop branches are considered taken, which reduces hardware cost at the expense of prediction accuracy.
  • ARM big.LITTLE vs Intel Hybrid Architecture. big.LITTLE technology provides heterogeneous computing through clusters of high performance and energy efficient cores, switching threads based on a load threshold without a common clock grid. The Intel Hybrid Architecture approach unifies cores of different classes in a common coherent domain, relying on Thread Director for hardware task scheduling with single thread prioritization.

OS and driver support

Support is implemented through strict compliance with Base System Architecture specifications and Platform Design Document sets, defining a standardized method of device discovery via ACPI and Device Tree, enabling the operating system kernel to dynamically load drivers without binding to a specific board. Drivers interact with peripherals through a unified System Control and Management Interface software interface, translating high level commands into low level operations for managing power, frequency, and sensors. For heterogeneous compute units, a framework of shared buffers and remote procedure calls is used, ensuring code execution on accelerators in a unified address space with the main processor.

Security

Comprehensive multi level isolation is based on the TrustZone hardware extension, creating two virtual processors on a single physical core with strict resource partitioning via a non secure access signal on the system bus, where the secure monitor manages world switching on secure interrupts and the Secure Monitor Call instruction. The integrity of the boot chain is guaranteed by a firmware authentication mechanism based on immutable mask ROM code, verifying the digital signature of each subsequent bootloader. During execution, data flow control prevents exploitation of memory vulnerabilities through pointer authenticity verification using the Pointer Authentication extension, which computes and verifies a cryptographic message authentication code for return addresses and indirect branches.

Logging and debug

The trace system functions through the built in Embedded Trace Macrocell, which, without interfering with the CPU operation, continuously captures the stream of executed instructions, timestamps, and operating system events, multiplexing these streams with system trace data from bus matrices through a common Serial Wire Viewer infrastructure component via a single high speed microcontroller pin. For monitoring subsystem functioning without stopping the core, the CoreSight protocol is used, allowing configuration of cross triggers between processors and hardware blocks such that a write event to a specified memory range or a performance counter reaching a threshold value triggers the capture of the pipeline state into a cyclic trace buffer of embedded RAM.

Architecture limitations

A fundamental limitation of the instruction set is the use of a load store computation model with a weak memory ordering model, requiring developers to explicitly set data and instruction synchronization barriers to prevent speculative reordering of operations by the core, creating a risk of hard to catch race conditions in drivers with incorrect use of exclusive load and store instructions. The lack of a unified standard for interrupts and timers among different silicon manufacturers leads to fragmentation of low level system code. Execution determinism is limited by the operation of the dynamic branch predictor and multi level caches, whose lines require maintenance operations to maintain coherence in multiprocessor clusters with the AMBA CHI protocol.

History and development

The concept was laid down at Acorn Computers in 1983 as a reaction to the excessive complexity of CISC processors, leading to the creation of the 32 bit ARM1 chip with a minimalist design of 25 thousand transistors, requiring no microcode. The commercial evolution moved to an intellectual property licensing model, allowing third party companies to embed microprocessor cores into their own system on chip designs with customizable peripherals. The transition to the 64 bit AArch64 architecture in ARMv8 unified the programming model, adding thirty one general purpose registers, removing predicated execution, and introducing a new instruction set with a fixed length of four bytes, laying the foundation for the introduction of Scalable Vector Extension with overlapping register windows for high performance computing and machine learning.