AArch64 (64-bit processor architecture with fixed instruction length)

AArch64 is an operating mode of ARM processors where they process data in 64-bit chunks. It uses a unified fixed-length instruction set and gives programs access to a larger memory space unavailable in older 32-bit versions.

The architecture underpins virtually all modern smartphones, tablets, and single-board computers such as the Raspberry Pi. It is used in server solutions from cloud providers, including AWS Graviton, as well as in the latest laptops and desktop systems with Apple M series and Snapdragon X Elite processors.

Typical AArch64 problems

Difficulties arise when running legacy 32-bit software not directly compatible with the new mode, requiring an emulation layer. Developers face non-obvious code porting errors related to differences in the memory model and data alignment. Increased battery power consumption due to 64-bit addressing in scenarios where 32 bits would suffice remains a serious challenge.

How AArch64 works

The AArch64 architecture (ARM 64-bit Architecture) implements a reduced instruction set computing (RISC) model, where each instruction strictly occupies 32 bits in memory, simplifying pipeline processing but ruling out encoding large constants in a single word. The processor operates on thirty-one 64-bit general-purpose registers and a separate set of 128-bit registers for SIMD operations, accessible via the NEON extension. Unlike the legacy AArch32 mode (ARMv7), where conditional execution of instructions was a fundamental property, in AArch64 branch prediction is implemented primarily through explicit compare and branch instructions, eliminating the need to analyze predicate bits in each opcode and reducing decoder load. The exception handling mechanism has been radically reworked: instead of a set of banked registers for different interrupt modes, a fixed address space is used, and switching between privilege levels EL0–EL3 occurs by saving processor state onto the stack or in shadow registers, ensuring low latency when entering the hypervisor or secure environment. The virtual memory system is based on multi-level page tables with page sizes of 4, 16, or 64 KB, with the memory management unit supporting a 48-bit virtual address space, representing a compromise between addressable memory size and table storage and traversal overhead. Compared to x86-64, the AArch64 architecture features a weakly-ordered memory access model, allowing more aggressive reordering of read and write operations, requiring explicit synchronization barriers in critical sections and placing responsibility on the compiler for correct generation of fence instructions, whereas in x86-64 a significant portion of these guarantees is provided in hardware.

AArch64 functionality

  1. General-purpose register file. The architecture provides thirty-one 64-bit general-purpose registers named X0–X30. The lower 32 bits of each register are accessible via the W0–W30 notation. Operations on W-registers implicitly zero the upper bits of the corresponding X-register, eliminating data dependencies.
  2. Zero register XZR/WZR. Instructions reading XZR return zero, and writes to it are ignored. This is a hardware embodiment of a constant, avoiding access to a physical register cell. The name XZR is hardware-shared with the stack pointer SP, with the selection between them set contextually by instruction bit fields.
  3. Fixed-length instruction decoding. All instructions are exactly 32 bits in size. Encoding is unified: the upper bits define the operation class and condition flags. This enables a simple, multi-stage pipeline with predictable fetch unit throughput and low complexity of the first-stage pre-decoder.
  4. Register pointer model. The program counter PC is not accessible as a general-purpose register for direct arithmetic operations. The address of the next instruction is obtained via relative addressing (ADR/ADRP) or by reading the PC with a pseudo-instruction. This eliminates a class of errors associated with unintended writes to the program counter.
  5. Saturating and non-saturating arithmetic. The base instruction set lacks saturating operations and general conditional predicate execution. Instead, the NEON SIMD extension is used, preserving the orthogonality of the integer core. The scalar ALU operates only with wrapping arithmetic, relying on compiler optimization and runtime intrinsic functions.
  6. Base plus offset addressing mode. The only memory addressing mode: a base register is summed with a 9-bit signed offset or an index register, optionally shifted by the access size. The mode with automatic post-increment or pre-decrement is supported exclusively for paired loads and stores, providing insurance against bus conflicts.
  7. Frame pointer and linkage. Register X29 is reserved by the calling convention as the frame pointer (FP), and X30 as the link register (LR). The BL instruction writes the return address to LR atomically with the branch. On function entry, the FP/LR pair is saved onto the stack with an STP instruction using pre-indexed offset, forming a deterministic call stack.
  8. Virtual memory and address translation. The memory management unit uses multi-level page tables, supporting 4 KB, 16 KB, and 64 KB granule sizes. Virtual-to-physical address translation goes through up to four stages, being cached in the TLB. The virtual address space size reaches 48 bits, with plans for expansion to 52 bits.
  9. Memory barrier model. The architecture implements a weak memory ordering model, requiring explicit barriers: DMB for ordering data observability, DSB for waiting for all bus operations to complete, and ISB for flushing the prefetch pipeline. LDAR/STLR instructions provide acquire/release semantics without a full core barrier.
  10. Exclusive access and monitors. Exclusive access is used for lock-free synchronization. The LDXR instruction loads a value, tagging the address with the core monitor, and STXR stores the result only if the tag has not been cleared. The returned status code allows organizing an atomic retry loop.
  11. Call stack and shadow area. The stack pointer is selected between SP_EL0 and SP_ELx for each exception level. AArch64 does not implement a hardware red zone beyond the stack top. Code must maintain 16-byte stack alignment on calls; violation on SIMD register access risks an alignment exception.
  12. Exception level numbering. Execution is divided into the hierarchy EL0 (applications), EL1 (OS kernel), EL2 (hypervisor), and EL3 (Secure Monitor). Transition between levels is only allowed toward higher privilege via synchronous exceptions, IRQ, FIQ, or SError, with the vector base address configured by the VBAR register.
  13. Dynamic machine timer allocation. Each exception level owns its own physical or virtual timer comparator. Access to the counter and comparator is via CNTPCT_EL0, CNTP_CVAL_EL0 registers. Timer virtualization at EL2 ensures transparent migration of guest operating systems without modifying their code.
  14. GIC interrupt organization. The Generic Interrupt Controller interface supports up to 16 software-generated inter-processor interrupts (SGI) and numerous private/shared peripheral interrupts. AArch64 uses a two-vector interrupt model: IRQ for standard handling and FIQ for low-latency security scenarios.
  15. Address constant generation. Since a 64-bit literal does not fit in a 32-bit instruction, a software construction technique is applied. The MOVK instruction sets a 16-bit fragment of a register, leaving other bits untouched, while MOVZ clears non-target fields, allowing the linker to synthesize a full address in a minimal number of cycles.
  16. SIMD/FP bank set. The set of vector registers V0–V31 is 128 bits wide. They are shared with FP registers, with the lower 64 bits addressable as D-registers and 32 bits as S-registers. The bank overlap eliminates redundant transfers between scalar and vector computations in multimedia algorithms.
  17. Vector per-element operations. NEON performs uniform operations on packed 8-, 16-, 32-, and 64-bit integer elements, as well as on half-, single-, and double-precision floating-point values. Saturating arithmetic and extended multiplication are present only in this SIMD instruction subset.
  18. Vector (Ordered storage of numbers in continuous memory)
  19. Indirect table addressing. The TBL and TBX instructions implement arbitrary byte permutations within one or adjacent registers. The first source operand provides the table of vectors, and the second provides the indices. TBX differs from TBL by preserving the original target element value when the index is out of bounds.
  20. Core cryptographic primitives. AES, SHA-1, and SHA-256 extensions are encoded in the NEON instruction space, operating on V0–V31 registers. Encryption operations execute in a single cycle: AESE implements the SubBytes/ShiftRows/MixColumns round, and PMULL provides polynomial multiplication for GCM mode without carries.
  21. Polymorphic execution architecture. To identify the presence of extensions, the processor provides the ID_AA64ISAR0_EL1 and ID_AA64PFR0_EL1 registers. Software reads feature bit fields before using CRC instructions, LSE atomic extensions, or half-precision operations, ensuring backward functional compatibility.
  22. IEEE 754 floating-point handling. The FPU fully supports denormalized numbers, quiet/signaling NaNs, and four rounding modes, dynamically selected by the FCSR field. The Accumulated Exception sticky flag records the fact of an inexact result, allowing error checking to be deferred without analyzing status after every instruction.
  23. Branch tracing. The Embedded Trace Macrocell unit generates a stream of packets with timestamps for every indirect branch or exception instruction. Filtering by address range and core context eliminates debug port overload, enabling real-time timing diagram verification at frequencies up to several gigahertz.

Comparisons

  • AArch64 vs x86-64. AArch64 employs a computing model with a limited word length and a fixed instruction length of 32 bits, whereas x86-64 uses variable-length instructions. This gives ARM processors more predictable decoding and simplifies the implementation of wide superscalar pipelines, while the variable length of x86-64 creates complexity at the fetch stage but allows complex operations to be encoded more compactly.
  • AArch64 vs IA-64 (Itanium). AArch64 implements a classic out-of-order RISC design, relying on dynamic processor optimization, unlike the statically scheduled EPIC approach of IA-64. The IA-64 decision to shift the burden of parallelism detection onto the compiler and software pipelining of loops proved excessively complex for developers, whereas the hardware reordering logic of AArch64 has proven its versatility for a wide range of tasks.
  • IA-64 (Architecture of explicitly parallel instruction computing EPIC)
  • AArch64 (A64) vs AArch32 (A32). The A64 mode offers 31 general-purpose registers of 64-bit width versus 15 registers in the 32-bit A32 mode, dramatically reducing the frequency of stack accesses. The A64 architecture removed conditional execution from almost all instructions, which was a hallmark of A32, replacing it with compact conditional selects, solving the problem of inefficient encoding space usage and simplifying branch prediction logic.
  • AArch64 vs RISC-V (RV64). AArch64 is a mature proprietary architecture with a vast ecosystem, guaranteeing strict binary compatibility, while RV64 is an open standard with a modular structure and the ability to customize the instruction set. AArch64 includes instructions accelerating cryptography, complex atomic memory operations, and speculative access, whereas similar capabilities in RISC-V are placed in optional extensions, which provides flexibility but may lead to platform fragmentation.
  • RISC-V (Open modular instruction set architecture)
  • AArch64 SIMD (NEON) vs x86-64 SIMD (AVX-512). The NEON subsystem operates on 32 registers of fixed 128-bit width and focuses on integer operations and mobile workload efficiency, while AVX-512 in x86-64 uses 32 registers of 512 bits for massive vector computations. The AArch64 approach avoids processor clock frequency drops due to thermal constraints, typical of activating wide AVX-512 units, at the cost of lower peak throughput in high-performance double-precision computing.

OS and driver support

The AArch64 architecture implements execution through an exception model with four levels (EL0–EL3), where the operating system kernel runs at EL1 and the hypervisor at EL2, ensuring clear hardware privilege isolation; system register tables and a standardized boot interface via Device Tree or ACPI are used to unify peripheral interaction, allowing the OS to abstract from the specific platform, while drivers use strictly typed memory access instructions adhering to the weakly-ordered memory model with explicit data synchronization barriers (DMB, DSB).

Security

Security is based on hardware context separation via the TrustZone extension, implemented by switching between non-secure and secure worlds on a Secure Monitor Call signal at EL3, with Pointer Authentication (PAC) introduced in AArch64, computing a cryptographic signature for pointers using PACIA/PACIB instructions and verifying it before dereferencing via AUTIA/AUTIB to prevent return-oriented programming attacks, while Memory Tagging Extension (MTE) assigns 4-bit tags to memory regions for automatic detection of spatial and temporal pointer errors.

Logging

The logging system in AArch64 is implemented through the Embedded Trace Macrocell (ETM), configured by writing to system registers and generating a stream of packets containing instruction addresses, timestamps, and process context, while Self-Hosted Trace is used for debugging without external hardware, allowing the OS kernel to record the execution trace into a dedicated ring buffer in system memory via the ETR block, while Performance Monitoring Unit (PMU) components count microarchitectural events in hardware (cache misses, branch predictions) and generate interrupts on counter overflow.

Limitations

The primary limitation is the strict weakly-ordered memory access model, requiring developers to explicitly place barrier instructions after each operation producing shared data, complicating code porting from strongly-ordered architectures (x86); the lack of microarchitectural backward compatibility with 32-bit mode in the general core set (A32/T32 execution is possible only with a hardware implementation of a separate execution block) forces a complete program rebuild for 64-bit pointers and the LP64 model, and the fixed instruction length of 32 bits when encoding long address offsets leads to increased code size due to literal pools.

History and development

Development of AArch64 began as part of the ARMv8 project, introduced in 2011, as a radical revision of the architecture abandoning conditional execution of almost all instructions and introducing a flat 64-bit virtual address space; further development in version ARMv8.1 added atomic memory operations (LSE), accelerating thread synchronization, and version ARMv9 expanded the architecture with Realm Management Extension for confidential computing and the second-generation Scalable Vector Extension (SVE2), providing variable-length vectorization for tasks not specific to HPC.