EU (Execution Unit) is that part of the processor which physically executes commands: adds numbers, shifts bits, accesses memory. Having received a prepared instruction from the decoder, the EU performs the computation and stores the result. Simply put, it is the calculator inside the core that directly changes the system state.
The Execution Unit is used in all types of central and graphics processors, microcontrollers, and digital signal processors (DSPs). In modern superscalar architectures (Intel Core, AMD Ryzen, ARM Cortex-A), several heterogeneous EUs are engaged simultaneously: arithmetic-logic units for integer operations, floating-point units (FPU) for real-number calculations, and SIMD modules for vector data processing in multimedia and scientific tasks.
The key challenge is execution unit idle time caused by data dependencies: a subsequent instruction cannot begin until the EU computes the result of the previous one. Long chains of memory operations create delays due to slow RAM access. Logical errors in the microarchitecture (speculative execution) sometimes expose vulnerabilities like Spectre. Overheating and throttling also reduce EU efficiency under peak loads by limiting the clock frequency.
How the Execution Unit works
Having received a decoded command, the Execution Unit reads operands from the register file, passes them through a functional path specified by the micro-operation (adder, multiplier, shift unit) and commits the result to the target register or updates the status flags. This pipelined process is similar to the operation of an assembly line: in a simple scalar core, the EU processes an instruction strictly in one or several cycles, while the scheduling control unit monitors data readiness. The fundamental difference from a related module — the Instruction Scheduler — is that the scheduler only distributes execution-ready micro-operations to free ports but does not perform computations. Comparing with the Decode Unit, the EU does not engage in interpreting the operation code and generating control signals; rather, it implements an already materialized operation. In superscalar processors, the scheduler dynamically loads several different types of EUs in parallel, whereas the EUs themselves remain narrowly specialized executors without decision-making logic regarding execution order. This approach, supported by out-of-order execution, allows efficiently filling cycles with useful work, hiding the latency of slow blocks, be it division operations or data loads from cache memory.
Execution Unit functionality
- Instruction decoding and dispatching. The execution unit receives decoded micro-operations (uOps) from the queue as input, not the original CISC instructions. The internal scheduler analyzes operand readiness and the availability of specific ports. When all conditions are met, the uOp is dispatched to a free execution port for immediate processing.
- CISC (Executing complex operations with a single instruction)
- Register file and renaming. The EU interacts exclusively with the physical register file (PRF), bypassing the logical names of architectural registers. The unit uses a register alias table (RAT) to read data directly from the result queue. This eliminates false WAR and WAW dependencies, allowing instructions to be speculatively executed out of order.
- Data readiness scheduling. The Reservation Station constantly monitors the Common Data Bus. As soon as the needed operand is computed and tagged with the corresponding tag, the station captures the value directly, bypassing cache memory. This is the wake-up mechanism that immediately turns a waiting uOp into one ready for dispatch to execution.
- Speculative execution. The execution unit computes instructions without waiting for the resolution of preceding branches. Results are marked as speculative and stored in the Reorder Buffer (ROB). In the event of a mispredicted branch, the EU hardware flushes the speculation, clearing the pipeline and restoring the register state from a checkpoint.
- Arithmetic-logic operations (ALU). Integer clusters inside the EU process basic operations of addition, subtraction, bitwise shifts, and logical functions. Modern microarchitectures use several homogeneous ALUs capable of delivering a result in one cycle. This is critically important for computing effective addresses during memory load/store operations.
- ALU (Performs arithmetic and logical operations)
- Integer multiplication and division. High-performance EUs contain dedicated hardware multipliers with a latency of 3–4 cycles. Division is implemented iteratively via SRT or Newton-Raphson algorithms on a separate unit, as it is not fully pipelined and occupies significantly more cycles. The scheduler must account for the long port blockage.
- Vector extensions (SIMD). Specialized EU ports execute packed data processing instructions — MMX, SSE, and AVX. The operation splits 256-bit or 512-bit registers into parallel lanes. Arithmetic on floating-point vectors, integer saturation additions, and element permutations with masks are executed here at peak throughput.
- Vector (Ordered storage of numbers in continuous memory)
- Floating-point computations (FPU). The module processes scalar and packed real numbers according to the IEEE 754 standard. The hardware logic includes a separate FMA (Fused Multiply-Add) multiplier, performing multiplication with addition without intermediate rounding. The FPU usually has a deep pipeline (latency of 4–8 cycles) and its own register file.
- Address Generation Unit (AGU). The AGU is part of the EU and is responsible for computing virtual addresses in memory operation instructions. The block adds the base register, scaled index, and displacement. Speculative execution of the AGU allows precomputing the load address before bound checking is completed, speeding up access to the L1 data cache.
- Memory access logic (Load/Store). Load/store ports transform the uOp into requests to the memory subsystem. The load unit is capable of forwarding data (Store-to-Load Forwarding) directly from the store buffer if the physical address matches, bypassing the cache. This reduces latency and is critically important for sequential code.
- Reorder Buffer (ROB). Although the ROB is sometimes logically separated, its physical connection to the EU is inseparable. It is in the ROB that EU results are temporarily stored until the instruction retirement moment. Status bits in the ROB signal to the EU about exceptions or speculation failures, triggering the pipeline flush mechanism.
- Predicate registers and masking. In architectures with explicit predicated execution, the EU computes the predicate condition and writes flags directly to the corresponding register. Conditional move logic (CMOV) and AVX-512 masking are implemented inside the ALU block, allowing write disabling for individual vector elements without expensive branching.
- Flag operation processing. The execution unit computes the flags register (EFLAGS/RFLAGS) in parallel with the main result. Generation of zero, carry, overflow, and sign states is integrated into the ALU path. For fast context restoration, the EU operates a flag correspondence table, resolving state dependencies without code serialization.
- Per-cycle micro-operation issue. The EU dispatcher in a superscalar core selects several ready instructions and sends them strictly to free execution ports in a single cycle. Port asymmetry forces the EU to account for heterogeneity: if two multiplications are ready, but peak issue is limited to one multiplier, the second uOp remains in the Reservation Station.
- Hardware zero values (Zero Idiom). A special EU circuit recognizes register zeroing idioms (e.g.,
XOR reg, reg). Instead of actually passing through the ALU, the result is directly marked in the ROB as zero without physical computation. This nullifies the dependency on the previous register value, breaking the chain of false dependencies and saving energy. - Micro-assert and exception handling. When an arithmetic overflow or an illegal FPU operation occurs, the EU does not interrupt the pipeline immediately. The exception is marked in the micro-operation status and deferred until the retirement stage. This guarantees a precise interrupt, allowing the error to be handled only when the instruction becomes architecturally visible.
- Execution unit pipelining. The pipeline depth of the FPU or integer divider directly influences the EU design. While the first result has not emerged, the scheduler cannot launch the next dependent instruction. To compensate, the EU implements accumulator bypass reading (bypass network) to save waiting cycles when forwarding the result back to the input.
- Bypass Network. This is a forwarding network inside the EU, connecting the outputs of execution units directly to their inputs and scheduler queues. The ALU result, not yet written to the register file, is multiplexed into the dependent uOp. Without this network, latency would increase by write-read cycles.
- Level 1 cache memory interface. The execution unit does not wait for an L2 hit. Through the AGU and load ports, the EU is directly connected to the L1 Data Cache banks. On a miss, the unit requests a fill buffer and immediately frees the pipeline for non-speculative or latency-non-critical instructions, continuing non-blocking fetch.
- Power management. A modern EU allows clock gating of idle SIMD modules or integer clusters at the microarchitectural level. Power gating is applied if the dispatcher sees no flow of corresponding instructions within the execution window, reducing static current leakage without software intervention.
- Tracing and profiling. Built-in Performance Monitoring Units (PMU) are linked to the EU for measuring pipeline stalls, port utilization, and bypass network misses. This non-invasive hardware monitoring allows a profiler to precisely point out heavy linear code sections that are bottlenecked by the arithmetic or logical block throughput.
Comparisons
- EU vs ALU. Unlike the arithmetic-logic unit, which performs only mathematical and bitwise operations, the execution unit is a broader concept. The EU manages the full instruction execution cycle: it decodes commands, accesses registers, sets status flags, and coordinates data transfer between the ALU, FPU, and the bus, whereas the ALU remains merely the computational core within this pipeline.
- EU vs Control Unit. The control unit fetches instructions from memory and generates synchronizing signals for all processor modules, setting the order of operation. The execution unit, in contrast, receives ready-made control vectors and is focused solely on data transformation. If the Control Unit conducts the orchestra, the EU directly plays the part, physically altering the contents of registers and memory in accordance with micro-operations.
- EU vs FPU. The specialized floating-point unit is oriented toward processing real numbers and complex mathematical functions. The execution unit, integrating the integer ALU and interacting with the FPU, is responsible for the overall program execution. When a non-integer instruction is encountered, the EU isolates the operands and passes them to the coprocessor, then loads the result back into the register file to continue the linear execution of the algorithm.
- EU vs SIMD Engine. Vector extensions process multiple data elements with a single instruction due to data-level parallelism. The classic execution unit acts in a scalar manner, modifying one value per cycle. In modern architectures, the EU has evolved, incorporating SIMD modules as functional components, which allows it to uniformly dispatch the instruction stream and switch between processing modes without stalling the pipeline.
- EU vs AGU. The Address Generation Unit specializes exclusively in computing effective memory addresses and array indexing for load and store operations. The execution unit, on the contrary, is responsible for the data transformation itself, not for navigating the address space. They function in tandem: the AGU computes the target pointer, and the EU performs the arithmetic or logical operation on the value fetched from the generated address.
OS and driver support
The execution unit interacts with the operating system through the mechanism of execution contexts and interrupt vector tables, where drivers register hardware event handlers, and the task scheduler switches the EU state by saving and restoring the register file, stack pointers, and condition flags during thread switching.
Security
Protection at the execution unit level is implemented through hardware control of access rights via segment descriptors and page tables, which check the privilege level of the current task before each instruction is executed, preventing privileged command execution by user code and ensuring address space isolation of processes.
Logging
Instruction flow tracing in the EU is carried out by built-in debug registers and the trace flag in the status register; when set, the processor generates a debug exception after each instruction execution, allowing an external debugger to read the branch address, register contents, and operands for step-by-step analysis.
Parallelism limitations
The scalar execution unit is capable of processing strictly one instruction per cycle, which creates a fundamental throughput limitation, and only the introduction of pipelining with division into fetch, decode, and execution stages allows partial overlapping of sequential data processing delays.
History and development
The evolution of the EU can be traced from an elementary arithmetic-logic device with a fixed set of micro-commands to superscalar implementations with out-of-order execution and speculative instruction execution, where the unit dynamically reorders the instruction stream and predicts branches for maximum functional unit utilization.