ALU (Arithmetic Logic Unit) is the computational core of the processor, performing all math and logic. Imagine a fast calculator inside the chip: it receives two numbers and an operation code, instantly adds, subtracts, or compares them, and delivers the result. Without this circuit, the computer could not perform a single calculation.
The arithmetic-logic unit is used in central and graphics processors, microcontrollers, digital signal processors, and specialized integrated circuits. It is a mandatory functional block of any computational core, from the simplest industrial controllers to the superscalar cores of server chips. The ALU is also embedded in programmable logic arrays and systems-on-chip, enabling numerical data processing in telecommunications equipment, automotive control units, and embedded real-time systems.
When designing an ALU, the main challenges relate to the trade-off between clock frequency and signal delay in carry propagation within adders, which limits maximum performance. Problems of power consumption and heat dissipation also arise from continuous transistor switching, especially in mobile devices. Another challenge is the correct implementation of exception handling, such as integer overflow and division by zero, which require hardware generation of status flags and interrupt signals for the processor’s control logic.
How the ALU works
The operating principle of the ALU is based on multi-channel combinational circuits, where input operands are fed to functional blocks that perform several types of operations in parallel, and an output multiplexer routes to the output the result of the specific operation designated by the control code. Unlike a dedicated multiplier, the ALU performs multiplication as a series of shifts and additions or delegates it to an external hardware block. Addition is implemented through a cascade of full adders with a carry-lookahead chain, which is faster than the sequential signal propagation typical of simpler circuits. Logical operations, unlike arithmetic ones with their inter-bit connections, are performed bitwise and independently, so their circuitry implementation is significantly simpler and does not require carry propagation mechanisms. Floating-point units, which are often confused with the ALU, process the mantissa and exponent separately, whereas a classic ALU works only with integer data or fractional data in fixed-point format, ensuring precise and fast calculations without result normalization.
ALU functionality
- Operand structure and bit width. The ALU operates on binary words of fixed length, usually a multiple of a power of two. The bit width determines the maximum size of integer data processed in one clock cycle. Inputs A and B accept operands, and output Y generates the result of the operation on the full word.
- Operation selector decoding. The operation code arrives at the selector input and is decoded to activate the required functional node. The multi-bit code determines whether addition, logical shift, or bitwise conjunction will be performed. The decoder strobes the corresponding paths, minimizing parasitic activity of unused logic.
- Implementation of a carry-lookahead adder. The basic arithmetic block is the full adder, organized in a cascade or according to a parallel carry scheme. To reduce delay, a carry-lookahead scheme with generation and propagation of signals at the bit group level is used, which is critically important for high-frequency, multi-bit ALUs.
- Two’s complement arithmetic. Negative numbers are represented in two’s complement, which allows the same adder to be used for both addition and subtraction. The subtraction operation is implemented by inverting the bits of operand B and adding one at the carry input, transforming the equation into A + (–B).
- Carry and borrow flag. The Carry flag flip-flop captures the output carry of the adder’s most significant bit. During addition, it signals an overflow of the unsigned range; during subtraction, the inverted state of the flag acts as a borrow, allowing the processor to correctly implement multi-word integer arithmetic.
- Overflow flag. The Overflow flag captures sign result distortion incompatible with the carry. The generation logic analyzes the carries into and out of the sign bit: the exclusive OR of these signals indicates an excursion beyond the representable range of numbers in two’s complement.
- Zero and negative result flags. The Zero flag is set when all bits of the output bus are zero, which is implemented by a multi-input NOR gate. The Negative or Sign flag duplicates the state of the result’s most significant bit, providing a fast hardware response to the sign without reading the entire register.
- Bitwise logical operations. Combinational logic implements the basis of Boolean functions over pairs of bits. AND, OR, and XOR operations are performed independently for each bit without inter-bit propagation. The hardware simplicity allows results to be obtained with a minimal delay of one gate level.
- Inversion operation. The unary NOT operation is implemented as a special case of bitwise logic or by feeding a constant to the second operand. In hardware, it is a set of inverters on each bit of the bus. It is often schematically included in the logic operations block or the operand preparation path.
- Arithmetic shifts. The arithmetic shift right operation preserves the value of the sign bit, copying it into the vacated high-order positions. This provides the equivalent of division by a power of two in two’s complement. Left shift is equivalent to multiplication by two, provided no significant bits are lost.
- Logical and cyclic shifts. Logical shift fills the vacated bits with zeros, regardless of the sign. Cyclic shift loops the shifted-out bit to the input of the vacated position. The ALU often includes the carry bit in the rotation chain to organize extended shifts over multi-byte structures.
- Bit set and clear function. Masking is performed through logical AND to reset a group of bits to zero or OR to force bits to one. The ALU executes this as a standard logical operation, receiving the mask from an operand register, which is critical for working with peripheral control registers.
- Comparator implementation. Comparison of numbers is circuit-wise reduced to subtracting the operands without saving the result, analyzing only the flags. The zero flag signals equality, and the combination of the Negative and Overflow flags allows hardware to determine less than or greater than relationships for signed data.
- Single-bit shift block. In addition to the multi-bit barrel shifter, the ALU may include a simple single-step shifter. It is used in microprogram loops for multiplication and division, where at each iteration the adder accumulates a partial product, and the single-step shift adjusts the operand position.
- Increment and decrement. The specialized operation of increasing or decreasing an operand by one is often optimized without using the full adder, which reduces delay. It is performed by feeding the constant 1 to the second input of the ALU adder while blocking the second operand feed via a multiplexer.
- Operand pass-through function. Transparent translation of operand A or B to output Y without transformation is implemented by routing through the output multiplexers of the adder or logic block. This allows the ALU to be used as a transit path when transferring data between registers without modification.
- Multi-cycle multiplication. An ALU without a hardware multiplier implements multiplication through sequential shifts and additions. Analysis of the multiplier’s least significant bit controls the accumulation of the multiplicand via the adder, followed by an arithmetic shift right of the accumulator and multiplier on each cycle.
- Division with remainder restoration. The iterative division algorithm uses subtraction of the divisor from the partial remainder on the adder. Sign analysis determines the restoration step and the quotient bit. The ALU engages the adder and shift block on each cycle, sequentially generating the bits of the result.
- Operand preprocessing path. The ALU input multiplexers invert operand B or zero it out to implement subtraction and pass-through. The preprocessing logic also allows feeding the constant 1 to the adder for increment, implementing the function table without duplicating computational blocks.
- Generation and propagation of conditional signals. The ALU output section contains multiplexers controlled by the combinational logic of the flags. The circuit generates not only the result data bus but also produces conditional branch signals to control the execution flow, without requiring an explicit read of the status register.
Comparisons
- ALU vs FPU. The arithmetic-logic unit operates on integer data and bit masks, while the floating-point unit implements the IEEE 754 standard in hardware. The ALU performs addition in one clock cycle, whereas the FPU requires mantissa normalization, which increases latency. Furthermore, the FPU contains its own registers and does not compete with the integer pipeline, implementing instruction-level parallelism.
- FPU (Hardware acceleration of floating point computations)
- ALU vs Hardware multiplier. A classic ALU performs multiplication via microcode through iterative shifts and additions, consuming dozens of cycles. A dedicated multiplier uses an array of full adders and a Wallace tree, forming the product in a minimal number of cycles. In modern cores, the multiplier is often integrated into the ALU path but logically remains a specialized high-speed module with a rigid interconnection topology.
- ALU vs Shift circuit (Barrel Shifter). The ALU performs logical and arithmetic shifts, but speed suffers with variable shift amounts. The barrel shifter is built on a cascade of multiplexers, shifting a word by an arbitrary number of bits in a single operation. Unlike the multi-cycle ALU loop, it combinationally commutes input lines, critically accelerating scaling and operand alignment in the processor pipeline.
- ALU vs AGU (Address Generation Unit). The ALU performs arithmetic transformations without considering the context of segmentation and paging. The address generation unit, in contrast, calculates linear addresses with indexing, scaling, and base registers. The AGU functionally overlaps with the ALU adder but works autonomously, allowing parallel address generation for memory fetch and data processing in the main pipeline.
- ALU vs SIMD block. A scalar ALU processes a pair of single operands, whereas a SIMD register is split into a vector of short integers. SIMD arithmetic, like NEON or AVX, implements the same addition and multiplication logic but with parallel processing of four, eight, or sixteen elements simultaneously, which multiplicatively boosts peak performance in digital signal processing tasks without increasing clock frequency.
Hardware abstraction for the OS and drivers
The operating system does not interact with the ALU directly through input/output commands, but manages it via the interrupt controller and flag register: when an exception occurs (division by zero), the control unit captures the error code in the status register, generates a non-maskable interrupt signal, and passes control to the kernel, which extracts the data through task context save operations.
Hardware vulnerabilities and access control
The security of the execution pipeline is implemented through the separation of instruction execution privilege levels: the protection circuit detects attempts by unprivileged code to change system mode flags or execute privileged arithmetic operations with I/O ports, after which the ALU triggers a general protection exception, preventing hypervisor compromise.
Event capture in the debug path
Logging functions through the processor’s trace buffer, where each ALU operation is accompanied by a write to the branch queue: the values of the source operands, the result, a timestamp, and the state of the carry, overflow, and sign flags are saved, allowing the debugger to reconstruct the exact sequence of computations without stopping the pipeline.
Structural and precision limits
Limitations are determined by the bit width of internal data buses and the physical depth of the carry-lookahead adder: the maximum operand size is fixed by the register file width, and floating-point operations are limited by the mantissa precision of the hardware multiplier; when boundaries are exceeded, the ALU generates a precision or exponent overflow exception.
Evolution from discrete boards to SIMD blocks
Development began with transistor modules on discrete components, then the functions were integrated into the microprocessor chip as a single execution block with microprogram control, and the modern implementation includes vector pipelines, where one control signal simultaneously launches the processing of number arrays in eight parallel 32-bit ALUs with saturating arithmetic.