CU (Managing the execution of machine instructions of the processor)

The Control Unit or CU does not perform calculations but decodes program commands and sends strictly synchronized signals to the arithmetic logic unit, memory, and registers, telling them what to do in each cycle.

The control unit is an integral part of any central processor: from microcontrollers in household appliances and automotive electronic blocks to high-performance server chips. In graphics processors, the thread dispatcher implements similar control logic. The CU is also used in specialized computers, such as digital signal processors and programmable logic integrated circuits, where deterministic instruction execution is required.

Typical problems

Physical degradation of transistors or voltage surges in the decoding block cause opcode interpretation errors, leading to system freezes or execution of incorrect instructions. Imperfect branch prediction logic causes pipeline stalls, reducing performance. In early hardwired logic implementations, a critical problem was the inability to fix microarchitectural errors without replacing the entire die.

How the CU works

The control unit functions cyclically, passing through fetch and decode stages. After receiving a machine word from memory, the CU isolates the opcode in it and uses a decoder to convert it into a unique set of micro-operations. Unlike devices with hardwired logic, where control signals are formed by combinational circuits on gates, microprogrammed CUs use a control memory storing sequences of microcommands, resembling an interpreter embedded directly in silicon. If compared to a direct memory access controller, which autonomously transfers data blocks without core involvement, the CU, on the contrary, manages the command flow of the core itself. Unlike a bus arbiter, which resolves access conflicts, the control unit generates strictly ordered timing pulses. This is fundamentally different from the operation of the arithmetic logic unit as well: the ALU only passively executes the requested operation on data, whereas the CU dictates exactly which operation to execute, where to take operands from, and where to place the result. This principle of separation of responsibilities allows creating modular architectures where the control logic is upgraded independently of the data execution paths.

CU functionality

  1. Instruction decoding. The control unit receives the binary opcode from the instruction register and converts it into a set of logic signals. The opcode decoder activates a unique output line corresponding to a specific machine command, initializing a microprogram sequence for its execution in the arithmetic logic unit.
  2. State machine control. The CU functions as a synchronous finite state machine, changing its state depending on the current clock step. The transition between fetch, decode, and execute states is strictly determined by control flags, forming hardwired logic of the processor core behavior without branching at the hardware level.
  3. Generation of micro-operation sequences. To execute a complex instruction, the CU initiates a chain of atomic micro-operations. Write enable signals to the register file, bus multiplexer control, and ALU mode setting are issued in strictly defined cycles synchronized with the edges of the system clock signal.
  4. Instruction fetch by program counter. The block issues the address from the program counter onto the memory address bus and activates the read signal. After receiving the opcode via the data bus, the CU initiates a write to the instruction register and immediately increments the program counter to point to the next machine command in the stream.
  5. Interrupt and exception handling. Upon receiving a hardware interrupt request, the CU suspends the normal pipeline, saves the context of the current task on the stack, and vector-loads the interrupt handler address into the program counter. Masking and prioritization of signals are performed by the built-in arbitration logic of the control unit.
  6. Indirect addressing control. If the operand field contains a pointer to an effective address, the CU initiates an additional bus cycle to fetch the actual address from memory. The block automatically calculates the number of machine cycles for indirect fetch required to fully resolve pointer chains, modifying the cycle counter.
  7. Branch prediction logic. Modern CUs contain a branch prediction automaton that analyzes the address of a conditional branch instruction. At the decoding stage, a predicted address for the next fetch is generated, allowing speculative filling of the pipeline before the actual branch condition is evaluated in the execution path.
  8. Synchronous pipeline control. The CU inserts forced bubbles into the microprocessor pipeline upon detecting data or control hazards. The pipeline register locking circuit prevents the advancement of dependent instructions until the conflict is resolved, maintaining the coherence of architectural register states.
  9. Control word formation. At each cycle, the CU assembles a wide control word whose bits directly control the multiplexers of the data paths. Each bit is responsible for a specific gate or selector, forming a unique configuration of operand transfer routes between the register file, ALU, and cache memory.
  10. Arbitration of access to shared ports. The block resolves conflict situations when several pipeline stages simultaneously request access to a single write port of the register file. The CU arbiter prioritizes exception completion operations or the writing of execution stage results, blocking a counter write for operand forwarding.
  11. Interface with the memory management unit. During virtual-to-physical address translation, the CU initiates a TLB lookup. In case of a TLB miss, the control unit freezes the pipeline, launches a hardware page table walk, and waits for the translation completion signal to re-execute the problematic instruction without losing machine state.
  12. Microprogram sequencing. In architectures with microcode, the CU contains a sequencer that retrieves a sequence of control words from the microcode ROM. The address of the next microcommand is calculated considering the condition code and branch fields, implementing multi-cycle complex instructions without increasing the hardware complexity of hardwired logic.
  13. Data dependency handling. The CU implements a result forwarding scheme, comparing destination register numbers at the execution stage with source register numbers at the decoding stage. Activation of bypass multiplexers allows the ALU result to be sent back to its input, bypassing the wait for write-back to the register file.
  14. Synchronization with slow external devices. When accessing memory with large delays, the CU stretches the wait phase, inserting additional wait cycles. The memory ready signal is used for asynchronous write or read confirmation, allowing the processor to exchange data with peripherals at arbitrary response speeds.
  15. System exception monitoring. The CU continuously checks address boundaries, data alignment, and attempts to execute privileged instructions in user mode. Upon condition violation, the block generates an exception, flushes the pipeline, and transfers control to the error handling vector, writing the cause of the failure into a special cause register.
  16. Implementation of power-saving modes. Upon a halt command, the CU disables clock signals from unused functional modules, preserving the automaton state. The control unit waits for a wake-up signal from an external interrupt, after which it synchronously restarts the clocking system and restores the program flow without disrupting the logical integrity of computations.
  17. Dynamic instruction dispatching. The out-of-order control unit analyzes a window of ready micro-operations and sends them to free execution units. The hardware scheduler inside the CU checks the busy bits of reservation stations and the actual availability of operands, ignoring the original program order of instructions to maximize ILP.
  18. Initialization and reset of architectural state. Upon hardware reset, the CU forcibly sets the program counter to the reset vector, clears control registers, and resets the pipeline finite state machines to the initial state. The process is performed asynchronously relative to the normal clock frequency until the system power supply and clock generator stabilize.
  19. Breakpoint support. The integrated address comparator within the CU continuously verifies the current program counter value against the debug register. Upon an exact address match and fulfillment of additional operand size conditions, the block generates a debug exception before the actual execution of the marked instruction to transfer control to the debugger.
  20. Interaction with the cache coherence mechanism. Upon detecting a shared cache line modification operation by another processor core, the CU suspends speculative execution. The control unit analyzes snooping protocol signals, invalidates the affected micro-operations in the scheduler, and initiates a re-fetch of code or data to maintain memory correctness.

Comparisons

  • Control Unit vs Instruction Decoder. The Control Unit interprets the opcode and generates a sequence of control signals for the entire computational path, whereas the Instruction Decoder performs only the initial conversion of the command bit field into a set of static enable lines. The CU possesses temporal logic and knowledge of the microarchitectural context, while the decoder is a purely combinational circuit lacking mechanisms for multi-cycle sequencing and wait state handling.
  • Control Unit vs Microprogram Sequencer. In a programmable logic scheme, the CU delegates the formation of the detailed signal sequence to the Microprogram Sequencer, which addresses the control memory and handles branching based on status flags. The top-level CU initiates the starting address of the microprogram and accepts the completion signal, while the sequencer performs the routine work of iterating through microcommands, offloading the main control finite state machine.
  • Control Unit vs Branch Predictor. The task of the CU boils down to the synchronous execution of the prescribed instruction stream, whereas the Branch Predictor speculatively predicts the direction of a conditional branch based on history, minimizing pipeline stalls. The controller uses the ready prediction result for the immediate fetch of the next command and, upon detecting an error, initiates a hardware rollback, not participating in the statistical analysis of branching patterns.
  • Control Unit vs Scheduler (Dispatch Unit). In superscalar architectures, the CU handles fetch and decoding, after which the Scheduler analyzes data dependencies and the availability of execution units, reordering the queue of micro-operations ready for execution. The Control Unit is responsible for maintaining the architectural sequence of exceptions, while the scheduler is responsible for out-of-order execution and maximum utilization of arithmetic-logic blocks without violating the logical integrity of the program.
  • Control Unit vs Memory Management Unit. The control module generates logical addresses and manages the instruction flow, while the MMU translates virtual addresses into physical ones, checks access rights, and detects page faults. The CU is unaware of the real location of data in RAM and, upon a Page Fault, suspends the pipeline until paging is complete, receiving a translation ready signal from the MMU.

OS and driver support

The modern control unit implements hardware support for operating systems through built-in interrupt vector tables and fast context switching mechanisms, including automatic saving and restoring of the register file upon a timer tick or external exception; for interaction with drivers, the CU contains specialized I/O registers mapped into the memory address space and hardware command queues, allowing the driver to atomically pass chains of descriptors without constant polling of device readiness by the central processor.

Security

Security at the control automaton level is ensured by hardware memory boundary checking through shadow segment registers, where the CU microcode writes the permitted address ranges for the current process before starting instruction execution, as well as by implementing control flow finite state machines that analyze indirect branches against pre-computed labels in the shadow return stack, thanks to which an attempt to exploit a buffer overflow vulnerability causes an immediate hardware interrupt for control integrity violation before the malicious payload is executed.

Logging

The non-destructive logging function in the control unit is implemented through a hardware trace module embedded directly into the decode pipeline; the CU is capable of generating packets with timestamps, branch codes, and the contents of selected registers, sending this stream via a dedicated trace buffer to system memory using a direct access protocol, which allows recording the execution history of machine words without generating exceptions and without modifying the operating system code.

Limitations

The key fundamental limitation of the CU is the finite depth and throughput of the control memory, due to which the complexity of microprograms is strictly limited by the size of the microcode ROM, and expanding the instruction set requires physical replacement or reflashing of the control matrix; an additional bottleneck is the sequential nature of microcommand fetching, creating an insurmountable gap in the rate of control signal issuance compared to hardware finite state machines when implementing frequently repeated simple operations.

Historical evolution

The development of control units has gone from hardwired logic on discrete elements, where the command execution algorithm was irreversibly fixed in interconnections, through Maurice Wilkes’s concept of microprogramming, which presented control memory as a matrix on ferrite cores, to modern hybrid architectures, in which simple instructions are decoded by hardware finite state machines in one cycle, and complex multi-cycle operations are emulated through procedure calls from a high-speed microcode cache with the possibility of updating it with micropatches on a running processor.