Thumb-2 (Hybrid 32 and 16-bit instruction set)

Thumb-2 (ARM Extended Compressed Instruction Set) is a technology that combines short 16-bit instructions for memory saving and full 32-bit commands for complex operations in a single stream. The processor automatically switches between them without performance loss, achieving code density close to pure Thumb and the speed of ARM mode.

This instruction set is used in almost all modern microcontrollers and embedded systems based on ARM Cortex-M architecture, as well as in Cortex-A and Cortex-R processors operating in kernel mode. It has become the standard for programming Internet of Things devices, automotive control units, medical gadgets, and industrial controllers. Developers use Thumb-2 when creating firmware in C/C++, where the compiler automatically mixes instruction formats to produce the most compact and fast binary file, critical for operation in Flash memory of limited size.

The main difficulty when working with Thumb-2 lies in the non-obviousness of branch alignment and code fragmentation. If the branch target is at an odd address but the processor expects a 32-bit instruction, an unaligned access exception occurs. Additionally, dense instruction packing complicates static disassembly, and the mixing of formats sometimes creates difficulties during manual analysis of memory dumps. In rare cases, incorrect code generation leads to boundary errors, when a 16-bit instruction cannot encode the required offset, forcing the linker to insert additional branches, which slightly bloats the code.

How Thumb-2 works

The operating principle of Thumb-2 is based on abandoning the core state interleaving mode (ARM/Thumb), characteristic of the ARMv5T architecture, in favor of a single variable-length decode unit. In ARMv7 and newer architectures, the processor reads the instruction stream and, analyzing the upper bits of the first halfword, instantly determines whether the instruction belongs to the 16-bit subset or is the first half of a 32-bit command. Unlike pure Thumb, which lacked flexibility due to a limited number of registers and the inability to predicate most arithmetic operations, Thumb-2 introduces extended versions of instructions with conditional execution (IT blocks), bit operations, multiply-accumulate, and coprocessor access. Compared to the fixed 32-bit length of the original ARM mode, this reduces energy dissipation on instruction fetch from memory and increases pipeline throughput. If the classic approach required a BX directive to change the instruction set and created overhead during interprocedural calls, Thumb-2 makes the boundary transparent: the linker freely mixes formats without switch instructions, and the cache memory is efficiently filled with an aligned stream. This allows achieving performance gains by reducing the number of cache misses while preserving programming convenience and the completeness of the classical ARM command system.

Thumb-2 functionality

  1. Instruction Encoding Modes. Thumb-2 combines 16-bit Thumb instructions and 32-bit ARM instructions in a single stream without mode switching. The processor dynamically decodes variable-length commands, eliminating overhead for core state changes.
  2. Prefix Range Extension. 32-bit Thumb-2 instructions are encoded by a combination of a 16-bit prefix and a standard Thumb halfword. The prefix contains additional bits for registers, immediate values, and condition codes, expanding the limited address space of the original Thumb set.
  3. Orthogonality with ARM Registers. Unlike classic Thumb, Thumb-2 provides full access to all 16 general-purpose registers, including r8–r14, for most data processing operations. This eliminates the bottleneck associated with register shortage during intensive computations.
  4. Conditional Execution via IT Block. The If-Then instruction specifies conditional execution for up to four subsequent instructions. The IT block encodes a condition pattern in a single bit field, allowing compact implementation of short conditional constructs without branches and pipeline flushes.
  5. Extended Multiply Operations. The set includes 32×32 multiply instructions with 64-bit result, multiply-accumulate, and dual multiply operations. These operations use three source operands and a 64-bit accumulator, critically accelerating digital signal processing.
  6. Hardware Integer Division. For the first time in ARM architecture, unsigned and signed division instructions are introduced. The operations return the quotient in several cycles with early exit on zero, which is significantly faster than software libraries and more deterministic in execution time.
  7. Bit Field Manipulation. Bit field extract and insert commands are implemented. They perform operations on arbitrary contiguous bits of a register, including sign extension, in a single pipeline pass without shift and mask loops.
  8. Improved Shifted Addressing. 32-bit load and store instructions support an index register shifted by an arbitrary constant. Addressing of the form [base, offset, LSL #n] forms pointers to array elements without additional address calculation commands.
  9. Compact ROM Jump Tables. The TBB/TBH instruction performs a table byte or halfword branch. It loads an offset from a table in memory, doubles it, and adds it to the PC, efficiently implementing dense switch-case constructs.
  10. Count Leading Zeros Instruction. The CLZ command determines the number of leading zeros in a register. It is used in number normalization algorithms, priority encoding, and floating-point emulation, replacing several iterative instructions.
  11. Byte and Bit Order Reversal. The REV, REV16, and RBIT instructions change the byte order in a word or halfword and completely reverse a bit vector. Hardware support accelerates network conversions and cryptographic substitutions.
  12. Q-Format Saturating Arithmetic. The SSAT and USAT commands limit a signed or unsigned value to a specified bit range with saturation instead of overflow. This is the basis of media data processing without conditional checks and branches.
  13. Parallel Arithmetic Instructions. The SIMD extension within general-purpose registers allows performing operations on packed 8- and 16-bit data: addition, subtraction, selection, and permutation. A single instruction processes up to four sub-elements.
  14. Memory Barriers and Synchronization. DMB, DSB, and ISB instructions are introduced for strict ordering of memory accesses and pipeline synchronization. This ensures the correct operation of multitasking systems, DMA transfers, and self-modifying code.
  15. Non-modifying Shift and Combine. Commands like PKHBT allow combining the halves of two registers with an arbitrary shift, forming a new word in one cycle. This accelerates pixel data packing and descriptor construction.
  16. Exceptions with Integrity Control. SVC (formerly SWI) and BKPT instructions use compact encoding for supervisor calls and breakpoints. New formats improve service number identification and simplify embedded software debugging without overhead.
  17. Atomic Memory Modification. LDREX and STREX instructions form an exclusive monitor for implementing semaphores and spin-locks. They guarantee atomic read-modify-write without bus locking, ensuring scalability of multi-core systems.
  18. Coprocessor Interface. Thumb-2 retains access to coprocessors via 32-bit MCR and MRC encodings. The core can control the VFP coprocessor and CP15 system registers directly from the compact stream without entering the ARM state.
  19. Suppression of Unused Upper Bytes. For ROM-oriented systems, code density is critical. Thumb-2 automatically aligns 32-bit instructions on halfword boundaries without padding, achieving up to 30% memory savings relative to ARM mode with identical performance.
  20. Execution Time Determinism. By eliminating on-the-fly Thumb-to-ARM translation inherent in older decoders, the Cortex-M3 implementation performs instruction merging directly. Each instruction, including conditional blocks, has a fixed number of cycles, simplifying hard real-time verification.

Comparisons

  • Thumb-2 vs ARM (Original 32-bit Instruction Set). Thumb-2 provides code density close to the original Thumb, with performance practically identical to full ARM. This is achieved by mixing 16- and 32-bit instructions in a single stream without mode switching, which eliminates context overhead and allows the compiler to flexibly balance between code size and execution speed.
  • Thumb-2 vs RISC-V Compressed Extension (RVC). RVC adds frequently used 16-bit instructions, extending the base set, but requires an explicit assembler directive for switching. Thumb-2 is more deeply integrated: no switching is required, and 32-bit instructions freely include immediate operands and conditional execution, unavailable in the standard short form of RVC, which reduces the latter’s peak performance in branched procedures.
  • RISC-V (Open modular instruction set architecture)
  • Thumb-2 vs MIPS16e. MIPS16e uses a compressed instruction mode, requiring processor switching via special jump instructions, which creates delays. Thumb-2 functions in a single unified stream without decode mode change, using the upper bits to determine instruction length, which minimizes fetch delays and simplifies pipeline design while maintaining access to the full register set.
  • Thumb-2 vs x86 Variable-Length Encoding. x86 encoding allows arbitrary instruction length from one to fifteen bytes, which complicates fast decoding. Thumb-2 is strictly limited to two lengths, where 32-bit instructions are unambiguously identified by prefix bits. This provides deterministic pre-decoding, significantly reducing power consumption and simplifying parallel processing in superscalar microarchitectures.
  • x86 (Execution of instructions based on CISC architecture)
  • Thumb-2 vs ARMv8 AArch64 Fixed-Width Encoding. The fixed 32-bit length of AArch64 simplifies the pipeline but sacrifices code density. Thumb-2, on the contrary, achieves on average 30% more compact binary representation due to variable length, which is critically important for embedded systems with limited flash memory. However, AArch64 wins in branch address prediction simplicity due to the constant instruction length.
  • AArch64 (64-bit processor architecture with fixed instruction length)

OS and driver support

For Thumb-2 execution in privileged mode, the operating system automatically switches the processor state via the EPSR.T flag upon entering an interrupt handler, allowing the kernel and drivers to use compact 16-bit instructions for fast context saving and high-performance 32-bit instructions with conditional execution for critical sections without ARM/Thumb mode switching; drivers gain access to atomic operations via LDREX/STREX instructions, which are generated by the compiler directly in the mixed instruction stream, and interrupt vector tables are compiled exclusively with 32-bit instructions to guarantee correct fetching from fixed addresses.

Security

The instruction set implements protective mechanisms through strict separation of bit fields, preventing ambiguous command decoding, where the upper bits of the second halfword unambiguously determine the impossibility of interpreting a 32-bit instruction as two independent 16-bit ones, which eliminates attacks via jumping into the middle of a command; secure context switching functions use isolated BX and BLX branch instructions with forced halfword alignment, guaranteeing the activation of the intended security mode through mandatory checking of the least significant bit of the target address before modifying the program status register.

Logging

Execution trace is implemented through the Embedded Trace Macrocell block, which, when working with Thumb-2, analyzes instruction fetch, distinguishing 16-bit and 32-bit instructions at the pre-decode stage, and forms a compact stream of trace packets, recording direct and indirect branches, allowing instrumental logging to restore the complete execution history, correlating the cycle counter and context identifiers with specific mixed instruction sequences without loss of accuracy when code density changes.

Limitations

A fundamental limitation is the impossibility of conditional execution for most 16-bit instructions, except for branches, since the compact encoding format sacrifices the 4-bit condition field, forcing the programmer to compose performance-critical code sections using 32-bit forms of conditional execution or predication via flags set by preceding instructions; additionally, a restriction is imposed on direct addressing of high registers R8-R12 from the 16-bit subset, which requires the generation of intermediate copy instructions to access all general-purpose registers from low-density code.

History and development

The technology was created as an evolutionary transition from the fixed 16-bit Thumb and 32-bit ARM instruction sets by introducing the ARMv6T2 architecture in the ARM1156 processor, where developers implemented dynamic instruction length determination through decoding the upper five bits of the first halfword, allowing free mixing of instructions without mode declaration; further development in ARMv7-A/R cemented Thumb-2 as the sole instruction set for Cortex-A and Cortex-R without support for the legacy 32-bit ARM mode, adding extensions for digital signal processing and SIMD operations while maintaining backward compatibility on the instruction stream.