GEMM (Matrix multiplication using the row times column method)

GEMM (Matrix Multiplication) performs the multiplication of two matrices, computing each element of the result as the scalar product of an entire row of the first matrix and an entire column of the second. This is a basic linear algebra operation, where to obtain the value at position (i, j) we elementwise multiply row i from the left matrix by column j from the right matrix and sum the results.

GEMM is the computational foundation of deep learning, providing the forward and backward pass in fully connected and convolutional neural networks. The procedure underlies scientific simulations, 3D graphics rendering, signal processing algorithms, and computer vision. Libraries such as BLAS optimize GEMM for central and graphics processors, since this operation is critically important for high-performance computing.

The main difficulty is the low arithmetic intensity in a naive implementation, which hits the memory bandwidth bottleneck rather than the processor speed limit. Direct access to columns causes cache misses, since data in memory is stored row by row. An incorrect nested loop order slows down computations by tens of times. Careful balancing of block loading is required to avoid compute unit idle time and minimize thread synchronization overhead.

How GEMM works

Unlike the elementwise Hadamard product, which only requires matching dimensions and multiplies elements pairwise at identical positions, GEMM imposes a strict requirement: the number of columns of the first matrix must equal the number of rows of the second. Compared to the Kronecker product, which without restrictions creates a block matrix of gigantic size, GEMM generates a compact result by collapsing the common K dimension. The computational pattern looks like this: the resulting matrix C of size M×N is obtained from A (M×K) and B (K×N). For each of the M rows of matrix A and each of the N columns of matrix B, K multiplications and K−1 additions are launched.

Modern optimized libraries (for example, cuBLAS or oneAPI Math Kernel Library) never perform this with a trivial triple loop. Instead, matrices are sliced into tiles (micro-matrices) that are placed into fast shared GPU memory or L1/L2 cache of the central processor. Each tile is processed by a micro-kernel that computes a small outer product, thanks to which data is reused many times and the number of accesses to slow global memory drops radically. Additionally, SIMD instructions (FMA) are employed for simultaneous multiplication and addition of vectors, and pipelining masks memory access latencies, allowing performance close to the theoretical peak of arithmetic operations per second on specific hardware.

GEMM functionality

  1. Basic GEMM signature. The GEMM function performs the operation C ← α·A·B + β·C, where A, B, and C are matrices, and α and β are scalars. Matrix A has size M×K, matrix B has size K×N, matrix C has size M×N. This form of notation covers multiplication and accumulation in one operation, eliminating intermediate memory allocation for the temporary result A·B.
  2. Semantics of parameters m, n, k. Parameter m defines the number of rows of matrices A and C. Parameter n specifies the number of columns of matrices B and C. Parameter k is the dimension of the inner extent: the number of columns of matrix A and the number of rows of matrix B. Each index runs a range strictly corresponding to the logical size in the rows × columns operation.
  3. Transposing operands. Parameters transA and transB control the interpretation of input matrices. With transA = ‘N’ matrix A is not transposed, with ‘T’ — Aᵀ is used. Similarly transB affects B. The combination of flags allows reusing data without physical copying, substituting Aᵀ·B, A·Bᵀ or Aᵀ·Bᵀ into the same computational scheme.
  4. Leading dimensions storage convention. Parameters ldA, ldB, and ldC specify the step between consecutive rows or columns in memory (leading dimension). For column-major storage ldA ≥ m (with transA=’N’), for row-major — ldA ≥ k. The difference between the logical size and ld allows processing submatrices within larger arrays without extracting data.
  5. Column-major storage format. Most BLAS implementations accept matrices in column-major order. Elements of one column are located contiguously in memory. With this format, the operation C = A·B interprets the first index as a row and the second as a column, which directly affects loop unrolling and cache traversal order choice.
  6. Scalar accumulation. The term β·C is added to the result elementwise after computing α·A·B or simultaneously with it. Parameter β=0 initiates pure multiplication with a zero background, β=1 implements accumulation. The hardware implementation often performs accumulation in double-precision registers regardless of the input data type.
  7. Scalar (Converting a multidimensional tensor into a single number)
  8. Microarchitectural tiling strategy. Efficient GEMM splits m, n, k into blocks (tiles) that fit into the cache hierarchy. The block of the k-dimension is reused many times. The output block C of dimension m×n is fully held in registers or L1 cache. This approach minimizes accesses to external memory and maximizes data reuse.
  9. Packing block A into a buffer. A fragment of matrix A is packed into a contiguous buffer optimized for sequential reading. During packing, rows or columns are permuted according to the access pattern of the vector instruction set. The goal is to provide the microarchitectural kernel with a continuous stream of operands without memory bank conflicts.
  10. Packing block B with replication. A block of matrix B is packed with an orientation toward wide broadcasting of one element to several computational units. A format is often used where one column of B is duplicated for parallel processing of several rows of block A. This technique allows loading element B once and applying it to an entire vector A.
  11. Macro-kernel of the inner loop. The macro-kernel, or outer kernel, operates on prepared packed buffers. It iterates over the k panel, calling the micro-kernel that computes a small block C of dimension m_r × n_r. The boundaries m_r and n_r are chosen so that the C panel block fits entirely in the available register file of a specific processor core.
  12. Micro-kernel and register layout. The micro-kernel implements the elementary multiplication of fragments A (m_r × 1) and B (1 × n_r), accumulating the result in a temporary matrix C_reg[m_r][n_r] mapped to vector registers. The loop over k in the micro-kernel is unrolled with a factor multiple of the SIMD vector length, supporting pipeline filling.
  13. Vectorization and SIMD instructions. A single FMA (Fused Multiply-Add) instruction performs c = a·b + c on vectors with widths from 128 to 512 bits. Inside the micro-kernel, element A is broadcast across the entire width of the vector register, then multiplied by a row of panel B. Such a scheme requires data alignment and adherence to load stride.
  14. Caching hierarchy during the accumulation stage. Block C after computation remains resident in L2 or L1 cache until all iterations over the k dimension are completed. Writing back to main memory happens once. This turns GEMM from a memory-cubic algorithm into a quadratic one in terms of DRAM traffic.
  15. DRAM (Storage and Byte-addressing of Data)
  16. Accounting for matrix rectangularity. Narrow or elongated matrices are processed by panel decompositions. When n ≪ m the algorithm splits the m-dimension into several vertical panels. When m, n are comparable to k, a block size is chosen that balances the load of computational units, preventing idle time.
  17. Double buffering scheme. During computations on the current block C, the next block A and B is preloaded asynchronously, overlapping memory latency with computations. Double buffering is implemented by explicit pointer switching to packed buffers, which masks delays even in the absence of a hardware prefetch mechanism.
  18. Precision and mixed types. Input matrices A and B can be supplied in FP16 or BF16 format, while accumulation in C_reg is conducted in FP32. When writing the final result C, type conversion is performed. In GPU tensor cores, INT8 and INT4 formats with a 32-bit accumulator are supported, requiring separate quantization.
  19. Handling edge cases. If the sizes m, n, or k are not multiples of the micro-kernel size, fractional processing of boundaries is performed. Residual rows and columns are processed by special slow-path kernels or by a generic kernel with SIMD write masking. The efficiency of such a path is lower, so a decomposition into blocks is chosen that minimizes residuals.
  20. Computation quality invariants. Summation in GEMM must be performed deterministically. Implementations accumulate C_reg in FP32 or FP64, after which scaling is applied once. During parallelization, the reduction order of parts C is fixed to guarantee the bitwise reproducibility of the result across repeated runs with identical decomposition parameters.
  21. Parallel decomposition over m and n. Multithreaded GEMM divides the index space among threads over the m (rows) and n (columns) dimensions, eliminating write races on C. The k dimension is not parallelized due to the need for accumulator synchronization. Each thread processes its own rectangular region of the final matrix.
  22. BLAS Level 3 interface. The function sgemm/dgemm/cgemm/zgemm is the reference implementation of GEMM for single/double precision and complex numbers. The signature is strictly standardized: gemm(transA, transB, m, n, k, alpha, A, ldA, B, ldB, beta, C, ldC). Portability between libraries (OpenBLAS, MKL, BLIS) is ensured precisely by this contract.

Comparisons

  • GEMM vs GEMV. The GEMM operation performs multiplication of two matrices, forming the resulting matrix as the sum of products of rows of the first matrix and columns of the second. The GEMV function represents a special case of matrix-vector multiplication, where one of the operands degenerates into a one-dimensional array. The computational complexity of GEMM is O(n³), whereas GEMV is limited to O(n²), which makes the matrix-vector product less demanding on memory bandwidth, but also lacking the intensive data reuse characteristic of block GEMM algorithms.
  • GEMM vs SYMM. Unlike the universal GEMM, which accepts general matrices, the SYMM operation is optimized for Hermitian or symmetric data representation. While GEMM processes all elements, SYMM uses the symmetry property to access only the upper or lower triangle, reducing the number of operations by almost half. This specialization imposes increased requirements on the correctness of matrix filling, but allows achieving substantial savings of computational resources when solving problems with quadratic forms in nuclear physics and structural analysis.
  • GEMM vs TRMM. The TRMM function performs multiplication of a triangular matrix by a dense rectangular matrix, while GEMM operates on two dense matrices. Using the triangular form in TRMM allows ignoring zero elements below or above the main diagonal, reducing the number of required multiplication and addition operations. In the context of solving systems of linear equations by forward substitution, TRMM is more efficient than GEMM, since it algorithmically eliminates redundant computations inevitable when converting a triangular matrix to a full dense format.
  • GEMM vs GEMM3M. Standard GEMM performs multiplication exclusively over the field of real numbers, while the GEMM3M modification is adapted for processing complex matrices using three real multiplications instead of four. Applying this algebraic trick reduces the number of expensive floating-point operations by approximately 25%, which is critically important for digital signal processing tasks. However, GEMM3M requires more additions and extra caution in controlling numerical stability under limited bit-width conditions.
  • GEMM vs Batched GEMM. A single GEMM operation is optimized for maximum performance on large matrices that fill cache memory and computational pipelines. The batched mode Batched GEMM, on the contrary, is designed for simultaneous multiplication of many small matrices of independent tasks. While GEMM minimizes delays through deep parallelism within one kernel, Batched GEMM solves the problem of launching a large number of kernels to prevent graphics processor idle time, which is fundamentally important when processing graph neural networks and in spectral clustering tasks.

Execution environment support

The interaction of the GEMM operation with the operating system is realized through a layer of user-mode kernel-mode drivers (KMD) and user-mode drivers (UMD), where the math library compiler generates adaptive machine code for a specific processor revision identifier (CPUID) during the first call, and the allocation of large contiguous blocks of virtual memory is carried out via the mmap system call with the MAP_HUGETLB flag to minimize TLB misses during row-wise matrix processing.

Computation integrity and memory protection

To prevent data corruption when multiplying rows by columns, a buffer bounds checking mechanism is applied at the hardware level through Intel MPX extensions or software checks of leading dimension (LD) sizes, where the LD parameter defines the step between the beginnings of adjacent rows in memory, and if the actual size of the allocated area does not match the declared matrix dimensions, the driver immediately returns a STATUS_ACCESS_VIOLATION exception without executing the dangerous BLAS kernel code.

Hierarchical state logging

The GEMM library tracing system records each stage of the computation pipeline in a kernel ring buffer: at the micro-kernel level, cache line load metrics are recorded when multiplying a row vector by a column block, at the macro-kernel level, parameters of the tiled decomposition of the m and n dimension loops are logged, and user callbacks allow exporting aggregated gigaflops data to the ETW subsystem without stopping the command pipeline.

Hardware and architectural limitations

The speed of multiplying rows of the first matrix by columns of the second is rigidly limited by the peak bandwidth of the memory subsystem (STREAM benchmark), therefore the implementation of fused multiply-add instructions via VEX encoding prefixes of AVX2 requires alignment of column start addresses to a 32-byte boundary, and when the matrix dimension exceeds the L2 cache associativity threshold, a drop in vector ALU utilization is inevitable due to the latency of fetching data from RAM.

Evolution of the algorithmic base

The development of GEMM began with a naive triple Fortran loop, then moved to the Goto block algorithm with packing row panels into a contiguous buffer to eliminate cache bank conflicts, and in modern implementations a dynamic runtime microarchitecture selection is applied based on discriminant performance analysis, where for narrow matrices a specialized micro-kernel is activated that computes the product of sparse rows and dense columns without branching in the loop body.