Tensor (Multidimensional container for numerical data)

Tensor is a mathematical object that generalizes scalars, vectors, and matrices to an arbitrary number of dimensions. If a scalar is a point (zero dimensions), a vector is a line (one dimension), and a matrix is a table (two dimensions), then a tensor allows storing data in three-dimensional cubes and higher-order spaces while preserving their structural organization.

Tensors serve as the fundamental data structure in machine learning and deep neural networks, where model weights and input signals are represented as multidimensional arrays. In computational physics, they describe the stress-strain state of materials. Computer vision processes video as four-dimensional tensors (time, height, width, color channels), and general relativity describes gravity through the spacetime curvature tensor.

The main challenge lies in the high computational costs of convolution and multiplication operations, which requires specialized graphics processors. Incorrect dimension alignment leads to array shape incompatibility errors, and high dimensionality provokes the curse of dimensionality, making data extremely sparse. Tensor serialization is also critical: saving without axis metadata turns them into nameless arrays of numbers, losing the semantic connection to the original features.

How a Tensor works

Unlike a Matrix, which always remains a two-dimensional object indexed by rows and columns, a tensor introduces the concept of dimension (axis) as a fundamental abstraction. The fundamental property that distinguishes a tensor from a simple nested list of numbers is the coordinate transformation rule: when the basis of a vector space changes, the tensor components transform according to a strict covariant or contravariant law. This guarantees that the described physical or geometric entity remains invariant with respect to the choice of coordinate system, a property an ordinary array does not possess. When compared to matrices, where only multiplication along adjacent dimensions is allowed, tensor computations employ the operation of contraction along arbitrary axes (tensor contraction), which sums the products of elements along one or more shared dimensions.

In software libraries such as PyTorch or TensorFlow, a tensor is implemented as a contiguous block of memory with an attached stride scheme, where element access occurs by computing an offset using a formula based on a multi-index, rather than through dereferencing nested pointers. This ensures high performance due to dense data packing in processor cache lines and the ability to efficiently parallelize operations. Unlike classical linear algebra, which focuses on two-dimensional structures, tensor decompositions, including canonical polyadic decomposition and Tucker decomposition, allow approximating high-order arrays with minimal information loss, compressing models by orders of magnitude without significant accuracy degradation.

Tensor functionality

  1. A Multidimensional Array as a Fundamental Data Structure. A tensor in the computational context is a container for numerical data organized in an N-dimensional regular grid. Unlike the physical concept of a tensor, the emphasis here is placed exclusively on storage geometry and operations on axes, not on coordinate transformation laws.
  2. Shape and Dimensionality Attributes. The shape is a tuple of integers defining the array size along each axis. The rank, or dimensionality (ndim), equals the length of this tuple. A scalar has rank 0, a vector has rank 1, a matrix has rank 2. The total number of tensor elements is calculated as the product of all shape components.
  3. Shape (Returns the dimensions of a tensor)
  4. Linear Memory Storage Strategy. Physically, a tensor is placed in a one-dimensional RAM block. To map a multidimensional index to a flat offset, row-major (C-order) or column-major (Fortran-order) schemes are used. Strides define the number of bytes to move along a specific dimension, allowing slice operations to be implemented without copying data.
  5. Reshape Operation Without Changing Data. The Reshape function creates a new view of the same memory buffer with different axis geometry, provided the total number of elements matches. The key constraint is preserving the original element order in memory. A conflict of strides with the new shape results in a copy being created.
  6. Universal Transpose Function. Permute or Transpose rearranges tensor axes according to a given index permutation. The operation does not physically move data but merely reorders the stride tuple. This allows instantly changing axis interpretation, for example, transforming a channel batch from CHW to WHC.
  7. Transpose (Matrix row and column transposition)
  8. Adding and Removing Degenerate Axes. Unsqueeze introduces an axis of dimension 1 at a specified position, increasing the rank. Squeeze, conversely, removes all axes of length 1. These operations are critical for aligning broadcasting dimensions when connecting neural network layers without allocating new buffers.
  9. The Broadcasting Mechanism. Broadcasting enables element-wise operations on tensors of different shapes. The smaller tensor is virtually padded with unit axes on the left, after which dimensions are compared. Compatibility is achieved if dimensions are equal or one of them equals one. Actual data is not duplicated until computation time.
  10. Slicing and Advanced Indexing. A basic slice creates a sub-view of the original tensor by manipulating the memory offset and strides. Advanced indexing (integer array indexing) always produces a new copy of data, since element selection is arbitrary and breaks the regular memory step, forming a tensor from irregularly chosen positions.
  11. Element-wise Arithmetic Computations. Addition, multiplication, division, and subtraction operators are applied pairwise to tensor elements with broadcasting taken into account. The compute kernel is vectorized: instead of processing one scalar per cycle, SIMD processor instructions are engaged, processing data batches to intensify throughput.
  12. Convolution as a Reduction Operation. The convolution function slides a small kernel over input feature maps. At each position, element-wise multiplication and summation (reduce sum) are performed. The result is aggregated into an output tensor. The implementation is often reduced to matrix multiplication via im2col transformation to utilize high-performance GEMM kernels.
  13. Batch Matrix Multiplication. The Batch MatMul function operates on three-dimensional arrays (batch, M, K) and (batch, K, N). It performs independent multiplication of corresponding matrix pairs from the batch. This extension generalizes linear algebra to multidimensional cases, serving as the computational core of the attention mechanism in transformer architectures.
  14. Aggregation Along a Specified Axis. The Reduce function (sum, mean, max) collapses the specified dimension, decreasing the tensor rank. Computing the sum along an axis with dimension preservation (keepdims=True) leaves an axis of size 1 in place for correct subsequent broadcasting of gradients and statistical parameters.
  15. Einstein Summation. The Einsum function implements the Einstein summation convention. Using string notation, for example, 'bhn,bhm->bnm', multiplication and contraction of shared indices are performed without creating explicit temporary arrays. This allows expressing complex operations (transposition, batch product, diagonal extraction) with a single compact and optimizable instruction.
  16. Splitting and Concatenating Arrays. Split divides a tensor into a list of subtensors along a specified axis, creating independent views with shifted data pointers. Concatenate joins a sequence of tensors into one, requiring new memory allocation for the resulting contiguous block and validating the compatibility of all dimensions except the concatenation axis.
  17. Logical Indexing via Mask. The selection operation using a boolean mask of the same shape returns a one-dimensional tensor of values corresponding to True positions. Masking is effective for filtering anomalies or selecting elements above a threshold, although it destroys the original axis structure, producing a flat vector of results.
  18. Data Typing and Type Casting. Each tensor is homogeneous and bound to a specific dtype (float32, int8, bfloat16). Type casting regulates the balance between precision and performance. Mixed precision computations involve storing weights in float32 while performing arithmetic in float16, with dynamic loss scaling to preserve small-magnitude gradients.
  19. Immutability and Copy Semantics. View creates a new tensor sharing data with the original; modifying one leads to changes in the other. Copy creates a deep clone with an independent buffer. The contiguous function forcibly brings a memory-fragmented tensor to dense C-order, which is often required for passing data to low-level BLAS libraries.
  20. Working with Sparse Tensors. A sparse tensor stores only non-zero elements and their indices (COO, CSR formats). Coordinate format storage eliminates zero overhead at fill rates below 10%. Arithmetic on such structures is implemented by specialized kernels that avoid multiplication by zero.
  21. Usage in Backpropagation. A tensor is extended with a grad attribute that accumulates the loss function gradient. Calling backward() builds a computation graph from the stored connections between tensors. Automatic differentiation traverses the graph in reverse topological order, applying the chain rule to each operation to compute partial derivatives.
  22. Managing Device Placement. A tensor has a device property indicating the computational accelerator (CPU, GPU, TPU). Transfer between devices is synchronous and triggers copying over the PCIe bus. Frameworks introduce the concept of lazy computation: an operation is not executed until the result is explicitly requested, allowing the graph to optimize placement.
  23. Memory Model and Gradient Stability. The detach() operation returns a tensor excluded from the computation graph. This is necessary for updating embeddings or computing metrics without accumulating gradients. The clamp() function restricts values to a range, preventing gradient explosion and stabilizing optimizer convergence.
  24. Binary Array Serialization. The serialization protocol converts a tensor into a flat byte stream with a header (data type, shape, endianness). Formats like NPY or SafeTensors guarantee memory-mapping without deserialization, allowing instant loading of model slices from disk to RAM via the virtual memory interface without parsing the structure.

Comparisons

  • Tensor vs Matrix. A tensor is a multidimensional array of arbitrary dimensionality, whereas a matrix is strictly two-dimensional. This distinction allows a tensor to encode hierarchical data structures, such as time series with numerous features or video streams, without resorting to artificial dimension flattening, thereby preserving spatial and temporal correlations critically important for deep learning.
  • Tensor vs Vector. A vector is a one-dimensional special case of a tensor characterized by a single axis. The transition from vector to tensor means an expansion of the model’s descriptive capacity: if a vector describes an object with a set of independent scalar features, a tensor is capable of capturing complex multimodal interactions by distributing information across several axes, such as image width, height, and channels.
  • Vector (Ordered storage of numbers in continuous memory)
  • Tensor vs Scalar. A scalar is a zero-rank tensor, devoid of axes and representing a single numerical value. The comparison of scalar and tensor demonstrates the fundamental principle of information aggregation: while a scalar captures only intensity or norm, a tensor structure preserves directional relationships and multicomponent states, ensuring differentiability of complex function compositions.
  • Scalar (Converting a multidimensional tensor into a single number)
  • Tensor vs Multi-dimensional List (nested Python lists). Unlike native nested lists, which are interpreted as recursive structures with pointers, a dense tensor in specialized libraries is allocated in a contiguous memory block. This enables operation vectorization and efficient CPU cache utilization, and also allows running operations on graphics accelerators without the overhead of interpreting reference hierarchies.
  • Tensor vs Dataframe (tabular structure). A dataframe is oriented towards heterogeneous columnar data with explicit string metadata and semantic indices, while a tensor is homogeneous and anonymous along its axes. Tensor processing is designed for computation graphs and automatic differentiation over numerical arrays, whereas a dataframe is optimized for relational operations of filtering, grouping, and aggregation of mixed data types.

OS and driver support

The implementation of a tensor as a software abstraction is achieved through a layered architecture, where a high-level interface (e.g., NumPy or PyTorch) translates operations into calls to low-level libraries specific to each operating system, and device drivers (CUDA for NVIDIA, ROCm for AMD, oneAPI for Intel) compile these calls into machine code that manages memory allocation, thread synchronization, and kernel execution on the graphics processor; on systems without a discrete accelerator, execution is automatically delegated to central processor vector instructions (AVX-512, NEON) through dynamic dispatch mechanisms.

Security of access to multidimensional data

Formal bounds and type checking in tensor operations is ensured by embedding symbolic predicates into the computation graph that verify index correctness at the tracing or compilation stage, thus eliminating buffer overflow and corruption of adjacent memory regions without runtime overhead; for operations whose dimensions cannot be statically determined, lightweight checking assertions are embedded into the compiled code, triggering a deterministic interrupt with call stack preservation upon detecting an out-of-bounds access.

Parallelism models and memory layout transform

The logical representation of a tensor is separated from physical storage through a stride system, which for an N-dimensional array defines the step in bytes between adjacent elements along each axis, allowing operations such as transposition or slice extraction by recalculating metadata without copying the original data; extended layout formats, including block-sparse and fractal patterns, are constructed by factorizing the multidimensional index space into hierarchical tables, ensuring coalesced access during parallel execution on streaming multiprocessors.

Automatic logging and deterministic replay

Deterministic recording of all non-commutative operations on a tensor is implemented by intercepting each API call, assigning it a globally ordered monotonic identifier, serializing the input shape metadata and the initial state of the pseudorandom number generator into a binary journal resilient to power failures thanks to checksums and a double-write circular buffer; this subsequently allows bitwise restoration of the intermediate computation state at an arbitrary step by replaying the journal in an identical hardware and software environment.

Limitations

The applicability of the tensor abstraction is limited by the requirement for static a priori fixation of the number of dimensions and the absence of native support for structures with irregular nesting, due to which representing data such as ragged arrays or graphs of variable topology requires introducing padding masks or sparse index formats, increasing the volume of unused memory; when constructing automatic differentiation via an operation tape, saving intermediate values for the backward pass creates a trade-off between computational speed and peak video memory consumption, resolved by checkpointing methods with scheduled subgraph rematerialization.

Evolution of interfaces and hardware acceleration

The historical development of the tensor as a fundamental data structure has traversed a path from one-dimensional arrays in Fortran and APL, through the formalization of multidimensional operations in BLAS and LAPACK libraries, to the emergence of eager and graph execution modes in modern frameworks, where operator semantics are overloaded so that the expression A + B does not perform computation immediately but builds a symbolic representation in an intermediate graph, which is then compiled by a JIT compiler into a specialized kernel that exploits accelerator features such as tensor cores with mixed-precision matrix-matrix multiplication and asynchronous data prefetching across multiple cache levels.