What is EXT4 (Journaling file system for Linux)

EXT4 is the primary file system of Linux, an evolution of EXT3. It organizes data storage on disk as a hierarchy of directories and files, maintaining a special journal of operations. This protects the structure from destruction during sudden power loss, preserving high performance and supporting huge data volumes.

EXT4 is the default standard in the overwhelming majority of Linux distributions, including Debian, Ubuntu, Fedora, and their derivatives. It is used on servers, desktop computers, and embedded systems. Thanks to its stability and backward compatibility with EXT2/3, it dominates boot partitions, user home directories, and enterprise file storages where reliability proven over decades of operation is required.

Typical problems

The main difficulty is a critical loss of performance under severe fragmentation on disks filled to more than 85-90 percent. Although journaling protects metadata, it does not always guarantee the integrity of file contents during a crash unless full data journaling mode is used. A lengthy file system check by the fsck utility on very large partitions measured in tens of terabytes can take hours and lead to significant downtime.

How EXT4 works

At its core lies the classic architecture of Unix-like systems, extended with modern scalability mechanisms. Disk space is divided into block groups, each containing its own copy of the superblock, group descriptors, block usage bitmaps and inode bitmaps, and the data blocks themselves. Such grouping reduces fragmentation and head movement, as metadata and data of a single file tend to reside within one group.

The central element is the inode, a structure storing all file metadata except its name: access permissions, timestamps, size, and pointers to blocks with actual content. EXT4 uses a multi-level addressing system. The first 12 pointers are direct, referencing data blocks directly. The thirteenth pointer is singly indirect, leading to a block containing an array of pointers to data blocks. The fourteenth is doubly indirect, and the fifteenth is triply indirect. The key innovation of EXT4 is the extent mechanism, which replaces this classic scheme to accelerate work with large files. An extent describes a contiguous range of physical blocks with a pair of numbers: the starting block and the count of consecutive blocks. A single inode can store up to four extents, and to describe heavily fragmented files a balanced extent tree, briefly called an Htree, is built.

Journaling is implemented through a separate circular journal space. Before making changes to the main file system structures, the operation is first atomically written to the journal as a transaction. Only after the transaction is committed in the journal does the actual metadata change occur on disk (the checkpoint stage). If power is lost at the moment of failure, upon the next mount the system replays committed but not yet applied transactions from the journal, restoring structure consistency. By default, EXT4 operates in ordered mode, where file data itself is guaranteed to be flushed to disk before writing metadata to the journal, preventing garbage from appearing in new blocks after a crash. File names are stored not in inodes but in specialized directory structures organized for fast search as the aforementioned H-tree, allowing instantaneous file lookup even in directories with tens of millions of entries.

EXT4 features

Block Allocation. EXT4 uses extents, contiguous ranges of physical blocks described by a single descriptor. One extent can address up to 128 megabytes of data, radically reducing metadata fragmentation and CPU load when working with large files compared to indirect addressing.
Delayed Block Allocation. This technique does not reserve disk space immediately upon a write call but only records the operation in the page cache. Actual extent allocation occurs when dirty pages are flushed, allowing the allocator to optimize placement contiguity and minimize internal fragmentation.
Persistent Allocation via fallocate. The fallocate system call with default flags guarantees block allocation for extents without zero-filling them, or with initialization (FALLOC_FL_ZERO_RANGE). This eliminates the risk of ENOSPC on subsequent writes and prevents fragmentation by fixing the data layout on disk.
Inode Packing into Blocks. The default inode structure size is increased to 256 bytes, allowing extended attributes and the tails of small files to be stored directly inside the inode. Such inline data packing of up to 60 bytes eliminates extra accesses to data blocks and improves stat operation performance.
Accelerated Checking via Unlinked Inodes. The e2fsck utility mechanism marks unused blocks in group tables as zero. During consistency checking, the program skips such blocks without analyzing their contents, reducing file system recovery time on petabyte-scale volumes by orders of magnitude.
Metadata Checksums. EXT4 computes 32-bit crc32c checksums for the superblock, group descriptors, journal, extent trees, and directory entries. Upon detecting a checksum mismatch, the structure is marked as corrupted, allowing bit errors to be isolated before they lead to a kernel panic.
Multi-Block Allocator. Instead of allocating individual blocks, the mballoc allocator operates with groups and preallocates blocks according to extent topology. It searches for contiguous regions in bitmaps using two-level buddy allocator bitmaps, accelerating the search and minimizing disk seeks during sequential I/O.
Journal Checksum Support. JBD2 transactions include journal block checksums with version v3 checksum. This protects the transaction buffer from silent corruption on disks with lost writes, making the crash recovery mechanism deterministic even under media degradation conditions.
Htree Directory Format. Instead of a linear list of file names, a balanced B-tree with hashed keys is used. The tree depth is limited to two levels, guaranteeing logarithmic search time in directories with millions of entries, eliminating the bottleneck of traditional directory inodes.
Atomic Block Group Allocation. The flex_bg flag combines several adjacent block groups into one logical meta-group. The bitmaps and inode tables of such a group are arranged compactly, allowing the allocator to make global placement decisions and reducing latency when crossing regular group boundaries.
Extended Attributes in Separate Blocks. If the xattr size exceeds the space in the free area of the inode or inode block, EXT4 places them in a separate data block referenced via an EA-inode. This approach supports attributes whose values significantly exceed the typical limit of one file system block.
Snapshot Support via Multi-Mapping (past e2image). The e2image utility creates a meta-image containing only service structures and a reverse-link map. This enables instant metadata backup without copying user data, ensuring rapid restoration of the logical volume structure.
Bad Inode Marking. On a critical inode read error, the kernel marks the structure with the EXT4_ERROR_INODE flag and disables it from use, preventing further error escalation. The file becomes inaccessible, but the integrity of the rest of the file system is preserved, not requiring an immediate remount to read-only mode.
Quotas via Hidden Files. The quota subsystem in EXT4 stores limits and counters in special inodes hidden from the user in the file system root. Abandoning external quota.user files accelerates limit checking, as work proceeds through direct structure reading in kernel space, bypassing the VFS layer for standard files.
I/O Barriers and the blkdev Flag. Before writing a journal transaction commit, EXT4 sends a device cache flush request (write barrier). In modern configurations this is replaced by the block layer with the REQ_PREFLUSH flag, guaranteeing the physical order of data reaching the platters and preventing journal corruption on power loss.
Dynamic Extent Insertion. When there is insufficient space to add a new range into an index block, a split operation is performed. The algorithm recursively divides an extent tree node into two, preserving balance. This eliminates the need to fully rebuild the block map when large file fragmentation increases.
Inode-Level Encryption (fscrypt). EXT4 supports transparent encryption of directory contents, where encryption keys are bound to inodes and generated from the user master key. Only data and file names are encrypted, while metadata (size, timestamps) remains open for management without unlocking.
Persistent Preallocation. The EXT4_GET_BLOCKS_PRE_IO flag triggers block reservation before actual writing. This allows applications like DBMS to guarantee physical space availability for logs, eliminating allocation delays at the critical moment of flushing a transaction to disk.
Nanosecond Timestamp Handling. The i_mtime_extra field and similar ones in the inode store the upper 2 bits of seconds and a 30-bit nanosecond offset. This extends the date range to the year 2446 and provides timestamp resolution sufficient for synchronization in distributed systems with high data modification frequency.

Comparisons

EXT4 vs XFS (Space Management and Allocation Latency) — EXT4 uses delayed block allocation to reduce fragmentation and improve write performance by coalescing small buffers before flushing to disk. XFS implements a similar mechanism but reserves space more aggressively to prevent fragmentation under parallel I/O streams. The difference is that XFS manages extents more precisely under peak loads, while EXT4 tends to reallocate blocks during prolonged buffer idleness, which sometimes creates a false sense of free space.
EXT4 vs Btrfs (Data Integrity Control) — In its standard configuration, EXT4 relies on checksums only for metadata (metadata_csum option), leaving user data without software-level hardware verification. Btrfs, being a Copy-on-Write system, automatically computes checksums for both metadata and the data itself, allowing the detection of bit rot. The comparison favors Btrfs, which guarantees end-to-end integrity, while EXT4 trusts error correction to the underlying hardware.
EXT4 vs F2FS (Flash Storage Optimization) — EXT4 supports TRIM and can work with solid-state drives, but its internal architecture with journaling and inode tables was originally designed for magnetic disks, creating excessive write overhead. F2FS uses a log-structured approach with an adaptive garbage collection algorithm and knowledge of NAND geometry. The comparison shows that F2FS minimizes write amplification and latency on low-level operations, whereas EXT4 is neutral to storage type, sacrificing cell lifespan.
EXT4 vs ZFS (Snapshot and Redundancy Operation) — EXT4 lacks a built-in mechanism for instantaneous state snapshots; creating snapshots requires external LVM-level add-ons that work slowly and require space reservation. ZFS organically merges the file system and volume manager, allowing the creation of instant, lightweight clones without performance degradation. Functionally, ZFS offers native integrity and deduplication, while EXT4 remains a direct block management system without built-in data versioning logic.
EXT4 vs NTFS (Permissions and Compatibility) — The classic EXT4 permission system is based on the POSIX model (owner/group/others with SUID/SGID bits), which is native to Linux but requires third-party drivers on Windows. NTFS uses access control lists (ACL) with extended rights and inheritance, integrated into Active Directory. When comparing cross-platform compatibility, NTFS wins in heterogeneous Windows networks, but EXT4 provides higher performance and lower CPU load in Unix-like environments due to its simplified access control structure.
NTFS (File system with journaling and access control)

OS and driver support

Ext4 is the native file system of the Linux kernel with a reference implementation supporting all modern kernel versions (starting from 2.6.28, full stabilization from 2.6.30); the ext4 driver is implemented as a kernel module, ensuring backward compatibility with ext2/ext3 through direct mounting of these partitions as ext4 without conversion, and for Windows there exists the ext4fsd driver and the commercial Paragon ExtFS, implementing support through a layer between the NT API and ext4 structures; on macOS the FUSE module ext4fuse is used with read-only support, and on FreeBSD the ext2fs driver was extended for basic ext4 support with limitations on extents and journaling, while cross-platform compatibility is ensured by disabling Linux-specific mount options (flex_bg, 64bit) when creating the file system.

Security

Ext4 security mechanisms are implemented at the VFS and extended attribute level, where standard POSIX permissions (owner-group-others with read, write, and execute bits) are stored in the inode, and access control lists (ACL) are stored in a separate block addressed via the system.posix_acl_access extended attribute and cached in memory to reduce overhead during permission checks; SELinux mandatory access control support is integrated through the storage of security labels in the security.selinux extended attribute directly in the inode or in a separate block when exceeding the size, and metadata and journal integrity is ensured by checksums in group descriptor structures and the superblock, while ext4 supports encryption at the individual directory level via the fscrypt API, where encryption keys are bound to a user session in the kernel keyring, and a unique key is generated for each encrypted file, with which data and file names are encrypted in AES-256-XTS or AES-256-CTS mode.

Logging

Journaling in ext4 is implemented at the block device level through the JBD2 (Journaling Block Device 2) subsystem, where all metadata changes, before being written to the main file system structures, are atomically committed to a circular journal physically located either inside the file system as a hidden inode or on a separate device; JBD2 operates in three modes: ordered (default) — only metadata is written to the journal, and data is flushed to disk before the metadata transaction commit, writeback — metadata is journaled without a strict data flush order, and journal — full journaling of all data and metadata with double writing; transaction atomicity is guaranteed through a mechanism of compound block descriptors and two-phase commit, where first all transaction blocks are written to the journal area, then a commit flag with a checksum is set, after which a checkpoint runs in the background — transferring changes from the journal to the main structures with subsequent marking of journal space as free.

Limitations

Fundamental limitations of ext4 include a maximum volume size of 1 exabyte (2^60 bytes) with a 4 KiB block size, achieved by moving to 48-bit block addressing in extents instead of the 32-bit indirect addressing of ext3, the maximum file size is 16 tebibytes, limited by the bit width of the i_blocks field in the inode and the extent structure holding up to four extents directly in the inode and up to 2^32 blocks in an index tree up to 5 levels deep; the number of subdirectories within a single directory is limited to 64,000 when using classic directory blocks, but with the dir_index option and HTree hash trees enabled it is theoretically unlimited, although performance begins to degrade with millions of entries; the number of files is limited by the number of inodes set at file system creation via the bytes-per-inode ratio, where the default value of 16,384 bytes per inode yields roughly 61 million inodes per terabyte, and the hardware limit is 2^32 inodes per volume due to 32-bit inode numbers without enabling the 64-bit feature.

History and development

The development of ext4 began in 2006 as a fork of ext3, initiated by Theodore Ts’o with the goal of overcoming the fundamental limitations of its predecessor while maintaining backward compatibility, and the first key change was the introduction of extents — continuous block ranges addressed by a pair (starting block, length) — replacing indirect addressing to radically reduce fragmentation and accelerate operations with large files; in kernel 2.6.28 (December 2008), ext4 received stable status, incorporating support for delayed block allocation through the allocate-on-flush mechanism to coalesce small requests into large extents and minimize fragmentation, as well as persistent preallocation via fallocate without zero-filling; subsequent significant milestones included built-in support for metadata checksums (2012), file system-level encryption via fscrypt (2015), the bigalloc project for allocating clusters instead of blocks (experimental), and ongoing adaptation to new devices with the introduction of multi-threaded writing (multi-block allocator), support for SMR drives, and optimization for NVMe storage through improved parallel processing of block groups.

SMR (Overlapping track recording for increased density)