QDR SRAM (Memory with double data transfer rate)

QDR SRAM is static memory with separate ports for reading and writing, capable of transferring data twice per clock cycle. Unlike conventional memory, it allows the processor to simultaneously read old data and write new data, eliminating bus turnaround idle time and doubling bandwidth without increasing frequency.

The technology is in demand in high-speed network devices: backbone routers, core network switches, and line cards. It is used in routing tables where instant address lookup is required, as well as in packet buffers and traffic management systems. The memory is indispensable in telecommunications equipment, measurement instruments, and FPGA accelerators where deterministic latency and streaming data processing without access mode switch pauses are required.

Typical problems

The main drawback is the high cost per megabyte of memory and significant heat dissipation due to the operation of two independent I/O ports at maximum frequencies. Separate buses require twice the number of chip pins compared to DDR, which complicates printed circuit board topology. Difficulties arise with impedance matching and trace length equalization for clock lines. With incorrect board layout or insufficient cooling, clock slips and signal integrity violations may occur, leading to bit errors in routing tables that are not obvious at first glance.

How QDR SRAM works

QDR SRAM is based on an architecture with two fully independent ports: a unidirectional port for write operations and a unidirectional port for read operations. Each port is controlled by its own clock signal, which is the key difference from traditional memory with a shared bidirectional data bus. Clock signals are supplied from an external source, and data transfer occurs on both the rising and falling edges of each clock pulse using Double Data Rate technology. This means that two bits of information are transferred per differential data line pair in one clock period.

Let us examine the timing diagram in detail. The generator produces two phase-shifted clock signals, usually designated as K and K# for the write port and C and C# for the read port. At the beginning of a write cycle, the memory controller presents the address and data to the corresponding chip inputs. The write control signal latches the address on the rising edge of the K clock pulse, and data is captured by the chip registers sequentially on the rising and falling edges of the same pulse. In parallel with this, independently and asynchronously with respect to the write operation, the read port performs its task. The controller presents the address to the read bus, and on the edges of the C clock signal, data from the memory cell array appears on the output lines.

Structurally, an SRAM memory cell is a classic six-transistor flip-flop, but the access paths to it are multiplexed in a special way. The sense amplifier block is connected to the cells through one set of switches, and the write drivers through another. This eliminates conflicts during simultaneous read and write access to different addresses. To handle collisions when read and write addresses coincide in the same cycle, a bypass or prioritization circuit is built into the memory, which guarantees either the actual data being written is output directly from the input bus, or its immediate reflection in the array. The resulting device performance is determined by the formula: doubled clock frequency multiplied by the bus width of each port. At a frequency of 500 megahertz and a port width of 36 bits, throughput is 72 gigabits per second for reading and the same for writing, giving a total flow of 144 gigabits per second without bus turnaround idle time.

QDR SRAM functionality

  1. Quadrant access architecture. QDR SRAM implements two fully independent I/O ports, one dedicated exclusively to reading and the second only for writing. This separation eliminates bus direction conflicts, removing high-impedance states and allowing simultaneous reception of new data and sending of write requests.
  2. Double data transfer rate. Data transfer in QDR SRAM occurs on both edges of the clock signal, doubling bandwidth without proportionally increasing the core base frequency. Strobing of each half-cycle ensures command and data capture on both the rising and falling transitions of the clock pulse.
  3. Separate clock domains. Input clock signals K and K_n form a differential pair for commands and addressing, while echo clocks CQ and CQ_n are generated by the chip itself synchronously with the outgoing read stream. This mechanism compensates for trace delays and guarantees centering of the data valid window on the memory controller side.
  4. Pipelining of read requests. The address applied to the read port enters a multi-stage pipeline, as a result of which data appears on the output lines after a strictly deterministic number of cycles, called latency. This approach allows masking the cell array access time through continuous streaming processing of new addresses without bus idle time.
  5. Burst and byte write. The write bus is controlled by byte mask signals, allowing selective modification of individual octets within a wide word without performing read-modify-write cycles. The controller activates the required masks simultaneously with the presentation of the data word and address, implementing atomic update of specified bytes.
  6. Synchronous operation control. The capture logic for external signals is strictly tied to the active edges of the input clocks. Addresses, byte masks, and control pins are latched in the same cycle, eliminating asynchronous races and simplifying timing analysis on the FPGA or ASIC side, requiring only compliance with setup and hold times.
  7. Echo clocking for reading. When outputting data, the chip presents a phase-shifted synchronization source that mirrors the internal output path. The controller uses this strobe for direct capture of incoming bits without complex clock recovery circuits and delay adjustment on the line, which is critically important at frequencies above 300 MHz.
  8. Latency configuration. The read delay parameter is fixed by built-in logic or set at the die level and is typically 1.5, 2.0, or 2.5 cycles. The stability of this value is guaranteed by the manufacturer, allowing designers to hard-code the wait pipeline into the control state machine without adaptive delay tracking mechanisms.
  9. Banked decoding organization. The address space is physically divided into several memory banks, each possessing its own row and column decoding logic. The banked structure reduces the length of local lines and load capacitance, supporting high slew rates when reading large-volume arrays.
  10. Power integrity control. The chip input buffers are powered from an isolated Vddq domain, allowing interface levels to be matched to the controller logic independently of the memory core voltage. This separation minimizes noise induced by I/O port transient currents on the sensitive cell amplification and regeneration circuits.
  11. Deep pipeline without breaks. After initial pipeline filling, a read command can be issued on every clock cycle without idle waiting, as the independent write port does not interrupt the output stream. This property allows achieving one hundred percent read bus utilization over extended blocks of sequential table reads.
  12. Electrical impedance matching. QDR SRAM output drivers are calibrated to the transmission line impedance. Matching is achieved by programming the drive strength through pins using a calibration resistor, eliminating the need for external terminators and reducing reflections on high-speed traces while maintaining sharp signal edges.
  13. Deterministic write delay. Unlike asynchronous static memories, the moment of physical data capture into the cell is rigidly tied to the clock grid. The controller knows that information is guaranteed to be latched at the edge following the presentation of address and masks, which simplifies timing window calculations.
  14. Multi-tasking port usage. The read port is used for fetching counters, packet headers, or routing tables, while the write port simultaneously updates statistics or inserts new entries. The physical independence of the paths eliminates arbitration inside the chip, transferring parallelism control exclusively to the external controller.
  15. Dual synchronization strobing. The transmitted read data is accompanied by a differential strobe whose edges are aligned to the center of the data valid window. The receiver captures a word on the first edge and immediately prepares to capture the next word on the opposite transition, implementing a continuous DDR stream.
  16. Elimination of bus turnaround states. Since the ports are unidirectional, there is no need to insert guard intervals for switching buffers from read to write mode. Any operation starts immediately upon command issuance, saving up to one or two cycles on each access type change compared to classic ZBT SRAM.
  17. Thermal and frequency stability. Internal delay circuits and output drivers are designed so that timing relationships are maintained over a wide temperature and voltage range. The read strobe deviation from data remains minimal, ensuring that the timing budget for data reception is not fully consumed by parasitic drift.
  18. Reset and initialization protocol. After power-up, the chip requires forced holding of signals in a specific state for a defined number of cycles to correctly initialize internal pipeline state machines and output impedance calibration, guaranteeing start of operation with predictable queue and array states.
  19. Address space locality. Address interleaving between read banks allows masking row activation delays. While selection occurs in one bank, the controller already addresses the next, maintaining the filling of the output FIFO without reducing the rate, which is important for direct memory access with linear addressing.
  20. Compatibility with external synchronization systems. Echo clocks can be mixed into the receiver PLL loop for dynamic compensation of temperature drift. The controller phases the internal capture register strictly to the center of the eye diagram, based on the incoming strobe rather than the global reference oscillator, increasing the noise immunity margin.

Comparisons

  • QDR SRAM vs Standard Synchronous SRAM — in standard synchronous SRAM, data transfer and addressing occur only on one clock edge, limiting bandwidth to one word per cycle. QDR SRAM implements separate input and output ports with double data rate on each, allowing simultaneous reading and writing of up to two words per cycle, doubling effective performance compared to the classic architecture.
  • QDR SRAM vs DDR SDRAM — DDR SDRAM uses a multiplexed bidirectional data bus with double data rate, requiring switching between reading and writing, which creates bus turnaround delays and reduces overall efficiency under mixed loads. QDR SRAM, with dedicated unidirectional read and write buses, completely eliminates these pipeline bubbles, providing a continuous transaction stream unattainable for DRAM architectures with shared I/O lines.
  • SDRAM (Synchronous Data Storage and Retrieval)DRAM (Storage and Byte-addressing of Data)
  • QDR SRAM vs RLDRAM (Reduced Latency DRAM) — although RLDRAM has improved latency compared to conventional DRAM and supports high frequencies, the internal bank mechanism still introduces deterministic row activation states and the periodic need for cell charge refresh. QDR SRAM functions completely statically, delivering data with a fixed read delay without precharge cycles, guaranteeing random access with constant low latency at clock frequencies above 500 MHz without data stream gaps.
  • QDR SRAM vs QDR-II SRAM — QDR-II is an evolutionary development of QDR technology, introducing doubled data transfer frequency while maintaining the base core clock frequency, achieved through circuitry with equivalent quadrupled rate. Additionally, QDR-II uses a pseudo-HSTL interface and improved data unpacking topology, reducing power consumption per unit of bandwidth and allowing data exchange rates up to 36–40 Gbps per package, doubling the capabilities of the first generation.
  • QDR SRAM vs DDR-II SRAM — DDR-II SRAM evolved from standard synchronous SRAM with a common bidirectional bus, adding two prefetch bits to double the data stream but leaving the fundamental limitation of sequential interleaving of read and write cycles on a single bus. QDR SRAM with separate independent ports allows 100% bandwidth utilization without bus turnaround delays, which is critically important for network processors where table lookup and modification operations occur simultaneously and continuously.

OS and driver support

QDR SRAM support is implemented not through standard file systems, but at the level of direct physical memory mapping into kernel space via the /dev/mem interface or specialized kernel modules, which use functions like ioremap() and request_mem_region() to reserve a physical address range mapped to the QDR controller, and provide a character device with mmap operations for access from userspace without data copying. The driver configures the FPGA controller (via AXI interface or PCIe BAR registers), writing delay values and bus width to configuration registers, then synchronizes processor cache lines through memory barriers and instructions like _flush_cache_range to guarantee atomicity and ordering of read-write operations in a multi-threaded environment. In Linux systems, the driver is typically designed as a platform driver with parameter passing through Device Tree.

Security

The security of QDR SRAM as memory without built-in cryptography is ensured at the architectural level through physical address space isolation: a hardware firewall in the FPGA (for example, an AXI Memory Protection Unit) is configured to allow access to the QDR range only to designated bus masters (DMA controller or a specific processor core), blocking any transactions from unauthorized identifiers. The kernel driver during initialization programs the MPU registers through a protected configuration port, additionally setting the privileged access only flag and prohibiting speculative data reading by the processor from outside the allocated buffer by setting memory attributes as Device-nGnRE or Strongly-Ordered, which prevents leaks through prefetch mechanisms. On the application software side, separation is implemented through setting access rights on the character device and udev group policies, while error injection detection uses comparison of packet checksums computed by a hardware ECC accelerator.

Logging

Logging functions through a two-level architecture: at the hardware level, the QDR controller contains event counters (ECC corrections, timeouts, protocol errors), which are implemented as readable clear-on-read registers in the controller address space. When a critical event occurs (uncorrectable error), the controller generates a hardware interrupt to the GIC, upon which the driver calls a subroutine that atomically reads the counter contents through 32-bit status registers and resets them. At the software level, the driver allocates a ring buffer in non-cacheable memory, where the ISR handler writes an event trace structure with a timestamp from the TSC counter and an error mask. A separate kthread reads this buffer and exports it through a relay interface to sysfs or debugfs, enabling user utilities like trace-cmd and writing to the system log via printk_ratelimited with KERN_WARNING level. I/O operation logging is optionally extended with ftrace hooks at the call points of read/write functions for building timing diagrams.

Limitations

QDR SRAM limitations are determined by the separate read and write buses with independent dual synchronization: write and read transactions cannot simultaneously address the same word without the risk of returning undefined data. Therefore, the driver must either implement software hardware locking through compare-and-swap commands with cyclic checking, or segment the memory into non-overlapping TX/RX buffers, eliminating races at the protocol level. The maximum throughput is strictly limited by the doubled I/O frequency and the 18/36-bit bus width, while no out-of-order burst reordering mechanisms are supported due to the absence of command queues. Therefore, each request must complete strictly before the next, imposing a round-trip latency delay of about 5–7 cycles. Capacity is limited to tens of megabits due to the physical size of the six-transistor cell, making it impossible to use QDR as main RAM, restricting its application area to statistical routing tables and search buffers.

History and development

The development of QDR SRAM began with the QDR SRAM Consortium, founded by Cypress, IDT, Micron, and others to standardize memory with separate I/O ports. The first generation, QDR-I, implemented dual-edge transfer only on one edge of each clock (K and K#). The transition to QDR-II introduced differential echo strobes (CQ and CQ#) and doubled transfer speed with source-synchronous synchronization, allowing the FPGA controller to automatically calibrate data capture through dynamic IDELAY delay lines based on a training sequence during initialization. The next evolution, QDR-II+, added built-in on-die termination, which reduced signal reflections without external resistors, and configurable latency programmable via Mode Register loading through special cycles on the address bus. QDR-IV expanded the architecture to two independent bidirectional ports (Banked QDR), where each bank has its own pair of buses, and the controller in the FPGA implements bank interleaving to hide latency. The most modern controllers have transitioned to using SerDes transceivers and abstracted the signal level behind a MAC layer, ensuring backward compatibility with the existing software stack through the same driver API, hiding generation differences behind timing configuration functions.