Next Article in Journal
Ultrasound-Assisted Extraction of Carotenoids from Carrot Pomace: Process Optimization and Application Potential
Previous Article in Journal
Foggy Ship Detection with Multi-Scale Feature and Attention Fusion
Previous Article in Special Issue
FPGA Spectral Clustering Receiver for Phase-Noise-Affected Channels
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Review

Reconfigurable SmartNICs: A Comprehensive Review of FPGA Shells and Heterogeneous Offloading Architectures

by
Andrei-Alexandru Ulmămei
and
Călin Bîră
*
Department of Electronic Devices, Circuits and Architectures, Faculty of Electronics, Telecommunication and Information Technology, National University of Science and Technology POLITEHNICA Bucharest, 060042 Bucharest, Romania
*
Author to whom correspondence should be addressed.
Appl. Sci. 2026, 16(3), 1476; https://doi.org/10.3390/app16031476
Submission received: 31 December 2025 / Revised: 21 January 2026 / Accepted: 25 January 2026 / Published: 1 February 2026
(This article belongs to the Special Issue Recent Applications of Field-Programmable Gate Arrays (FPGAs))

Abstract

Smart Network Interface Cards (SmartNICs) represent a paradigm shift in system architecture by offloading packet processing and selected application logic from the host CPU to the network interface itself. This architectural evolution reduces end-to-end latency toward the physical limits of Ethernet while simultaneously decreasing CPU and memory bandwidth utilization. The current ecosystem comprises three principal categories of devices: (i) conventional fixed-function NICs augmented with limited offload capabilities; (ii) ASIC-based Data Processing Units (DPUs) that integrate multi-core processors and dedicated protocol accelerators; and (iii) FPGA-based SmartNIC shells—reconfigurable hardware frameworks that provide PCIe connectivity, DMA engines, Ethernet MAC interfaces, and control firmware, while exposing programmable logic regions for user-defined accelerators. This article provides a comparative survey of representative platforms from each category, with particular emphasis on open-source FPGA shells. It examines their architectural capabilities, programmability models, reconfiguration mechanisms, and support for GPU-centric peer-to-peer datapaths. Furthermore, it investigates the associated software stack, encompassing kernel drivers, user-space libraries, and control APIs. This study concludes by outlining open research challenges and future directions in RDMA-oriented data preprocessing and heterogeneous SmartNIC acceleration.

1. Introduction

For more than two decades, the historical expectation that faster general-purpose processors would automatically absorb growing software and data demands has weakened. Single-thread CPU performance has become increasingly constrained by power density, memory latency, and diminishing returns from deeper speculation and pipelines; as a result, much of the performance progress has shifted toward parallelism and domain-specific acceleration [1].
At the same time, two external trends have accelerated. First, GPU throughput continues to grow rapidly, driven by wider SIMD-style execution, specialized tensor units, and ever-higher memory bandwidth, making GPUs the dominant engine for AI training and increasingly for data analytics. Second, data center network line rates have advanced from 10–100 Gb/s to 200–400 Gb/s and are now moving toward 800 Gb/s-class links, so each server can ingest or emit tens of megabytes every millisecond. This combination widens a data–compute gap: the network and accelerator fabric can deliver data faster than the host CPU can steer, validate, copy, and transform it at low latency.

1.1. The Data–Compute Gap

Modern micro-services, storage systems, and distributed training pipelines frequently push 100 Gb/s or more per node. At 400 Gb/s, a NIC can emit roughly 50 MB every millisecond; if each packet must traverse a general-purpose networking stack, be parsed, copied between buffers, and validated in software, the aggregate CPU cost becomes significant. Azure’s experience with large-scale services shows that even “small” per-packet work (steering, checksumming, framing, encryption hooks, and queue management) can consume an unwanted number of host cores and memory bandwidth, motivating hardware support closer to the wire [2].
Kernel-bypass I/O frameworks and in-kernel fast paths (e.g., DPDK; XDP/AF_XDP) reduce overhead by shortening or bypassing the OS datapath, but they still rely on the host CPU to execute protocol logic and application-specific transforms [3]. As line rates continue to rise, the remaining gap is less about raw PCIe bandwidth and more about where computation happens: moving data to the CPU (and often to the GPU) and back is increasingly expensive in latency, energy, and contention.

1.2. SmartNICs as a Solution for the Data-Compute Gap

SmartNICs embed programmable or fixed-function accelerators directly on the network interface, enabling computation to run on the datapath before packets reach host memory. Current systems span a spectrum. At one end, ASIC SmartNICs (often branded as DPUs) integrate hardened protocol engines (e.g., TLS/IPsec, NVMe-oF, and RDMA support) alongside clusters of embedded cores; their energy efficiency and software maturity are strong, but their capabilities are largely fixed at tape-out and evolve on multi-year silicon cycles [4].
At the other end, FPGA SmartNICs provide a reconfigurable substrate that can adapt as protocols, security primitives, and ML operators evolve. Most FPGA SmartNICs are organized around a static shell that encapsulates Ethernet MACs, PCIe/DMA engines, queue management, and control interfaces, while exposing one or more partial-reconfiguration regions (PRRs) where user-defined accelerators can be swapped in at run time. This model has been used to build cloud-oriented SmartNICs that balance performance and manageability [5] and to prototype isolated, multi-tenant, P4-programmable SmartNICs with reconfiguration support [6]. More broadly, programmable data planes and in-network computing have matured into a rich design space, with SmartNICs forming a practical point in the spectrum between in-switch processing and host-centric acceleration [7].

1.3. Why Focus on Shells?

The shell is the static component that determines I/O bandwidth, DMA behavior, queue layout, SR-IOV support, clock domains, and reconfiguration latency. A well-architected shell primarily matters because it fixes the “physics” of what the NIC can do: it sets the baseline datapath latency, determines whether the line rate can be sustained under realistic traffic patterns, and constrains how easily operators can deploy new functions. First, shells that implement parsing, match–action logic, RDMA engines, and on-card buffering can eliminate PCIe round trips for common datapath work, reducing tail latency while sustaining at least 100 Gb/s line rate. Second, shells that support dynamic partial reconfiguration enable composability: operators can swap a TLS offload for an ML aggregation kernel without draining the host or replacing hardware. Third, shells increasingly act as heterogeneous bridges, exposing low-level data movement (e.g., verbs-like RDMA pipelines) and peer-to-peer PCIe paths that allow direct GPU–FPGA transfers, which is essential for in-network machine learning and GPU-centric distributed workloads.

1.4. Contribution of This Article

This study provides a comparative analysis of six representative, state-of-the-art FPGA SmartNIC shells—Coyote, OpenNIC, RecoNIC, ClickNP, Corundum, and FpgaNIC—and places them in context against conventional NICs and DPU-class devices. Rather than treating all shells as interchangeable integration templates, we highlight the distinct design philosophies that shape each platform (e.g., research-oriented rapid prototyping vs. production-leaning open-source datapaths; host-centric acceleration vs. network-centric programmability), and we connect these choices to practical use cases and fundamental constraints (notably reconfiguration granularity and isolation, host/PCIe bottlenecks, memory-system placement, and the limits imposed by timing closure and tool-flow complexity). We then examine how architectural decisions—especially static monolithic pipelines versus partial-reconfigurable regions—affect programmability, latency, and upgrade cadence, and we discuss emerging GPU–FPGA offload paths that minimize or remove CPU involvement from the data plane for low-latency streaming workloads. Finally, we synthesize techniques for performing stream-oriented preprocessing inside the NIC before RDMA (e.g., filtering, aggregation, lightweight transforms, and protocol-aware adaptations), thereby reducing GPU-side overhead and improving end-to-end pipeline efficiency. By systematically mapping strengths, gaps, and recurring trade-offs across today’s shells and device classes, we motivate the requirements for a next-generation SmartNIC architecture—one that pairs high internal bandwidth with modular reconfiguration and a unified development flow—pushing programmability, latency, and protocol agility beyond the current state of the art.
From NIC evolution to SmartNIC implementations. Modern NICs have progressively absorbed “near-host” functionality that reduces per-packet CPU overhead and enables parallel receive/transmit: multi-queue designs, receive-side scaling (RSS), checksum and segmentation offloads, and virtualization primitives such as SR-IOV. In parallel, RDMA-capable NICs pushed the envelope further by enabling kernel-bypass and zero-copy data movement, shifting parts of the communication stack into the interface and exposing new architectural constraints around queueing, memory registration, and transport reliability. These trends motivate SmartNICs as the next step: instead of isolated fixed offloads, SmartNICs make packet handling and data movement programmable under tight latency and bandwidth budgets [8,9,10].
SmartNICs in practice. Contemporary SmartNIC implementations span a spectrum from DPUs (SoC-based cards with embedded CPU complexes and fixed-function accelerators) to FPGA-based SmartNICs that expose customizable datapaths and, in many systems, partial reconfiguration. This distinction matters for developers: DPUs typically favor mature software ecosystems and predictable functionality, while FPGA SmartNICs favor datapath specialization and fast iteration on new parsing, scheduling, and preprocessing logic—at the cost of stricter timing/floorplanning constraints and higher hardware-design effort. Recent systems work also shows that obtaining consistent offload speedups depends on accurately matching a workload’s compute/memory behavior to a given SmartNIC’s micro-architecture and memory system, motivating tooling and methodology rather than ad hoc porting [11].
AI models at the network edge. A growing subset of SmartNIC use cases involves machine-learning-assisted telemetry, anomaly detection, flow inference, and policy decisions. Because SmartNIC datapaths operate under line-rate constraints, the models that appear in practice are usually compact and quantization-friendly (e.g., linear models, small MLPs/regressors, or carefully structured fixed-point inference) and are integrated in ways that preserve deterministic per-packet latency. We therefore treat artificial intelligence in the NIC as a systems-design question—how models are represented, updated, and executed within the datapath and memory hierarchy—rather than as a pure accuracy benchmark [12,13].
Novelty and organization. Unlike prior work that focuses on a single SmartNIC class, this article connects three perspectives—(i) architectural trade-offs across FPGA shells and DPU-class designs, (ii) CPU-bypass GPU–FPGA offload paths, and (iii) in-NIC preprocessing that reshapes data before RDMA—highlighting where current platforms constrain programmability, latency, and protocol agility. The remainder of this paper is organized as follows: Section 2 summarizes the evolution and baseline architecture of traditional NICs; Section 3 reviews DPU-style SmartNICs; Section 4 compares FPGA-based SmartNIC shells and their design trade-offs; Section 5 discusses GPU–FPGA direct offload; Section 6 surveys in-NIC preprocessing for RDMA pipelines; Section 7 analyzes the software stack and programming layers; and Section 8 concludes with open challenges and future directions.

2. Traditional Network Interface Cards (NICs)

A Network Interface Card (NIC) is a largely fixed-function adapter whose primary role is to transmit and receive Ethernet (or InfiniBand) frames between the host memory system and the physical medium. In the classic model, everything above Layer-2 framing—including most transport semantics, application parsing, and policy decisions—remains the responsibility of the host CPU and the operating system networking stack.
From an architectural perspective, a traditional NIC integrates four tightly coupled functions. First, it terminates the physical link through PHY/SERDES blocks (clock recovery, equalization, and FEC where applicable). Second, it implements the MAC layer to perform framing and deframing (preamble/SFD processing, CRC generation and checking, VLAN tags, and often hardware timestamping for PTP). Third, it moves packet buffers to and from system DRAM using a DMA engine driven by descriptor rings programmed by the host driver, raising MSI-X events based on programmable moderation policies. Finally, it may include a small set of fixed accelerators—such as checksum offload, TSO/LSO, RSS, and basic flow steering—that reduce per-packet CPU work while keeping the datapath itself non-programmable and vendor-defined.
Because higher-layer protocol logic still executes on the host, the performance envelope of a traditional NIC remains coupled to CPU capacity, memory bandwidth, and the cost of software datapaths. This CPU-centric design dominated through the 1 Gb/s → 10 Gb/s era and still underpins many commodity adapters today; the difference in modern deployments is that software fast paths (kernel bypass and driver-level hooks) are increasingly used to postpone or reduce host overhead rather than changing the NIC’s fundamental role.

2.1. Historical Perspective

From the first 10 Mb/s Ethernet adapters of the late 1980s to today’s 400 Gb/s ASICs, the purpose of a classic NIC has remained conceptually stable: move packets between a host buffer and the wire with minimal loss and bounded latency. As line rates climbed from 1 Gb/s to 10/40/100 Gb/s and beyond (see Table 1), vendors expanded the set of fixed offloads primarily to keep interrupts and per-packet CPU work under control (e.g., segmentation, coalescing, and receive-side scaling), while protocol evolution and policy continued to be handled in host software.

2.2. Baseline Architecture

The baseline architecture of a traditional NIC can be understood as a pipeline that converts wire-format frames into DMA-visible buffers. The front end consists of the PHY/PCS/SERDES and MAC, which recover the bitstream, apply FEC where required, and validate frames via CRC and framing rules. The core of the NIC is a DMA subsystem that reads and writes host memory based on descriptor rings and queue state maintained by the driver. A small control-plane interface (configuration registers, doorbells, and status counters) exposes link state, queue pointers, interrupt moderation settings, and steering rules. In modern adapters, this baseline is extended with multi-queue support and virtualization primitives (e.g., SR-IOV), enabling direct queue ownership by guests while still keeping most protocol logic on the CPU.

2.3. Virtualization and Multi-Queue

Modern NICs expose hundreds or thousands of transmit and receive queues to parallelize packet processing and to isolate tenants. Receive-side scaling (RSS) distributes incoming traffic across queues using a hash over header fields so that multiple CPU cores can process packets concurrently, which is essential at 100–400 Gb/s line rates. SR-IOV extends this idea into virtualization by slicing the adapter into multiple PCIe virtual functions (VFs), allowing guest VMs or containers to steer and consume packets without hypervisor-mediated copying.
However, multi-queue and SR-IOV primarily reduce software overhead; they do not remove it. Even with direct queue ownership, the guest must still run a full networking stack and any application-specific parsing or policy. This is why contemporary deployments frequently pair traditional NICs with software fast paths. Driver-level hooks such as XDP/eBPF provide early drop/redirect decisions closer to the driver hot path, while kernel-bypass frameworks and AF_XDP can reduce copies and syscall overhead in latency-sensitive datapaths [3,25,26].

2.4. Representative Device Families

Representative device families illustrate how the classic NIC evolved from a simple DMA-based frame mover into a highly optimized I/O endpoint with increasingly rich but still largely fixed-function acceleration. Table 2 summarizes widely deployed families across multiple generations and highlights the incremental nature of NIC innovation: each step primarily adds hardware assistance for queue scaling (RSS/MSI-X), virtualization (SR-IOV/VMDq), overlay parsing (VXLAN/GENEVE), and time-sensitive features such as IEEE 1588 timestamping, while preserving the same fundamental architectural boundary—higher-layer semantics and most application logic remain on the host CPU. This table therefore serves as a baseline for the remainder of this paper: SmartNICs and DPUs should be interpreted as architectural responses to the point where incremental NIC offloads no longer suffice to contain CPU overhead at 100–400 Gb/s and beyond.

2.5. Strengths, Limitations, and Where Classic Nics Fit Today

Traditional NICs remain attractive because their hardware and software ecosystems are mature and predictable. They are supported across operating systems and hypervisors, have stable drivers and operational tooling, and their fixed-function datapaths are straightforward to validate and deploy at scale. Their bill of materials is also relatively low compared to reconfigurable solutions, and their deterministic behavior simplifies performance debugging under steady workloads.
The limitations emerge when line rates and workload complexity outpace what host software can handle efficiently. At 100–400 Gb/s, even modest per-packet work (metadata parsing, policy checks, buffer management, and security hooks) can consume substantial CPU capacity and memory bandwidth, while additional copies between NIC buffers, kernel space, and user space increase latency variance. Moreover, the feature set is effectively defined by silicon and firmware: new transport behaviors, novel telemetry, or specialized in-network transforms generally require host-side implementation. As a result, many operators employ software acceleration to extend the lifetime of classic NICs, but these techniques largely shift where CPU cycles are spent rather than eliminating them; in practice, this motivates a transition toward SmartNIC-class devices that can terminate or accelerate parts of the datapath on-card.

2.6. Role in Modern Data Center Systems

Traditional NICs remain dominant for web front-ends, scale-out storage, and edge systems where power budgets are tight and protocol requirements are stable. In hyperscale settings, they are increasingly paired with kernel-bypass and driver fast-path mechanisms to sustain throughput while controlling CPU burn. Nevertheless, as I/O-intensive services expand and the “I/O-driven server” model becomes more prominent, a fixed-function NIC often becomes the bottleneck for latency and for CPU efficiency, setting the stage for more capable SmartNICs and DPUs [27].

2.7. Bridge to SmartNICs

The central pressure point is not merely raw bandwidth but the cost of orchestrating data movement and per-packet decision-making in host software. This motivates moving selected functions—filtering, steering, security primitives, transport termination, and storage protocol handling—closer to the wire. The next section therefore introduces DPUs, which integrate these functions into a fixed-function but highly optimized SoC, and contrasts them with FPGA-based SmartNIC shells later in this paper.

3. Data-Processing Units (DPUs): Fixed-Function SmartNICs

A Data-Processing Unit (DPU)—also marketed as a SmartNIC ASIC or Infrastructure Processing Unit (IPU)—is a network adapter built around a custom system-on-chip that hardens large portions of the datapath and common infrastructure offloads into silicon. Unlike FPGA SmartNICs, a DPU trades broad reconfigurability for deterministic performance, stronger power efficiency, and a software stack that resembles a “miniature server” dedicated to I/O and security control. This shift aligns with the emerging view of the SmartNIC/DPU as a data movement controller rather than a peripheral that merely transfers packets [27].
Architecturally, modern DPUs combine (i) multi-rate Ethernet MAC/PCS blocks with high-speed SERDES, (ii) a packet parsing and classification pipeline (often P4- or microcode-programmable within fixed stage boundaries), (iii) hardened transport and storage engines (e.g., RoCEv2 RDMA, TCP segmentation/aggregation, and NVMe-oF), and (iv) on-card compute in the form of embedded CPU clusters that run control-plane services, agents, and sometimes portions of the virtual switch. High-bandwidth DMA engines and PCIe Gen4/Gen5 interfaces provide zero-copy access to host memory, and some platforms additionally target peer-to-peer paths for storage or accelerator attachment. The net result is that protocols such as RDMA and NVMe-oF can be terminated on-card, reducing host CPU overhead and stabilizing tail latency for I/O-intensive services.

3.1. Historical DPU Milestones (2017–2025)

The rapid evolution of DPUs over the last decade is best understood as a sequence of integration steps: first, the convergence of a high-throughput NIC datapath with a general-purpose on-card CPU complex; then, the progressive hardening of infrastructure primitives such as RDMA, storage fabrics, cryptography, and virtualization; and finally, the emergence of programmable packet-processing stages that can be shaped by P4-like or eBPF-like models. Table 3 provides a chronological view of major commercial families and highlights the specific inflection points that changed how DPUs are deployed: increases in port bandwidth, richer security/telemetry engines, and a more mature on-card software ecosystem. This timeline frames why DPUs are increasingly used as “infrastructure endpoints” in modern clusters—terminating network, storage, and security functions close to the wire—while also clarifying the main limitation relative to FPGA SmartNICs: their datapath capabilities are primarily defined at design time and expand only with new silicon generations.

3.2. Baseline Micro-Architecture

Table 4 describes the components of the architecture of a DPU, explaining the block and its usefulness:

3.3. Reference SmartNIC Hardware Architecture (Cross-Cutting View)

To avoid discussing each device family in isolation, we summarize a reference SmartNIC architecture that captures the common hardware blocks that recur across NICs, DPUs, and FPGA shells. A SmartNIC can be viewed as two tightly coupled subsystems: (i) a network datapath that receives frames, performs parsing/classification, applies actions (steering, filtering, encapsulation, encryption, and telemetry), and schedules traffic; and (ii) a host/memory subsystem that moves data and metadata between the card and the host (or GPU) via DMA/RDMA while enforcing isolation and ordering constraints. Around these, SmartNIC implementations add a control-plane (embedded CPUs/firmware) and optional accelerators for compute-heavy primitives. The key architectural difference between device classes is where programmability lives: fixed-function NICs offer limited knobs around queueing and offloads; DPUs add general-purpose processing and rich I/O virtualization but keep most datapath functions in fixed engines; FPGA shells expose a programmable datapath and memory hierarchy at the cost of stricter timing/floorplanning and a larger “static infrastructure” footprint. Table 5 highlights the main blocks and their role in each class, providing a common vocabulary for the detailed discussions in Section 2, Section 3, Section 4, Section 5, Section 6 and Section 7.

3.4. Key On-Card Offloads and Accelerators

While the presence of embedded CPU cores is essential for orchestration, most of the performance and isolation benefits of DPUs come from their fixed-function engines and pipeline accelerators. These blocks determine which functions can execute entirely on-card with deterministic latency and without host CPU cycles, and they shape the practical boundary between “control-plane” tasks (policy, configuration, and observability) and “data-plane” tasks (classification, cryptography, and RDMA/storage termination). Table 6 therefore organizes common DPU capabilities by category and emphasizes the system-level consequence of each offload: lower CPU consumption per byte, reduced tail latency under load, and stronger multi-tenant isolation because sensitive traffic can be processed without traversing host memory. In the remainder of this paper, these categories are used as a reference checklist when comparing FPGA shells: FPGA SmartNICs can approximate many of these engines in reconfigurable logic but differ in how easily the functionality can be replaced, extended, or specialized via partial reconfiguration.

3.5. Virtualization and Multi-Tenant Isolation

DPUs place virtualization and isolation at the center of their design because they are frequently deployed in multi-tenant clouds where the NIC is a shared security boundary. Hardware queue hierarchies (often on the order of 10 3 10 4 RX/TX queues) are used to assign dedicated queue pairs to VFs, vDPA instances, or container endpoints, enabling scalable parallelism without requiring a monolithic host vSwitch to touch every packet. Address translation and access control are typically enforced using IOMMU/SMMU contexts per tenant, limiting DMA to explicitly authorized memory windows and reducing the blast radius of a compromised guest.
In addition, DPUs integrate a distinct security domain rooted in secure boot and hardware trust features (e.g., ROM-based boot chains and trusted execution modes such as TrustZone in Arm-based designs). This allows the DPU to run privileged control-plane agents and policy enforcement independently of the host OS. Recent empirical and characterization work emphasizes that these platforms are powerful but idiosyncratic: performance, programmability boundaries, and offload behavior can vary substantially across DPU generations and configurations, which must be accounted for when designing end-to-end systems [28].

3.6. Representative Device Families (2023–2025)

To ground the discussion in concrete platforms, Table 7 lists representative DPU families that are either shipping or widely referenced in recent deployments. The table deliberately focuses on attributes that directly shape system integration: (i) port configuration and line rate (which sets the dataplane throughput target), (ii) the scale of the embedded CPU complex (which bounds control-plane capacity and the feasibility of running on-card agents such as virtual switching, telemetry, or policy enforcement), (iii) the presence of hardened protocol and security engines (which determines which infrastructure services can be terminated on-card without host involvement), and (iv) the PCIe generation and lane width (which constrains host, GPU, and storage attachment bandwidth). In later sections, these characteristics provide a point of comparison against FPGA shells, whose flexibility and partial reconfiguration capabilities trade off against the determinism and mature software ecosystems typically associated with these ASIC-based devices.

3.7. Strengths and Limitations

DPUs offer a compelling operating point for infrastructure offloads because much of the datapath is hardened in silicon: common tasks such as steering, encryption hooks, RDMA/NVMe-oF handling, and queue management can be executed with predictable microsecond-scale latency even under sustained load, while typically achieving substantially better performance-per-watt than reconfigurable designs for the same fixed offload. This advantage is reinforced by mature vendor software ecosystems—for example, full Linux environments on-card and production-grade SDKs that expose stable APIs for packet processing, storage, and security services—which lowers integration and deployment risk in large fleets. At the same time, these benefits come with structural constraints: the feature set and acceleration blocks are largely frozen until the next silicon tape-out cycle, programmability is usually bounded by the vendor’s pipeline and extension model (e.g., table edits in P4-like stages, eBPF-based hooks, or predefined accelerators rather than arbitrary custom datapaths), and the embedded CPU complex that runs control and orchestration can itself become the bottleneck when the control-plane is heavy or when many tenants compete for limited on-card compute and memory resources. Consequently, DPUs excel when the target workload aligns with supported offloads and operational maturity is critical, whereas FPGA shells remain attractive when the operator needs new datapath functions, rapidly evolving protocols, or tightly customized in-network computation.

3.8. Role in Modern Data Centers

DPUs have become mainstream in hyperscale and enterprise deployments because they shift infrastructure work—virtual switching, storage termination, and inline security—off the host CPU while improving isolation in multi-tenant environments. In public clouds, DPUs are commonly positioned as a bare-metal isolation boundary, where the control plane and tenant traffic separation are enforced on the NIC-side rather than in the host kernel (e.g., Google Cloud IPU initiatives and AWS Nitro-style architectures). In AI clusters, high-end DPU generations are increasingly used as fabric endpoints for large RoCE-based GPU pods, where the NIC must sustain high line rates with low jitter and handle telemetry, congestion signaling, and transport offloads close to the wire; this is frequently associated with NVIDIA BlueField “SuperNIC” deployments in GPU back-end networks. DPUs are also widely adopted in storage systems, where inline crypto, compression, checksums, and NVMe-oF/TCP termination reduce CPU load and stabilize tail latency—a typical pattern is embedding Pensando-class DPUs in enterprise storage appliances to accelerate data services and telemetry. Finally, in carrier and edge infrastructure, DPU/Infrastructure-processor families such as Marvell OCTEON are deployed to implement packet-core functions (e.g., UPF/BNG), combining line-rate forwarding with inline crypto and, increasingly, lightweight ML-assisted classification in power- and space-constrained environments.

3.9. Bridge to Reconfigurable SmartNICs

While DPUs deliver a great perf/W ratio for stable protocols, emerging use-cases—custom congestion control for AI collectives, new security formats, in-NIC inference for proprietary models—require reprogrammability. This gap is filled by FPGA SmartNIC shells, which trade a factor of power efficiency for dynamic reconfiguration, which includes changing protocols on the fly, adding custom pipelines.

4. FPGA-Based SmartNIC Shells: Survey and Comparative Analysis

Modern FPGA SmartNIC platforms are typically engineered around a shell abstraction: a stable, board-validated base design that integrates the host interface (PCIe, DMA, and interrupts), the network interface (Ethernet MAC/PCS, timestamping, and flow steering), and the management/control plane, while exposing one or more user regions for custom offloads. This separation is motivated by operational realities: the shell concentrates vendor- and board-specific complexity (timing closure against hardened IP, reset sequencing, link bring-up, drivers, and compliance constraints), and the user region becomes the “innovation surface” where packet/transport/storage/AI accelerators evolve on a faster cadence.
A second motivation is that SmartNIC workloads are fundamentally parallel: to sustain 100–400 Gb/s line rate, a SmartNIC cannot rely on a single serial micro-engine. Instead, it must exploit spatial and pipeline parallelism (multi-queue RX/TX, replicated engines, and deep streaming pipelines with initiation interval close to one). Consequently, hardware architecture details—queueing model, pipeline boundaries, buffering strategy, memory hierarchy, and internal interconnect—directly determine whether a platform can meet throughput and tail-latency targets under adversarial traffic mixes.

4.1. FPGA Shells: Shell Infrastructure and Subsystem Components

Concretely, an FPGA SmartNIC shell typically contains the following: (i) a host-facing subsystem (PCIe endpoint, DMA engines, doorbells/MSI-X, descriptor management, and driver-visible control/status registers); (ii) a wire-facing subsystem (Ethernet MAC/PCS/PHY interface, RSS/flow steering hooks, checksum/segmentation assists, and optional PTP timestamping); and (iii) an on-card fabric substrate (AXI-stream/NoC-style routing, clock/reset islands, performance counters, and a control-plane path for configuration and telemetry). The user region(s) then attach through standardized streaming and memory interfaces (e.g., AXI-Stream for packets; AXI4/NoC endpoints for state), so that offloads can be inserted without re-deriving the board bring-up and driver stack [34,35].
Parallel datapath processing is not an “optional optimization” but the main reason FPGA SmartNICs are viable: multi-queue designs allow the host to scale submission/completion across cores while the FPGA concurrently processes independent flows; replicated match/action or crypto/compression engines amortize per-stage latency; and careful buffering/backpressure design prevents head-of-line blocking when workloads combine small control packets with large data transfers. Frameworks such as DrawerPipe explicitly structure packet processing as interchangeable pipeline stages with uniform interfaces, making the parallel pipeline boundary a first-class design concept [36]. Likewise, EasyNet provides a 100 Gb/s networking substrate intended for HLS-based kernels, reflecting the practical need for “network I/O as a reusable shell service,” not bespoke per-project logic [37].

4.2. FPGA Shells: Datapath Architecture and Design Methodologies

At a hardware level, most FPGA SmartNIC shells converge to a similar fast-path organization: a multi-port Ethernet front-end (MAC/PCS and PHY control), an ingress pipeline (parsing, classification, and optional match/action), a buffering and scheduling stage (often queue-based; sometimes with rate shaping/telemetry hooks), and a host-facing I/O subsystem that provides high-throughput DMA/RDMA semantics over PCIe. A representative example of a “NIC-first” open implementation is Corundum, which exposes the core hardware ingredients required for a modern 100 Gb/s-class interface—high-performance datapath logic, Ethernet MAC integration, a PCIe interface with a dedicated DMA engine, and precise timestamping support—illustrating how much of a SmartNIC’s functionality is anchored in the mechanics of moving data and metadata deterministically at line rate [38]. From the shell perspective, these blocks form the non-negotiable baseline: regardless of higher-level programmability, a SmartNIC must sustain sustained DMA throughput, absorb burstiness, and preserve low and stable latency under contention.
Where shells differ is in how they structure programmability around that baseline. Frameworks such as ClickNP and DrawerPipe explicitly organize the datapath as composable stages/elements so that packet-processing functions can be assembled from reusable modules and mapped to pipelines without rewriting the entire NIC substrate [36,39]. In hardware terms, this typically translates to (i) a clear separation between a streaming packet path and a sideband metadata/control path, (ii) explicit stage boundaries that allow pipelining and replication, and (iii) a restricted, predictable interface between stages that improves timing closure and enables reuse across multiple functions. This architectural choice trades absolute micro-optimization for faster development and safer composition: the shell “pays” some infrastructure overhead so that user logic can be inserted as modules with well-defined backpressure and resource expectations.
A second axis of differentiation is how shells treat memory hierarchy, reconfiguration boundaries, and isolation. Designs that target multi-service deployment and fast rollout increasingly introduce hierarchical regions and runtime-controlled reconfiguration mechanisms (e.g., multi-tenant regions and service management layers), pushing partial reconfiguration and explicit boundary protocols into the core shell architecture [40]. Similarly, cloud-oriented FPGA SmartNIC designs emphasize safe sharing of hardware resources and scaling of task graphs through different forms of parallelism, which has direct architectural consequences for buffer partitioning, arbitration, and scheduling across multiple tenants or services [5]. These goals also expose fundamental limitations: shared DMA engines, shared memories, and shared on-card accelerators can create cross-tenant interference unless the shell provides explicit performance and security isolation mechanisms. Work on SmartNIC isolation (even when demonstrated on SoC-based SmartNICs) makes the underlying point clear: without deliberate partitioning and enforcement, contention and side channels become first-order constraints on deployability [41,42].

Fundamental Hardware Limitations

Despite their programmability, FPGA SmartNIC shells face a set of recurring, hardware-rooted limits that shape achievable throughput, latency, and deployability. First, the I/O boundary is often the dominant bottleneck: PCIe DMA engines and their associated buffering/interrupt or doorbell mechanisms impose practical limits on sustained host throughput and latency variance, especially under many small messages or high queue counts. Second, the on-card memory hierarchy introduces contention effects that are easy to underestimate: even with HBM, performance depends on banking, access patterns, and NoC arbitration, and poorly partitioned buffers or shared metadata structures can create head-of-line blocking that manifests as tail-latency spikes. Third, timing closure and floorplanning become first-order constraints as soon as the shell supports large pipelines, multi-service composition, or partial reconfiguration regions; long interconnects, cross-region routes, and boundary crossings can dominate critical paths and reduce achievable frequency, while “infrastructure” logic (crossbars, schedulers, monitors, CDC, and reset/clock trees) consumes a non-trivial fraction of resources. Finally, isolation is not free: when multiple functions share MACs, DMA engines, memories, or accelerators, the shell must provide explicit mechanisms for bandwidth budgeting, backpressure propagation, and state partitioning; otherwise, interference and microarchitectural side channels can negate the benefits of multi-tenancy even if functional isolation is maintained. These limitations motivate shell designs that treat resource partitioning, predictable arbitration, and reconfiguration-aware floorplans as core architectural elements rather than as afterthoughts.

4.3. Representative Shells and Design Philosophies

Table 8 summarizes representative FPGA SmartNIC shells and closely related datapath frameworks. The set spans multiple design philosophies: (1) production NIC cores emphasizing stable interfaces and RTL control (e.g., Corundum [38]); (2) vendor-maintained shells that prioritize driver stability and platform integration (e.g., OpenNIC [43]); (3) research shells designed for rapid iteration across offload ideas (e.g., ClickNP [39]); and (4) cloud-oriented, multi-tenant shells that treat isolation, composability, and dynamic service rollout as first-order concerns (e.g., SuperNIC [5], Janus [6], and Coyote [40]). In addition, P4-enabled SmartNIC designs demonstrate how “programmable parsing + match/action” can be integrated into an FPGA NIC to support slicing and service-driven reconfiguration at the dataplane level [44].
Design philosophies, suitable use cases, and limitations. Although Table 8 compares shells on features, their design philosophies strongly shape where they are most effective and what their hard limits are. Corundum, for instance, is closest to a production NIC core: it prioritizes a clean, verifiable RTL NIC micro-architecture (queues, DMA, and datapath control), making it well suited when the contribution is a new NIC feature, a latency-critical inline primitive, or a reproducible research NIC baseline; the trade-off is that it largely assumes hardware-centric development and does not aim to provide cloud-style service composition or PR-based rollout mechanisms out of the box [38]. OpenNIC follows the opposite philosophy: it is a platform-first vendor shell that emphasizes board bring-up and a stable host-facing software stack (e.g., QDMA/DPDK integration), which makes it a pragmatic base for systems papers that need “working 100 GbE quickly” and for deploying monitoring, measurement, or match/action pipelines as plug-in datapath modules; however, the same platform emphasis implies a non-trivial fixed infrastructure overhead and—because the shell is typically static—functional evolution often requires full rebuild/redeployment rather than online PR updates. Concretely, DUMBO reports that a significant fraction of resources can be attributed to the fixed OpenNIC infrastructure and measures a baseline OpenNIC latency of roughly 960 ns [47]. Recent work also leverages OpenNIC-like shells specifically to study system-level objectives (e.g., energy-aware networking and OS integration), reinforcing the view that some shells are optimized for deployability rather than maximal datapath minimalism [48]. Research-first frameworks such as ClickNP are optimized for rapid datapath iteration via a modular programming model (Click-style elements compiled to FPGA modules), which is a strong fit for prototyping middleboxes, new scheduling/measurement logic, or feature exploration; the limitation is that such frameworks typically lag behind in absolute throughput targets and full offload completeness (e.g., relying on the host for TCP), and their abstraction can reduce fine-grained control over timing and resource utilization [39]. Finally, cloud-oriented shells such as Coyote treat composability, isolation, and rollout as first-order goals: PR regions, explicit boundary protocols, and runtime controllers enable safer multi-service deployment but at the cost of larger static footprint, more demanding timing/floorplanning, and additional engineering for performance/security isolation in multi-tenant settings [40,41,42]. Specialized designs (e.g., RDMA/verbs-centric RecoNIC or GPU-centric SmartNICs) are best matched to RDMA-heavy storage/AI clusters where zero-copy semantics dominate; their primary limitation is portability, since they depend on platform-specific RDMA/PCIe/IOMMU assumptions and software hooks to preserve those semantics end-to-end [45,49].

4.4. Architecture Patterns

Parallelism is not an optional optimization in SmartNICs; it is the mechanism that makes line-rate processing feasible under multi-port operation and concurrent services. At the hardware level, modern SmartNIC datapaths are inherently multi-lane (multiple MAC/PCS lanes, multiple queue pairs, and multiple memory channels/banks), and throughput scales only if packet handling, metadata generation, DMA, and optional preprocessing can run concurrently without serial bottlenecks. In practice, this means exploiting parallel receive/transmit queues (to match multi-core hosts and multiple flows), decoupling stages through pipelining (parser → match/action → scheduling → DMA/RDMA), and distributing state and buffers across on-card memories (e.g., HBM banking or multi-channel DDR) to avoid contention. The same requirement becomes stricter when a shell supports multiple reconfigurable regions or multiple services: parallelism must be paired with isolation (bandwidth budgeting, backpressure, and per-tenant resource limits) to prevent one workload from degrading others. The main limitations are architectural rather than conceptual: increasing concurrency raises floorplanning and timing-closure difficulty, amplifies NoC/memory arbitration effects, and can introduce latency variance if shared structures (queues, caches, HBM ports, and PCIe DMA engines) are not carefully partitioned. For these reasons, we treat “parallel data processing” as a first-order design objective that directly determines achievable throughput, latency stability, and safe multi-service composition on a SmartNIC.
Table 9 groups today’s shells by the architectural pattern they implement—ranging from fully static designs that require a fresh bit-stream for every change to hierarchical shells that hot-swap both infrastructure and user kernels. Each pattern is accompanied by concrete adopters and a concise list of benefits and trade-offs.

4.5. Features

While architectural pattern gives a high-level overview, practical deployment depends on concrete capabilities: maximum port count, whether dynamic PR is supported, which protocol engines are embedded into the shell, and what software tool-chain developers must use. Table 10 shows these attributes for the six shells introduced earlier.

4.6. Programming-Model Spectrum

The six shells illustrate a continuum that balances raw hardware control against developer productivity. At one extreme, Corundum exposes only synthesizable Verilog; designers enjoy cycle-accurate freedom but must write RTL and close timing themselves, a workload suitable for hardware specialists and production NIC vendors. Moving up the abstraction ladder, Coyote, FpgaNIC wrap the datapath in Vitis HLS templates: kernel authors describe packet-side logic in C/C++, leaving the shell to manage AXI buses, DMA, and resets. RecoNIC and Coyote go a step further by accepting P4 descriptions of header parsing and match–action blocks, enabling network researchers to prototype new transport formats without touching RTL. At the highest level sits ClickNP, whose Click-style “elements” compile to HLS modules and can be re-wired at runtime, letting a systems engineer prototype router pipelines with the same model used in software. The cost of abstraction is twofold: less precise control over timing and resource utilization and—except for Coyote’s hierarchical design—limited ability to combine high-level and low-level modules in a single bit-stream.

4.7. Build-Time and Run-Time Reconfiguration Costs

Partial reconfiguration is operationally meaningful only if two practical costs are controlled: offline build latency (how long it takes to produce a full or partial bitstream after a change) and on-card PR latency (how long traffic must be quiesced—or rerouted—while a region is updated). Table 11 aggregates the best-case figures reported by each project. The central engineering trade-off is visible in these numbers: designs that invest in hierarchical partitioning, strict interface contracts, and PR controllers often pay a larger static footprint and a more demanding timing-closure process, but they enable one to two orders-of-magnitude faster “deploy new logic” cycles in practice.

4.8. Pr in Production: Orchestration, Reliability, and Toolchain Support

In production environments, PR is not only a reconfiguration mechanism; it becomes a distributed systems problem. A safe PR rollout typically requires (i) traffic orchestration (drain or divert flows away from the target region, preserve ordering where required, and bound packet loss), (ii) region quiescence (ensure there are no in-flight transactions across the region boundary), and (iii) rollback semantics (retain a known-good configuration and revert on timeout or validation failure). These requirements strongly influence shell architecture: designs that provide bypass paths, per-region performance isolation, and explicit boundary protocols can treat PR as an online operation rather than a maintenance window.
Reliability concerns also matter at scale. Even if PR latency is sub-second, field operation must account for configuration integrity, transient faults, and bitstream management. At minimum, shells should support authenticated bitstreams, validation before activation, and operational tooling that makes PR artifacts reproducible (clear provenance from source commit to partial bitstream). Toolchain support is equally critical: reproducible floorplanning, stable interface timing, and clear separation between shell timing closure and user logic closure reduce the “integration tax” that otherwise prevents PR from being used outside research prototypes. A broad perspective on these multi-tenant and operational constraints—including attack surfaces introduced by user-programmable regions—is surveyed in [50].

4.9. Security: Secure Boot, Bitstream Integrity, and Side Channels

Shells change the security model of a NIC by allowing hardware logic to evolve post-deployment. This creates two security imperatives. First, platforms need a chain of trust from boot to runtime: secure boot of the management controller/host agent that provisions the FPGA, authenticated loading of the base shell image, and authenticated loading of partial bitstreams for PR regions. Second, shells must assume that bitstreams are an adversarial input in multi-tenant settings: protections should address bitstream tampering, replay/rollback of vulnerable configurations, and unauthorized modification of PR payloads in transit or at rest.
Standardized guidance for protecting EDA/IP artifacts (including encryption and key management practices used in hardware design flows) is captured by IEEE Std 1735 (useful as a baseline reference point for production-grade protection expectations) [51]. For cloud-style FPGA services, the most salient risks include cross-region information leakage (timing/power side-channels and shared-resource contention), boundary violations via malformed interfaces, and supply-chain risk in the PR artifact pipeline; these issues and their mitigations are discussed at survey depth in [50].

4.10. Strengths and Weaknesses of FPGA Shells

Different shells excel in different use-cases; no single design dominates every metric. Table 12 shows the standout advantage and the principal limitation of each shell, giving a ensemble image for where a given platform will—or will not—fit a project’s requirements.

4.11. Takeaways for a New Shell Design

A new shell design must meet a small set of concrete requirements derived from the discussion above. First, sub-second partial reconfiguration should be treated as a hard requirement: in a 100 Gb/s deployment, even one minute of downtime forfeits more than 700 GB of potential throughput, and hierarchical approaches validated by Coyote v2 and RecoNIC indicate that service-transparent PR latencies below 350 ms are attainable without compromising timing closure. Second, the shell should expose a minimal set of hardened fast-path engines to broaden applicability; embedding essential blocks—such as RoCE verbs handling, IEEE 1588 timestamping, and checksum generation—spares application teams from repeatedly re-implementing line-rate plumbing and allows the platform to function as an immediate drop-in NIC rather than merely an FPGA board. Third, a stable, well-documented software stack is a decisive adoption factor: shells accompanied by mature drivers and user-space libraries (e.g., OpenNIC, Corundum, and Coyote) tend to see disproportionate uptake compared to technically rich but API-sparse alternatives, and sustained investment in a long-lived C/P4/HLS SDK typically yields higher dividends than ultra-aggressive area optimization. Fourth, high-level entry points—HLS, P4, or Click-style composition—meaningfully compress innovation cycles; while they may consume roughly 10% additional fabric, they can reduce idea-to-bitstream turnaround from weeks to days, which is an acceptable trade-off in both academic research and prototyping. Fifth, static-region budget is finite and must be planned explicitly: experience with hierarchical shells suggests that infrastructure can appropriate 25–40% of total LUTs, so capacity planning should reserve headroom for at least two large user accelerators. Finally, zero-copy peer paths (e.g., GPU P2P and shared RDMA queues) are becoming decisive differentiators; we therefore treat GPU ↔ FPGA offload as a first-class design axis and discuss the detailed datapath models and constraints (topology, buffer registration, and completion semantics) in Section 5.
Implication: an industry-ready FPGA SmartNIC shell has to combine (i) deterministic, sub-second PR support, (ii) a curated but indispensable set of fixed offloads, (iii) a vendor-agnostic, version-stable SDK, and (iv) explicit architectural provision for zero-copy accelerator paths—all while preserving sufficient reconfigurable resources for future, as-yet-unknown workloads.

5. GPU FPGA Direct Offload

Contemporary SmartNIC innovation now targets the inverse bottleneck: removing the host CPU from GPU-to-network egress paths. In accelerator-centric servers, vast result tensors—video-analytics metadata, ML inference outputs, or compressed columnar data—originate inside GPU HBM. In a classical pipeline the egress journey still involves (i) a DMA pull from GPU memory into host DRAM, (ii) CPU-driven protocol framing, checksumming, and congestion handling, and (iii) a second DMA push from host DRAM to the NIC. At 100–400 Gb/s these double copies and cache can use up to dozens of CPU cores that add no computational value.
A GPU → FPGA → Network offload path collapses the chain. Leveraging peer-to-peer PCIe Gen4/5—and, prospectively, cache-coherent CXL 3.0—the FPGA SmartNIC initiates DMA reads that fetch payloads directly from GPU HBM. Packetisation, protocol termination (TCP/UDP/RoCE), and line-rate scheduling are executed in deterministic FPGA logic; frames depart the NIC without ever traversing host memory. The CPU is delegated to low-duty control-plane tasks, freeing cache bandwidth. Prototype systems (e.g., FpgaNIC, RecoNIC) have demonstrated 4–6 µs reductions in tail latency at 100 Gb/s and release of a full CPU socket’s worth of cycles on inference nodes.
This mechanism comes with two distinct advantages: it reshapes the data-delivery path, creating the GPU to wire path, where the host DRAM is not used, and second, it maximizes GPU utilization, ensuring that accelerators are not throttled by PCIe round-trips and CPU copies.
Typical deployment domains for this architecture include distributed deep-learning training, where gradient blocks are emitted directly from GPU HBM, optionally pre-compressed or aggregated on the FPGA, and then streamed onto an RDMA fabric without CPU involvement. Another domain is LLM serving gateways: GPUs generate token logits while the FPGA performs final formatting, batching, and flow control, pushing responses straight to client connections. The same pattern applies to real-time video analytics and XR, in which feature maps or inference metadata leave GPU memory, undergo watermarking or encryption in FPGA fabric, and reach the network within a frame budget on the order of tens of milliseconds. In storage disaggregation, GPUs can handle compression or erasure coding while the FPGA encapsulates the resulting chunks into NVMe-oF or SPDK-oriented transports end-to-end, again avoiding CPU participation on the data path. Finally, ultra-low-jitter trading systems can use GPU-based risk engines to stream price vectors while the SmartNIC attaches FIX/OUCH headers and enforces pacing to satisfy stringent tick-to-trade latency and jitter targets (e.g., sub-25 μs SLAs).
A system that bypasses the CPU egress path (depicted in Figure 1) relies on a small set of enabling technologies. First, it needs a peer-to-peer physical interconnect that can sustain the required bandwidth between the GPU and the FPGA. Second, it requires an addressing and permission model that allows the FPGA to treat GPU memory as a valid DMA target, rather than forcing all transfers through host memory. Third, the design depends on symmetrical DMA capability on both sides—i.e., both devices must be able to initiate and complete the relevant memory transactions and synchronization semantics efficiently. Finally, looking forward, cache-coherent load/store semantics would further simplify the datapath by eliminating explicit copy operations altogether, enabling true shared-memory communication between the GPU and FPGA.
Table 13 summarizes the state of each layer and cites representative vendor or standards documents.

5.1. Application Domains to Benefit from GPU → FPGA → Network Offload

Distributed deep-learning training. State-of-the-art frameworks (e.g., NCCL over RoCE) spend up to 30% of training time in gradient aggregation. With peer-to-peer DMA, the SmartNIC can pull tensors directly from GPU HBM, perform a first-stage reduction in FPGA logic, and transmit a single aggregated vector on the wire. The FpgaNIC prototype reports a 1.3× wall-clock speed-up on 100 Gb/s links, while freeing an entire CPU socket for auxiliary workload management [49].
Large-language-model (LLM) serving. Tokenisation and prompt-cache lookup are latency-sensitive yet structurally simple; implementing these primitives in the FPGA allows the GPU to focus on matrix multiplies. Early benchmarks on GPUDirect-enabled BlueField systems show a 20–25% improvement in tokens-per-second for GPT-style models when request preprocessing is offloaded from the CPU [52].
Real-time video analytics and XR. Edge nodes ingest uncompressed 4-K/8-K streams, run CNNs or vision transformers, and must respond within a single video frame. FPGAs can perform crop/resize and color-space conversion at line rate; results are DMA-read directly from GPU memory and forwarded with end-to-end latencies under 16 ms on Jetson-based P2P prototypes [52].
Storage disaggregation with inline computation. In composable NVMe-oF fabrics, GPUs execute compression or erasure-coding kernels while the FPGA SmartNIC terminates NVMe headers and streams coded stripes onto the network, slashing host-DRAM traffic by more than 50% in RecoNIC demonstrations [45].
Low-jitter financial trading pipelines. Market-data pre-filters run on the SmartNIC (FIX/OUCH parsing; order-book updates) while GPUs compute risk or pricing models. Direct DMA of result vectors to the NIC reduces cache pollution and helps meet sub-25 μs tick-to-trade service-level agreements, as documented in recent GPUDirect RDMA application notes [52].
Collectively, published evaluations record 4–6 μs reductions in tail latency at 100 Gb/s and reclaim an entire CPU socket’s worth of cycles per server—underscoring that GPU-to-FPGA direct offload is not a niche optimization but a decisive enabler for accelerator-dominated data center workloads.

5.2. State of the Art (2024–2025)

This subsection summarizes representative SmartNIC-related developments reported in the 2024–2025 period, emphasizing what changed (capabilities, constraints, and deployment practice) relative to earlier designs.
Recent platform work points to several concrete enablers that make GPU → FPGA → network offload more practical and, increasingly, more general.
GPU-virtual-address DMA (GVAD) removes a major integration hurdle by allowing the SmartNIC to issue DMA transactions directly against GPU virtual addresses. In FpgaNIC, GPU virtual address mappings are brought into the FPGA’s AXI master so that GPUDirect RDMA can target on-card memory (HBM) without relying on intermediate pin-down buffers [49].
Shared RDMA verbs for in-NIC accelerators move the RDMA control plane closer to the datapath. RecoNIC exposes the host’s queue-pair table to FPGA kernels, enabling an accelerator to post operations such as RDMA_WRITE autonomously rather than routing every action through the host [45].
Cache-coherent attachment via CXL 3.0 suggests a future where explicit copies can be replaced by coherent load/store semantics. Early announcements for devices such as Intel Agilex 2 and AMD/Xilinx Versal Premium ES indicate CXL Type-3 capability, allowing SmartNICs to access GPU-resident data as coherent cache lines with sub-microsecond latency budgets (often quoted as <300 ns for load/store paths) [56,57].
PCIe 6.0 readiness matters because peer-to-peer designs are frequently bandwidth-limited at the link. Public PCIe 6.0 summaries commonly cite up to ∼256 GB/s bidirectional bandwidth for a ×16 connection, effectively doubling today’s headroom for P2P transfers and reducing pressure on compression or aggregation to “fit the pipe” [54].
Edge-scale GPUDirect on ARM broadens the deployment envelope. NVIDIA’s Jetson Orin driver support for external FPGA BARs under GPUDirect RDMA opens the door to sub-15 W inference gateways where the SmartNIC performs in-line preprocessing, packaging, or security functions while the GPU focuses on inference [52].

5.3. Design Checklist

To bridge the discussion on hardware limitations with practical future implementations, the design checklist (depicted in Table 14) defines the technical benchmarks required to address current bottlenecks with a next-generation GPU-aware FPGA SmartNIC shell.

6. In-NIC Preprocessing for RDMA Pipelines

Remote Direct Memory Access (RDMA) has become the de facto transport for east–west traffic in hyperscale data centers, underpinning Azure’s storage fabric [2], Meta’s Memcache deployments [58], and every modern NVMe-oF disaggregation layer. By allowing a user process to issue a WRITE or READ that bypasses the remote CPU, RDMA cuts down latency to single-digit microseconds and cuts host-cycle consumption by an order of magnitude. The tradeoff is bandwidth efficiency: an RDMA work-request transmits its payload exactly as provided by the application. Any aggregation, compression, or application-level filtering that could shrink—or otherwise optimize—the payload must occur before the work-request is posted. In practice that preprocessing is still executed in software, re-introducing memory copies, caches, and context switches that RDMA was meant to avoid.
FPGA-based SmartNICs provide the missing component. When preprocessing operates on GPU-resident buffers via PCIe P2P, the underlying GPU → FPGA transfer models are the same as those detailed in Section 5; this section focuses on what to preprocess and how it composes with RDMA semantics.
A shell equipped with streaming parsers, match–action tables, and on-board scratchpads can transform data inside the NIC before the first RDMA_WRITE reaches the wire. This preserves RDMA’s low-latency promise while reclaiming PCIe bandwidth and host CPU cycles. Published pipelines report tangible gains across multiple workload classes.
Gradient aggregation in the NIC (as demonstrated by SwitchML and FpgaNIC) reduces east–west traffic volume by more than 4 × and trims all-reduce latency by roughly 35–40% at 100 Gb/s [49].
Inline LZ4 compression integrated into a Corundum-derived datapath increases effective link throughput from 100 Gb/s to around 170 Gb/s for log-streaming workloads, while adding less than 20 W of board power [38].
Key-prefix filtering implemented in FPGA logic prior to an RDMA GET reduces Memcached tail latency by approximately 60 μs under hotspot workloads [59].
Hence, in-NIC preprocessing is a structural requirement for the next wave of RDMA-centric fabrics: it aligns the granularity of data on the wire with the semantics of the application—without revisiting the host CPU that RDMA set out to bypass.

6.1. RDMA Fast-Path Explanation

RDMA exposes a transport in which user space posts a Work Queue Element (WQE) that the NIC executes without host involvement; completion is signaled by a lightweight doorbell to a Completion Queue (CQ). Figure 2 highlights the critical objects.
  • Protection Domain (PD)—a capability container that binds memory regions, queue pairs, and completion queues.
  • Memory Region (MR)—a page-pinned address range translated by the HCA/IOMMU; referenced by an lkey/rkey.
  • Queue Pair (QP)—transmit and receive rings holding WQEs. Common opcodes are RDMA_WRITE, RDMA_READ, and SEND/RECV.
  • Completion Queue (CQ)—ring of Completion Queue Entries that report success, error, and byte count.

6.1.1. Latency Budget

On modern HCAs (e.g., ConnectX-6 Dx) a WRITE WQE requires only 250 ns from doorbell to first byte on the wire, and a further 600 ns until completion signaling at a line rate of 100 Gb/s  [60]. However, this budget excludes any application-level marshaling performed prior to posting the WQE, which is precisely where in-NIC preprocessing offers substantial savings.

6.1.2. Role of a SmartNIC Shell

A reconfigurable shell can tap the AXI-stream path before the HCA serializer: parsers or compute kernels rewrite, compress, or aggregate the payload and then pass a compacted buffer to the DMA engine, leaving the RDMA transport unchanged. Because the QP state (PSN, ACKs, and congestion control) resides inside the shell, the accelerator needs to implement only the data transform.

6.1.3. Transport Variants

Although InfiniBand defines the verbs model, most data centers deploy RDMA over Ethernet. RoCE v2 (Routable RDMA over Converged Ethernet) encapsulates the InfiniBand GRH in UDP/IP and is ubiquitous at 25 Gb/s and above. iWARP carries the verbs over TCP; it is largely superseded in modern clusters but can remain relevant for long-haul or less tightly controlled networks. NVMe-oF over RDMA reuses SEND/RECV opcodes to transport NVMe command capsules, enabling microsecond-scale access latencies.
Understanding these mechanisms clarifies where a preprocessing kernel must attach: after virtual-address translation but before transport encapsulation, so correctness is preserved while the host-CPU bypass is maximized.

6.2. Taxonomy of In-NIC Preprocessing Tasks

Network payloads can be modified at four distinct semantic layers before an RDMA_WRITE/READ is issued. Classifying these transforms clarifies which hardware blocks and shell services must be present to support a given use case.

6.2.1. Header Manipulation

Inline checksum and CRC generation is a common requirement in storage fabrics such as NVMe-oF and iSCSI-RDMA, which often demand per-segment integrity checks; computing these values on the FPGA avoids stalling the host or GPU on the critical path. Another class of header-level work is header compression and expansion, where proprietary overlays compress private-data fields and SmartNIC logic restores them just before ingress to the remote NIC/HCA.

6.2.2. Payload Shaping

A prominent payload transform is vector aggregation and reduction. SwitchML and FpgaNIC aggregate machine-learning gradients in on-NIC BRAM/HBM, reducing east–west traffic by about 4 × and shaving roughly 35% off all-reduce latency [49]. A second payload-shaping primitive is scatter–gather coalescing: repacking multiple small RDMA regions into a single contiguous buffer reduces work-queue-entry (WQE) count and lowers PCIe overhead.

6.2.3. Content Filtering

Bloom-filter discard enables early rejection of requests that would otherwise waste host cycles. KV-Direct computes a 64-bit Bloom hash on the NIC and can drop 60–70% of cache-miss probes without touching the host [59]. Beyond fixed filters, programmable ACL and DPI pipelines can enforce policy at line rate; for example, FlowBlaze’s stateful match–action design blocks unwanted flows at 40 Gb/s while adding less than 3 μs of latency [61].

6.2.4. Security and Integrity

At the security layer, SmartNICs can apply inline encryption and message authentication. For instance, P4-programmable AES-GCM datapaths can encrypt payloads before an RDMA verb is posted, supporting zero-trust and multi-tenant policies. Another security transform is redaction and tokenisation, where the FPGA scrubs PII fields or injects tokens prior to RDMA transfer, offloading compliance logic from the GPU.

6.2.5. Application-Specific Transforms

Some transforms are tightly tied to specific application domains. In distributed training, gradient sparsification and quantization (e.g., top-k selection or 8-bit quantization) can reduce traffic by an order of magnitude without measurable accuracy loss in reported systems [62]. In telemetry and logging pipelines, delta coding for log streams has been demonstrated in a Corundum-derived datapath with on-NIC LZ4+delta compression, achieving an effective 170 Gb/s on a 100 Gb/s link while consuming approximately 20 W of additional board power [38].
Cross-Cutting Observation
Most tasks require only a narrow set of hardware primitives—streaming CRC/crypto pipes, on-chip scratchpad, and an AXI-Stream switch—suggesting that a well-chosen repertoire of fixed engines inside the shell can serve a broad spectrum of preprocessing workloads while relegating application-specific logic to partial-reconfiguration regions.

6.3. State of the Art: Published In-NIC Preprocessing Pipelines

Academic prototypes and a handful of vendor white-papers already demonstrate that significant application logic can be executed inside the NIC at line rate, well before an RDMA verb reaches the wire. Table 15 consolidates representative projects, grouped by the primary function they offload. The common pattern is clear: a streaming parser feeds a small on-chip scratch-pad (typically BRAM or HBM), after which a compute kernel performs the transform and forwards a size-reduced or aggregated buffer to the RDMA engine.

6.4. Hardware Building Blocks for In-NIC Preprocessing

Because every SmartNIC shell hides the latency-critical datapath inside its static region, only a limited set of reusable hardware primitives can afford to live there; everything else must fit into a user PR region without breaking timing at 100–400 Gb/s. Table 16 catalogs those primitives, the functional gap they close, and examples of shells that already integrate them.

Integration Guidelines

Most shells already embed the parser, DMA pump, and clock-island logic; adding a thin, parameterizable CRC/crypto pipe and reserving 1–2 MB of BRAM for scratch-pads enables 80% of the workloads listed in §6.2. Deep buffering or large reductions call for HBM devices, but these should be optional to keep cost tiers flexible.

7. Software Stack and Programming Layers

Hardware flexibility does not guarantee deployment success; any hardware needs to be enabled by software. In practice, SmartNIC viability is often decided less by raw datapath capability and more by whether the software stack can (i) expose stable abstractions to applications, (ii) integrate safely with the host memory/IOMMU model, and (iii) support operational workflows such as upgrades, debugging, and performance isolation. Recent systems experience argues that SmartNICs should be viewed as data-movement controllers whose value depends on end-to-end co-design across driver, runtime, and control/management software [27,63]. At the same time, programmability layers (e.g., P4 on multicore SmartNICs) introduce their own performance/portability trade-offs: the abstraction improves developer productivity but can hide microarchitectural bottlenecks, motivating automated tuning and profiling frameworks [64]. Finally, even when the dataplane is programmable, the control plane is a reliability-critical component: inconsistent or non-deterministic updates of match/action state can disrupt correctness, so update mechanisms and their timing behavior must be treated as part of the SmartNIC system design [65]. Table 17 presents an overview of the software stacks used in SmartNIC applications.
This subsection shows the highlights of these types of software that is used in the interaction of a SmartNIC with its host.

7.1. Host-Side Kernel Drivers

A SmartNIC is a PCIe-enabled device, so a kernel driver is required to enumerate it, configure it, and expose a safe data and control interface to user space. A first responsibility is PCIe probing and BAR mapping: drivers such as mlx5 and Intel’s idpf map doorbells and completion queues into user space via ioctl/mmap, while honoring IOMMU policies and features such as ATS that are important for GPU P2P transfers [66,67]; Section 5 discusses the end-to-end offload paths and constraints. Another key design choice is interrupt versus polling operation. At line rates of 100–400 Gb/s, datapaths typically rely on busy-polling (e.g., NAPI polling or eBPF/XDP-style processing) to avoid interrupt storms; interrupts are then reserved for link events, exceptional conditions, or partial-reconfiguration completion notifications [68]. Finally, kernel support for virtualization (e.g., SR-IOV and vDPA/virtio-net) enables multiple containers or VMs to own dedicated queues or queue pairs while sharing the same shell datapath.

7.2. User-Space Run-Times

Above the driver, user-space run-times provide the performance-critical interfaces that applications actually program against. DPDK and SPDK are widely used to provide zero-copy packet and NVMe buffer management, and many FPGA shells ship a DPDK poll-mode driver (PMD) layered on top of their kernel driver (e.g., dpdk_qdma) [69]. For RDMA workloads, libverbs and RDMA-CM expose queue-pair abstractions for RoCE and iWARP; in this layer, systems such as RecoNIC extend libverbs with shared-queue-pair mechanisms so that in-NIC accelerators can post work queue entries (WQEs) without trapping through the host. In vendor ecosystems, SDKs such as DOCA or the Pensando API wrap device registers and service endpoints in C/CUDA bindings and management APIs; for example, DOCA 2.5 includes GPUDirect RDMA helpers that map BlueField doorbells into CUDA streams directly [70].

7.3. On-Card Firmware and OS

A third layer runs on the card itself and determines how the SmartNIC is initialized, managed, and updated. In the DPU class, full-featured Linux distributions are common: BlueField-3 and Pensando DSC boot Ubuntu or Yocto on multi-core Arm clusters, where containerized agents can manage OVS-DPDK pipelines, DOCA plugins, and secure-boot attestation workflows [71]. In many FPGA shells, by contrast, lightweight RTOS or bare-metal control is typical; Corundum and OpenNIC frequently rely on a microcontroller-level monitor that initializes PHYs and supports management actions such as loading partial bitstreams over PCIe BAR 0. Finally, shells that embrace dynamic services introduce a dedicated partial-reconfiguration orchestrator. For example, Coyote v2’s on-card manager quiesces affected queues, loads a new PR region via ICAP in under 300 ms, and restores queue state in a way that is transparent to host software [40].
Together, these three layers deliver the abstractions and life-cycle controls that turn a reconfigurable datapath into a production-grade SmartNIC.

8. Conclusions Future Work

This article has examined the evolution of network interface technology from simple frame movers to highly programmable SmartNICs. This study demonstrates that the fundamental driver behind this evolution is the growing mismatch between general-purpose CPUs and modern accelerators such as GPUs. Reducing the cost of data movement, rather than adding yet more compute cores, now offers the largest performance dividend.

8.1. Key Findings

Three observations emerge consistently from the surveyed systems. First, shell design is decisive: a well-structured FPGA shell can guarantee deterministic 100 Gb/s-class I/O while still supporting partial reconfiguration on sub-second time scales. In particular, hierarchical shells (e.g., Coyote v2) indicate that both infrastructure services and user kernels can be updated without interrupting live traffic, provided that reconfiguration boundaries, queue quiescing, and state management are treated as first-class design constraints. Second, direct GPU↔FPGA paths matter: peer-to-peer DMA over PCIe Gen4/5—and, prospectively, coherent attachment via CXL 3.0—eliminates multiple memory copies and can remove several microseconds from the end-to-end critical path. Across prototypes, these CPU-bypassing designs can reclaim the equivalent of an entire CPU socket while improving accelerator-dominated workloads such as distributed training by up to 30%. Third, in-NIC preprocessing changes the economics of RDMA: operations such as gradient reduction, Bloom-filter pruning, and inline encryption can be executed at network speed inside the SmartNIC, shrinking east–west traffic by up to a factor of four and reducing the CPU work per gigabyte transferred.

8.2. Software Implications

Hardware flexibility is valuable only when matched by a robust software stack. Reliable kernel drivers, user-space libraries (DPDK, libverbs, and DOCA), and secure on-card firmware are essential for turning laboratory prototypes into production tools.

8.3. Future Work

Future work will focus on four complementary directions. First, we target an 800 Gb/s reconfigurable shell built around a Versal-HBM device capable of sustaining 4 × 200 Gb/s Ethernet ports. In this design, the high-bandwidth on-chip network would interconnect the MACs, HBM stacks, and partial-reconfiguration regions, providing sufficient internal bandwidth (>3 TB/s) to prevent congestion even when all four links operate at line rate. Second, we aim to move toward unified tool-chains: a timing-aware build flow that can integrate RTL, C/C++ HLS, and P4 within a single project, with the objective of reducing “idea-to-bitstream” latency to under 24 h. Third, we plan to develop a runtime scheduler and associated host driver/API that can deploy and manage SmartNIC compute workloads, whether or not they are directly tied to the network datapath; in this approach, a scheduler running on the embedded processors would orchestrate execution with minimal host involvement beyond workload specification. Finally, we will extend the study with a network power-consumption overview. Since network interfaces can account for around 15% of a SmartNIC’s total power, we plan to investigate prediction mechanisms for adaptive link-speed reduction in order to optimize energy consumption.

8.4. Original Contributions

Beyond summarizing prior work, this article makes several concrete contributions. First, it proposes a unified taxonomy for the SmartNIC design space by separating the landscape into comparable device classes—traditional NICs, ASIC/DPU-class SmartNICs, and FPGA-based shells—and by discussing them through a common set of architectural axes (e.g., programmability, offload locus, and operational constraints). Second, this paper provides a normalized, side-by-side analysis of representative FPGA shells and research platforms, capturing not only their feature sets but also their reconfiguration model (static versus PR/DFX), integration style, and practical trade-offs, as consolidated in Table 8, Table 9 and Table 10 and Table 12. Third, it contributes a cross-layer synthesis of GPU–FPGA peer-to-peer offload paths by consolidating the required hardware/software stack layers and the key design constraints for CPU-bypassing GPU↔FPGA datapaths—an aspect that is often fragmented across vendor documentation and isolated prototypes (Table 13 and Table 14). Fourth, the article offers a consolidated view of in-NIC RDMA-oriented preprocessing by surveying the main preprocessing operators and mapping them to the architectural building blocks that enable line-rate execution, while highlighting where bottlenecks remain (Table 15 and Table 16). Finally, based on the comparative evidence gathered throughout the survey, this paper distills actionable guidance for next-generation reconfigurable SmartNIC shells, emphasizing requirements such as low-latency partial reconfiguration, clear isolation boundaries, bandwidth budgeting across NIC/PCIe/NoC/HBM, and a stable SDK/API surface that supports heterogeneous RDMA-centric workloads.

Author Contributions

Conceptualization, A.-A.U.; methodology, A.-A.U.; formal analysis, A.-A.U.; investigation, A.-A.U.; writing—original draft preparation, A.-A.U. and C.B.; writing—review and editing, A.-A.U. and C.B.; visualization, C.B.; supervision, C.B.; project administration, A.-A.U. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Acknowledgments

This study was supported by the Romanian Ministry of Research, Innovation and Digitization through the ATLAS CERN-RO grants.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
ABIApplication Binary Interface
ACLAccess Control List
AESAdvanced Encryption Standard
APIApplication Programming Interface
ARMAdvanced RISC Machine
ASICApplication-Specific Integrated Circuit
ATSAddress Translation Services
AXIAdvanced eXtensible Interface
BARBase Address Register
BERBit Error Rate
BFBlueField
BRAMBlock RAM
CDRClock and Data Recovery
CNIContainer Network Interface
CQCompletion Queue
CQECompletion Queue Entry
CPUCentral Processing Unit
CRCCyclic Redundancy Check
CXLCompute Express Link
DDoSDistributed Denial of Service
DFXDynamic Function eXchange
DMADirect Memory Access
DOCAData-Center-on-a-Chip Architecture (NVIDIA SDK)
DPDKData Plane Development Kit
DPIDeep Packet Inspection
DPUData Processing Unit
DRAMDynamic Random-Access Memory
ECCElliptic-Curve Cryptography
FECForward Error Correction
FIFOFirst-In, First-Out
FPGAField-Programmable Gate Array
FWFirmware
GB/sGigabytes per second
Gb/sGigabits per second
GPUGraphics Processing Unit
GPUDirectNVIDIA GPUDirect
GPTGenerative Pre-trained Transformer
GRHGlobal Routing Header
GVADGPU Virtual-Address DMA
HBMHigh-Bandwidth Memory
HCAHost Channel Adapter
HLSHigh-Level Synthesis
HPCHigh-Performance Computing
HTTPHypertext Transfer Protocol
I/OInput/Output
IBInfiniBand
ICAPInternal Configuration Access Port
IEEEInstitute of Electrical and Electronics Engineers
IOMMUInput-Output Memory Management Unit
IPInternet Protocol
IRQInterrupt Request
iWARPInternet Wide Area RDMA Protocol
KVKey–Value
KVMKernel-based Virtual Machine
L2Layer 2
L3Layer 3
LANLocal Area Network
LUTLook-Up Table
LZ4Lempel–Ziv 4
MACMedia Access Control
MIGMulti-Instance GPU
MLMachine Learning
MMIOMemory-Mapped I/O
MLPMultilayer Perceptron
MSI-XMessage Signaled Interrupts eXtended
MTUMaximum Transmission Unit
NICNetwork Interface Card
NoCNetwork-on-Chip
NCCLNVIDIA Collective Communications Library
NVMeNon-Volatile Memory Express
NVMe-oFNVMe over Fabrics
OSOperating System
OVSOpen vSwitch
P4Programming Protocol-independent Packet Processors
P2PPeer-to-Peer
PCIePeripheral Component Interconnect Express
PCSPhysical Coding Sublayer
PHYPhysical Layer
PIIPersonally Identifiable Information
PMDPoll-Mode Driver
PRPartial Reconfiguration
PTPPrecision Time Protocol
QoSQuality of Service
QPQueue Pair
RDMARemote Direct Memory Access
RDMA-CMRDMA Connection Manager
RMReconfigurable Module
RoCERDMA over Converged Ethernet
RoCEv2Routable RDMA over Converged Ethernet v2
RPCRemote Procedure Call
RSSReceive-Side Scaling
RTOSReal-Time Operating System
SDKSoftware Development Kit
SDNSoftware-Defined Networking
SHASecure Hash Algorithm
SIGCOMMACM Special Interest Group on Data Communication
SLAsService-Level Agreements
SmartNICSmart Network Interface Card
SoCSystem-on-Chip
SPDKStorage Performance Development Kit
SR-IOVSingle Root I/O Virtualization
SRAMStatic Random-Access Memory
TCPTransmission Control Protocol
TLSTransport Layer Security
TPUTensor Processing Unit
UDPUser Datagram Protocol
URAMUltraRAM
VCUVideo Codec Unit
VFVirtual Function
VLANVirtual LAN
VMVirtual Machine
vDPAvhost Data Path Acceleration
VXLANVirtual Extensible LAN
WQEWork Queue Element
XDPeXpress Data Path
XRExtended Reality

References

  1. Hennessy, J.L.; Patterson, D.A. A New Golden Age for Computer Architecture. Commun. ACM 2019, 62, 48–60. [Google Scholar] [CrossRef]
  2. Firestone, D.; Putnam, A.; Mundkur, S.; Chiou, D.; Dabagh, A.; Andrewartha, M.; Angepat, H.; Bhanu, V.; Caulfield, A.; Chung, E.; et al. Azure Accelerated Networking: SmartNICs in the Public Cloud. In Proceedings of the 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18), Renton, WA, USA, 9–11 April 2018; pp. 51–66. [Google Scholar]
  3. Vieira, M.A.M.; Castanho, M.S.; Pacífico, R.D.G.; Santos, E.R.S.; Câmara Júnior, E.P.M.; Vieira, L.F.M. Fast Packet Processing with eBPF and XDP: Concepts, Code, Challenges, and Applications. ACM Comput. Surv. 2020, 53, 16. [Google Scholar] [CrossRef]
  4. Luizelli, M.C.; Vogt, F.; Matos, G.M.V.; Cordeiro, W.; Schaeffer Filho, A.E.; Schwarz, M.; Verdi, F.L.; Rothenberg, C.E. SmartNICs: The Next Leap in Networking. In Minicursos do XLII Simpósio Brasileiro de Redes de Computadores e Sistemas Distribuídos (SBRC); Sociedade Brasileira de Computação: Porto Alegre, Brazil, 2024; pp. 40–89. [Google Scholar] [CrossRef]
  5. Lin, W.; Shan, Y.; Kosta, R.; Krishnamurthy, A.; Zhang, Y. SuperNIC: An FPGA-Based, Cloud-Oriented SmartNIC. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA ’24), Monterey, CA, USA, 3–5 March 2024. [Google Scholar] [CrossRef]
  6. Sukhwani, B.; Kapur, M.; Ohmacht, A.; Schour, L.; Ohmacht, M.; Ward, C.; Haymes, C.; Asaad, S. Janus: An Experimental Reconfigurable SmartNIC with P4 Programmability and SDN Isolation. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA ’23), Monterey, CA, USA, 12–14 February 2023. [Google Scholar] [CrossRef]
  7. Kianpisheh, S.; Taleb, T. A Survey on In-Network Computing: Programmable Data Plane and Technology Specific Applications. IEEE Commun. Surv. Tutor. 2023, 25, 701–761. [Google Scholar] [CrossRef]
  8. Hanford, N.; Ahuja, V.; Farrens, M.K.; Tierney, B.; Ghosal, D. A Survey of End-System Optimizations for High-Speed Networks. ACM Comput. Surv. 2018, 51, 54. [Google Scholar] [CrossRef]
  9. Peter, S.; Li, J.; Zhang, I.; Ports, D.R.K.; Woos, D.; Krishnamurthy, A.; Anderson, T.; Roscoe, T. Arrakis: The Operating System Is the Control Plane. ACM Trans. Comput. Syst. 2015, 33, 11. [Google Scholar] [CrossRef]
  10. Huang, M.; Li, T.; Yang, H.; Li, C.; Zhang, Y.; Sun, Z. Survey on Ethernet RDMA Network Interface Card. J. Comput. Res. Dev. 2025, 62, 1262–1289. [Google Scholar] [CrossRef]
  11. Qiu, Y.; Xing, J.; Hsu, K.-F.; Kang, Q.; Liu, M.; Narayana, S.; Chen, A. Automated SmartNIC Offloading Insights for Network Functions. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP), Virtual, 26–29 October 2021. [Google Scholar] [CrossRef]
  12. Zheng, C.; Hong, X.; Ding, D.; Vargaftik, S.; Ben-Itzhak, Y.; Zilberman, N. In-Network Machine Learning Using Programmable Network Devices: A Survey. IEEE Commun. Surv. Tutor. 2024, 26, 1171–1200. [Google Scholar] [CrossRef]
  13. Sada, M.F.; Graham, J.J.; Tatineni, M.; Mishin, D.; DeFanti, T.A.; Würthwein, F. Real-Time In-Network Machine Learning on P4-Programmable FPGA SmartNICs with Fixed-Point Arithmetic and Taylor. In Proceedings of the Practice and Experience in Advanced Research Computing (PEARC), Columbus, OH, USA, 20–24 July 2025. [Google Scholar] [CrossRef]
  14. Intel Corporation. Intel® 82559 Fast Ethernet PCI Bus LAN Controller Datasheet. 1999. Available online: https://www.intel.cn/content/dam/doc/datasheet/82559er-fast-ethernet-pci-datasheet.pdf (accessed on 7 November 2025).
  15. Microsoft Corporation. Large Send Offload (LSO) in Microsoft Windows; Microsoft Corporation: Redmond, WA, USA, 2002. [Google Scholar]
  16. Linux Kernel Community. Generic Receive Offload (GRO)—Linux Kernel Documentation. 2009. Available online: https://docs.kernel.org/networking/segmentation-offloads.html (accessed on 7 November 2025).
  17. Microsoft Corporation. Scalable Networking: Eliminating the Receive Processing Bottleneck—Introducing RSS; Microsoft Corporation: Redmond, WA, USA, 2004. [Google Scholar]
  18. PCI-SIG. PCI Express Base Specification 3.0; Annex 6: Message Signalled Interrupts eXtended (MSI-X); PCI-SIG: Beaverton, OR, USA, 2010. [Google Scholar]
  19. PCI-SIG. Single Root I/O Virtualization and Sharing Specification 1.1; PCI-SIG: Beaverton, OR, USA, 2010. [Google Scholar]
  20. Intel Corporation. Intel Virtual Machine Device Queues (VMDq); Technical Brief 324419-001. 2008. Available online: https://www.intel.sg/content/dam/www/public/us/en/documents/white-papers/vmdq-technology-paper.pdf (accessed on 7 November 2025).
  21. Høiland-Jørgensen, T.; Brouer, J.D.; Borkmann, D.; Fastabend, J.; Ahern, D.; Graf, T. The eXpress Data Path: Fast Programmable Packet Processing in the Operating System Kernel. In Proceedings of the SIGCOMM, Budapest, Hungary, 20–25 August 2018; pp. 5–21. [Google Scholar]
  22. Linux Foundation. eBPF—Extended Berkeley Packet Filter Project Overview; Linux Foundation: San Francisco, CA, USA, 2021. [Google Scholar]
  23. Intel Corporation. Intel Ethernet Controller E810 Series Product Brief; Intel Corporation: Santa Clara, CA, USA, 2020. [Google Scholar]
  24. NVIDIA Networking. NVIDIA ConnectX-6 Dx Adapter Card Data Sheet; NVIDIA Networking: Sunnyvale, CA, USA, 2022. [Google Scholar]
  25. Wu, M.; Chen, Q.; Wang, J. Toward low CPU usage and efficient DPDK communication in a cluster. J. Supercomput. 2022, 78, 1852–1884. [Google Scholar] [CrossRef]
  26. Parola, F.; Procopio, R.; Risso, F. Assessing the Performance of XDP and AF_XDP Based Network Functions in Edge Data Center Scenarios. In Proceedings of the ACM Conference on emerging Networking EXperiments and Technologies (CoNEXT), Virtual, 6–10 December 2021. [Google Scholar] [CrossRef]
  27. Sherry, J. The I/O Driven Server: From SmartNICs to Data Movement Controllers. SIGCOMM Comput. Commun. Rev. 2024, 53, 9–17. [Google Scholar] [CrossRef]
  28. Kashyap, A.; Li, Y.; Ng, D.; Lu, X. Understanding the Idiosyncrasies of Emerging BlueField DPUs. In Proceedings of the International Conference on Supercomputing (ICS), Salt Lake City, UT, USA, 8–11 June 2025. [Google Scholar] [CrossRef]
  29. NVIDIA Networking (Mellanox Technologies). NVIDIA BlueField-3 DPU Product Brief. 2024. Available online: https://resources.nvidia.com/en-us-accelerated-networking-resource-library/datasheet-nvidia-bluefield?lx=LbHvpR&topic=networking-cloud/ (accessed on 10 January 2026).
  30. Marvell Technology Group Ltd. Marvell OCTEON® 10 CN106xx Family: Infrastructure Processor; Marvell Technology Group Ltd.: Santa Clara, CA, USA, 2024. [Google Scholar]
  31. AMD Pensando Systems. AMD Pensando DSC-2 “Elba” Data Processing Unit; AMD Pensando Systems: Milpitas, CA, USA, 2023. [Google Scholar]
  32. Broadcom Inc. Broadcom Stingray® BCM5883 Series SmartNIC SoC—Product Data Sheet; Broadcom Inc.: Palo Alto, CA, USA, 2022. [Google Scholar]
  33. Intel Corporation. Intel Infrastructure Processing Unit E2100 (“Mount Evans”) Architecture Overview; Intel Corporation: Santa Clara, CA, USA, 2023. [Google Scholar]
  34. Kubálek, J.; Cabal, J.; Špinler, M.; Iša, R. DMA Medusa: A Vendor-Independent FPGA-Based Architecture for 400 Gbps DMA Transfers. In Proceedings of the IEEE FCCM 2021, Orlando, FL, USA, 9–12 May 2021. [Google Scholar] [CrossRef]
  35. Cabal, J.; Sikora, J.; Friedl, Š; Špinler, M.; Kořenek, J. FPL Demo: 400G FPGA Packet Capture Based on Network Development Kit. In Proceedings of the FPL 2022, Belfast, UK, 29–31 August 2022. [Google Scholar] [CrossRef]
  36. Li, J.; Sun, Z.; Yan, J.; Yang, X.; Jiang, Y.; Quan, W. DrawerPipe: A Reconfigurable Pipeline for Network Processing on FPGA-Based SmartNIC. Electronics 2019, 9, 59. [Google Scholar] [CrossRef]
  37. He, Z.; Korolija, D.; Alonso, G. EasyNet: 100 Gbps Network for HLS. In Proceedings of the 31st International Conference on Field-Programmable Logic and Applications (FPL), Dresden, Germany, 30 August–3 September 2021; pp. 197–203. [Google Scholar] [CrossRef]
  38. Forencich, A.; Snoeren, A.C.; Porter, G.; Papen, G. Corundum: An Open-Source 100-Gbps NIC. In Proceedings of the IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), Fayetteville, AR, USA, 3–6 May 2020; pp. 38–46. [Google Scholar] [CrossRef]
  39. Li, B.; Tan, K.; Luo, L.; Peng, Y.; Luo, R.; Xu, N.; Xiong, Y.; Cheng, P.; Chen, E. ClickNP: Highly Flexible and High Performance Network Processing with Reconfigurable Hardware. In Proceedings of the ACM SIGCOMM Conference, Florianópolis, Brazil, 22–26 August 2016. [Google Scholar] [CrossRef]
  40. Ramhorst, B.; Korolija, D.; Heer, M.; Dann, J.; Liu, L.; Alonso, G. Coyote v2: Raising the Level of Abstraction for Data Center FPGAs. In Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, Seoul, Republic of Korea, 13–16 October 2025; pp. 639–654. [Google Scholar] [CrossRef]
  41. Grant, S.; Yelam, A.; Bland, M.; Snoeren, A.C. SmartNIC Performance Isolation with FairNIC: Programmable Networking for the Cloud. In Proceedings of the ACM SIGCOMM 2020, Virtual, 10–14 August 2020. [Google Scholar] [CrossRef]
  42. Zhou, Y.; Wilkening, M.; Mickens, J.; Yu, M. SmartNIC Security Isolation in the Cloud with S-NIC. In Proceedings of the EuroSys 2024, Athens, Greece, 22–25 April 2024. [Google Scholar] [CrossRef]
  43. AMD Inc. OpenNIC Shell for Xilinx Alveo. 2022. Available online: https://github.com/Xilinx/open-nic-shell (accessed on 7 November 2025).
  44. Yan, Y.; Beldachi, A.F.; Nejabati, R.; Simeonidou, D. P4-enabled Smart NIC: Enabling Sliceable and Service-Driven Optical Data Centres. J. Light. Technol. 2020, 38, 2688–2694. [Google Scholar] [CrossRef]
  45. Zhong, G.; Kolekar, A.; Amornpaisannon, B.; Choi, I.; Javaid, H.; Baldi, M. A Primer on RecoNIC: RDMA-enabled Compute Offloading on SmartNIC. arXiv 2023. [Google Scholar] [CrossRef]
  46. Pan, L.; Guo, S.; Zhang, M. Design of a Fast and Scalable FPGA-Based Bitmap for RDMA Networks. Electronics 2024, 13, 4900. [Google Scholar] [CrossRef]
  47. Azorin, R.; Monterubbiano, A.; Castellano, G.; Gallo, M.; Pontarelli, S.; Rossi, D. Taming the Elephants: Affordable Flow Length Prediction in the Data Plane. Proc. ACM Netw. 2024, 2, 5. [Google Scholar] [CrossRef]
  48. Liess, M.; Biersack, F.; Nolte, L.; Wild, T.; Herkersdorf, A. ecoNIC: SmartNIC-assisted Power Management for Networking Workloads in Linux Servers. Microprocess. Microsyst. 2025, 119, 105209. [Google Scholar] [CrossRef]
  49. Wang, Z.; Huang, H.; Zhang, J.; Wu, F.; Alonso, G. FpgaNIC: An FPGA-Based Versatile 100Gb SmartNIC for GPUs. In Proceedings of the 2022 USENIX Annual Technical Conference (USENIX ATC 22), Carlsbad, CA, USA, 11–13 July 2022; pp. 967–986. [Google Scholar]
  50. Ahmed, M.K.; Kealoha, M.P.; Mbongue, J.M.; Saha, S.K.; Tchinda, E.N.; Mbua, P.E.; Bobda, C. Multi-Tenant Cloud FPGA: A Survey on Security, Trust and Privacy. ACM Trans. Reconfig. Technol. Syst. 2025, 18, 23. [Google Scholar] [CrossRef]
  51. IEEE Std 1735-2014; IEEE Recommended Practice for Encryption and Management of Electronic Design Intellectual Property (IP). IEEE: Piscataway, NJ, USA, 2015; pp. 1–90. [CrossRef]
  52. NVIDIA Corporation. NVIDIA GPUDirect RDMA Application Note v4.0. 2024. Available online: https://docs.nvidia.com/cuda/gpudirect-rdma/ (accessed on 7 November 2025).
  53. Advanced Micro Devices Inc. AMD/Xilinx QDMA Subsystem—Product Guide. 2023. Available online: https://docs.amd.com/r/en-US/pg302-qdma (accessed on 7 November 2025).
  54. PCI-SIG. PCIe 6.0 Electrical Compliance Update. In Proceedings of the PCI-SIG Developers Conference, Mohiuddin Mazumber, India, 10 November 2025. [Google Scholar]
  55. CXL Consortium. Compute Express Link (CXL) Specification, Version 3.0. 2024. Available online: https://computeexpresslink.org/wp-content/uploads/2024/02/CXL-3.0-Specification.pdf (accessed on 7 November 2025).
  56. Intel Corporation. Intel Agilex 2 FPGA Product Brief. 2025. Available online: https://www.altera.com/products/fpga/agilex/7 (accessed on 7 November 2025).
  57. AMD/Xilinx. Versal Premium ACM; AMD/Xilinx: San Jose, CA, USA, 2025. [Google Scholar]
  58. Meta Platforms Inc. RoCE Networks for Distributed AI Training at Scale. 2024. Available online: https://engineering.fb.com/2024/08/05/data-center-engineering/roce-network-distributed-ai-training-at-scale/ (accessed on 7 November 2025).
  59. Li, B.; Ruan, Z.; Xiao, W.; Lu, Y.; Xiong, Y.; Putnam, A.; Chen, E.; Zhang, L. KV-Direct: High-Performance In-Memory Key-Value Store with Programmable NIC. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP ’17), Shanghai, China, 28–31 October 2017; pp. 137–152. [Google Scholar]
  60. NVIDIA Networking. NVIDIA ConnectX-6 Dx Programmer’s Reference Manual; NVIDIA Networking: Sunnyvale, CA, USA, 2023. [Google Scholar]
  61. Pontarelli, S.; Bifulco, R.; Bonola, M.; Cascone, C.; Spaziani, M.; Bruschi, V.; Sanvito, D.; Siracusano, G.; Capone, A.; Honda, M.; et al. FlowBlaze: Stateful Packet Processing in Hardware. In Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), Boston, MA, USA, 26–28 February 2019; pp. 531–548. [Google Scholar]
  62. Sapio, A.; Canini, M.; Ho, C.-Y.; Nelson, J.; Kalnis, P.; Kim, C.; Krishnamurthy, A.; Moshref, M.; Ports, D.; Richtarik, P. Scaling Distributed Machine Learning with In-Network Aggregation. In Proceedings of the 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21), Virtual, 12–14 April 2021; pp. 785–808. [Google Scholar]
  63. Nickel, M.; Göhringer, D. A Survey on Architectures, Hardware Acceleration and Challenges for In-Network Computing. ACM Trans. Reconfig. Technol. Syst. 2024, 18, 10. [Google Scholar] [CrossRef]
  64. Xing, J.; Qiu, Y.; Hsu, K.-F.; Kang, Q.; Liu, M.; Narayana, S.; Chen, A. Unleashing SmartNIC Packet Processing Performance in P4. In Proceedings of the ACM SIGCOMM Conference, New York, NY, USA, 10–14 September 2023. [Google Scholar] [CrossRef]
  65. Stubbe, H.; Gallenmüller, S.; Simon, M.; Hauser, E.; Scholz, D.; Carle, G. Exploring Data Plane Updates on P4 Switches with P4Runtime. Comput. Commun. 2024, 225, 44–53. [Google Scholar] [CrossRef]
  66. Linux Kernel Community. mlx5—Mellanox/NVIDIA Ethernet Driver Documentation; Linux Kernel Community: Palo Alto, CA, USA, 2024. [Google Scholar]
  67. Intel Corporation. IDPF: Intel Distributed Device Function Driver; Intel Corporation: Santa Clara, CA, USA, 2023. [Google Scholar]
  68. Netdev Community. XDP: High-Performance Packet Processing in Linux. 2023. Available online: https://xdp-project.net (accessed on 7 November 2025).
  69. DPDK Project. Data Plane Development Kit (DPDK) 23.11. 2023. Available online: https://www.dpdk.org (accessed on 7 November 2025).
  70. NVIDIA Corporation. DOCA Programming Guide (DOCA SDK v2.5.5). NVIDIA Docs. 2025. Available online: https://docs.nvidia.com/doca/sdk/DOCA-Programming-Guide/index.html (accessed on 21 January 2026).
  71. AMD Pensando. Pensando DSC-2 System Software Overview; AMD Pensando: Milpitas, CA, USA, 2024. [Google Scholar]
Figure 1. Conceptual GPU → FPGA SmartNIC → network pipeline. The figure highlights the minimum layers required for CPU-bypass: a peer-to-peer interconnect, an addressing/permission model, symmetric DMA capabilities, and a transport endpoint.
Figure 1. Conceptual GPU → FPGA SmartNIC → network pipeline. The figure highlights the minimum layers required for CPU-bypass: a peer-to-peer interconnect, an addressing/permission model, symmetric DMA capabilities, and a transport endpoint.
Applsci 16 01476 g001
Figure 2. RDMA WRITE fast-path with GPU → FPGA direct offload. The SmartNIC reads tensors directly from GPU HBM, applies in-NIC preprocessing, segments the payload, and transmits it on the wire; WQE doorbells and CQ completions bypass the host CPU entirely. The underlying P2P/DMA mechanics are detailed in Section 5.
Figure 2. RDMA WRITE fast-path with GPU → FPGA direct offload. The SmartNIC reads tensors directly from GPU HBM, applies in-NIC preprocessing, segments the payload, and transmits it on the wire; WQE doorbells and CQ completions bypass the host CPU entirely. The underlying P2P/DMA mechanics are detailed in Section 5.
Applsci 16 01476 g002
Table 1. Key NIC offload milestones and their advantages to the host CPU.
Table 1. Key NIC offload milestones and their advantages to the host CPU.
EraFeatureBenefit for the Host
1 Gbps (1999) [14]TCP/UDP checksum offloadRemoves per-packet CRC/CSUM arithmetic from the CPU datapath.
10 Gbps (2002) [15,16]LSO/TSO, LRO/GROCoalesces many small segments into large ones and aggregates receives, cutting interrupts and context switches.
40 Gbps (2010) [17,18]RSS + MSI-XDistributes flows across multiple cores via hash-based queue steering, reducing per-core packet-handling load.
25/100 Gbps (2015) [19,20]SR-IOV, VMDqProvides hardware-level queue isolation for dozens of virtual machines or containers, eliminating hypervisor copying.
100–400 Gbps (2020) [21,22]XDP, eBPF hooksAllows early packet drop, redirect, or forwarding decisions inside the driver hot-path, cutting kernel traversal latency.
line rate (2023+) [23,24]GENEVE/VXLAN tunnel parsingOffloads overlay encapsulation/decapsulation so overlay networks run at line rate without host CPU involvement.
Table 2. Representative traditional NIC device families and their distinguishing capabilities.
Table 2. Representative traditional NIC device families and their distinguishing capabilities.
Vendor/FamilyLaunchMax Speed (Gbps)Notable Traits
Intel 825992009 2 × 10  First commodity NIC with 128 RSS queues; widely adopted in early cloud servers.
Mellanox ConnectX-3201140 Early RoCEv1 RDMA support in firmware, while completion-queue polling remained CPU-driven.
Intel XL7102015 2 × 40  / 4 × 10  SR-IOV for up to 256 virtual functions and VMDq steering for virtual machines.
Broadcom NetXtreme-E201725/50 Integrated IEEE 1588 PTP timestamping and advanced flow-steering filters.
Intel E8102020100 Hardware offloads for VXLAN/GENEVE tunnels plus a programmable flow director.
Marvell Alaska C4002022400 PAM4 SERDES with PCIe 5 × 16; still relies on host CPU for TCP segmentation offload.
Table 3. Historical progression of major DPU/SmartNIC ASIC families and their headline innovations.
Table 3. Historical progression of major DPU/SmartNIC ASIC families and their headline innovations.
YearMilestoneFirst Commercial SiliconHeadline Innovation
2017BlueField-1Mellanox BF16008 × Arm A72, inline RoCE v2 and NVMe-oF at 2 × 25 Gb/s.
2019Pensando DSC-1 “Capri”Pensando DSC-25First fully P4-programmable SmartNIC ASIC
(144 micro-PUs).
2020BlueField-2NVIDIA BF2500/BF2800TLS/IPsec offload, 2 × 100 Gb/s, debut of DOCA SDK.
2021Intel “Mount Evans” IPUIntel E200016 × Arm N1, P4 pipeline plus NVMe-oF/TLS, 200 Gb/s links.
2022Broadcom Stingray v2BCM58838 × Arm A72 @ 3 GHz, 100 Gb/s, programmable eBPF hooks.
2023AMD Pensando DSC-2 “Elba”DSC2-100/20016 × Arm A72 + 144 P4 MPUs, dual-100 Gb/s ports.
2024Marvell OCTEON 10CN106xx24 × Arm N2, 16 × 50 Gb lanes (≤800 Gb/s), on-die INT8 ML tiles.
2025BlueField-3/SuperNICBF360016 × Arm A78AE, PCIe 5 × 16, dual 400 Gb/s ports with GPU P2P.
Table 4. Detailed breakdown of a DPU baseline micro-architecture.
Table 4. Detailed breakdown of a DPU baseline micro-architecture.
#BlockBlock DescriptionUsage
1PHY/PCS/SERDES25–112 G PAM4 lanes, Reed–Solomon FEC, CDR, auto-negotiation.Turns optical/electrical signals into clean bit-streams with BER ≤ 10−12, setting the maximum line-rate budget under strict power limits.
2Multi-rate MACCRC add/check, IEEE 1588 timestamping, VLAN/MACsec tagger, 512-b datapath.Delivers cut-through frames to the fabric in ≤60 ns; enforces pause-frame flow control critical for RDMA losslessness.
3Packet parser and match–action8–16 fixed pipeline stages built from SRAM + small TCAMs; P4 compiler or micro-uCode programs tables.Splits L2–L4 headers, applies ACLs/QoS, and tags metadata in  100 ns at 400 Gb/s, guaranteeing deterministic latency.
4Transport/storage enginesTCP segmenter/reassembler, RTT/ECN logic, RoCEv2 verbs, NVMe-oF PRP walker.Eliminates tens of host CPU cores by offloading L4 and storage protocols with sub-µs tail latency.
5Crypto/compression complexAES-GCM, ChaCha-Poly, SHA-x, RSA/ECC, RegEx, Zstd/LZ77 pipes, key-vault SRAM.Inline TLS/IPsec and data-at-rest crypto at ≤0.2 nJ/byte—impossible in software without huge CPU cost.
6Embedded CPU cluster4–24 × Arm A-class, 2–8 MB L3, TrustZone, secure boot ROM, DDR/LPDDR channel.Runs Linux control-plane agents (OVS-DPDK, DOCA, and gRPC) while datapath stays in hard logic.
7On-chip memory8–32 MB SRAM + 4–8 GB HBM2e/LPDDR5 on a high-b/w NoC (≥3 TB/s).Single-cycle SRAM for hot flow state; deep HBM buffers absorb 400 Gb/s bursts without host DRAM thrash.
8PCIe 5/CXL RC and DMA16–32 lanes, ATS/SVM, peer-to-peer, optional CXL.io/CXL.mem.Provides > 64 GB/s bidirectional, zero-copy access to host DRAM, GPUs, or NVMe drives without CPU involvement.
9NoC, clocks, power islandsMesh/ring at 1–1.5 GHz, QoS arbiters, independent clock/power domains.Isolates jitter-sensitive SERDES, enables DVFS so idle NICs drop below 10 W.
Table 5. Cross-cutting reference architecture blocks and how they typically manifest in NICs, DPU-class SmartNICs, and FPGA shell SmartNICs.
Table 5. Cross-cutting reference architecture blocks and how they typically manifest in NICs, DPU-class SmartNICs, and FPGA shell SmartNICs.
Block/FunctionTraditional NICDPU-Class SmartNICFPGA Shell SmartNIC
Ingress/egress MAC + PCSFixed pipeline; tuned for throughput/latencySame as NIC; often multi-port with SoC integrationSame as NIC; integrated into shell; may be shared across PR regions
Parser + classificationLimited, often fixed or narrowly programmableFixed engines + programmable hooks; often vendor SDK abstractionProgrammable (RTL/HLS/P4); can be specialized per workload
Match/action + policyConstrained offloads (filters, steering, encapsulation)Mix of fixed-function + embedded CPU assistProgrammable datapath; supports custom actions and preprocessing
Queueing + schedulingMulti-queue (RSS), traffic classes, basic shapingMore advanced scheduling/telemetry; supports multi-tenant isolationCustom schedulers possible; isolation must be explicitly engineered
DMA/RDMA enginesDMA + optional RDMA (RoCE/IB) offloadsDMA/RDMA plus virtualization/IOMMU integration; strong host isolationDMA/RDMA endpoints may be integrated or custom; P2P paths possible
On-card computeMinimal (checksums, crypto offload)Embedded CPU cores + accelerators (crypto, regex, compression)Programmable compute in datapath; accelerators as PR modules
On-card memory hierarchySmall SRAM/cache; sometimes DRAM on higher-end cardsDRAM + caches; sometimes HBM on high-end partsHBM/DDR + BRAM/URAM; NoC/banking becomes first-order design axis
Control planeHost driver + firmwareEmbedded OS/firmware; rich management APIsHost + embedded CPU; runtime for PR/service lifecycle is common
Update modelDriver/firmware updates; static datapathFirmware/app updates; datapath mostly fixedBitstream/partial bitstream updates; strong dependence on floorplan/timing
Table 6. Key on-card offload accelerators integrated in modern DPU ASICs.
Table 6. Key on-card offload accelerators integrated in modern DPU ASICs.
CategoryFixed Engines (Examples)Benefit for the System
TransportTCP segmentation/aggregation logic, RTT and ECN hardware, RoCE v2 reliable-connection verbs.Eliminates L4 bookkeeping from host CPUs; delivers sub-µs tail latency for distributed databases and AI parameter servers.
StorageNVMe-oF initiator/target pipeline, VirtIO-blk datapath, SNAP indirection logic.Provides line-rate remote block storage and bare-metal provisioning without host-side copy or interrupt overhead.
SecurityAES-GCM/ChaCha-Poly engines, SHA-2/3 hashers, RSA/ECC big-num PKA, inline TLS 1.3/IPsec, RegEx DPI core.Enables encryption, authentication, and deep-packet inspection at wire speed with <0.2 nJ/byte energy cost.
AI/MLINT8 matrix-multiply tiles, vector accumulation units, pattern-matching accelerators.Supports in-NIC inference or metadata enrichment (e.g., threat classification) without involving host GPUs/CPUs.
VirtualizationSR-IOV/vDPA queue slicers, virtio-net front-end, nested S-MMU address translation.Delivers per-tenant isolation and hardware vSwitch functions, allowing hundreds of VMs or containers to share a single DPU at line rate.
Table 7. Representative DPU/SmartNIC ASIC families shipping or sampling in 2023–2025.
Table 7. Representative DPU/SmartNIC ASIC families shipping or sampling in 2023–2025.
Vendor/SiliconPorts × Gb/sArm CoresFixed Offloads in SiliconPCIe Gen
NVIDIA BlueField-3 (SuperNIC) [29] 2 × 400 16 × A78AERoCE v2, NVMe-oF, TLS 1.3, RegEx DPI, peer-to-peer GPU DMA5 × 16
Marvell OCTEON 10 CN106 [30] 16 × 50 (≤800) 24 × N2TCP/UDP offload, IPsec, VPP fast path, inline INT8 ML tiles5 × 32
AMD Pensando DSC-2 “Elba” [31] 2 × 100 16 × A72144 P4 MPUs, IPsec, NVMe-TCP, inline telemetry counters4 × 16
Broadcom Stingray BCM5883 [32] 1 × 100 8 × A72TLS offload, NVMe-oF target, eBPF hooks, on-chip vSwitch4 × 8
Intel Mount Evans E2100 IPU [33] 2 × 200 16 × N1P4-programmable pipeline, virtio-net front end, QuickAssist crypto5 × 16
Table 8. Representative FPGA SmartNIC shells and closely related datapath frameworks (not all are fully open-source “shells” in the strict vendor sense, but each embodies a reusable substrate for on-card networking/offloads).
Table 8. Representative FPGA SmartNIC shells and closely related datapath frameworks (not all are fully open-source “shells” in the strict vendor sense, but each embodies a reusable substrate for on-card networking/offloads).
Shell/FrameworkYearOpen SourceDFX/PRKey Features and Emphasis
Coyote v2 [40]2025YesYesHierarchical shell with multi-tenant PR regions; focuses on safe composition and rapid service rollouts on data center FPGA cards.
OpenNIC [43]2022YesNoVendor-maintained shell integrating DMA and host software hooks; emphasizes a stable ABI and practical board bring-up.
RecoNIC [45,46]2023–2024YesYesRDMA-centric compute offload; emphasizes zero-copy remote memory access paths and on-NIC acceleration coupling.
ClickNP [39]2016NoNoClick-style “elements” compiled to FPGA modules; emphasizes fast prototyping and datapath modularity.
Corundum [38]2020YesNoHigh-performance open RTL NIC core (multi-queue, scalable); emphasizes clean NIC architecture and verifiability.
DrawerPipe [36]2019PartialN/AReconfigurable pipeline abstraction for SmartNIC packet processing; emphasizes interchangeable stages and pipeline modularity.
P4-enabled SmartNIC [44]2020NoN/AP4-programmable packet processing integrated into an FPGA NIC; emphasizes slicing/service-driven dataplane changes.
SuperNIC [5]2024NoYesCloud-oriented FPGA SmartNIC; emphasizes composability, isolation, and platform-level integration for hyperscale use.
Janus [6]2023NoYesExperimental reconfigurable SmartNIC with P4 programmability and SDN isolation; emphasizes resource isolation and policy control.
Table 9. Architectural patterns in open-source FPGA SmartNIC shells and their trade-offs.
Table 9. Architectural patterns in open-source FPGA SmartNIC shells and their trade-offs.
PatternAdoptersProsCons
Single static shell, no partial reconfigurationOpenNIC [43], Corundum [38]Simplest build flow; deterministic timing; stable driver ABI; easier verification.Requires full synthesis/P&R for functional changes; limited agility for evolving offloads.
Static shell + one PR slotClickNP [39]Moderate build-time compared to full shell redesign; user can swap one accelerator at run-time.Only a single PR kernel active at once; infrastructure blocks typically still require downtime to update.
Hierarchical shell with multi-tenant PRCoyote v2 [40], RecoNIC [45], SuperNIC [5], Janus [6]Supports hot-swapping services and user kernels; enables composability and (in cloud-oriented designs) stronger isolation and controlled rollout.Higher static-area overhead; more complex clock/region planning; requires orchestration (drain/validate/rollback) for production use.
Pipeline modularity with uniform stage interfacesDrawerPipe [36], EasyNet [37]Makes pipeline boundaries explicit; encourages parallel pipelining/replication for line rate; improves reuse and productivity for datapath kernels.Careful buffering/backpressure needed to avoid head-of-line blocking; verification and timing closure become harder as stages evolve independently.
P4-programmable SmartNIC dataplaneP4-enabled SmartNIC [44], Janus [6]Higher-level programmability for parsing + match/action; easier evolution of dataplane logic within a structured model; supports slicing concepts.Programmability constrained by P4 model/resource limits; mapping to FPGA still has non-trivial compilation/toolchain cost.
GPU peer-to-peer focusFpgaNIC [49], Coyote v2 (optional) [40]Zero-copy NIC/FPGA → GPU transfers—ideal for in-network ML, preprocessing, or video analytics.Requires PCIe P2P/IOMMU/ACS support and careful buffer registration/BAR sizing; portability limited to platforms exposing P2P.
RDMA/verbs-centric compute offload substrateRecoNIC [45]Strong fit for RDMA-heavy workloads; aligns accelerator pipelines with verbs semantics; enables low-latency remote-memory interaction.Tight coupling to RDMA semantics and platform details; multi-tenant security/tooling must be engineered carefully.
Table 10. Capability matrix for leading open-source FPGA SmartNIC shells (2025 snapshot). * Gen4 support is work-in-progress.
Table 10. Capability matrix for leading open-source FPGA SmartNIC shells (2025 snapshot). * Gen4 support is work-in-progress.
ShellPortsRatePCIePRBuilt-in Protocol EnginesPrimary API/Tool-Flow
S1 Coyote v2 [40]2100 GGen4 ×16YesRoCE v2, checksum, GPU-DMA bridgePOSIX-style C API; Vitis HLS/P4
S2 OpenNIC [43]2100 GGen3/4 ×16NoRSS, SR-IOV, checksumsXilinx QDMA + DPDK PMD
S3 RecoNIC [45]2100 GGen4 ×16YesShared RDMA verbs; AXI-Stream bridgeVitis HLS, P4 front-end, C lib
S4 ClickNP [39]2/440 GGen3 ×8Hot-swap— (TCP in host)ClickNP HLS compiler (C-like elements)
S5 Corundum [38]1–4100 GGen3 *NoPTP, RSS, SG-DMAPure Verilog;
S6 FpgaNIC [49]1100 GGen4 ×16SlotsGPU P2P BAR, basic L3/L4HLS templates + CUDA DMA hooks
Table 11. Offline build time versus on-card partial-reconfiguration latency (author-reported best-case values).
Table 11. Offline build time versus on-card partial-reconfiguration latency (author-reported best-case values).
ShellFull Build TimePR Bit-Stream SizeOn-Card PR Latency
Coyote v2∼2.5 h (U280)12 MB290 ms
RecoNIC∼3 h (U280)10 MB350 ms
FpgaNIC∼2 h (VCU118)6 MB200 ms
OpenNIC (static)∼4 h (U50)N/Areboot required
Table 12. Headline strengths and weaknesses of leading open-source FPGA SmartNIC shells.
Table 12. Headline strengths and weaknesses of leading open-source FPGA SmartNIC shells.
ShellMain StrengthPrimary Weakness
Coyote v2 [40]Hierarchical multi-tenant PR; hot-swaps both services and user kernels without traffic loss.Large static area and more difficult timing closure.
OpenNIC [43]Vendor-maintained driver stack (QDMA/XDMA, DPDK) and stable ABI.No dynamic PR—any change requires a full rebuild and board reboot.
RecoNIC [45]RDMA verbs exposed directly to accelerators, enabling zero-copy remote memory access.Limited to Xilinx UltraScale+ boards and a smaller tooling ecosystem.
ClickNP [39]Click-style HLS elements; pipelines can be rewired at run time for rapid prototyping.Throughput capped at ≤40 Gb/s; TCP still handled by host CPU.
Corundum [38]Clean, fully parameterized Verilog NIC core—ideal as a forkable base design.Purely static build; lacks any built-in PR manager.
FpgaNIC [49]GPU-centric PCIe P2P path (≤4 µs NIC → GPU); excellent for in-network ML.Single-port design; GPL licensing and smaller community.
Table 13. Technology stack enabling direct GPU → FPGA → Network offload (unit-consistent PCIe bandwidth figures).
Table 13. Technology stack enabling direct GPU → FPGA → Network offload (unit-consistent PCIe bandwidth figures).
LayerKey MechanismRepresentative Support/Notes
Physical linkPCIe Gen4/Gen5 peer-to-peer over ×16 (16/32 GT/s; 128b/130b)Practical payload bandwidth is commonly quoted at ∼31.5 GB/s (Gen4 ×16) or ∼63 GB/s (Gen5 ×16) per direction. Used by GPUDirect RDMA-capable GPUs and FPGA shells based on QDMA/XDMA [52,53].
Future bandwidth headroomPCIe 6.0 (64 GT/s, FLIT-mode + FEC) and CXL 3.0 fabric attachPCIe 6.0 doubles Gen5 bandwidth; a ×16 link is commonly quoted at up to ∼256 GB/s bidirectional in public summaries. CXL 3.0 extends the same physical layer into fabric-attached accelerator pools [54,55].
AddressingATS/PRI + IOMMU; GPU BAR and VA translationATS/PRI enables devices to DMA using translated addresses (platform- and driver-dependent). Demonstrated in GPU-aware FPGA SmartNIC prototypes [45,49].
DMA enginesFPGA QDMA/XDMA and GPU copy/stream enginesBidirectional, interrupt-light transfers in pull or push modes, orchestrated via CUDA GPUDirect and the SmartNIC runtime [52].
Cache coherence (road-map)CXL.mem/CXL.cache (where supported)Coherent attach can remove explicit DMA for fine-grain loads/stores on platforms that implement the full CXL coherency stack [55].
Table 14. Engineering requirements for a next-generation GPU-aware FPGA SmartNIC shell.
Table 14. Engineering requirements for a next-generation GPU-aware FPGA SmartNIC shell.
RequirementRationaleEvidence/Source
P2P-capable PCIe Gen5/6 endpoint (≥512 MB BAR)Sustains tens to 100+ GB/s class DMA rates (Gen5/Gen6), without host-memory staging when P2P is supported.QDMA Gen5 IP (AMD 2023) [53]; PCIe 6.0 DevCon demo [54].
ATS/PRI with IOMMU awarenessAllows the FPGA to use GPU virtual addresses.Demonstrated in FpgaNIC GVAD path [49].
PR-safe clock and power islandsIsolates 64 GT/s SERDES jitter; enables hot-swap at line-rate.Hierarchical zoning in Coyote v2 (2025).
Coherent CXL 3.0 (Type-3 device) optionRemoves explicit DMA; enables <300 ns random loads/stores to GPU pools.Agilex 2 and Versal Premium ES road-maps  [56,57].
Curated fixed offloads (RoCE, PTP, basic crypto)Prevents re-implementation of plumbing; eases adoption.OpenNIC experience: driver maturity outweighs LUT savings  [43].
Unified SDK bridging CUDA Shell DMA APIMirrors GPUDirect semantics; eases ML-stack integration.NVIDIA DOCA GPUDirect RDMA guides [52].
Table 15. Representative in-NIC preprocessing pipelines executed before the RDMA transport layer.
Table 15. Representative in-NIC preprocessing pipelines executed before the RDMA transport layer.
ProjectPlatform/SpeedResultKey Technique
SwitchML [62]NetFPGA-SUME, 25 Gb/s∼35 % ↓ in all-reduce timeInteger adders in FPGA aggregate gradient shards; only the reduced tensor exits the NIC.
FpgaNIC (Agg) [49]Xilinx VCU118, 100 Gb/s1.3 × higher training throughput; frees one CPU socketOn-card DDR reduction buffer; DMA result directly into peer GPUs via GPUDirect.
KV-Direct [59]NetFPGA-10G, 40 Gb/s60 µs ↓ P99 GET latencyBloom-filter hit/miss in FPGA; NIC drops negative queries before RDMA.
FlowBlaze [61]NetFPGA-SUME, 40 Gb/s3 µs end-to-end firewall latencyStateful match–action pipeline with per-flow FSM in BRAM.
Pensando DSC “Secure NVMe” [31]DSC-2 ASIC, 2 × 100 Gb/s∼15 % host-CPU saving in storage nodesOn-NIC CRC + AES-XTS prior to NVMe-oF WRITE.
Table 16. Reusable hardware primitives that underpin line-rate preprocessing inside SmartNIC shells.
Table 16. Reusable hardware primitives that underpin line-rate preprocessing inside SmartNIC shells.
Building BlockPurpose/FunctionResource Cost 1Adopted in (Examples)
Streaming parser and match–action tableSplits L2–L4 headers, sets flow-metadata, filters packets at line rate.~6 k LUT + 2 BRAM/100 Gb/sFlowBlaze [61], Corundum [38].
Windowed CRC/AES pipeInline integrity or encryption; 64–128-byte streaming window.~8 k LUT + 4 DSPCoyote v2 crypto PR [40], Pensando DSC [31].
On-chip scratch-pad (BRAM/URAM)Holds aggregation buffers or Bloom-filter state at single-cycle latency.1 MB BRAM ≈ 2 SwitchML, FpgaNIC reduction buffer [49].
HBM or DDR4 controllerDeep store for >1 MB state (gradient buckets, compression tables).HBM2 controller tile, no LUT; adds ~6 WRecoNIC RDMA buckets [45].
AXI-Stream switch and rate limiterArbitrates multiple preprocessing kernels; enforces line-rate pacing.~3 k LUT per portOpenNIC static shell [43].
QP/DMA bridge (pump)Writes transformed payloads to the correct RDMA work-queue descriptor.~5 k LUT + 2 BRAM per queueCoyote, RecoNIC (shared verbs) [45].
Partial-reconfig controllerIsolates clocks, pauses QP, reloads PR bitstream in <300 ms.~4 k LUT overallCoyote v2.
1 Typical Xilinx UltraScale+ speed-grade 2 synthesis at 250 MHz; numbers vary with vendor tool versions.
Table 17. Software layers that determine whether a SmartNIC design is deployable. The dominant challenges are end-to-end: memory/IOMMU correctness, update consistency, and performance isolation.
Table 17. Software layers that determine whether a SmartNIC design is deployable. The dominant challenges are end-to-end: memory/IOMMU correctness, update consistency, and performance isolation.
Software LayerRole/Typical ComponentsPractical Constraints/Failure Modes
Kernel driver + PCIe bindingDevice enumeration, queue exposure, interrupts/MSI-X, DMA mapping, IOMMU integration; exposes netdev/RDMA devices and management interfacesThroughput/latency sensitivity to queue mapping, interrupt moderation, IOMMU/ATS settings; correctness depends on memory pinning, isolation, and ordering
User-space datapath/RDMA stackDPDK/VPP-style poll-mode drivers, AF_XDP, libibverbs/rdma-core; provides stable API for apps and frameworksBusy-polling vs latency/CPU trade-offs; memory registration costs; zero-copy semantics depend on consistent buffer ownership and cache/NUMA placement
Control plane for programmable datapathsP4 control (table updates), telemetry configuration, policy install; often integrated with SDN controllers or orchestrationUpdate consistency and timing: non-deterministic or poorly staged updates can transiently violate policy/correctness; requires explicit update strategies [65]
On-card runtime + lifecycle managementService deployment, monitoring, and upgrades; PR/DFX managers, queue quiescing, state save/restore, watchdogsOperational complexity: safe updates require quiescing/backpressure and state discipline; runtime must prevent interference between services/tenants
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ulmămei, A.-A.; Bîră, C. Reconfigurable SmartNICs: A Comprehensive Review of FPGA Shells and Heterogeneous Offloading Architectures. Appl. Sci. 2026, 16, 1476. https://doi.org/10.3390/app16031476

AMA Style

Ulmămei A-A, Bîră C. Reconfigurable SmartNICs: A Comprehensive Review of FPGA Shells and Heterogeneous Offloading Architectures. Applied Sciences. 2026; 16(3):1476. https://doi.org/10.3390/app16031476

Chicago/Turabian Style

Ulmămei, Andrei-Alexandru, and Călin Bîră. 2026. "Reconfigurable SmartNICs: A Comprehensive Review of FPGA Shells and Heterogeneous Offloading Architectures" Applied Sciences 16, no. 3: 1476. https://doi.org/10.3390/app16031476

APA Style

Ulmămei, A.-A., & Bîră, C. (2026). Reconfigurable SmartNICs: A Comprehensive Review of FPGA Shells and Heterogeneous Offloading Architectures. Applied Sciences, 16(3), 1476. https://doi.org/10.3390/app16031476

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop