1. Introduction
For more than two decades, the historical expectation that faster general-purpose processors would automatically absorb growing software and data demands has weakened. Single-thread CPU performance has become increasingly constrained by power density, memory latency, and diminishing returns from deeper speculation and pipelines; as a result, much of the performance progress has shifted toward parallelism and domain-specific acceleration [
1].
At the same time, two external trends have accelerated. First, GPU throughput continues to grow rapidly, driven by wider SIMD-style execution, specialized tensor units, and ever-higher memory bandwidth, making GPUs the dominant engine for AI training and increasingly for data analytics. Second, data center network line rates have advanced from 10–100 Gb/s to 200–400 Gb/s and are now moving toward 800 Gb/s-class links, so each server can ingest or emit tens of megabytes every millisecond. This combination widens a data–compute gap: the network and accelerator fabric can deliver data faster than the host CPU can steer, validate, copy, and transform it at low latency.
1.1. The Data–Compute Gap
Modern micro-services, storage systems, and distributed training pipelines frequently push 100 Gb/s or more per node. At 400 Gb/s, a NIC can emit roughly 50 MB every millisecond; if each packet must traverse a general-purpose networking stack, be parsed, copied between buffers, and validated in software, the aggregate CPU cost becomes significant. Azure’s experience with large-scale services shows that even “small” per-packet work (steering, checksumming, framing, encryption hooks, and queue management) can consume an unwanted number of host cores and memory bandwidth, motivating hardware support closer to the wire [
2].
Kernel-bypass I/O frameworks and in-kernel fast paths (e.g., DPDK; XDP/AF_XDP) reduce overhead by shortening or bypassing the OS datapath, but they still rely on the host CPU to execute protocol logic and application-specific transforms [
3]. As line rates continue to rise, the remaining gap is less about raw PCIe bandwidth and more about where computation happens: moving data to the CPU (and often to the GPU) and back is increasingly expensive in latency, energy, and contention.
1.2. SmartNICs as a Solution for the Data-Compute Gap
SmartNICs embed programmable or fixed-function accelerators directly on the network interface, enabling computation to run on the datapath before packets reach host memory. Current systems span a spectrum. At one end, ASIC SmartNICs (often branded as DPUs) integrate hardened protocol engines (e.g., TLS/IPsec, NVMe-oF, and RDMA support) alongside clusters of embedded cores; their energy efficiency and software maturity are strong, but their capabilities are largely fixed at tape-out and evolve on multi-year silicon cycles [
4].
At the other end, FPGA SmartNICs provide a reconfigurable substrate that can adapt as protocols, security primitives, and ML operators evolve. Most FPGA SmartNICs are organized around a static shell that encapsulates Ethernet MACs, PCIe/DMA engines, queue management, and control interfaces, while exposing one or more partial-reconfiguration regions (PRRs) where user-defined accelerators can be swapped in at run time. This model has been used to build cloud-oriented SmartNICs that balance performance and manageability [
5] and to prototype isolated, multi-tenant, P4-programmable SmartNICs with reconfiguration support [
6]. More broadly, programmable data planes and in-network computing have matured into a rich design space, with SmartNICs forming a practical point in the spectrum between in-switch processing and host-centric acceleration [
7].
1.3. Why Focus on Shells?
The shell is the static component that determines I/O bandwidth, DMA behavior, queue layout, SR-IOV support, clock domains, and reconfiguration latency. A well-architected shell primarily matters because it fixes the “physics” of what the NIC can do: it sets the baseline datapath latency, determines whether the line rate can be sustained under realistic traffic patterns, and constrains how easily operators can deploy new functions. First, shells that implement parsing, match–action logic, RDMA engines, and on-card buffering can eliminate PCIe round trips for common datapath work, reducing tail latency while sustaining at least 100 Gb/s line rate. Second, shells that support dynamic partial reconfiguration enable composability: operators can swap a TLS offload for an ML aggregation kernel without draining the host or replacing hardware. Third, shells increasingly act as heterogeneous bridges, exposing low-level data movement (e.g., verbs-like RDMA pipelines) and peer-to-peer PCIe paths that allow direct GPU–FPGA transfers, which is essential for in-network machine learning and GPU-centric distributed workloads.
1.4. Contribution of This Article
This study provides a comparative analysis of six representative, state-of-the-art FPGA SmartNIC shells—Coyote, OpenNIC, RecoNIC, ClickNP, Corundum, and FpgaNIC—and places them in context against conventional NICs and DPU-class devices. Rather than treating all shells as interchangeable integration templates, we highlight the distinct design philosophies that shape each platform (e.g., research-oriented rapid prototyping vs. production-leaning open-source datapaths; host-centric acceleration vs. network-centric programmability), and we connect these choices to practical use cases and fundamental constraints (notably reconfiguration granularity and isolation, host/PCIe bottlenecks, memory-system placement, and the limits imposed by timing closure and tool-flow complexity). We then examine how architectural decisions—especially static monolithic pipelines versus partial-reconfigurable regions—affect programmability, latency, and upgrade cadence, and we discuss emerging GPU–FPGA offload paths that minimize or remove CPU involvement from the data plane for low-latency streaming workloads. Finally, we synthesize techniques for performing stream-oriented preprocessing inside the NIC before RDMA (e.g., filtering, aggregation, lightweight transforms, and protocol-aware adaptations), thereby reducing GPU-side overhead and improving end-to-end pipeline efficiency. By systematically mapping strengths, gaps, and recurring trade-offs across today’s shells and device classes, we motivate the requirements for a next-generation SmartNIC architecture—one that pairs high internal bandwidth with modular reconfiguration and a unified development flow—pushing programmability, latency, and protocol agility beyond the current state of the art.
From NIC evolution to SmartNIC implementations. Modern NICs have progressively absorbed “near-host” functionality that reduces per-packet CPU overhead and enables parallel receive/transmit: multi-queue designs, receive-side scaling (RSS), checksum and segmentation offloads, and virtualization primitives such as SR-IOV. In parallel, RDMA-capable NICs pushed the envelope further by enabling kernel-bypass and zero-copy data movement, shifting parts of the communication stack into the interface and exposing new architectural constraints around queueing, memory registration, and transport reliability. These trends motivate SmartNICs as the next step: instead of isolated fixed offloads, SmartNICs make packet handling and data movement programmable under tight latency and bandwidth budgets [
8,
9,
10].
SmartNICs in practice. Contemporary SmartNIC implementations span a spectrum from DPUs (SoC-based cards with embedded CPU complexes and fixed-function accelerators) to FPGA-based SmartNICs that expose customizable datapaths and, in many systems, partial reconfiguration. This distinction matters for developers: DPUs typically favor mature software ecosystems and predictable functionality, while FPGA SmartNICs favor datapath specialization and fast iteration on new parsing, scheduling, and preprocessing logic—at the cost of stricter timing/floorplanning constraints and higher hardware-design effort. Recent systems work also shows that obtaining consistent offload speedups depends on accurately matching a workload’s compute/memory behavior to a given SmartNIC’s micro-architecture and memory system, motivating tooling and methodology rather than ad hoc porting [
11].
AI models at the network edge. A growing subset of SmartNIC use cases involves machine-learning-assisted telemetry, anomaly detection, flow inference, and policy decisions. Because SmartNIC datapaths operate under line-rate constraints, the models that appear in practice are usually compact and quantization-friendly (e.g., linear models, small MLPs/regressors, or carefully structured fixed-point inference) and are integrated in ways that preserve deterministic per-packet latency. We therefore treat artificial intelligence in the NIC as a systems-design question—how models are represented, updated, and executed within the datapath and memory hierarchy—rather than as a pure accuracy benchmark [
12,
13].
Novelty and organization. Unlike prior work that focuses on a single SmartNIC class, this article connects three perspectives—(i) architectural trade-offs across FPGA shells and DPU-class designs, (ii) CPU-bypass GPU–FPGA offload paths, and (iii) in-NIC preprocessing that reshapes data before RDMA—highlighting where current platforms constrain programmability, latency, and protocol agility. The remainder of this paper is organized as follows:
Section 2 summarizes the evolution and baseline architecture of traditional NICs;
Section 3 reviews DPU-style SmartNICs;
Section 4 compares FPGA-based SmartNIC shells and their design trade-offs;
Section 5 discusses GPU–FPGA direct offload;
Section 6 surveys in-NIC preprocessing for RDMA pipelines;
Section 7 analyzes the software stack and programming layers; and
Section 8 concludes with open challenges and future directions.
2. Traditional Network Interface Cards (NICs)
A Network Interface Card (NIC) is a largely fixed-function adapter whose primary role is to transmit and receive Ethernet (or InfiniBand) frames between the host memory system and the physical medium. In the classic model, everything above Layer-2 framing—including most transport semantics, application parsing, and policy decisions—remains the responsibility of the host CPU and the operating system networking stack.
From an architectural perspective, a traditional NIC integrates four tightly coupled functions. First, it terminates the physical link through PHY/SERDES blocks (clock recovery, equalization, and FEC where applicable). Second, it implements the MAC layer to perform framing and deframing (preamble/SFD processing, CRC generation and checking, VLAN tags, and often hardware timestamping for PTP). Third, it moves packet buffers to and from system DRAM using a DMA engine driven by descriptor rings programmed by the host driver, raising MSI-X events based on programmable moderation policies. Finally, it may include a small set of fixed accelerators—such as checksum offload, TSO/LSO, RSS, and basic flow steering—that reduce per-packet CPU work while keeping the datapath itself non-programmable and vendor-defined.
Because higher-layer protocol logic still executes on the host, the performance envelope of a traditional NIC remains coupled to CPU capacity, memory bandwidth, and the cost of software datapaths. This CPU-centric design dominated through the 1 Gb/s → 10 Gb/s era and still underpins many commodity adapters today; the difference in modern deployments is that software fast paths (kernel bypass and driver-level hooks) are increasingly used to postpone or reduce host overhead rather than changing the NIC’s fundamental role.
2.1. Historical Perspective
From the first 10 Mb/s Ethernet adapters of the late 1980s to today’s 400 Gb/s ASICs, the purpose of a classic NIC has remained conceptually stable: move packets between a host buffer and the wire with minimal loss and bounded latency. As line rates climbed from 1 Gb/s to 10/40/100 Gb/s and beyond (see
Table 1), vendors expanded the set of fixed offloads primarily to keep interrupts and per-packet CPU work under control (e.g., segmentation, coalescing, and receive-side scaling), while protocol evolution and policy continued to be handled in host software.
2.2. Baseline Architecture
The baseline architecture of a traditional NIC can be understood as a pipeline that converts wire-format frames into DMA-visible buffers. The front end consists of the PHY/PCS/SERDES and MAC, which recover the bitstream, apply FEC where required, and validate frames via CRC and framing rules. The core of the NIC is a DMA subsystem that reads and writes host memory based on descriptor rings and queue state maintained by the driver. A small control-plane interface (configuration registers, doorbells, and status counters) exposes link state, queue pointers, interrupt moderation settings, and steering rules. In modern adapters, this baseline is extended with multi-queue support and virtualization primitives (e.g., SR-IOV), enabling direct queue ownership by guests while still keeping most protocol logic on the CPU.
2.3. Virtualization and Multi-Queue
Modern NICs expose hundreds or thousands of transmit and receive queues to parallelize packet processing and to isolate tenants. Receive-side scaling (RSS) distributes incoming traffic across queues using a hash over header fields so that multiple CPU cores can process packets concurrently, which is essential at 100–400 Gb/s line rates. SR-IOV extends this idea into virtualization by slicing the adapter into multiple PCIe virtual functions (VFs), allowing guest VMs or containers to steer and consume packets without hypervisor-mediated copying.
However, multi-queue and SR-IOV primarily reduce software overhead; they do not remove it. Even with direct queue ownership, the guest must still run a full networking stack and any application-specific parsing or policy. This is why contemporary deployments frequently pair traditional NICs with software fast paths. Driver-level hooks such as XDP/eBPF provide early drop/redirect decisions closer to the driver hot path, while kernel-bypass frameworks and AF_XDP can reduce copies and syscall overhead in latency-sensitive datapaths [
3,
25,
26].
2.4. Representative Device Families
Representative device families illustrate how the classic NIC evolved from a simple DMA-based frame mover into a highly optimized I/O endpoint with increasingly rich but still largely fixed-function acceleration.
Table 2 summarizes widely deployed families across multiple generations and highlights the incremental nature of NIC innovation: each step primarily adds hardware assistance for queue scaling (RSS/MSI-X), virtualization (SR-IOV/VMDq), overlay parsing (VXLAN/GENEVE), and time-sensitive features such as IEEE 1588 timestamping, while preserving the same fundamental architectural boundary—higher-layer semantics and most application logic remain on the host CPU. This table therefore serves as a baseline for the remainder of this paper: SmartNICs and DPUs should be interpreted as architectural responses to the point where incremental NIC offloads no longer suffice to contain CPU overhead at 100–400 Gb/s and beyond.
2.5. Strengths, Limitations, and Where Classic Nics Fit Today
Traditional NICs remain attractive because their hardware and software ecosystems are mature and predictable. They are supported across operating systems and hypervisors, have stable drivers and operational tooling, and their fixed-function datapaths are straightforward to validate and deploy at scale. Their bill of materials is also relatively low compared to reconfigurable solutions, and their deterministic behavior simplifies performance debugging under steady workloads.
The limitations emerge when line rates and workload complexity outpace what host software can handle efficiently. At 100–400 Gb/s, even modest per-packet work (metadata parsing, policy checks, buffer management, and security hooks) can consume substantial CPU capacity and memory bandwidth, while additional copies between NIC buffers, kernel space, and user space increase latency variance. Moreover, the feature set is effectively defined by silicon and firmware: new transport behaviors, novel telemetry, or specialized in-network transforms generally require host-side implementation. As a result, many operators employ software acceleration to extend the lifetime of classic NICs, but these techniques largely shift where CPU cycles are spent rather than eliminating them; in practice, this motivates a transition toward SmartNIC-class devices that can terminate or accelerate parts of the datapath on-card.
2.6. Role in Modern Data Center Systems
Traditional NICs remain dominant for web front-ends, scale-out storage, and edge systems where power budgets are tight and protocol requirements are stable. In hyperscale settings, they are increasingly paired with kernel-bypass and driver fast-path mechanisms to sustain throughput while controlling CPU burn. Nevertheless, as I/O-intensive services expand and the “I/O-driven server” model becomes more prominent, a fixed-function NIC often becomes the bottleneck for latency and for CPU efficiency, setting the stage for more capable SmartNICs and DPUs [
27].
2.7. Bridge to SmartNICs
The central pressure point is not merely raw bandwidth but the cost of orchestrating data movement and per-packet decision-making in host software. This motivates moving selected functions—filtering, steering, security primitives, transport termination, and storage protocol handling—closer to the wire. The next section therefore introduces DPUs, which integrate these functions into a fixed-function but highly optimized SoC, and contrasts them with FPGA-based SmartNIC shells later in this paper.
3. Data-Processing Units (DPUs): Fixed-Function SmartNICs
A Data-Processing Unit (DPU)—also marketed as a SmartNIC ASIC or Infrastructure Processing Unit (IPU)—is a network adapter built around a custom system-on-chip that hardens large portions of the datapath and common infrastructure offloads into silicon. Unlike FPGA SmartNICs, a DPU trades broad reconfigurability for deterministic performance, stronger power efficiency, and a software stack that resembles a “miniature server” dedicated to I/O and security control. This shift aligns with the emerging view of the SmartNIC/DPU as a data movement controller rather than a peripheral that merely transfers packets [
27].
Architecturally, modern DPUs combine (i) multi-rate Ethernet MAC/PCS blocks with high-speed SERDES, (ii) a packet parsing and classification pipeline (often P4- or microcode-programmable within fixed stage boundaries), (iii) hardened transport and storage engines (e.g., RoCEv2 RDMA, TCP segmentation/aggregation, and NVMe-oF), and (iv) on-card compute in the form of embedded CPU clusters that run control-plane services, agents, and sometimes portions of the virtual switch. High-bandwidth DMA engines and PCIe Gen4/Gen5 interfaces provide zero-copy access to host memory, and some platforms additionally target peer-to-peer paths for storage or accelerator attachment. The net result is that protocols such as RDMA and NVMe-oF can be terminated on-card, reducing host CPU overhead and stabilizing tail latency for I/O-intensive services.
3.1. Historical DPU Milestones (2017–2025)
The rapid evolution of DPUs over the last decade is best understood as a sequence of integration steps: first, the convergence of a high-throughput NIC datapath with a general-purpose on-card CPU complex; then, the progressive hardening of infrastructure primitives such as RDMA, storage fabrics, cryptography, and virtualization; and finally, the emergence of programmable packet-processing stages that can be shaped by P4-like or eBPF-like models.
Table 3 provides a chronological view of major commercial families and highlights the specific inflection points that changed how DPUs are deployed: increases in port bandwidth, richer security/telemetry engines, and a more mature on-card software ecosystem. This timeline frames why DPUs are increasingly used as “infrastructure endpoints” in modern clusters—terminating network, storage, and security functions close to the wire—while also clarifying the main limitation relative to FPGA SmartNICs: their datapath capabilities are primarily defined at design time and expand only with new silicon generations.
3.2. Baseline Micro-Architecture
Table 4 describes the components of the architecture of a DPU, explaining the block and its usefulness:
3.3. Reference SmartNIC Hardware Architecture (Cross-Cutting View)
To avoid discussing each device family in isolation, we summarize a reference SmartNIC architecture that captures the common hardware blocks that recur across NICs, DPUs, and FPGA shells. A SmartNIC can be viewed as two tightly coupled subsystems: (i) a network datapath that receives frames, performs parsing/classification, applies actions (steering, filtering, encapsulation, encryption, and telemetry), and schedules traffic; and (ii) a host/memory subsystem that moves data and metadata between the card and the host (or GPU) via DMA/RDMA while enforcing isolation and ordering constraints. Around these, SmartNIC implementations add a control-plane (embedded CPUs/firmware) and optional accelerators for compute-heavy primitives. The key architectural difference between device classes is where programmability lives: fixed-function NICs offer limited knobs around queueing and offloads; DPUs add general-purpose processing and rich I/O virtualization but keep most datapath functions in fixed engines; FPGA shells expose a programmable datapath and memory hierarchy at the cost of stricter timing/floorplanning and a larger “static infrastructure” footprint.
Table 5 highlights the main blocks and their role in each class, providing a common vocabulary for the detailed discussions in
Section 2,
Section 3,
Section 4,
Section 5,
Section 6 and
Section 7.
3.4. Key On-Card Offloads and Accelerators
While the presence of embedded CPU cores is essential for orchestration, most of the performance and isolation benefits of DPUs come from their fixed-function engines and pipeline accelerators. These blocks determine which functions can execute entirely on-card with deterministic latency and without host CPU cycles, and they shape the practical boundary between “control-plane” tasks (policy, configuration, and observability) and “data-plane” tasks (classification, cryptography, and RDMA/storage termination).
Table 6 therefore organizes common DPU capabilities by category and emphasizes the system-level consequence of each offload: lower CPU consumption per byte, reduced tail latency under load, and stronger multi-tenant isolation because sensitive traffic can be processed without traversing host memory. In the remainder of this paper, these categories are used as a reference checklist when comparing FPGA shells: FPGA SmartNICs can approximate many of these engines in reconfigurable logic but differ in how easily the functionality can be replaced, extended, or specialized via partial reconfiguration.
3.5. Virtualization and Multi-Tenant Isolation
DPUs place virtualization and isolation at the center of their design because they are frequently deployed in multi-tenant clouds where the NIC is a shared security boundary. Hardware queue hierarchies (often on the order of – RX/TX queues) are used to assign dedicated queue pairs to VFs, vDPA instances, or container endpoints, enabling scalable parallelism without requiring a monolithic host vSwitch to touch every packet. Address translation and access control are typically enforced using IOMMU/SMMU contexts per tenant, limiting DMA to explicitly authorized memory windows and reducing the blast radius of a compromised guest.
In addition, DPUs integrate a distinct security domain rooted in secure boot and hardware trust features (e.g., ROM-based boot chains and trusted execution modes such as TrustZone in Arm-based designs). This allows the DPU to run privileged control-plane agents and policy enforcement independently of the host OS. Recent empirical and characterization work emphasizes that these platforms are powerful but idiosyncratic: performance, programmability boundaries, and offload behavior can vary substantially across DPU generations and configurations, which must be accounted for when designing end-to-end systems [
28].
3.6. Representative Device Families (2023–2025)
To ground the discussion in concrete platforms,
Table 7 lists representative DPU families that are either shipping or widely referenced in recent deployments. The table deliberately focuses on attributes that directly shape system integration: (i) port configuration and line rate (which sets the dataplane throughput target), (ii) the scale of the embedded CPU complex (which bounds control-plane capacity and the feasibility of running on-card agents such as virtual switching, telemetry, or policy enforcement), (iii) the presence of hardened protocol and security engines (which determines which infrastructure services can be terminated on-card without host involvement), and (iv) the PCIe generation and lane width (which constrains host, GPU, and storage attachment bandwidth). In later sections, these characteristics provide a point of comparison against FPGA shells, whose flexibility and partial reconfiguration capabilities trade off against the determinism and mature software ecosystems typically associated with these ASIC-based devices.
3.7. Strengths and Limitations
DPUs offer a compelling operating point for infrastructure offloads because much of the datapath is hardened in silicon: common tasks such as steering, encryption hooks, RDMA/NVMe-oF handling, and queue management can be executed with predictable microsecond-scale latency even under sustained load, while typically achieving substantially better performance-per-watt than reconfigurable designs for the same fixed offload. This advantage is reinforced by mature vendor software ecosystems—for example, full Linux environments on-card and production-grade SDKs that expose stable APIs for packet processing, storage, and security services—which lowers integration and deployment risk in large fleets. At the same time, these benefits come with structural constraints: the feature set and acceleration blocks are largely frozen until the next silicon tape-out cycle, programmability is usually bounded by the vendor’s pipeline and extension model (e.g., table edits in P4-like stages, eBPF-based hooks, or predefined accelerators rather than arbitrary custom datapaths), and the embedded CPU complex that runs control and orchestration can itself become the bottleneck when the control-plane is heavy or when many tenants compete for limited on-card compute and memory resources. Consequently, DPUs excel when the target workload aligns with supported offloads and operational maturity is critical, whereas FPGA shells remain attractive when the operator needs new datapath functions, rapidly evolving protocols, or tightly customized in-network computation.
3.8. Role in Modern Data Centers
DPUs have become mainstream in hyperscale and enterprise deployments because they shift infrastructure work—virtual switching, storage termination, and inline security—off the host CPU while improving isolation in multi-tenant environments. In public clouds, DPUs are commonly positioned as a bare-metal isolation boundary, where the control plane and tenant traffic separation are enforced on the NIC-side rather than in the host kernel (e.g., Google Cloud IPU initiatives and AWS Nitro-style architectures). In AI clusters, high-end DPU generations are increasingly used as fabric endpoints for large RoCE-based GPU pods, where the NIC must sustain high line rates with low jitter and handle telemetry, congestion signaling, and transport offloads close to the wire; this is frequently associated with NVIDIA BlueField “SuperNIC” deployments in GPU back-end networks. DPUs are also widely adopted in storage systems, where inline crypto, compression, checksums, and NVMe-oF/TCP termination reduce CPU load and stabilize tail latency—a typical pattern is embedding Pensando-class DPUs in enterprise storage appliances to accelerate data services and telemetry. Finally, in carrier and edge infrastructure, DPU/Infrastructure-processor families such as Marvell OCTEON are deployed to implement packet-core functions (e.g., UPF/BNG), combining line-rate forwarding with inline crypto and, increasingly, lightweight ML-assisted classification in power- and space-constrained environments.
3.9. Bridge to Reconfigurable SmartNICs
While DPUs deliver a great perf/W ratio for stable protocols, emerging use-cases—custom congestion control for AI collectives, new security formats, in-NIC inference for proprietary models—require reprogrammability. This gap is filled by FPGA SmartNIC shells, which trade a factor of power efficiency for dynamic reconfiguration, which includes changing protocols on the fly, adding custom pipelines.
4. FPGA-Based SmartNIC Shells: Survey and Comparative Analysis
Modern FPGA SmartNIC platforms are typically engineered around a shell abstraction: a stable, board-validated base design that integrates the host interface (PCIe, DMA, and interrupts), the network interface (Ethernet MAC/PCS, timestamping, and flow steering), and the management/control plane, while exposing one or more user regions for custom offloads. This separation is motivated by operational realities: the shell concentrates vendor- and board-specific complexity (timing closure against hardened IP, reset sequencing, link bring-up, drivers, and compliance constraints), and the user region becomes the “innovation surface” where packet/transport/storage/AI accelerators evolve on a faster cadence.
A second motivation is that SmartNIC workloads are fundamentally parallel: to sustain 100–400 Gb/s line rate, a SmartNIC cannot rely on a single serial micro-engine. Instead, it must exploit spatial and pipeline parallelism (multi-queue RX/TX, replicated engines, and deep streaming pipelines with initiation interval close to one). Consequently, hardware architecture details—queueing model, pipeline boundaries, buffering strategy, memory hierarchy, and internal interconnect—directly determine whether a platform can meet throughput and tail-latency targets under adversarial traffic mixes.
4.1. FPGA Shells: Shell Infrastructure and Subsystem Components
Concretely, an FPGA SmartNIC shell typically contains the following: (i) a host-facing subsystem (PCIe endpoint, DMA engines, doorbells/MSI-X, descriptor management, and driver-visible control/status registers); (ii) a wire-facing subsystem (Ethernet MAC/PCS/PHY interface, RSS/flow steering hooks, checksum/segmentation assists, and optional PTP timestamping); and (iii) an on-card fabric substrate (AXI-stream/NoC-style routing, clock/reset islands, performance counters, and a control-plane path for configuration and telemetry). The user region(s) then attach through standardized streaming and memory interfaces (e.g., AXI-Stream for packets; AXI4/NoC endpoints for state), so that offloads can be inserted without re-deriving the board bring-up and driver stack [
34,
35].
Parallel datapath processing is not an “optional optimization” but the main reason FPGA SmartNICs are viable: multi-queue designs allow the host to scale submission/completion across cores while the FPGA concurrently processes independent flows; replicated match/action or crypto/compression engines amortize per-stage latency; and careful buffering/backpressure design prevents head-of-line blocking when workloads combine small control packets with large data transfers. Frameworks such as DrawerPipe explicitly structure packet processing as interchangeable pipeline stages with uniform interfaces, making the parallel pipeline boundary a first-class design concept [
36]. Likewise, EasyNet provides a 100 Gb/s networking substrate intended for HLS-based kernels, reflecting the practical need for “network I/O as a reusable shell service,” not bespoke per-project logic [
37].
4.2. FPGA Shells: Datapath Architecture and Design Methodologies
At a hardware level, most FPGA SmartNIC shells converge to a similar fast-path organization: a multi-port Ethernet front-end (MAC/PCS and PHY control), an ingress pipeline (parsing, classification, and optional match/action), a buffering and scheduling stage (often queue-based; sometimes with rate shaping/telemetry hooks), and a host-facing I/O subsystem that provides high-throughput DMA/RDMA semantics over PCIe. A representative example of a “NIC-first” open implementation is Corundum, which exposes the core hardware ingredients required for a modern 100 Gb/s-class interface—high-performance datapath logic, Ethernet MAC integration, a PCIe interface with a dedicated DMA engine, and precise timestamping support—illustrating how much of a SmartNIC’s functionality is anchored in the mechanics of moving data and metadata deterministically at line rate [
38]. From the shell perspective, these blocks form the non-negotiable baseline: regardless of higher-level programmability, a SmartNIC must sustain sustained DMA throughput, absorb burstiness, and preserve low and stable latency under contention.
Where shells differ is in how they structure programmability around that baseline. Frameworks such as ClickNP and DrawerPipe explicitly organize the datapath as composable stages/elements so that packet-processing functions can be assembled from reusable modules and mapped to pipelines without rewriting the entire NIC substrate [
36,
39]. In hardware terms, this typically translates to (i) a clear separation between a streaming packet path and a sideband metadata/control path, (ii) explicit stage boundaries that allow pipelining and replication, and (iii) a restricted, predictable interface between stages that improves timing closure and enables reuse across multiple functions. This architectural choice trades absolute micro-optimization for faster development and safer composition: the shell “pays” some infrastructure overhead so that user logic can be inserted as modules with well-defined backpressure and resource expectations.
A second axis of differentiation is how shells treat memory hierarchy, reconfiguration boundaries, and isolation. Designs that target multi-service deployment and fast rollout increasingly introduce hierarchical regions and runtime-controlled reconfiguration mechanisms (e.g., multi-tenant regions and service management layers), pushing partial reconfiguration and explicit boundary protocols into the core shell architecture [
40]. Similarly, cloud-oriented FPGA SmartNIC designs emphasize safe sharing of hardware resources and scaling of task graphs through different forms of parallelism, which has direct architectural consequences for buffer partitioning, arbitration, and scheduling across multiple tenants or services [
5]. These goals also expose fundamental limitations: shared DMA engines, shared memories, and shared on-card accelerators can create cross-tenant interference unless the shell provides explicit performance and security isolation mechanisms. Work on SmartNIC isolation (even when demonstrated on SoC-based SmartNICs) makes the underlying point clear: without deliberate partitioning and enforcement, contention and side channels become first-order constraints on deployability [
41,
42].
Fundamental Hardware Limitations
Despite their programmability, FPGA SmartNIC shells face a set of recurring, hardware-rooted limits that shape achievable throughput, latency, and deployability. First, the I/O boundary is often the dominant bottleneck: PCIe DMA engines and their associated buffering/interrupt or doorbell mechanisms impose practical limits on sustained host throughput and latency variance, especially under many small messages or high queue counts. Second, the on-card memory hierarchy introduces contention effects that are easy to underestimate: even with HBM, performance depends on banking, access patterns, and NoC arbitration, and poorly partitioned buffers or shared metadata structures can create head-of-line blocking that manifests as tail-latency spikes. Third, timing closure and floorplanning become first-order constraints as soon as the shell supports large pipelines, multi-service composition, or partial reconfiguration regions; long interconnects, cross-region routes, and boundary crossings can dominate critical paths and reduce achievable frequency, while “infrastructure” logic (crossbars, schedulers, monitors, CDC, and reset/clock trees) consumes a non-trivial fraction of resources. Finally, isolation is not free: when multiple functions share MACs, DMA engines, memories, or accelerators, the shell must provide explicit mechanisms for bandwidth budgeting, backpressure propagation, and state partitioning; otherwise, interference and microarchitectural side channels can negate the benefits of multi-tenancy even if functional isolation is maintained. These limitations motivate shell designs that treat resource partitioning, predictable arbitration, and reconfiguration-aware floorplans as core architectural elements rather than as afterthoughts.
4.3. Representative Shells and Design Philosophies
Table 8 summarizes representative FPGA SmartNIC shells and closely related datapath frameworks. The set spans multiple design philosophies: (1) production NIC cores emphasizing stable interfaces and RTL control (e.g., Corundum [
38]); (2) vendor-maintained shells that prioritize driver stability and platform integration (e.g., OpenNIC [
43]); (3) research shells designed for rapid iteration across offload ideas (e.g., ClickNP [
39]); and (4) cloud-oriented, multi-tenant shells that treat isolation, composability, and dynamic service rollout as first-order concerns (e.g., SuperNIC [
5], Janus [
6], and Coyote [
40]). In addition, P4-enabled SmartNIC designs demonstrate how “programmable parsing + match/action” can be integrated into an FPGA NIC to support slicing and service-driven reconfiguration at the dataplane level [
44].
Design philosophies, suitable use cases, and limitations. Although
Table 8 compares shells on features, their design philosophies strongly shape where they are most effective and what their hard limits are. Corundum, for instance, is closest to a production NIC core: it prioritizes a clean, verifiable RTL NIC micro-architecture (queues, DMA, and datapath control), making it well suited when the contribution is a new NIC feature, a latency-critical inline primitive, or a reproducible research NIC baseline; the trade-off is that it largely assumes hardware-centric development and does not aim to provide cloud-style service composition or PR-based rollout mechanisms out of the box [
38]. OpenNIC follows the opposite philosophy: it is a platform-first vendor shell that emphasizes board bring-up and a stable host-facing software stack (e.g., QDMA/DPDK integration), which makes it a pragmatic base for systems papers that need “working 100 GbE quickly” and for deploying monitoring, measurement, or match/action pipelines as plug-in datapath modules; however, the same platform emphasis implies a non-trivial fixed infrastructure overhead and—because the shell is typically static—functional evolution often requires full rebuild/redeployment rather than online PR updates. Concretely, DUMBO reports that a significant fraction of resources can be attributed to the fixed OpenNIC infrastructure and measures a baseline OpenNIC latency of roughly 960 ns [
47]. Recent work also leverages OpenNIC-like shells specifically to study system-level objectives (e.g., energy-aware networking and OS integration), reinforcing the view that some shells are optimized for deployability rather than maximal datapath minimalism [
48]. Research-first frameworks such as ClickNP are optimized for rapid datapath iteration via a modular programming model (Click-style elements compiled to FPGA modules), which is a strong fit for prototyping middleboxes, new scheduling/measurement logic, or feature exploration; the limitation is that such frameworks typically lag behind in absolute throughput targets and full offload completeness (e.g., relying on the host for TCP), and their abstraction can reduce fine-grained control over timing and resource utilization [
39]. Finally, cloud-oriented shells such as Coyote treat composability, isolation, and rollout as first-order goals: PR regions, explicit boundary protocols, and runtime controllers enable safer multi-service deployment but at the cost of larger static footprint, more demanding timing/floorplanning, and additional engineering for performance/security isolation in multi-tenant settings [
40,
41,
42]. Specialized designs (e.g., RDMA/verbs-centric RecoNIC or GPU-centric SmartNICs) are best matched to RDMA-heavy storage/AI clusters where zero-copy semantics dominate; their primary limitation is portability, since they depend on platform-specific RDMA/PCIe/IOMMU assumptions and software hooks to preserve those semantics end-to-end [
45,
49].
4.4. Architecture Patterns
Parallelism is not an optional optimization in SmartNICs; it is the mechanism that makes line-rate processing feasible under multi-port operation and concurrent services. At the hardware level, modern SmartNIC datapaths are inherently multi-lane (multiple MAC/PCS lanes, multiple queue pairs, and multiple memory channels/banks), and throughput scales only if packet handling, metadata generation, DMA, and optional preprocessing can run concurrently without serial bottlenecks. In practice, this means exploiting parallel receive/transmit queues (to match multi-core hosts and multiple flows), decoupling stages through pipelining (parser → match/action → scheduling → DMA/RDMA), and distributing state and buffers across on-card memories (e.g., HBM banking or multi-channel DDR) to avoid contention. The same requirement becomes stricter when a shell supports multiple reconfigurable regions or multiple services: parallelism must be paired with isolation (bandwidth budgeting, backpressure, and per-tenant resource limits) to prevent one workload from degrading others. The main limitations are architectural rather than conceptual: increasing concurrency raises floorplanning and timing-closure difficulty, amplifies NoC/memory arbitration effects, and can introduce latency variance if shared structures (queues, caches, HBM ports, and PCIe DMA engines) are not carefully partitioned. For these reasons, we treat “parallel data processing” as a first-order design objective that directly determines achievable throughput, latency stability, and safe multi-service composition on a SmartNIC.
Table 9 groups today’s shells by the architectural pattern they implement—ranging from fully static designs that require a fresh bit-stream for every change to hierarchical shells that hot-swap both infrastructure and user kernels. Each pattern is accompanied by concrete adopters and a concise list of benefits and trade-offs.
4.5. Features
While architectural pattern gives a high-level overview, practical deployment depends on concrete capabilities: maximum port count, whether dynamic PR is supported, which protocol engines are embedded into the shell, and what software tool-chain developers must use.
Table 10 shows these attributes for the six shells introduced earlier.
4.6. Programming-Model Spectrum
The six shells illustrate a continuum that balances raw hardware control against developer productivity. At one extreme, Corundum exposes only synthesizable Verilog; designers enjoy cycle-accurate freedom but must write RTL and close timing themselves, a workload suitable for hardware specialists and production NIC vendors. Moving up the abstraction ladder, Coyote, FpgaNIC wrap the datapath in Vitis HLS templates: kernel authors describe packet-side logic in C/C++, leaving the shell to manage AXI buses, DMA, and resets. RecoNIC and Coyote go a step further by accepting P4 descriptions of header parsing and match–action blocks, enabling network researchers to prototype new transport formats without touching RTL. At the highest level sits ClickNP, whose Click-style “elements” compile to HLS modules and can be re-wired at runtime, letting a systems engineer prototype router pipelines with the same model used in software. The cost of abstraction is twofold: less precise control over timing and resource utilization and—except for Coyote’s hierarchical design—limited ability to combine high-level and low-level modules in a single bit-stream.
4.7. Build-Time and Run-Time Reconfiguration Costs
Partial reconfiguration is operationally meaningful only if two practical costs are controlled: offline build latency (how long it takes to produce a full or partial bitstream after a change) and on-card PR latency (how long traffic must be quiesced—or rerouted—while a region is updated).
Table 11 aggregates the best-case figures reported by each project. The central engineering trade-off is visible in these numbers: designs that invest in hierarchical partitioning, strict interface contracts, and PR controllers often pay a larger static footprint and a more demanding timing-closure process, but they enable one to two orders-of-magnitude faster “deploy new logic” cycles in practice.
4.8. Pr in Production: Orchestration, Reliability, and Toolchain Support
In production environments, PR is not only a reconfiguration mechanism; it becomes a distributed systems problem. A safe PR rollout typically requires (i) traffic orchestration (drain or divert flows away from the target region, preserve ordering where required, and bound packet loss), (ii) region quiescence (ensure there are no in-flight transactions across the region boundary), and (iii) rollback semantics (retain a known-good configuration and revert on timeout or validation failure). These requirements strongly influence shell architecture: designs that provide bypass paths, per-region performance isolation, and explicit boundary protocols can treat PR as an online operation rather than a maintenance window.
Reliability concerns also matter at scale. Even if PR latency is sub-second, field operation must account for configuration integrity, transient faults, and bitstream management. At minimum, shells should support authenticated bitstreams, validation before activation, and operational tooling that makes PR artifacts reproducible (clear provenance from source commit to partial bitstream). Toolchain support is equally critical: reproducible floorplanning, stable interface timing, and clear separation between shell timing closure and user logic closure reduce the “integration tax” that otherwise prevents PR from being used outside research prototypes. A broad perspective on these multi-tenant and operational constraints—including attack surfaces introduced by user-programmable regions—is surveyed in [
50].
4.9. Security: Secure Boot, Bitstream Integrity, and Side Channels
Shells change the security model of a NIC by allowing hardware logic to evolve post-deployment. This creates two security imperatives. First, platforms need a chain of trust from boot to runtime: secure boot of the management controller/host agent that provisions the FPGA, authenticated loading of the base shell image, and authenticated loading of partial bitstreams for PR regions. Second, shells must assume that bitstreams are an adversarial input in multi-tenant settings: protections should address bitstream tampering, replay/rollback of vulnerable configurations, and unauthorized modification of PR payloads in transit or at rest.
Standardized guidance for protecting EDA/IP artifacts (including encryption and key management practices used in hardware design flows) is captured by IEEE Std 1735 (useful as a baseline reference point for production-grade protection expectations) [
51]. For cloud-style FPGA services, the most salient risks include cross-region information leakage (timing/power side-channels and shared-resource contention), boundary violations via malformed interfaces, and supply-chain risk in the PR artifact pipeline; these issues and their mitigations are discussed at survey depth in [
50].
4.10. Strengths and Weaknesses of FPGA Shells
Different shells excel in different use-cases; no single design dominates every metric.
Table 12 shows the standout advantage and the principal limitation of each shell, giving a ensemble image for where a given platform will—or will not—fit a project’s requirements.
4.11. Takeaways for a New Shell Design
A new shell design must meet a small set of concrete requirements derived from the discussion above. First, sub-second partial reconfiguration should be treated as a hard requirement: in a 100 Gb/s deployment, even one minute of downtime forfeits more than 700 GB of potential throughput, and hierarchical approaches validated by Coyote v2 and RecoNIC indicate that service-transparent PR latencies below 350 ms are attainable without compromising timing closure. Second, the shell should expose a minimal set of hardened fast-path engines to broaden applicability; embedding essential blocks—such as RoCE verbs handling, IEEE 1588 timestamping, and checksum generation—spares application teams from repeatedly re-implementing line-rate plumbing and allows the platform to function as an immediate drop-in NIC rather than merely an FPGA board. Third, a stable, well-documented software stack is a decisive adoption factor: shells accompanied by mature drivers and user-space libraries (e.g., OpenNIC, Corundum, and Coyote) tend to see disproportionate uptake compared to technically rich but API-sparse alternatives, and sustained investment in a long-lived C/P4/HLS SDK typically yields higher dividends than ultra-aggressive area optimization. Fourth, high-level entry points—HLS, P4, or Click-style composition—meaningfully compress innovation cycles; while they may consume roughly 10% additional fabric, they can reduce idea-to-bitstream turnaround from weeks to days, which is an acceptable trade-off in both academic research and prototyping. Fifth, static-region budget is finite and must be planned explicitly: experience with hierarchical shells suggests that infrastructure can appropriate 25–40% of total LUTs, so capacity planning should reserve headroom for at least two large user accelerators. Finally, zero-copy peer paths (e.g., GPU P2P and shared RDMA queues) are becoming decisive differentiators; we therefore treat GPU ↔ FPGA offload as a first-class design axis and discuss the detailed datapath models and constraints (topology, buffer registration, and completion semantics) in
Section 5.
Implication: an industry-ready FPGA SmartNIC shell has to combine (i) deterministic, sub-second PR support, (ii) a curated but indispensable set of fixed offloads, (iii) a vendor-agnostic, version-stable SDK, and (iv) explicit architectural provision for zero-copy accelerator paths—all while preserving sufficient reconfigurable resources for future, as-yet-unknown workloads.
5. GPU FPGA Direct Offload
Contemporary SmartNIC innovation now targets the inverse bottleneck: removing the host CPU from GPU-to-network egress paths. In accelerator-centric servers, vast result tensors—video-analytics metadata, ML inference outputs, or compressed columnar data—originate inside GPU HBM. In a classical pipeline the egress journey still involves (i) a DMA pull from GPU memory into host DRAM, (ii) CPU-driven protocol framing, checksumming, and congestion handling, and (iii) a second DMA push from host DRAM to the NIC. At 100–400 Gb/s these double copies and cache can use up to dozens of CPU cores that add no computational value.
A GPU → FPGA → Network offload path collapses the chain. Leveraging peer-to-peer PCIe Gen4/5—and, prospectively, cache-coherent CXL 3.0—the FPGA SmartNIC initiates DMA reads that fetch payloads directly from GPU HBM. Packetisation, protocol termination (TCP/UDP/RoCE), and line-rate scheduling are executed in deterministic FPGA logic; frames depart the NIC without ever traversing host memory. The CPU is delegated to low-duty control-plane tasks, freeing cache bandwidth. Prototype systems (e.g., FpgaNIC, RecoNIC) have demonstrated 4–6 µs reductions in tail latency at 100 Gb/s and release of a full CPU socket’s worth of cycles on inference nodes.
This mechanism comes with two distinct advantages: it reshapes the data-delivery path, creating the GPU to wire path, where the host DRAM is not used, and second, it maximizes GPU utilization, ensuring that accelerators are not throttled by PCIe round-trips and CPU copies.
Typical deployment domains for this architecture include distributed deep-learning training, where gradient blocks are emitted directly from GPU HBM, optionally pre-compressed or aggregated on the FPGA, and then streamed onto an RDMA fabric without CPU involvement. Another domain is LLM serving gateways: GPUs generate token logits while the FPGA performs final formatting, batching, and flow control, pushing responses straight to client connections. The same pattern applies to real-time video analytics and XR, in which feature maps or inference metadata leave GPU memory, undergo watermarking or encryption in FPGA fabric, and reach the network within a frame budget on the order of tens of milliseconds. In storage disaggregation, GPUs can handle compression or erasure coding while the FPGA encapsulates the resulting chunks into NVMe-oF or SPDK-oriented transports end-to-end, again avoiding CPU participation on the data path. Finally, ultra-low-jitter trading systems can use GPU-based risk engines to stream price vectors while the SmartNIC attaches FIX/OUCH headers and enforces pacing to satisfy stringent tick-to-trade latency and jitter targets (e.g., sub-25 μs SLAs).
A system that bypasses the CPU egress path (depicted in
Figure 1) relies on a small set of enabling technologies. First, it needs a peer-to-peer physical interconnect that can sustain the required bandwidth between the GPU and the FPGA. Second, it requires an addressing and permission model that allows the FPGA to treat GPU memory as a valid DMA target, rather than forcing all transfers through host memory. Third, the design depends on symmetrical DMA capability on both sides—i.e., both devices must be able to initiate and complete the relevant memory transactions and synchronization semantics efficiently. Finally, looking forward, cache-coherent load/store semantics would further simplify the datapath by eliminating explicit copy operations altogether, enabling true shared-memory communication between the GPU and FPGA.
Table 13 summarizes the state of each layer and cites representative vendor or standards documents.
5.1. Application Domains to Benefit from GPU → FPGA → Network Offload
Distributed deep-learning training. State-of-the-art frameworks (e.g., NCCL over RoCE) spend up to 30% of training time in gradient aggregation. With peer-to-peer DMA, the SmartNIC can pull tensors directly from GPU HBM, perform a first-stage reduction in FPGA logic, and transmit a single aggregated vector on the wire. The FpgaNIC prototype reports a 1.3× wall-clock speed-up on 100 Gb/s links, while freeing an entire CPU socket for auxiliary workload management [
49].
Large-language-model (LLM) serving. Tokenisation and prompt-cache lookup are latency-sensitive yet structurally simple; implementing these primitives in the FPGA allows the GPU to focus on matrix multiplies. Early benchmarks on GPUDirect-enabled BlueField systems show a 20–25% improvement in tokens-per-second for GPT-style models when request preprocessing is offloaded from the CPU [
52].
Real-time video analytics and XR. Edge nodes ingest uncompressed 4-K/8-K streams, run CNNs or vision transformers, and must respond within a single video frame. FPGAs can perform crop/resize and color-space conversion at line rate; results are DMA-read directly from GPU memory and forwarded with end-to-end latencies under 16 ms on Jetson-based P2P prototypes [
52].
Storage disaggregation with inline computation. In composable NVMe-oF fabrics, GPUs execute compression or erasure-coding kernels while the FPGA SmartNIC terminates NVMe headers and streams coded stripes onto the network, slashing host-DRAM traffic by more than 50% in RecoNIC demonstrations [
45].
Low-jitter financial trading pipelines. Market-data pre-filters run on the SmartNIC (FIX/OUCH parsing; order-book updates) while GPUs compute risk or pricing models. Direct DMA of result vectors to the NIC reduces cache pollution and helps meet sub-25 μs tick-to-trade service-level agreements, as documented in recent GPUDirect RDMA application notes [
52].
Collectively, published evaluations record 4–6 μs reductions in tail latency at 100 Gb/s and reclaim an entire CPU socket’s worth of cycles per server—underscoring that GPU-to-FPGA direct offload is not a niche optimization but a decisive enabler for accelerator-dominated data center workloads.
5.2. State of the Art (2024–2025)
This subsection summarizes representative SmartNIC-related developments reported in the 2024–2025 period, emphasizing what changed (capabilities, constraints, and deployment practice) relative to earlier designs.
Recent platform work points to several concrete enablers that make GPU → FPGA → network offload more practical and, increasingly, more general.
GPU-virtual-address DMA (GVAD) removes a major integration hurdle by allowing the SmartNIC to issue DMA transactions directly against GPU virtual addresses. In
FpgaNIC, GPU virtual address mappings are brought into the FPGA’s AXI master so that GPUDirect RDMA can target on-card memory (HBM) without relying on intermediate pin-down buffers [
49].
Shared RDMA verbs for in-NIC accelerators move the RDMA control plane closer to the datapath. RecoNIC exposes the host’s queue-pair table to FPGA kernels, enabling an accelerator to post operations such as
RDMA_WRITE autonomously rather than routing every action through the host [
45].
Cache-coherent attachment via CXL 3.0 suggests a future where explicit copies can be replaced by coherent load/store semantics. Early announcements for devices such as Intel Agilex 2 and AMD/Xilinx Versal Premium ES indicate CXL Type-3 capability, allowing SmartNICs to access GPU-resident data as coherent cache lines with sub-microsecond latency budgets (often quoted as <300 ns for load/store paths) [
56,
57].
PCIe 6.0 readiness matters because peer-to-peer designs are frequently bandwidth-limited at the link. Public PCIe 6.0 summaries commonly cite up to ∼256 GB/s bidirectional bandwidth for a ×16 connection, effectively doubling today’s headroom for P2P transfers and reducing pressure on compression or aggregation to “fit the pipe” [
54].
Edge-scale GPUDirect on ARM broadens the deployment envelope. NVIDIA’s Jetson Orin driver support for external FPGA BARs under GPUDirect RDMA opens the door to sub-15 W inference gateways where the SmartNIC performs in-line preprocessing, packaging, or security functions while the GPU focuses on inference [
52].
5.3. Design Checklist
To bridge the discussion on hardware limitations with practical future implementations, the design checklist (depicted in
Table 14) defines the technical benchmarks required to address current bottlenecks with a next-generation GPU-aware FPGA SmartNIC shell.
6. In-NIC Preprocessing for RDMA Pipelines
Remote Direct Memory Access (RDMA) has become the de facto transport for east–west traffic in hyperscale data centers, underpinning Azure’s storage fabric [
2], Meta’s Memcache deployments [
58], and every modern NVMe-oF disaggregation layer. By allowing a user process to issue a
WRITE or
READ that bypasses the remote CPU, RDMA cuts down latency to single-digit microseconds and cuts host-cycle consumption by an order of magnitude. The tradeoff is bandwidth efficiency: an RDMA work-request transmits its payload exactly as provided by the application. Any aggregation, compression, or application-level filtering that could shrink—or otherwise optimize—the payload must occur before the work-request is posted. In practice that preprocessing is still executed in software, re-introducing memory copies, caches, and context switches that RDMA was meant to avoid.
FPGA-based SmartNICs provide the missing component. When preprocessing operates on GPU-resident buffers via PCIe P2P, the underlying GPU → FPGA transfer models are the same as those detailed in
Section 5; this section focuses on what to preprocess and how it composes with RDMA semantics.
A shell equipped with streaming parsers, match–action tables, and on-board scratchpads can transform data inside the NIC before the first RDMA_WRITE reaches the wire. This preserves RDMA’s low-latency promise while reclaiming PCIe bandwidth and host CPU cycles. Published pipelines report tangible gains across multiple workload classes.
Gradient aggregation in the NIC (as demonstrated by SwitchML and FpgaNIC) reduces east–west traffic volume by more than
and trims all-reduce latency by roughly 35–40% at 100 Gb/s [
49].
Inline LZ4 compression integrated into a Corundum-derived datapath increases effective link throughput from 100 Gb/s to around 170 Gb/s for log-streaming workloads, while adding less than 20 W of board power [
38].
Key-prefix filtering implemented in FPGA logic prior to an RDMA GET reduces Memcached tail latency by approximately 60 μs under hotspot workloads [
59].
Hence, in-NIC preprocessing is a structural requirement for the next wave of RDMA-centric fabrics: it aligns the granularity of data on the wire with the semantics of the application—without revisiting the host CPU that RDMA set out to bypass.
6.1. RDMA Fast-Path Explanation
RDMA exposes a transport in which user space posts a Work Queue Element (WQE) that the NIC executes without host involvement; completion is signaled by a lightweight doorbell to a Completion Queue (CQ).
Figure 2 highlights the critical objects.
Protection Domain (PD)—a capability container that binds memory regions, queue pairs, and completion queues.
Memory Region (MR)—a page-pinned address range translated by the HCA/IOMMU; referenced by an lkey/rkey.
Queue Pair (QP)—transmit and receive rings holding WQEs. Common opcodes are RDMA_WRITE, RDMA_READ, and SEND/RECV.
Completion Queue (CQ)—ring of Completion Queue Entries that report success, error, and byte count.
6.1.1. Latency Budget
On modern HCAs (e.g., ConnectX-6 Dx) a
WRITE WQE requires only
ns from doorbell to first byte on the wire, and a further
ns until completion signaling at a line rate of 100 Gb/s [
60]. However, this budget excludes any application-level marshaling performed prior to posting the WQE, which is precisely where in-NIC preprocessing offers substantial savings.
6.1.2. Role of a SmartNIC Shell
A reconfigurable shell can tap the AXI-stream path before the HCA serializer: parsers or compute kernels rewrite, compress, or aggregate the payload and then pass a compacted buffer to the DMA engine, leaving the RDMA transport unchanged. Because the QP state (PSN, ACKs, and congestion control) resides inside the shell, the accelerator needs to implement only the data transform.
6.1.3. Transport Variants
Although InfiniBand defines the verbs model, most data centers deploy RDMA over Ethernet. RoCE v2 (Routable RDMA over Converged Ethernet) encapsulates the InfiniBand GRH in UDP/IP and is ubiquitous at 25 Gb/s and above. iWARP carries the verbs over TCP; it is largely superseded in modern clusters but can remain relevant for long-haul or less tightly controlled networks. NVMe-oF over RDMA reuses SEND/RECV opcodes to transport NVMe command capsules, enabling microsecond-scale access latencies.
Understanding these mechanisms clarifies where a preprocessing kernel must attach: after virtual-address translation but before transport encapsulation, so correctness is preserved while the host-CPU bypass is maximized.
6.2. Taxonomy of In-NIC Preprocessing Tasks
Network payloads can be modified at four distinct semantic layers before an RDMA_WRITE/READ is issued. Classifying these transforms clarifies which hardware blocks and shell services must be present to support a given use case.
6.2.1. Header Manipulation
Inline checksum and CRC generation is a common requirement in storage fabrics such as NVMe-oF and iSCSI-RDMA, which often demand per-segment integrity checks; computing these values on the FPGA avoids stalling the host or GPU on the critical path. Another class of header-level work is header compression and expansion, where proprietary overlays compress private-data fields and SmartNIC logic restores them just before ingress to the remote NIC/HCA.
6.2.2. Payload Shaping
A prominent payload transform is vector aggregation and reduction. SwitchML and FpgaNIC aggregate machine-learning gradients in on-NIC BRAM/HBM, reducing east–west traffic by about
and shaving roughly 35% off all-reduce latency [
49]. A second payload-shaping primitive is scatter–gather coalescing: repacking multiple small RDMA regions into a single contiguous buffer reduces work-queue-entry (WQE) count and lowers PCIe overhead.
6.2.3. Content Filtering
Bloom-filter discard enables early rejection of requests that would otherwise waste host cycles.
KV-Direct computes a 64-bit Bloom hash on the NIC and can drop 60–70% of cache-miss probes without touching the host [
59]. Beyond fixed filters, programmable ACL and DPI pipelines can enforce policy at line rate; for example, FlowBlaze’s stateful match–action design blocks unwanted flows at 40 Gb/s while adding less than 3 μs of latency [
61].
6.2.4. Security and Integrity
At the security layer, SmartNICs can apply inline encryption and message authentication. For instance, P4-programmable AES-GCM datapaths can encrypt payloads before an RDMA verb is posted, supporting zero-trust and multi-tenant policies. Another security transform is redaction and tokenisation, where the FPGA scrubs PII fields or injects tokens prior to RDMA transfer, offloading compliance logic from the GPU.
6.2.5. Application-Specific Transforms
Some transforms are tightly tied to specific application domains. In distributed training, gradient sparsification and quantization (e.g., top-
k selection or 8-bit quantization) can reduce traffic by an order of magnitude without measurable accuracy loss in reported systems [
62]. In telemetry and logging pipelines, delta coding for log streams has been demonstrated in a Corundum-derived datapath with on-NIC LZ4+delta compression, achieving an effective 170 Gb/s on a 100 Gb/s link while consuming approximately 20 W of additional board power [
38].
Cross-Cutting Observation
Most tasks require only a narrow set of hardware primitives—streaming CRC/crypto pipes, on-chip scratchpad, and an AXI-Stream switch—suggesting that a well-chosen repertoire of fixed engines inside the shell can serve a broad spectrum of preprocessing workloads while relegating application-specific logic to partial-reconfiguration regions.
6.3. State of the Art: Published In-NIC Preprocessing Pipelines
Academic prototypes and a handful of vendor white-papers already demonstrate that significant application logic can be executed inside the NIC at line rate, well before an RDMA verb reaches the wire.
Table 15 consolidates representative projects, grouped by the primary function they offload. The common pattern is clear: a streaming parser feeds a small on-chip scratch-pad (typically BRAM or HBM), after which a compute kernel performs the transform and forwards a size-reduced or aggregated buffer to the RDMA engine.
Trends
Across published systems, the highest-impact results consistently depend on three enablers already discussed in
Section 4 and
Section 5. The first is a true streaming datapath exposed by the shell (typically via AXI-Stream), which allows preprocessing to operate inline without buffering entire messages. The second is low-latency on-card state, provided either by scratchpad memories or by HBM placed close to the parser, enabling operations such as aggregation, sliding CRC windows, or small stateful filters. The third is fast reinjection of processed data into the RDMA queue pair without a CPU round trip, so that the transformed payload can be transmitted under the same verbs semantics and completion model.
As PCIe 6.0 and CXL 3.0 become available, the primary bottleneck is likely to shift from link bandwidth toward on-card memory capacity and the sophistication of the in-stream compute cores. This further underscores the need for shells that balance deterministic I/O with ample reconfigurable logic and carefully engineered memory hierarchies.
6.4. Hardware Building Blocks for In-NIC Preprocessing
Because every SmartNIC shell hides the latency-critical datapath inside its static region, only a limited set of reusable hardware primitives can afford to live there; everything else must fit into a user PR region without breaking timing at 100–400 Gb/s.
Table 16 catalogs those primitives, the functional gap they close, and examples of shells that already integrate them.
Integration Guidelines
Most shells already embed the parser, DMA pump, and clock-island logic; adding a thin, parameterizable CRC/crypto pipe and reserving 1–2 MB of BRAM for scratch-pads enables 80% of the workloads listed in §6.2. Deep buffering or large reductions call for HBM devices, but these should be optional to keep cost tiers flexible.
7. Software Stack and Programming Layers
Hardware flexibility does not guarantee deployment success; any hardware needs to be enabled by software. In practice, SmartNIC viability is often decided less by raw datapath capability and more by whether the software stack can (i) expose stable abstractions to applications, (ii) integrate safely with the host memory/IOMMU model, and (iii) support operational workflows such as upgrades, debugging, and performance isolation. Recent systems experience argues that SmartNICs should be viewed as data-movement controllers whose value depends on end-to-end co-design across driver, runtime, and control/management software [
27,
63]. At the same time, programmability layers (e.g., P4 on multicore SmartNICs) introduce their own performance/portability trade-offs: the abstraction improves developer productivity but can hide microarchitectural bottlenecks, motivating automated tuning and profiling frameworks [
64]. Finally, even when the dataplane is programmable, the control plane is a reliability-critical component: inconsistent or non-deterministic updates of match/action state can disrupt correctness, so update mechanisms and their timing behavior must be treated as part of the SmartNIC system design [
65].
Table 17 presents an overview of the software stacks used in SmartNIC applications.
This subsection shows the highlights of these types of software that is used in the interaction of a SmartNIC with its host.
7.1. Host-Side Kernel Drivers
A SmartNIC is a PCIe-enabled device, so a kernel driver is required to enumerate it, configure it, and expose a safe data and control interface to user space. A first responsibility is PCIe probing and BAR mapping: drivers such as
mlx5 and Intel’s
idpf map doorbells and completion queues into user space via
ioctl/
mmap, while honoring IOMMU policies and features such as ATS that are important for GPU P2P transfers [
66,
67];
Section 5 discusses the end-to-end offload paths and constraints. Another key design choice is interrupt versus polling operation. At line rates of 100–400 Gb/s, datapaths typically rely on busy-polling (e.g., NAPI polling or eBPF/XDP-style processing) to avoid interrupt storms; interrupts are then reserved for link events, exceptional conditions, or partial-reconfiguration completion notifications [
68]. Finally, kernel support for virtualization (e.g., SR-IOV and
vDPA/
virtio-net) enables multiple containers or VMs to own dedicated queues or queue pairs while sharing the same shell datapath.
7.2. User-Space Run-Times
Above the driver, user-space run-times provide the performance-critical interfaces that applications actually program against. DPDK and SPDK are widely used to provide zero-copy packet and NVMe buffer management, and many FPGA shells ship a DPDK poll-mode driver (PMD) layered on top of their kernel driver (e.g.,
dpdk_qdma) [
69]. For RDMA workloads,
libverbs and RDMA-CM expose queue-pair abstractions for RoCE and iWARP; in this layer, systems such as RecoNIC extend
libverbs with shared-queue-pair mechanisms so that in-NIC accelerators can post work queue entries (WQEs) without trapping through the host. In vendor ecosystems, SDKs such as DOCA or the Pensando API wrap device registers and service endpoints in C/CUDA bindings and management APIs; for example, DOCA 2.5 includes GPUDirect RDMA helpers that map BlueField doorbells into CUDA streams directly [
70].
7.3. On-Card Firmware and OS
A third layer runs on the card itself and determines how the SmartNIC is initialized, managed, and updated. In the DPU class, full-featured Linux distributions are common: BlueField-3 and Pensando DSC boot Ubuntu or Yocto on multi-core Arm clusters, where containerized agents can manage OVS-DPDK pipelines, DOCA plugins, and secure-boot attestation workflows [
71]. In many FPGA shells, by contrast, lightweight RTOS or bare-metal control is typical; Corundum and OpenNIC frequently rely on a microcontroller-level monitor that initializes PHYs and supports management actions such as loading partial bitstreams over PCIe BAR 0. Finally, shells that embrace dynamic services introduce a dedicated partial-reconfiguration orchestrator. For example, Coyote v2’s on-card manager quiesces affected queues, loads a new PR region via ICAP in under 300 ms, and restores queue state in a way that is transparent to host software [
40].
Together, these three layers deliver the abstractions and life-cycle controls that turn a reconfigurable datapath into a production-grade SmartNIC.
8. Conclusions Future Work
This article has examined the evolution of network interface technology from simple frame movers to highly programmable SmartNICs. This study demonstrates that the fundamental driver behind this evolution is the growing mismatch between general-purpose CPUs and modern accelerators such as GPUs. Reducing the cost of data movement, rather than adding yet more compute cores, now offers the largest performance dividend.
8.1. Key Findings
Three observations emerge consistently from the surveyed systems. First, shell design is decisive: a well-structured FPGA shell can guarantee deterministic 100 Gb/s-class I/O while still supporting partial reconfiguration on sub-second time scales. In particular, hierarchical shells (e.g., Coyote v2) indicate that both infrastructure services and user kernels can be updated without interrupting live traffic, provided that reconfiguration boundaries, queue quiescing, and state management are treated as first-class design constraints. Second, direct GPU↔FPGA paths matter: peer-to-peer DMA over PCIe Gen4/5—and, prospectively, coherent attachment via CXL 3.0—eliminates multiple memory copies and can remove several microseconds from the end-to-end critical path. Across prototypes, these CPU-bypassing designs can reclaim the equivalent of an entire CPU socket while improving accelerator-dominated workloads such as distributed training by up to 30%. Third, in-NIC preprocessing changes the economics of RDMA: operations such as gradient reduction, Bloom-filter pruning, and inline encryption can be executed at network speed inside the SmartNIC, shrinking east–west traffic by up to a factor of four and reducing the CPU work per gigabyte transferred.
8.2. Software Implications
Hardware flexibility is valuable only when matched by a robust software stack. Reliable kernel drivers, user-space libraries (DPDK, libverbs, and DOCA), and secure on-card firmware are essential for turning laboratory prototypes into production tools.
8.3. Future Work
Future work will focus on four complementary directions. First, we target an 800 Gb/s reconfigurable shell built around a Versal-HBM device capable of sustaining 4 × 200 Gb/s Ethernet ports. In this design, the high-bandwidth on-chip network would interconnect the MACs, HBM stacks, and partial-reconfiguration regions, providing sufficient internal bandwidth (>3 TB/s) to prevent congestion even when all four links operate at line rate. Second, we aim to move toward unified tool-chains: a timing-aware build flow that can integrate RTL, C/C++ HLS, and P4 within a single project, with the objective of reducing “idea-to-bitstream” latency to under 24 h. Third, we plan to develop a runtime scheduler and associated host driver/API that can deploy and manage SmartNIC compute workloads, whether or not they are directly tied to the network datapath; in this approach, a scheduler running on the embedded processors would orchestrate execution with minimal host involvement beyond workload specification. Finally, we will extend the study with a network power-consumption overview. Since network interfaces can account for around 15% of a SmartNIC’s total power, we plan to investigate prediction mechanisms for adaptive link-speed reduction in order to optimize energy consumption.
8.4. Original Contributions
Beyond summarizing prior work, this article makes several concrete contributions. First, it proposes a unified taxonomy for the SmartNIC design space by separating the landscape into comparable device classes—traditional NICs, ASIC/DPU-class SmartNICs, and FPGA-based shells—and by discussing them through a common set of architectural axes (e.g., programmability, offload locus, and operational constraints). Second, this paper provides a normalized, side-by-side analysis of representative FPGA shells and research platforms, capturing not only their feature sets but also their reconfiguration model (static versus PR/DFX), integration style, and practical trade-offs, as consolidated in
Table 8,
Table 9 and
Table 10 and
Table 12. Third, it contributes a cross-layer synthesis of GPU–FPGA peer-to-peer offload paths by consolidating the required hardware/software stack layers and the key design constraints for CPU-bypassing GPU↔FPGA datapaths—an aspect that is often fragmented across vendor documentation and isolated prototypes (
Table 13 and
Table 14). Fourth, the article offers a consolidated view of in-NIC RDMA-oriented preprocessing by surveying the main preprocessing operators and mapping them to the architectural building blocks that enable line-rate execution, while highlighting where bottlenecks remain (
Table 15 and
Table 16). Finally, based on the comparative evidence gathered throughout the survey, this paper distills actionable guidance for next-generation reconfigurable SmartNIC shells, emphasizing requirements such as low-latency partial reconfiguration, clear isolation boundaries, bandwidth budgeting across NIC/PCIe/NoC/HBM, and a stable SDK/API surface that supports heterogeneous RDMA-centric workloads.