DrawerPipe: A Reconﬁgurable Pipeline for Network Processing on FPGA-Based SmartNIC

: In the public cloud, FPGA-based SmartNICs are widely deployed to accelerate network functions (NFs) for datacenter operators. We argue that with the trend of network as a service (NaaS) in the cloud is also meaningful to accelerate tenant NFs to meet performance requirements. However, in pursuit of high performance, existing work such as AccelNet is carefully designed to accelerate speciﬁc NFs for datacenter providers, which sacriﬁces the ﬂexibility of rapidly deploying new NFs. For most tenants with limited hardware design ability, it is time-consuming to develop NFs from scratch due to the lack of a rapidly reconﬁgurable framework. In this paper, we present a reconﬁgurable network processing pipeline, i.e., DrawerPipe, which abstracts packet processing into multiple “drawers” connected by the same interface. NF developers can easily share existing modules with other NFs and simply load core application logic in the appropriate “drawer” to implement new NFs. Furthermore, we propose a programmable module indexing mechanism, namely PMI, which can connect “drawers” in any logical order, to perform distinct NFs for different tenants or ﬂows. Finally, we implemented several highly reusable modules for low-level packet processing, and extended four example NFs (ﬁrewall, stateful ﬁrewall, load balancer, IDS) based on DrawerPipe. Our evaluation shows that DrawerPipe can easily ofﬂoad customized packet processing to FPGA with high performance up to 100 Mpps and ultra-low latency ( < 10 µs). Moreover, DrawerPipe enables modular development of NFs, which is suitable for rapid deployment of NFs. Compared with individual NF development, DrawerPipe reduces the line of code (LoC) of the four NFs above by 68%. We also evaluate the power of four NFs (i.e., ﬁrewall, stateful ﬁrewall, L4LB, IDS) integrated project with a 40-rule packet classiﬁcation. The result shows that the total power is 2.139 w, consisting of 0.050 w for clocks, 0.089 w for signals, 0.0058 w for logics, 0.018 for BRAM; 0.0055 w for MMCM, 0.154 w for I/O, 1.544 w for PS7 and 0.172 w for Device static. Then, we test the schematic RTL of a four-NF integrated project. The results show that there are six cells, 262 I/O ports, and 343 Nets in the entire SoC project.


Introduction
Modern public clouds provide computing, storage, and other types of services for multiple customers (i.e., tenants) on a shared infrastructure. To ensure security and performance isolation, each tenant is deployed in a virtualized network environment. Consequently, flexible network functions (NFs) are required to be deployed in two respects. First, datacenter operators need to implement NFs to enforce tenant isolation while guaranteeing Service Level Agreements (SLAs) [1,2]. Second, with the trend of network as a service (NaaS) in the cloud [3][4][5], tenants (especially enterprises) have moved line-of-business applications to the cloud [4]. For instance, Walmart has focused on migrating its thousands of internal business applications to Microsoft Azure to decrease operational costs associated with legacy architecture [6]. Thus, tenants also need to deploy a variety of customized NFs in their virtual networks. • We present a reconfigurable network processing pipeline for SmartNIC, i.e., DrawerPipe, which abstracts packet processing into multiple "drawers" with the same interface, and provides high flexibility to add, remove, or replace modules in the "drawers" to implement custom NFs. • We design a Programmable Module Indexing mechanism, i.e., PMI, and a PMI compiler, which allow developers to specify the module execution order for each flow to perform required NFs. • We implement a DrawerPipe prototype with five reusable modules on an FPGA integrated platform, and extend four example NFs (firewall, stateful firewall, load balancer, IDS). We then evaluate the PMI by constructing multiple service chains.
Experiment results show that DrawerPipe can offload customized packet processing to FPGA with high performance up to 100 Mpps and ultra-low latency (<10 µs). NFs deployed on DrawerPipe can share the same functionalities, often significantly reducing development efforts. Compared to developing the complete logic for each NF, DrawerPipe reduces the line of code (LoC) of above four NFs by 68%. At the same time, our PMI compiler (written in Python) can quickly construct one module chain for flows that follow the same service chain. In our test, PMI compiler builds 2-32 module chains for 10K flows within 70ms.
The rest of the paper is organized as follows: Section 2 introduces the requirements and approach of our work. Section 3 proposes the design of DrawerPipe. Section 4 shows PMI and its optimizations. Section 5 provides the experimentation and evaluation of DrawerPipe prototype, four extended NFs, and an PMI compiler. Related works on NF development are discussed in Section 6. Finally, Section 7 draws the conclusions.

Requirements and Approach
DrawerPipe targets designing a fast reconfigurable network processing pipeline for FPGA-based SmartNICs, which allows developers to develop and deploy multiple NFs easily on the FPGA while sharing the same functionalities. By analyzing the processing features of various commonly deployed NFs [24,25,29], we identify three key requirements for such a reconfigurable pipeline. R1: DrawerPipe supports sharing same functionalities between NFs to prevent redundant development while ensuring the data areolation. For example, we have implemented a firewall (FW) for filtering packets from malicious hosts, and an IDS [21] to alert a system administrator after detecting intrusion, as outlined in Figure 1. To prevent redundant development, FW and IDS should share similar functionalities, e.g., header parsing and packet classification. When multiple NFs use the same module in FPGA, we should ensure the data areolation between NFs and return processed data to the right NF. For example, as shown in Figure 1c, FW gets a matched ruleID (i.e., rule a) from packet classifier, while IDS obtains another matched ruleID (i.e., rule b) even for the same flow. R2: DrawerPipe supports FPGA/CPU co-design while using minimal FPGF-CPU communication times without affecting correctness. FPGA is no panacea, and some tasks are not suitable for FPGA [2]. Thus, DrawerPipe should support FPGA/CPU co-processing for complex NFs. Although previous work, R3: DrawerPipe supports users to customize the order in which modules are executed. We found that network traffic usually traverses a sequence of NFs these days, i.e., service chain. For example, as outlined in Figure 3, while packets belonging to tenant 2 perform L4LB for balancing accessing responses, packets belonging to tenant 3 enter the FW first for filtering malicious packets. Moreover, one NF can be decomposed into multiple software or hardware modules for reusing. Thus, DrawerPipe needs to build one module execution sequence (called module chain in this paper) for each service chain. Although it is easy to customize the execution order of software modules by adjusting the order in which functions are called [32,33], the hardware pipeline cannot dynamically modify the connection relationship between modules without resynthesizing.  . The NF developer specifies the module execution sequence for each type of packets. We need to decompose NFs and map them to one physical pipeline according to module execution sequence.
Our primary approach is to design a modular and reconfigurable network processing pipeline for FPGA-based SmartNICs, which allows developers to extend new NFs by inserting customized modules and specify the module execution sequence to perform required NFs. As shown in Figure 4, DrawerPipe consists of three components designed to meet all requirements:  Figure 4. NF developers write scripts to describe module chains, which are compiled into a module graph for constructing and configuring the packet processing pipeline based on DrawerPipe.
• Modular and reconfigurable pipeline: According to the characteristics of packet processing existed in commonly deployed NFs, DrawerPipe abstract packet processing into multiple "drawers" with the same interface, and provides several highly reusable modules for low-level packet processing. DrawerPipe allows NFs sharing similar functionalities while ensuring data areolation using two methods. First, to ensure matching isolation, every rule Programmable module indexing mechanism: Motivated by the idea of building linked list in C/C++, PMI allows users to configure the next module to process packets one by one. Thus, users can specify the module chain traversed by packets to obtain any required service chain for multiple tenants (R3). To reduce FPGA-CPU communication times, PMI steers packets through as many hardware modules as possible before passing through software ones (R2). Furthermore, we use PMI to distinguish flows or tenants that need to look up different logic rule tables for data areolation between NFs (R1).

Design of DrawerPipe and DrawerPipe Shell
In this section, we propose the DrawerPipe model which abstracts packet processing into multiple "drawers" with the same interface and provides five highly reusable modules for basic packet processing. Then, we present the design of DrawerPipe shell which hinds the low-level packet processing details.

DrawerPipe Model
We analyze the processing features of commonly deployed NFs [24,25,29], and find that most NFs require essential and similar processing steps. For example, most NFs parse packet headers and then classify packets based on these headers, and finally switch packets according to the destination MAC addresses. NFs can be considered as adding customized logic to these basic processing steps. For instance, we can implement an L4LB by plugging a software module allocating a server for each flow, and a hardware module searching an allocated server for each packet.
Leveraging this property, we present DrawerPipe that abstracts packet processing into five general stages, i.e., header parsing, field extraction, packet classification, user-defined action, and egress management. Each stage consists of multiple "drawers" connected by the same interface, as shown in Figure 5. Users can easily extend new NFs by inserting, removing, or replacing core application logic in the "drawers" without understanding the design details of other modules. Moreover, DrawerPipe provides five highly reusable modules (basic modules), i.e., header Parser (P), field Extractor (E), packet Classifier (C), L2 Switching (S), and Transmitter (T), for essential packet processing.  The header parsing stage is used to identify header types of packets and separate the packet header from the payload. While the payload is cached in the packet buffer, the packet header with an obtained header type is sent to the field extractor for further processing. Although the basic header parser (i.e., L4 Parser) only supports some common network protocols such as ARP, IPv4, IPv6, ICMP, TCP, and UDP, developers can easily extend custom protocols by adding new parsing modules following the basic parser without understanding or modifying existing parsing modules.
Field extraction fetches appropriate fields from the packet header according to its header type, such as extracting source and destination MAC addresses for ARP packets. The basic field extractor fetches source and destination IP addresses, IP protocol, source, and destination ports (if any) to construct a five-tuple, with source and destination MAC addresses for L2 learning and switching. As most NFs parse similar network protocols, they can share the same packet parser and only need to insert user-specific field extractors following the basic extractor for fetching required fields.
The function of packet classification is to classify or filter packets based on extracted fields, e.g., five-tuple. DrawerPipe provides a basic packet classifier based on the BitVector (BV) algorithm [34] which has predictable search latency and is suitable to be implemented in FPGAs. To trade-off hardware resource consumption and search latency, developers can replace the BV algorithm with other packet classification algorithms [35], such as decision tree (e.g., Hicuts [36], HyperCuts [37]), decomposition [38], tuple space searching (TSS) [39], or hash-based exact matching.
User-defined action is used to implement custom functionality, such as load balance. The basic module in the user-defined action stage is L2 switching, which obtains an egress port by searching the L2 forwarding table (i.e., mappings of a destination MAC address to egress port). Therefore, modification of a destination MAC address should be placed before the L2 switching, and the processing related with egress port can only be set after the L2 switching. For example, IP forwarding should be placed before the L2 switching, and measurements for counting packets sending to each egress port can only be set after the L2 switching.
The egress managing stage is used to implement packet scheduling, and drop or forward packets according to the processing actions generated during the packet classification (as ACL) or user-specific processing. DrawerPipe provides a transmitter to realize the latter functionality. NF developers can add scheduling algorithms by plugging custom modules before the transmitter. For example, we can apply Token Bucket Filtering [40] for traffic shaping, or Weighted Fair Queuing (WFQ) [41] for packet scheduling.
In detail, we take the extending L3 routing function to a basic five-stage DrawerPipe as an example. As shown in Figure 5, we add an IP forwarding module (Routing) before the L2 switching. The routing module reads a destination IP address from the metadata and performs the longest prefix match to get the next hop. Then, it uses the destination MAC address of the next hop to replace the original one, and decreases the TTL in the IP header. Finally, these modified fields will be written back to the packet in the transmitter.
To support hardware/software co-processing, users can insert either a hardware or software module in the DrawerPipe. Hardware modules can utilize massive parallelism provided by FPGAs to implement the part of packet processing that requires high performance. Software modules run as processes with high flexibility and can execute complicated logic. We extend a new NF in DrawerPipe in two steps. First, we decompose the processing logic of the NF into software and hardware parts. In each part, we share the same functionalities with existing NFs, and load customized modules in the "drawers". Furthermore, we design a DrawerPipe shell (described in Section 3.2) to exchange packets with attached metadata between the FPGA and CPUs. As a result, software and hardware modules can share the intermediate results by reading and modifying the metadata. Second, we use PMI (described in Section 4) to link these software and hardware modules for co-processing. In addition, we leverage PMI to specify module chains traversed by packets to obtain required service chains.

DrawerPipe Shell
DrawerPipe shell is the platform-related processing logic around DrawerPipe and provides the execution environment for packet processing logic loaded in the "drawers". It defines an API that allows hardware or software modules specified by users to access common functionality through a set of target-agnostic abstractions. Consequently, the DrawerPipe shell should meet the following requirements. First, it should provide a uniform interface that is portable across many different hardware platforms. Second, it should also provide an efficient medium to transport data across hardware and software modules. Finally, it should provide auxiliary functionalities to support receiving/sending packets from/to physical ports, buffering packets, CRC checking and calculation, attaching a receiving timestamp, and so on.
Note that developers can create a variety of potential NFs, ranging from simple ones such as firewall, to complex ones such as IDS, which needs software/hardware co-processing. To support these different use-cases, DrawerPipe shell provides hardware and software APIs. As shown in Figure 4, the hardware API includes four main functions, i.e., receiving packets from ingress ports, sending packets to egress ports, configuring registers or SRAM (tables) in hardware modules, and exchanging packets with software. The software API provides two main functions for receiving (or sending) packets from (to) hardware, and configuring hardware registers or RAM-based tables. Moreover, the functions of DrawerPipe shell are fixed, which can be implemented using IP cores provided by existing FPGA providers without frequent updates. Thus, modules in DrawerPipe do not need to modify their interface and logic when porting to other platforms.
Below, we divide the DrawerPipe shell into three major components, i.e., transceiver management, memory management, and FPGA-CPU communication.
Transceiver Management. DrawerPipe shell provides a transceiver management unit that enables DrawerPipe to use the media access control (MAC) and physical (PHY) layers specific to a target platform. The transceiver management unit uses vendor-specific protocols to receive and send packets, and provides DrawerPipe with data in a standard format. Developers do not need to understand the design details of PHY and MAC layer logic, such as gmii-rgmii conversion and CRC checking and calculation. As a result, modules in DrawerPipe are target-agnostic and portable across many FPGA platforms.
Memory Management. When packets arrive at DrawerPipe, they will be stored in the packet buffer after packet parsing, as outlined in Figure 5. The packet buffer can be designed in two ways. A straightforward approach is to use FIFO queues, i.e., one queue for each ingress port, and outputs packets in the order in which they are received. However, simple FIFO queues are not sufficient for more advanced packet processing functionalities, such as packet scheduling, which require reordering packets as they are processed.
DrawerPipe shell also provides an optional memory buffer based on SRAM. The SRAM-based memory management realizes two functions: dispatch and recycle. The dispatching function allocates a free packet block which can buffer a maximum transmission unit (MTU), and returns a unique packet identifier (packetID) when it receives a packet. Upon the completion of taking the packet out of the packet buffer, the recycling function frees the corresponding packet block, and recycles the packetID.
FPGA-CPU Communication. DrawerPipe integrates an FPGA-CPU communication channel between the hardware and software. We have designed two FPGA-CPU communication channels: one is built on top of the PCI express (PCIe) protocol [31], which is the de-facto protocol for internal communication within network devices. For example, in our previous FPGA-integrated network processing platform, such as iRouter [42], NetMagic-Pro [43], and OpenBox-S28 [44,45], we use PCIe-based channel to connect Intel FPGA (Stratix V) and Intel CPU (i7-4700eq). Another is built on top of the Advanced eXtensible Interface (AXI) [46] bus. For example, in our experiment, we use AXI-based channel to deliver message between FPGA and CPU. Based on FPGA-CPU communication channel, we provide two kinds of APIs for CPU: recv/send API and configuration API, as shown in Figure 4. The recv/send API used to exchange packets and metadata with the FPGA is the foundation for supporting FPGA-CPU co-processing. The configuration API provides functions of reading and configuring registers or SRAM-based tables in hardware modules.

Design of PMI
In this section, we present the strawman design of PMI and resource optimization in PMI implementation. We also design a PMI compiler to simplify the configuration of PMI table in each module.

Strawman Design of PMI
Most NFs deployed in DrawerPipe are composed of multiple modules, and these NFs can share modules with similar functionalities. To perform specific NFs, we need to assign a module execution sequence for flows that follow the same service chain. Inspired by linked list in C language, PMI constructs an ordered module chain by configuring the next module to process packets for each module.
The module in DrawerPipe has a unique module identifier (mid), and PMI uses next mid (nmid) to indicate the next module following the current module. The function of nmid is similar to the GOTO command in OpenFlow [47] for skipping unrelated matching tables. Users can specify the nmid to bypass some modules that do not need to process the current packet.
PMI uses the combination of mid and nmid to construct module chains. The processing flow can be expressed as: (1) the initial module (i.e., L4 parser) receives a packet, and obtains the next module by searching mappings of flow identifier (i.e., flowID such as five-tuple) to nmid. Moreover, the obtained nmid will be filled in metadata; (2) following module compares its local mid (lmid) with the nmid carried by the metadata. If lmid is equal to nmid, the current module needs to process this packet and update the nmid; otherwise, the packet can bypass the current module; (3) repeating step (2) until the packet enters the last module of DrawerPipe (i.e., transmitter); (4) transmitter forwards the packet to egress ports when the nmid is 0 (means no succeeding module), or sends the packet back to L4 parser for performing step (1) to (4) in a new loop.
As illustrated in Figure 6, we construct a module chain, i.e., parser (P) → extractor (E) → L4LB (including a software module and a hardware module) → switching (S) → transmitter (T) for outbound TCP packets, in two steps. First, we assign one mid for each module (e.g., the mid of parser is "1"). Second, we specify nmid for outbound TCP in all traversed modules (e.g., the nmid for outbound TCP packets in the extractor is "6"). As a result, packets belong to outbound TCP will traverse parser, extractor, L4LB's FPGA module, L4LB's CPU module (for new flow), and transmitter before being forwarded to the egress port. To connect software and hardware modules, we set the mid of software modules from 128, and the leverage transmitter to separate the packets sent to software modules from the packets forwarded to egress ports. Consequently, packets whose nmid is equal to or bigger than 128 are sent to software modules, and packets whose nmid smaller than 128 are returned to hardware modules.
To implement the PMI mechanism, each DrawerPipe module consists of three parts: nmid comparing logic, packet processing logic, and nmid lookup logic, as shown in Figure 7. The nmid comparing logic is relatively simple, and determines whether to process packets by comparing its lmid with the nmid conveyed in the metadata. Processing logic is realized by users to perform the application-specific function. The nmid lookup logic maintains a PMI table consisting of two fields: condition (representing flowID) and nmid. The nmid lookup logic examines packet's flowID with entries in PMI table to find the nmid.

Optimizations of PMI
As described above, the strawman design maintains a PMI table in each module, and PMI tables are independent. In such a distributed model, packets should search multiple PMI tables to perform required NFs. A nmid lookup example is given in Figure 8a, in L4LB's FPGA module, the first packet of f low3 matches the default entry and then updates its nmid to 129.
Although it is easy to implement a PMI table and write a reusable lookup logic based on a hash algorithm, the disadvantages of the strawman PMI model are also obvious. First, as the number of flows grows, maintaining a PMI table in each module consumes lots of memory resources and introduces additional lookup delay due to hash conflict. Second, inserting a new module may need to modify the PMI table (e.g., adding matching entries or changing matching fields) of all upstream modules, which can be complicated. For instance, in Figure 8a, if there is no FW in the current pipeline and we want to add an FW after the packet classifier to filter malicious flows, we need to find the upstream modules (i.e., extractor, packet classifier) and configure their PMI tables to direct traffic to the modules of FW.
To solve the problems above, we present two optimized PMI models, i.e., centralized PMI and hybrid PMI models. While the centralized PMI model specifies the module execution sequence in a centralized module, the hybrid PMI model compresses the same execution sequence as a path, and assigns nmid by searching path-nmid mappings in each module.

Centralized PMI Model
As shown in Figure 8b, compared with maintaining a PMI table in each module, centralized PMI maintains only one mapping table in the mid-chain (midc) dispatcher (D) after the extractor.
The mapping table includes two components, i.e., condition and mid chain. The mid chain consists of a list of modules that should be traversed by packets. Therefore, the received packet is sent to the midc dispatcher, and obtains a mid chain by examining its flowID with entries in the midc dispatch table. The following modules only need to check the mid chain, without maintaining a condition-nmid mapping table like strawman PMI model.
The processing flow of modules after midc distributor can be described as: (1) comparing the module's lmid with the header of mid chain carried by metadata. If the lmid is equal to the head mid, this module should process the current packet; otherwise, output the current packet without processing; (2) updating the mid chain by popping out the head mid after processing.
Still, taking the first packet of f low3 in Figure 8b as an example, the packet matches the default entry in the midc dispatcher and obtains a mid chain as 6 → 129 → 7 → 8. That is, the current packet needs to pass through four modules (i.e., L4LB's FPGA and L4LB's CPU modules, switching and transmitter) after the midc dispatcher. When mid chain reaches the end of the list, the packet can jump out of the DrawerPipe and be forwarded or discarded according to the output port in the metadata.
The centralized PMI model does not maintain a PMI table in each module, which can save a lot of storage resources and time used for performing PMI table lookups. However, it is hard for hardware to maintain a long and variable mid chain in the midc dispatch table and metadata. Moreover, it causes a waste of resources if we distribute a fixed space in midc table and metadata for the longest mid chain.

Hybrid PMI Model
The hybrid PMI model combines the advantages of both centralized and strawman models. As outlined in Figure 8c, the hybrid PMI model has two different characteristics. First, we compress the same module execution sequence into one path, which can save a lot of memory resources by avoiding recording one module chain for each flow. For example, in Figure 8c, f low1 and f low2 have the same module execution sequence, and we use path I to represent this sequence. As a result, a path dispatcher in hybrid PMI model saves a mapping of condition to path identifier (pathID) instead of a variable mid chain. Second, the hybrid PMI module maintains pathID-nmid mappings (namely path table) in each module.
Taking Figure 8c as an example, the first packet of f low3 matches the default entry in the path dispatcher and obtains path II, which means the packet should traverse modules represented by Path II. The following modules only need to search the path table to obtain the nmid, such as the nmid of path II in L4LB's FPGA module being 129 (i.e., L4LB's CPU module). Consequently, the current packet will be sent to L4LB's CPU module after the processing of L4LB's FPGA module.
The processing flow of hybrid PMI model can be described as: (1) DrawerPipe receives a packet, and sends it to a parser and extractor for basic packet processing; (2) then, path dispatcher gets this packet and examines its flowID with condition−path mappings to find a pathID and an initial nmid. The pathID and initial nmid will be filled in metadata; (3) succeeding modules compare their lmid with the nmid. If they are equal, the current module needs to process this packet and update nmid by searching the pathID−nmid mappings; otherwise, packet could bypass the current module; (4) if nmid is 0, this packet has already been processed by all modules, and can jump out of the pipeline.
There are two advantages in the hybrid PMI model. First, we merge the same module chain into a path, which is more memory-efficient than the centralized one. Second, as pathID is ordered from zero, modules can directly use pathID as the index to search the path-nmid table without any hash conflict. Consequently, we adopt the hybrid PMI model in DrawerPipe.

PMI Compiler Design
The intention of designing a PMI compiler is to transform service chaining intent into module connections, i.e., the configuration of path dispatch and path tables. As NFs deployed in DrawerPipe are composed of multiple modules, we can use module chains to represent service chaining intent.
Module chains are described in a simple script consisting of two-tuple, i.e., (condition, {module name}), as shown in Figure 9. The condition (i.e., flowID) consists of one or multiple matching fields, such as five-tuple. Each condition corresponds to a list of modules (i.e., {module name}), which represents the execution order of modules. In Figure 9, ( f low1, {P, E, FW-1, FW-2, LB-F, LB-C, S, T}) means packets belonging to f low1 pass through parser, extractor, FW-1, FW-2, LB-F, LB-C, switching and transmitter in order.  Path Table ( Based on the above abstraction of module chains, the PMI compiler transforms module chains into configuration content of the path dispatch table and path table as follows. First, the PMI compiler generates a module graph based on module chains, as outlined in Figure 9. Since DrawerPipe has only one physical processing pipeline, we merge multiple module chains to this pipeline described by module graph. Referring to Algorithm 1, we firstly initialize freeMid and moduleGraph. Then, in pseudocode 3-13 lines, we traverse all module chains stored in moduleChainArray (reading from the script), and dispatch a unique mid for each module without repetition. Meanwhile, we append these modules to moduleGraph. Similarly, we dispatch each different module list a pathID in the pseudocode 12 line. Finally, in pseudocode 14-20 lines, we examine all possible downstream modules for each module, and record them in the node array (described as moduleGraph[node]).

Module Chains
Second, the PMI compiler generates path information (configuration of path dispatch and path tables) based on module chains and the module graph, as shown in Figure 9. We find that each module chain corresponds to an entry in the path dispatch table, and each path corresponds to an entry in the path table. Moreover, if there is only one nmid in path table, we can merge these entries into one default entry. For instance, the default nmid of the parser is the extractor. Leveraging these observations, the PMI compiler generates path information in two steps, described in Algorithm 2: (1) In pseudocode 1-4 lines, we record a condition to pathID mapping for each module chain in the path dispatch table array (pathDTb[i]). (2) Then, in pseudocode 5-13 lines, we fill the pathID to nmid mapping for each moduleGraph module and merge the same nmid to a default entry. while node = NULL do 6: if node not in moduleGraph then 7: node.mid ← f reeMid + +

Evaluation
Based on the DrawerPipe model, we implemented a basic five-stage pipeline on a Xilinx FPGA integrated network processing platform that consists of a Zynq-7000 SoC chip [46], 4 GB DDR3 DRAM, and four 1 Gbps ports. In more detail, the Zynq-7000 SoC chip we used integrates an FPGA chip and two ARM Cortex-A9 processors .The FPGA chip has 53200 Slice LookUp Tables (Slice LUTs), 106,400 Slice registers, and 140 Block RAM tiles (i.e., 2520 Kb).
Below, we evaluate the DrawerPipe model and PMI mechanism from three aspects. First, we have extended four example NFs based on DrawerPipe to demonstrate the flexibility of DrawerPipe model. Second, we tested the performance and resource consumption of key modules in DrawerPipe. Third, we implement and compare resource consumption of three PMI models, and construct multiple module chains to evaluate the performance of our PMI compiler.

Applications
To evaluate the flexibility of DrawerPipe model, we have extended four example NFs based on the DrawerPipe model. These NFs include five-tuple based firewall, stateful firewall, L4 load balancer, and IDS. The hardware modules are written in Verilog, and the software modules are implemented using C. Using these case studies, we demonstrate that the DrawerPipe model supports diverse applications and enables modular development by allowing developers to focus on the core application logic. Table 1 summarizes the LoC of four NFs developed in two methods, i.e., individual and reusing. As mentioned in Section 2, individual development often requires developers to write complete packet processing code for each NF. Therefore, we adopt the reusing method to simplify NF development by reusing the same functionalities with other NFs and five basic modules provided by DrawerPipe. As shown in Table 1, compared with individual NF development, the reusing method only requires writing 12.4%-54.1% code for application-specific logic based on the DrawerPipe framework. Moreover, our experience also shows that DrawerPipe provides high flexibility of adding, removing, or replacing modules to implement custom NFs.

A1. Firewall.
The function of the firewall is to filter the specific flow whose flowID matches the pre-configured filtering rules. The rules of the firewall may have masks to filter one kind of packet. For instance, if we want to drop TCP packets, we can set the filtering rule as ( * , * , 6, * , * ) where IP protocol is equal to "6" and other fields of 5-tuple can be any value.
As DrawerPipe provides the basic packet processing functionalities of parsing, filed extraction, and packet classification, we can reuse these basic modules and add an action table to implement the firewall. The action table maintains an action (drop or accept) for each rule in the packet classifier. Consequently, if one packet hits a rule in packet classifier and gets the matched index, the firewall will drop or accept this packet according to the action directed by the index.
A2. Stateful firewall (SFW). Some NFs process packets based on connections, not just packets, such as stateful firewall and L4 load balancer. Our SFW maintains a list of IP addresses for internal hosts, and only allows the TCP connection request whose source IP address belongs to this IP address list. Consequently, SFW allows outbound TCP connections, but rejects inbound ones. However, the external host is allowed to send packets to the internal host through the established TCP connection. Thus, SFW needs to track the status of all active connections, including both client and server connection statuses, such as requested or established, sequence and acknowledgement numbers, and window size scaling factor.
We implement SFW by plugging two hardware modules between packet classifier and L2 switching. One module is used to examine the source IP address of packets with the internal IP address list for filtering inbound TCP connections. The other module tracks TCP connections, and drops invalid packets according to the connection status. SFW discards packets in three situations. First, the packet's sequence number exceeds the safe range of a received acknowledgement number (occurring in replay attacks [48]). Second, packets do not belong to the current status, e.g., receiving an FIN packet in the requested status which waits for an SYN-ACK packet). Third, packets match a closed or time-out connection. Thus, we maintain a timer for each connection, and close the connection when its timer expires (e.g., out timeout is set to 100 sec).

A3. L4 load balancer (L4LB).
Load balancer has been widely used in modern network, such as loading traffic to different paths in a multi-path environment, sending accessing requests to distinct servers to balance the workload in datacenter. Our L4LB is the latter, which balances the workload by sending more requests to servers that have lower CPU usage while guaranteeing the same flow processed by the same server based on 5-tuple.
L4LB also needs to track TCP status, e.g., assigning a destination server for each new connection, replacing destination (or source) IP address for established connections, and deleting the mapping of flowID to a destination server when the connection is closed. In our implementation, we insert a mapping searching module after the SFW, and share the connection management function with SFW. The mapping searching module is used to examine 5-tuple of packets with mappings to find the destination server. In addition, we developed a software module to communicate with servers to obtain their CPU usages, and to dispatch an appropriate server for each new connection to balance workload.
A4. Simplified IDS. Snort [21] is an open source intrusion detection system developed by the C language. The processing flow of snort can be decomposed into four steps. First, Snort receives packets from the network interface card (NIC) based on libpcap or DPDK [10]. Second, it parses received packet and extracts five-tuple according to the header type. In particular, it extracts source and destination IP addresses, IP protocol, ICMP type and code for ICMP packets. Third, it examines these extracted fields with pre-configured rules, and executes the last step if they match. Finally, it performs option fields lookups which consist of exact matching and regular expression matching. If they match, it will generate alarm information according to the processing action.
Due to the limited performance and high processing delay of Snort software, it is difficult to achieve line rate processing of 10 Gps. For this reason, we offload part of the processing of Snort to FPGA for acceleration, including packet parsing and packet classification used for rule matching based on 5-tuple, and left option matches on software. In our implementation, we added a hardware module after the packet classifier to direct specific flows to software, and developed a software module to perform option fields lookups to filter malicious packets. Finally, hardware and software modules interact with packet and control information through the DrawerPipe shell.

Performance and Resource Utilization
In this experiment, we evaluate the performance and resource utilization of key modules in DrawerPipe, including five basic modules and application-specific modules for implementing the four NFs above. Since our platform does not provide multiple 10 Gbps interfaces, we test the performance of these modules using simulation by Xilinx Vivado (the version is 2017.04) [27] and Intel Quartus (the version is Quartus prime 16.0.2) [28], which have been common choices in previous hardware tests [49,50]. We further investigate the performance of software modules contained in four NFs on our platform. Table 2 presents the key modules we have implemented in DrawerPipe. The listed modules include five basic modules: (L4_Parser, Extractor, Packet Classifier, Switching, Transmitter), two kinds of packet buffers (based on FIFO and SRAM), and six application-related modules (Firewall, Stateful Firewall, L4LB_FPGA, L4LB_CPU, IDS_FPGA, IDS_CPU). We first test the Maximum clock Frequency (FMax) of each module, and the lowest one is 232.88 MHz, as shown in Table 2. Thus, in the following simulation experiment, we test the throughput and latency under the clock frequency of 200 MHz, which is a typical clock frequency of FPGAs [2]. For the top part of Table 2, the module needs to touch every byte of a packet. We show the throughput in Gbps. The modules in the bottom part of the table, however, process only the packet header (metadata). Therefore, it makes more sense to measure the throughput using packet per second (pps). Although most modules can be completely pipelined, i.e., it can input search key and output result of every clock, some modules cannot process one metadata every clock, such as extractor processes with one metadata every two clocks, and can only achieve 100 Mpps. The width of Metadata we adapt is 256 bits, which includes five tuple. We also show the processing latency of each module in Table 2. As we see, this latency is low: the mean is 8 clocks (equal to 40 ns at 200 MHz), and the maximum is merely 26 clocks (Stateful firewall). We note that the delay is not related to the throughput, as modules can leverage FPGA's parallelism to process multiple packets at the same time. For instance, a BV-based packet classifier incurs a long delay (i.e., 17 clocks in our design) to find the matched rule with the highest priority. However, this packet classifier can be completely pipelined, i.e., it can input search key and output matched rule every clock.
We evaluate the three types of resource utilization for each module, including Slice LUTs, Slice registers, Block memory, and summarize the result in the last three columns of Table 2. The utilization is normalized to the capacity of the FPGA chip. We can see that most modules use only a small amount of logic and register resources. This is reasonable as most operations on packets are simple. Packet classifiers have moderate usage of logic and register resources because they have bigger logic, such as shift operation for updating match rules. The block memory usage, however, heavily relies on configurations of modules. For example, the block memory usage grows linearly with the number of rules supported in the packet classifier. Overall, our FPGA chip has sufficient capacity to support multiple simplified NFs by assembling above modules. Thus, a developer can use the onboard DDR memory to get enough storage space at the expense of introducing additional access delay.
Since servers in the datacenter are always virtualized to deploy multiple virtual machines, we encountered a problem that FPGA (i.e., Xilinx Zynq-7000 SoC chip we use) has limited resources to support tens of thousands of flows (or rules) when we implemented NFs. We find that there are three methods to support more flows (or rules). First, we can use high-capability FPGAs such as Xilinx Virtex UltraScale series and Intel Stratix 10 series, which have 20× memory resources than Xilinx Zynq-7000 SoC chip. However, the maximum flows (or rules) that can be maintained by onboard SRAM is less than the worst case even if we use the most advanced FPGA chip. Second, we can use onboard DDR memory (DRAM) to get enough storage space at the expense of introducing additional access delay. As the total number of supported flows or rules is limited only by the DRAM capacity, it is sufficient to support O(1M) flows and O(100K) rules with 4 GB DDR memory. Third, similar to processing of SDN, we can maintain part of the flows on FPGA's SRAM and send packets that miss the flow table to CUP for processing. This method uses DDR to storage flows (or rules), which has good scalability. However, compared to SRAM-based strategy, two DRAM-based methods incur additional access latency. To maintain a high performance, we can use SRAM as an L1 cache and use DRAM to store the left flows (or rules). At the same time, we should adopt an efficient replacement strategy to maintain a high SRAM hit rate to reduce the DRAM access times. In fact, considering scalability and implementation complexity, we adopt the third approach, i.e., maintaining a part of flows on FPGA's SRAM and sending packets that miss the flow table to CUP for processing. In addition, we can also use a high-capability FPGA to store more flows (or rules) in the next work.
We also evaluate the power of four NFs (i.e., firewall, stateful firewall, L4LB, IDS) integrated project with a 40-rule packet classification. The result shows that the total power is 2.139 w, consisting of 0.050 w for clocks, 0.089 w for signals, 0.0058 w for logics, 0.018 for BRAM; 0.0055 w for MMCM, 0.154 w for I/O, 1.544 w for PS7 and 0.172 w for Device static. Then, we test the schematic RTL of a four-NF integrated project. The results show that there are six cells, 262 I/O ports, and 343 Nets in the entire SoC project.
Furthermore, we evaluate the performance of software modules in two NFs (i.e., L4LB and IDS) on our platform. As mentioned above, the function of L4LB_CPU module is to dispatch a relatively idle server for each new connection, and IDS_CPU module is used to inspect packet payload with pre-configured rules. In the performance test of the L4LB_CPU module, we set the number of servers available for selection to 10, and configure the maximum number of flows to 1 K. That is, we support a load balance of 1 K active flows for 10 servers. We configure IDS_CPU module with almost 2 K option rules commonly used in the Snort.
In this experiment, we connect our platform with an IXIA emulator [51] which can generate packets with different size. The packets sent by IXIA pass through L4LB_CPU or IDS_CPU module and then return to IXIA. Consequently, we subtract the timestamps of sending and receiving packets to get processing delay, and calculate the throughput by counting the number of packets received per second. To better illustrate the performance of L4LB_CPU module, we also test the performance of L2 forwarding packets by FPGA (FPGA_FWD) and L2 forwarding packets by CPU (CPU_FWD), and use them as a reference. Moreover, in the scheme of L4LB_CPU_w_Basic_FPGA, the function of allocating server for new flows is implemented on CPU while the basic processing such as header parsing is implemented on FPGA. Figure 10a shows the throughput with different packet sizes. With all sizes, FPGA L2 forwarding can almost achieve line rates, i.e., 764 Mbps with 64 B packets and 981 Mbps with 1500 B packets. This is reasonable as two packets have to maintain an Interframe Gap (IFG), and sending smaller packets wastes more bandwidth. CPU straight forwarding, however, achieves only a maximum of 565 Mbps, and the throughput decreases as packets become smaller. This is because, with a smaller size, the number of packets needing to be processed increases. While L4LB_CPU_w_Basic_FPGA module can achieve 37 Mbps with 64 B packets and 499 Mbps with 1500 B packets, IDS_CPU_w_Basic_FPGA can only achieve 26 Mbps and 113 Mbps when packet sizes are 64 B and 1500 B, respectively. The reason is that the packet processing of L4LB_CPU module is relatively simple, but IDS_CPU module needs to perform much more complicated logic such as regular expression matching. In addition, since IDS_CPU module inspects every byte of a packet, its throughput does not increase significantly as the packet size increases. Fortunately, since the number of packets that need to be sent to the CPU is small, such as L4LB only needing to send the first packet of each flow to CPU, using two low-power ARM cores is enough to handle these exception packets. Figure 10b shows the latency with different packet sizes. Again, FPGA L2 forwarding yields low latency smaller than 10 µs, but CPU L2 forwarding incurs a larger latency up to 60 µs. For the same reason, the latency of L4LB_CPU_w_Basic_FPGA is a little more than CPU L2 forwarding, but the latency of IDS_CPU_w_Basic_FPGA is large and increases significantly as the packet size increases. We note that the difference between FPGA and CPU L2 forwarding delay is mainly caused by DMA, and the difference between CPU L2 forwarding and L4LB_CPU_w_Basic_FPGA or IDS_CPU_w_Basic_FPGA delay is the overhead for implementing application-related processing.

PMI Evaluation
In this subsection, we compare the resource consumption of three PMI models and evaluate the performance of our PMI compiler based on hybrid PMI model.
In this experiment, we implement three PMI model, i.e., strawman, centralized, and hybrid. The PMI tables in all models are implemented using Cuckoo Hashing (with two hash buckets) and contains 2K entries that match against flow 5-tuples. We fix the number of total flows to 1K and construct four kinds of module chains outlined in Figure 9. Table 3 shows three kinds of resource utilization in each PMI model. We can see that all schemes use only a small amount of logic and register resources and have a moderate usage of block memory to maintain PMI tables and buffer packet/metadata. The resource utilization of nmid lookup in each scheme is almost equal. However, strawman needs to maintain a nmid lookup in each module except the parser and the transmitter, which consume more resources than the centralized and hybrid models. Moreover, as mentioned above, the centralized model needs to maintain a module chain for each flow, which also uses more block memory resources than the hybrid one. Therefore, the total block memory used by strawman is centralized PMI models are 4.15× and 1.20× more than the hybrid one. Table 3. The resource utilization of three PMI models, i.e., strawman, centralized and hybrid, when constructing four module chains (outlined in Figure 9) for 1K flows.

Main Modules
Resource (%) In addition, we evaluate the performance of a PMI compiler on a machine with Intel core i7-8550U CPUs (1.80 GHz, 8 cores in total), 16 GB of RAM. Figure 11 shows the compiling latency of generating 2-32 module chains combining the four NFs above with the different number of flows. We can see that the compiling latency grows linearly as the number of flows increase. However, the difference between constructing 2-32 module chains for the same number of flows is very small. For instance, the compiling latency of mapping 100 flows to 2 and 32 module chains are 0.681 ms and 0.695 ms, respectively. This is because most of the time is spent reading module chains described in the script and generating the corresponding module graph, and the time consumed by generating configuration rules for path dispatch and path tables is small.

Related Work
Software NFs have perfect flexibility and can implement many complicated packet processing. Recently, many software systems have been designed to implement various types of NFs [25,29,33,52]. However, software NFs have two fundamental limitations: limited capacity and highly variable latency. However, most software systems exploit the multi-core parallelism in CPUs to achieve close to 10 Gbps throughput per machine, and scale out to use more machines when higher capacity is needed. While software NFs can scale out to provide more capacity, doing so adds considerable costs in both capital expenditure and operating expense [2,8]. Some other optimizations, such as bypassing kernel protocol stacks (e.g., DPDK [10]) and processing a vector of packets at a time (e.g., VPP [11]), can greatly increase the throughput of packet processing, but there are still large and fluctuating processing delays [8].
Recent work presents the idea of developing NFs on FPGA-based SmartNICs to accelerate specific NFs while maintaining high flexibility [2,8,67]. However, the state-of-the-art SmartNIC framework, i.e., AccelNet [8], is carefully designed to accelerate specific NFs for datacenter operators, which require professional hardware design ability. Thus, without rich experience and expertise, it is still challenging for most tenants to rapidly deploy their NFs on the SmartNICs. Moreover, there is limited work on orchestrating the performing order of NFs (i.e., service chaining), on the same FPGA/CPU co-processing platform.
There is also some related work presented to reduce the difficulty of developing NFs with FPGAs [2,14,15]. The core idea of these methods is describing NF in high-level languages (e.g., C/C++), and then compiling it to hardware code using High-Level Synthesis (HLS) tools (e.g., Xilinx Vivado HLS [27] or Altera OpenCL SDK [28]). Existing HLS tools, however, are still difficult to address data hazards, resulting in low processing performance and high resource consumption [8,15]. Moreover, the current work of developing NFs using high-level languages, e.g., ClickNP [2], EMU [14], focuses on accelerating individual NF, and lacks the way of deploying multiple NFs on one FPGA where NFs should share same functionalities while ensuring the data areolation. In addition, other work presents programmable architecture, such as PISA [18], abstracts packet processing into multiple Match-Action Tables (MATs) [19], and mapping network processing described in P4 [20] into these MATs. Currently, PISA focuses on processing packet in hardware, and it does not support software/hardware co-processing. In DrawerPipe, software module is equivalent to the hardware module, and both can be designed for data-plane processing. For example, limited by FPGA resources, we only offload part of the processing of Snort to FPGA for acceleration, including packet parsing, packet classification used, and left option matches on software. Therefore, a packet may be processed by hardware and software modules.
DrawerPipe is compatible with two methods using high-level programmable language, i.e., modules generated by HLS tools or MATs can be loaded in the "Drawer" to construct NFs with handwritten modules. Currently, although most NFs are still manually written in hardware description languages (HDL) in the datacenter [8], we may use high performance modules generated by HLS tools with the optimization of compilers in the future.

Conclusions and Future Work
In this paper, we present DrawerPipe, a modular and reconfigurable network processing pipeline for FPGA-based SmartNICs. DrawerPipe abstracts packet processing into multiple "drawers" connected using the same interface, and allows users to extend NFs by loading their core processing logic in the "drawers" while reusing the existing modules. As most NFs carry out similar processing stages, DrawerPipe also provides five highly reusable modules for basic packet processing. Furthermore, DrawerPipe adopts a programmable module indexing mechanism, namely PMI, to connect software and hardware modules for FPGA/CPU co-processing and to construct various module chains to perform required NFs for different flows. Although we need to redefine Metadata when change supported protocols (e.g., replacing IPv4 with IPv6), it is easy to extend the width of Metadata. In DrawerPipe, we only need to redefine the width of Metadata and modify the logic of extracting fields from Metadata for each module, without redesigning the core application logic. Moreover, we have designed an IPv6-based DrawerPipe, and implemented a switch supporting IPv6-based Segment Routing (SRv6) [68], which can be used to construct multiple overlay networks for different services. Our practice shows that DrawerPipe is flexible as it can rapidly reconfigure FPGA to construct specific NFs and achieves high performance using well-designed modules.
However, there are also some avenues that suggest improvements in scalability. For example, current SBV-based packet classification and SRAM-based packet buffer consume a lot of on-chip memory resource. Our future work will focus on reducing the resource consumption by optimizing or redesigning packet classification algorithm, such as employing the TSS [39] algorithm to replace the SBV algorithm. We will also extend the API of DrawerPipe shell for developers to use the onboard DDR memory to buffer packets or rules with an efficient replacement strategy.