1. Introduction
Currently, the public cloud is becoming very popular with multi-tenant systems for dynamically delivering on-demand computing and storage resources, a process which effectively reduces both capital expenditure (CAPEX) and operational expenditure (OPEX) [
1]. Since more and more enterprises and individuals are migrating their businesses and networks to the public cloud, building a robust network security environment for multi-tenant systems is becoming a critical challenge for public cloud providers [
2].
There are three important security protection features in the public cloud. (1) Boundary protection. Considering that diverse attacks are launched from the external network, it is important to deploy network protection devices in the boundary of the public cloud. (2) Multi-function deployment. In the public cloud, the business of each tenant is conducted in a dedicated virtual network. There are different security function requirements for multi-tenant systems [
3]. (3) Flow-based security service orchestration. Similar to the flow-based quality of service (QoS) in a network, the granularity of security protection is flow based in order to establish a good tradeoff between performance and security [
4].
To satisfy the above requirements, the integrated security stack concept is proposed for highly efficient collaboration of multiple security functions on a single system. However, deciding how to implement an integrated security stack is an open quandary for users considering the function, performance, cost, etc. After researching the related work, we summarize that there are three typical roadmaps for building an integrated security stack:
(1) Customized hardware design. Traditional security middleboxes (e.g., firewall and intrusion detection systems) are built and optimized based on customized ASIC chips for specific purposes. Each individual middlebox provides high packet processing performance while exhibiting limited programmability [
5]. The system-level overhead for building a security function chain is substantial due to connecting these security middleboxes with network cables.
(2) Software-only design. Network function virtualization (NFV [
6]) has enabled public cloud providers like Amazon and Microsoft to deliver high flexibility through software-based network functions deployed on general-purpose servers [
7]. Recently, some software solutions (e.g., DPDK [
8], VPP [
9], etc.) have been developed to accelerate packet processing with zero-copy and vectorized optimizations. While 40G/100G line-rate forwarding performance in Layer 2–3 could be reached with the above methods, software-only implementations face inherent limitations in throughput and latency when handling complex security processing tasks in Layer 4+ (e.g., stateful functions or deep packet inspection) [
10]. Since many security functions are in-line deployed to process flows without a traffic mirror, existing software solutions struggle to satisfy service-level agreements (SLA).
(3) Software/Hardware co-design. In order to balance processing performance and flexibility, software/hardware co-design based on CPU and programmable hardware has emerged as a promising solution [
11]. Current programmable hardware solutions include GPUs, FPGAs, network processors (NPs), and programmable ASIC chips. GPUs are typically deployed in look-aside mode, where the communication between the CPU and GPUs introduces relatively greater processing latency [
12]. Programmable ASICs implement functions based on match-action rules, lacking support for complex operations (e.g., stateful processing) [
13]. Compared to NPs, FPGAs use non-Von Neumann architecture, which has great potential in accelerating complex applications with lower process latency [
14].
Compared with the three roadmaps above, the CPU/FPGA-based SmartNIC proposal could establish a good tradeoff between performance and flexibility. In this paper, we focus on accelerating security functions in the public cloud with CPU/FPGA-based SmartNICs. Microsoft has deployed FPGA-based SmartNICs in Azure cloud for years to accelerate its Virtual Filter Platform (VFP [
15]), where cloud administrators map VFP-configured functions (e.g., ACLs, VNETs, and load balancers) into related actions implemented on SmartNICs. By aggregating rules from all functions through the group-based flow table (GFT) in SmartNICs, the processing performance is optimized with a fast hardware path [
16].
Although the practices of Microsoft demonstrate advantages in their customized scenarios, there are two critical restrictions for network operators aiming to develop and deploy security functions on CPU/FPGA-based SmartNIC platforms. (1) Substantial development cost. Development based on CPU/FPGA-based SmartNIC platforms requires professional hardware/software co-design capabilities that most tenants and network operators do not have [
17]. It is time-consuming and tedious to develop functions from scratch with low-level APIs to meet diverse security requirements. (2) Inefficient inter-function cooperation. Many designs emphasize improving the intra-function performance while disregarding the system-level co-design among functions. In such fragmented designs, there is a great deal of redundant logic, and the frequent cross-function communications are the key bottlenecks of system performance.
To optimize the system performance, we start from analyzing the features of security functions and there are three critical features summarized. (1) As for these security detection functions based on packet payloads, there is great overhead on mapping the security policies from users into executable security rules in the data plane. (2) Although the processing logic of each security function is complex, there are many shareable operations among these functions. (3) The security functions do not modify packet data. Also, the security function chain that each packet passes is determined by the flow identifications and the processing result in each function only determines whether to bypass the subsequent functions or not. These observations above provide the theoretical foundation for the architecture and optimization design in this paper.
We propose PASS, a flexible programmable framework for building integrated security stack in public cloud. PASS provides well-defined APIs for users to develop security functions efficiently while delivering high-performance processing ability with FPGA-based SmartNIC acceleration. By referring to the software-defined networking (SDN) architecture [
18], PASS enables centralized security policy orchestration based on a security controller. It eliminates per-function configuration operations while improving the efficiency of security management.
The main characteristics of PASS include three aspects. (1) PASS divides the system architecture into three planes: control plane, security auxiliary plane and data plane. In order to support fast translation and distribution of security policies, PASS offloads these complex security policy actions into security auxiliary plane. It reduces the processing latency and generated policy data volume greatly compared with the typical two-layer architecture. (2) PASS optimizes the inter-function processing flow by extracting shared packet operations (e.g., packet parse, packet classification) as pre-processing modules. It provides users with flow-granularity orchestration of security function chains. All the shared control information is carried with a user-defined metadata to support the elimination of redundant processing logic. (3) PASS provides users with high-level APIs to hide the generic low-level logic into platform-specific libraries. It divides a typical security function into three stages: rule parsing, packet processing, and statistics reporting. By providing the generic logic (e.g., resource management, match algorithms, etc) as library, developers only need to focus on designing core processing logic and data structure. Moreover, the modules developed with the same APIs could be reused easily.
We implement PASS prototype on CPU/FPGA-based SmartNIC platform (based on FAST framework [
19]). Based on this platform, three typical security functions (packet-filtering firewall, stateful firewall, and intrusion detection system) are developed with C/Verilog code. The advantages of PASS in the evaluation are summarized in three aspects. (1) Compared to the security functions developed from scratch without module reuse, PASS reduces the code by 65% on average. (2) Compared to software-only implementations, PASS improves security processing performance up to 76%. (3) Compared to the traditional two-plane design, where all the policy-related operations are executed in the centralized control plane, the latency of policy translation and distribution is reduced by up to 90% by offloading these operations into the security auxiliary plane.
Overall, the contributions are described as followed.
- (1)
- We present the motivations of developing security functions based on CPU/FPGA-based SmartNIC platforms, and propose the design goals of programmable security development framework PASS ( Section 2- ). 
- (2)
- We design the software-defined three-layer architecture of PASS, and provides the key optimizations and programming models ( Section 3- ). 
- (3)
- We implement PASS framework and three typical security functions to verify the technical feasibility ( Section 4- ). 
- (4)
- We build the experimental testbed and evaluations show that PASS optimizes the packet processing performance and policy distribution latency by 76% and 90%, respectively ( Section 5- ). 
  2. Motivation
  2.1. Programmable Platforms
Currently, network researchers build homogeneous or heterogeneous programmable platforms [
20] to meet the growing demand for network programmability using CPUs, FPGAs, GPUs [
21], programmable chips (PCs) [
22], and network processors (NPs) [
23]. In public cloud environments, there are four key requirements for programmable platforms to support the deployment of security-critical functions.
- (1)
- High Performance.-  With the rapid expansion of business and resource scale in the public cloud, the throughput performance of network and security devices is required to reach 100G/400G. In addition, to satisfy the QoS (quality of service) requirements of latency-sensitive applications, the underlying network infrastructure should deliver data to the destination within the specified deadline. Currently, many security functions (e.g., firewalls, intrusion prevention systems, etc.) are deployed in the packet processing path using in-line mode [ 24- ], which increases end-to-end latency due to security processing overhead. Thus, enhancing security while ensuring low-latency packet transmission remains a challenge in the design and implementation of security functions. 
 
- (2)
- High Flexibility.-  In the public cloud, the security function chains traversed by flows from different or the same tenants are diverse. Moreover, the types of security functions to be deployed may change dynamically as the network state evolves [ 25- ]. Therefore, programmable platforms should support flexible security policy configuration and dynamic security function reconfiguration. 
 
- (3)
- Multi-function Support. Since public clouds serve the security needs of multiple tenants using shared resources, programmable platforms should support the simultaneous deployment of multiple security functions. This approach improves resource utilization efficiency while reducing latency caused by inter-function cooperation. Furthermore, strict resource isolation should be enforced to prevent interference among tenants. 
- (4)
- Low Cost.-  It is essential for public cloud providers to minimize cost while meeting user requirements. The primary costs associated with deploying programmable platforms include equipment modification and power consumption [ 26- ]. Thus, leveraging existing commercial server resources and deploying programmable devices in an incremental plug-in mode are considered optimal solutions. 
 
The characteristics of different programmable platforms in existing research are compared in terms of performance, flexibility, cost, deployment point, and networking mode, as shown in 
Table 1.
- (1)
- Performance. Although software processing performance can achieve 40 G/100 G line rate in L2–L3 forwarding with optimization techniques (e.g., DPDK, VPP), there are significant gaps between software and hardware solutions in L4+ layer applications. Programmable switches face similar limitations due to the lack of relevant actions. GPUs are deployed in look-aside mode and require frequent memory copy operations, resulting in high interaction latency. Thus, GPUs are more suitable for accelerating AI and big data applications. In contrast, FPGAs are deployed in in-line mode and provide a fast path for packet processing. With advantages in cost-effectiveness and power efficiency, FPGAs are popular for accelerating complex network applications with lower processing latency. 
- (2)
- Flexibility. Programmable switches implement functions with multi-stage match-action tables, which provide limited support for stateful functions. Similarly, ASIC-based switches or FPGA solutions are suitable for the implementation of simple functions. Complex functions are better handled by CPUs. FPGAs are popular as platforms supporting logic offload based on the characteristics of security functions. 
- (3)
- Cost. Radically replacing commercial switches with programmable switches in the public cloud incurs high costs. By contrast, deploying SmartNICs incrementally on host servers involves lower expenses. Generally, SmartNICs are implemented based on FPGAs or NPs. FPGAs are designed using a non-von-Neumann architecture, which makes them more suitable for accelerating packet stream processing. Furthermore, incremental FPGA deployment effectively minimizes CPU resource consumption. 
- (4)
- Deployment and Networking. The underlay networking mode is required for deploying security functions at the switch layer. It allows users to access physical switch resources directly. By contrast, overlay networking is necessary to deploy security functions at the host layer. Although the design complexity of endpoint systems increases, the overlay networking mode simplifies the deployment of security functions. 
To sum up, the best platform selection for satisfying the four requirements mentioned above is the host CPU/FPGA-based SmartNIC. Currently, Microsoft has been deploying FPGA-based SmartNICs in its public cloud at scale for years. Microsoft’s solution provides a fast hardware path for specific scenarios. However, there is an urgent need to provide a programmable framework that allows users to develop high-performance network security functions through hardware/software co-design.
  2.2. Network Security Function
In order to support the design of a programmable development framework, we conduct an in-depth study on the characteristics of network security functions. The processing workflow of security functions is abstracted into three sequential phases (Decision -> Execution -> Feedback). (1) Decision phase. User applications dispatch abstracted security policies after analyzing the current network states. These policies are then translated into executable security rules. (2) Execution phase. The data plane performs security analysis on the traffic according to the defined rules. This phase consists of protocol parsing, packet header-based detection, packet payload-based detection, and execution of security actions. (3) Feedback phase. The data plane reports statistical information to the control plane for further security analysis and policy decision-making.
Security detection and protection are carried out by performing these three phases cyclically, which helps prevent user networks from security attacks. Next, we analyze security functions from two aspects as follows.
(1) Security function deployment. (a) Diversity. The network topology and internal traffic patterns differ across multi-tenant environments, resulting in diversity in the types and deployment locations of security functions. In addition, the security detection requirements for different flows belonging to the same tenant may also vary. (b) Dynamics. On one hand, the types of flows on a link change as user applications start or terminate. On the other hand, the bandwidth of flows varies over time. For example, the number of data accesses during the day is significantly higher than at midnight. These two factors require security devices to support dynamic, flow-based orchestration of security functions.
(2) Security function design. (a) Security rules are complex, especially in packet payload detection applications. We take Snort [
33] as an example. Each security rule consists of a rule header and multiple rule bodies. The rule header and bodies describe flow and attack features, respectively. Since the attack features are diverse, the translations from abstracted security policies to executable security rules introduce considerable overhead. (b) There are many shared operations among these complex security functions. Snort usually performs exact matching and regex matching on a packet multiple times. The stateful firewall (SFW) needs to support the management of TCP flow states. Much of the processing logic (e.g., packet parsing, matching, etc.) and related intermediate results can be shared among these functions. The exact and regex matching actions are the same across the Intrusion Detection System (IDS), the Web Application Firewall (WAF), and Data Loss Prevention (DLP), although their rule sets differ. (c) Security functions do not modify packet data. Unlike network functions (e.g., NAT, load balancer), which require modification of packet data (e.g., addresses, ports, etc.), security functions only need to match packet data and perform PASS/DROP actions. For network function chains, the action affects the function path that packets traverse. By contrast, security function chains are determined by flow features, which are unrelated to the actions.
   2.3. Comparison of Typical Programmable Frameworks
In this paper, we aim to provide a flexible programmable security development framework based on hardware/software co-design within the SDN architecture. SDN decouples the network architecture into control and data planes, enabling centralized control policy generation and distribution via a logically centralized controller. It significantly improves network management efficiency by using global network resources as input. In this paper, three typical software-hardware co-design frameworks (VFP, OpenBox [
27], and ClickNP [
32]) are compared with PASS in 
Table 2.
(1) Microsoft VFP. It orchestrates network and security function chains based on a host-based SDN architecture. Network administrators deploy specific service function chains (SFCs) on VFP according to application requirements, where each function is managed by a dedicated controller to reduce inter-component dependencies and improve scalability. However, the architectural design of VFP is not intended to provide a unified programmable framework. First, it is tedious and time-consuming for network operators to develop and manage multiple controllers. In addition, distributing SFC policies through multiple controllers introduces greater complexity. Second, the inter-function connections are fixed, where the input/output flows traverse the same SFC in reverse order. This makes it difficult to support flow-based SFC deployment and orchestration.
(2) OpenBox. It presents an abstraction of packet processing applications for the development and deployment of network functions. In addition, it decouples the control plane of network functions from the data plane and allows the reuse of data plane elements by multiple logical NFs. However, as a general network function development framework, it lacks customized optimizations for security functions. For example, it is particularly important to provide design guidance for mapping security functions to hardware/software and control plane/data plane components.
(3) ClickNP. It focuses on accelerating network functions with programmable hardware. It provides a modular architecture, resembling the well-known Click model. In addition, it provides a high-level C-like language to program FPGAs efficiently and proposes a set of optimization techniques to utilize the parallelism in FPGAs and reduce I/O overhead. Since it focuses on providing a framework for developing network functions, optimizations for the design of security function chains and the control plane are not carried out in ClickNP.
  2.4. Design Goals
In conclusion, developing a high-performance programmable framework is urgently needed for implementing security functions with the introduction of SDN and SmartNICs. By referring to the security function characteristics in 
Section 2.2 and the comparison in 
Table 2 in 
Section 2.3, we propose three key design goals for a software-defined, programmable security framework with SmartNIC acceleration.
G1: Dividing security functions between the control plane and data plane rationally.
SDN proposes to decouple the network architecture into control and data planes. However, the detailed division between control logic and execution logic should be determined based on specific scenario characteristics. Fresco [
34] proposes a modular security development model for extending security functions within the controller. This design requires forwarding a large number of packets to the controller, which can easily make the controller a performance bottleneck. Avant-Guard [
35] and OFX [
36] propose migrating security processing logic from the control plane to the data plane, which greatly reduces cross-plane data volume. These studies mainly focus on optimizing packet data processing, while neglecting the complexity of security rule generation and distribution. Specifically, when the translation from high-level abstract policies to low-level executable rules is performed in the control plane, it introduces high cross-plane communication latency and generates a large volume of packet data. Therefore, it is particularly important to reconsider the rational division of functions between the control and data planes.
G2: Accelerating security function chains in the data plane.
The packet processing overhead in the data plane consists of I/O overhead and security processing overhead. I/O overhead arises from packet receiving and transmitting operations. Currently, existing I/O acceleration frameworks (e.g., DPDK [
8], netmap [
37], etc.) optimize I/O performance using zero-interrupt and kernel-bypass techniques. Compared to network functions, the proportion of packet data processing to I/O processing is higher in security functions due to the deep analysis of packet contents. In this paper, we emphasize reducing security processing overhead from both intra-function and inter-function perspectives.
Intra-function acceleration is achieved by offloading performance- and resource-critical logic to the SmartNIC. HEX [
38] divides security processing into six phases and implements them on NetFPGA, with software applications only analyzing alert information. In practice, some complex security processing logic, such as packet payload intrusion detection, consumes significant resources and incurs high development costs. Thus, it is more suitable to implement such logic in software.
The inter-function acceleration focuses on optimizing the cooperation of SFCs. Although Openbox [
27] proposes implementing shared header parsing and classification at the initial processing stage, there are two critical limitations. First, the connections between modules are fixed, which is not adaptable to the requirements of dynamically deploying SFCs. Second, intermediate results should be reused through rational hardware–software co-design.
Thus, there are still significant optimization opportunities for accelerating both intra-function and inter-function processing, which are deeply explored in this paper.
G3: Improving development efficiency with well-defined APIs.
Besides performance optimizations, designing a SmartNIC-accelerated security development framework requires enhancing user development efficiency. Specifically, for some complex security functions (e.g., Snort), where the operations optimizable by hardware acceleration are limited, providing well-defined APIs to accelerate the development cycle is very important. Here, the southbound protocols and related APIs between the control and data planes have a significant impact on development effort and complexity.
As we know, OpenFlow [
39] is the most widely used southbound protocol in network management. However, it focuses on protocol universality while lacking support for security functions. OFX supports the development and deployment of security functions by extending the flow tables and actions of OpenFlow. Although this evolutionary design is compatible with OpenFlow, it lacks a detailed protocol specification to support functionality and reliability. As for programming interfaces, some studies provide users with high-level APIs by customizing dedicated operating systems for specific scenarios. For example, mOS [
40] offers users flow state management services, where security functions are efficiently developed based on state events. Therefore, in order to accelerate security function development, it is important to enable users to focus on designing core data structures and processing logic by abstracting away underlying communication, resource management, matching algorithms, etc.
  3. PASS Design
  3.1. Overview
By referring to the SDN architecture and the features of security functions, the PASS framework is divided into three collaborative planes, as shown in 
Figure 1. The security controller runs in the control plane for global security management. The newly added security auxiliary plane offloads high-overhead operations from the PASS agent. Hardware/software security processing on packet data is performed in the data plane. Cross-plane communication is implemented based on the customized PASS protocol.
Security Controller. It processes the input security policies from the network manager and the statistical data reported from the data plane. In order to minimize the latency of policy distribution and reserve more computing resources for alert analysis, the controller only needs to dispatch abstract policies to the security auxiliary plane.
PASS Agent. It supports high-overhead control logic offloading with three functions. First, it accelerates security rule translation and distribution using cached security rule sets. Second, it supports dynamically reconfiguring the security functions. Finally, the data volume is reduced by compressing the security statistics through the PASS agent.
Security Functions. PASS divides a security function into a software security function (sSF) and a hardware security function (hSF) rationally, referred to in 
Section 3.3. The sSFs and hSFs are mapped to the CPU and FPGA-based SmartNIC, respectively. The sSFs are developed with PASS APIs, and their run-time environment is OS process. The OS process is selected because it provides a good trade-off between inter-sSF isolation and processing efficiency. Packet forwarding among sSFs is implemented via virtual switches (vSwitches). The hSFs are developed and deployed as FPGA modules, and the rules in hSFs are configured via the corresponding sSFs.
   3.2. Unified Rule Management in PASS Agent
In a typical SDN architecture, controllers translate user-defined high-level security policies into matchable rules in the data plane. Notably, the number of received rules and the rule storage algorithms vary across different security functions. Thus, for an SFC policy, the positions for storing the mapped rules in different functions may differ. For example, a network operator inputs a policy to control flow A traversing a packet-filtering firewall (PFW) and an IDS in sequence. The security controller maps this policy into two security rules and dispatches them to the PFW and IDS, respectively. Since the 1st entry of the PFW and the 2nd entry of the IDS are already occupied, the newly received rules are stored in the 2nd entry of the PFW and the 3rd entry of the IDS, respectively, as shown in 
Figure 2a. There are three limitations in the design described above: (1) Rule translation for payload-based detection applications (e.g., IDS) requires looking up content-based rules from a database. In most cases, the number of mapped rules and the distribution latency are high. (2) The entry positions used to store rules for functions within the same SFC differ, which makes reuse of matching results impractical. (3) The entire rule headers must be carried when delete instructions are issued.
To overcome these challenges, policy translation and unified indexing for security rules are implemented in the PASS agent, as shown in 
Figure 2b. In our design, the controller only distributes user-defined policies to the agent. The agent decomposes the policy when a flow traverses multiple security functions. It prefetches all security rules from the controller and provides local access services for both the agent and security functions. The agent allocates a globally unique identifier to each security rule. Thus, it only needs to dispatch the rule identifier to the related security functions. In detail, a global rule table (GRT) is built in the agent for allocating unique identifiers (fid) to each rule. Considering the control plane may perform frequent addition and deletion operations on security rules, we design an algorithm to allocate the entry with the smallest fid to improve the resource utilization of rule tables. The PFW and IDS store the security rules in the table according to the received fid without extra computing operations.
There are three advantages in our design. First, the rule translation latency and rule data are greatly reduced in the control plane. Second, the globally unique fid can serve as the index for rule lookup in inter-function cooperation. Third, rules can be deleted using their fids without requiring the complete rule information.
  3.3. Mapping Security Function into SW/HW
In our design, the processing model of security functions is abstracted into two paths, as depicted in 
Figure 3. The first path consists of “protocol parse -> header-based detection -> action”. Similar to the OpenFlow protocol, the protocol parser identifies key metadata (e.g., packet type, address, etc.) for the subsequent modules. Header-based detection is categorized into stateless and stateful detection. Stateless detection performs matching operations on packet headers, while stateful detection requires executing state analysis on the managed flow states. The post-detection action consists of two parts. First, it generates statistics for further analysis in the controller. Second, it performs packet forwarding or drop operations. Path 1 is suitable for implementation on FPGA-based SmartNICs due to the limited number of detection fields and rules. L3–L4-related security functions (e.g., PFW, DDoS detection systems, and SFW) can be mapped onto Path 1.
Compared to Path 1, the more complex payload-based detection is performed in Path 2. Path 2 is abstracted as “protocol parse -> header-based detection -> payload-based detection -> action”, where the protocol parsing and header-based detection are performed in FPGA, as in Path 1. Since the processing logic and rules in payload-based detection are far more complex, implementing them fully in FPGA is impractical due to the high development and resource costs. Thus, it is more suitable to implement them in software or accelerate partial logic with hardware.
In order to shorten the processing path by eliminating redundant logic, we design a metadata structure and attach it in front of each packet to transmit intermediate results. The metadata is composed of generic fields and user-defined fields, as shown in 
Figure 4. The generic fields include fid, path, and action. fid is used to distinguish different flows, while path contains information about the security function chain. The action field includes To_CPU, To_PORT, and DROP. The user-defined fields are used to store intermediate results. The information extracted or generated by former security functions can be reused in subsequent security functions. In particular, the security devices are usually deployed at the network edge. Thus, metadata is used for cooperation among security functions within a device. When the packet is transmitted to the network or end system, the metadata in each packet is removed.
  3.4. Cooperation Between Security Functions
PASS enables cooperation among security functions within a single device. It effectively eliminates cross-device latency in a SFC and supports shortening the processing path. Since PASS divides a security function into sSF and hSF, there are two HW/SW interaction modes when a flow traverses a SFC.
In mode 1, the complete processing logic in each function is executed sequentially, as illustrated in 
Figure 5a. In detail, the “hSF -> sSF -> hSF -> sSF” path involves four I/O communications between HW and SW. The number of I/O accesses continuously doubles as the SFC length grows. In mode 2, packets are forwarded to sSFs only after all the hSFs in the SFC have processed them, as illustrated in 
Figure 5b. The “hSF -> hSF -> sSF -> sSF” path involves only two I/O communications between HW and SW. The number of I/O accesses remains unchanged as the SFC length grows. Compared to mode 1, we adopt mode 2 to reduce I/O access. In our design, all the intermediate results of hSFs are stored in the metadata (as described in 
Section 3.3) to provide reusable information directly to sSFs.
Both the stateless and stateful detection functions include L3–L4 packet header matching functions. Since packet header parsing and classification are shared operations, we extract these operations as a shared hSF in the pre-processing stage. The pre-process sSF is used to configure the rule tables in the pre-processing hSF. In addition, SFC orchestration can be completed in the pre-processing stage because the SFC that a flow traverses depends only on the flow features. The header-based classification table consists of packet header features as keys (such as five tuples), fid, and path. When a packet is processed by the pre-processing hSF, the metadata attached to the packet carries the fid and path. Each subsequent hSF determines whether to process the received packet by matching the carried path information with its built-in path selector module. If a packet does not need to be processed by a subsequent sSF/hSF according to previous actions, the corresponding sSF/hSF identifiers are removed from the path information. Furthermore, all subsequent sSFs/hSFs can index the related entries based on the fid without relying on unnecessary flow features.
We take an example to illustrate the PFW+IDS chain in 
Figure 6. (1) The operators input a security policy to the security controller. The policy is that the HTTP packets from A to B should go through PFW (No.1) and IDS (No.3) successively. The controller distributes the policy to all the security devices in the data plane. (2) Upon receiving this policy, the PASS agent computes the fid, path, and rules according to this policy. These rules are then dispatched to the shared Pre-process sSF, PFW sSF, and IDS sSF, respectively. (3) The classification table in the shared Pre-process sSF stores the key (flow feature), fid, and path. When packets from A to B arrive, the second flow entry is hit. It indicates that the fid and path are 1 and (1,3), respectively. (4) The path selector module in each hSF uses the path to decide whether this packet should be processed by that hSF. The PFW hSF performs fid-based filtering, and the resulting action indicates to pass this packet. (5) Since the SFC does not involve SFW, the path selector in the SFW hSF transmits this packet directly to the IDS hSF. (6) The IDS hSF analyzes whether it is an HTTP packet and stores the extracted HTTP type (GET or POST) in the user-defined metadata for further payload-based detection in the IDS sSF. (7) When the packets are mirrored to the IDS sSF, payload-based security analysis is performed based on the cached rule bodies set, using the carried fid as an index. If threats are detected, alerts and logs are reported to the controller for further actions.
  3.5. PASS Programming Model
In order to improve development efficiency, PASS provides a unified programming model to support module reuse among different users. PASS is a modular development framework, where security functions are divided into user-specific modules and platform-specific modules, as shown in 
Figure 7. The PASS APIs and the PASS southbound protocol are the core components of the PASS programming model.
(1) PASS APIs. Platform-specific modules hide the underlying complex implementation by providing standardized interfaces. It consists of the FAST library, FAST OS, and PASS library. The FAST library and OS are provided by the open-source FAST project [
19]. The complex implementations of PCIE/DMA/I/O drivers are transparent to developers via the FAST library and OS. We provide PASS APIs to hide communication between security functions and the agent, CPU resource management, complex matching algorithms, etc. The main PASS APIs are listed in 
Table 3. This allows developers to focus on the core security logic development. Developers only need to design and register three user-defined callbacks to the platform. The platform allocates CPU resources and creates threads to run them. Specifically, rule_mgt_callback() is used to parse and configure rules into the rule tables of sSF and hSF. pkt_handler_callback() is used to perform complex logic (such as payload matching) with security rules. log_mgt_callback() is used to collect and analyze log information. The packet parsing and classification in hSFs are platform-specific modules. In addition, we provide Stride BV matching algorithms as IP blocks. The input to each hSF is metadata and packet. Therefore, users only need to design the hardware rule table and core state machine.
 We program a simplified Snort with PASS as an example in 
Figure 8. The Snort security functions are divided into IDS sSF (left part) and IDS hSF (right part). (1) IDS sSF. The developers only need to implement three callback functions, including ids_rule_callback(), ids_pkt_callback(), and ids_log_callback(). Specifically, PASS provides users with shared functions for exact match and regex match on the packet payload. (2) IDS hSF. Since the packet header-based parsing and classification are performed in the pre-process hSF, the IDS hSF determines whether to process the received packets based on the path information in the input metadata. Then, IDS hSF extracts the application-level protocol fields (e.g., HTTP) and determines whether to direct the packets to the IDS sSF based on ids_hw_flow_table. In particular, PASS provides users with reusable lookup algorithms (e.g., Stride BV).
(2) PASS Southbound Protocol. Existing security function development frameworks support cross-plane communication by extending the OpenFlow experimental messages. Currently, different developers propose various extension proposals due to the lack of unified specifications. In this paper, we propose the PASS southbound protocol based on an analysis of the characteristics of security functions. The PASS southbound protocol format consists of a general packet head and a sub-protocol packet, as shown in 
Figure 9. (a) General packet head. It includes protocol version, packet type, and packet length. (b) Sub-protocol Packet. The sub-protocol packets store function-specific messages. There are six types of sub-protocol packets, as shown in 
Table 4. Each sub-protocol packet consists of a customized packet head and body.
 We take the rule management message as an example in 
Figure 9. The packet head contains Device_ID (the identifier of the device that receives this message), Rule_Type (the type of rule, e.g., PFW, SFW, or IDS), Rule_Op (the rule operation, e.g., add, delete, or update), Rule_Group_ID (the group identifier of the rule), Rule_Total_Num (the number of rules in the same group), and Current_Rule_ID (the index of the current rule in this message). To improve lookup efficiency, the rule bodies associated with the same rule header belong to the same rule group. The rule information is described using a Key -> Value format based on JSON for high flexibility.
  4. PASS Implementation
The PASS implementation consists of the security controller, the security agent, and the security functions.
In the control plane, we develop a lightweight security controller in Java (2000 lines of code). We select Java for three main advantages. (1) Rich libraries for network programming, e.g., RESTful APIs and the OSGi framework. These libraries accelerate the development of security controller functions, such as southbound protocol parsing and northbound API design. (2) Cross-platform portability. Java-based controllers can run on different operating systems without rewriting code for specific hardware. This is crucial for building cross-vendor, multi-data-center SDN architectures. (3) Distributed clustering. Java’s distributed computing frameworks enable multi-node controller clusters with high availability and load balancing. This supports the development of distributed security controllers to enhance robustness in the future.
Although there are already many open-source SDN controllers (e.g., Floodlight [
41], RYU [
42], ODL [
43], etc.), we choose to develop a controller from scratch for two reasons. (1) PASS uses a customized southbound communication protocol, whereas these open-source controllers are based on the OpenFlow protocol. In the future, the PASS protocol will be embedded into OpenFlow to improve compatibility. (2) We offload many security-related functions into the security auxiliary plane to optimize latency. However, the workload of secondary development based on open-source controllers is substantial because they contain many complex components and function calls.
In the security auxiliary plane, we implement the PASS agent in C (800 lines of code). Currently, the PASS agent supports rule management, software function start/stop, and other features. Since the PASS agent is designed based on an event-driven architecture, more functions can be flexibly extended by defining new events and registering related callbacks.
In the data plane, three security functions are developed, including PFW, SFW, and IDS. (1) PFW supports filtering packets by matching five-tuples with masks. Parallel lookup is implemented on FPGA (using the Stride BV algorithm), while CPUs execute rule and statistics-related tasks. (2) SFW supports complete state-based filtering based on the establishment and termination of the TCP protocol. The related state management and filtering features are implemented on FPGA. (3) IDS supports the core processing logic of Snort, such as exact and regex matching on the payload. Packet parsing and header-based detection are performed on FPGA, while the remaining complex logic (payload-based detection, rule and statistics management) runs on CPUs.
Development efforts. We implement these three functions on a CPU/FPGA-based SmartNIC using two solutions. (1) Strawman. The functions are implemented based on the FAST API without module reuse. FAST provides a general-purpose framework for hardware/software co-design. It hides the complex operations of DMA, PCI-E, and Linux kernel implementation from users. (2) PASS. The functions are implemented based on the PASS API with module reuse. PASS APIs are implemented by extending the security-related libraries based on FAST APIs. We compare the lines of code (LoC) for both software and hardware in 
Table 5. The experimental results show that PASS can reduce the code volume by an average of 65%.
 Resource Utilization. The FPGA logic utilization is shown in 
Table 6. We categorize FPGA resources into three types: Device-Specific Module, Platform-Basic Module, and Function-Specific Module. (1) Device-specific modules. These are specific to the FPGA hardware, including the Ethernet ports, debug units, etc. They consume 27% Slice LUTs (6931) and 12% Block RAM Tiles (10). (2) Platform-basic modules. These refer to the FAST OS (as depicted in 
Figure 7), including PCIe, DMA, etc. They consume 52% Slice LUTs (13,376) and 45% Block RAM Tiles (37). (3) Function-specific modules. These include pre-process hSF, PFW hSF, SFW hSF, and IDS hSF. They occupy 21% Slice LUTs (5237) and 43% Block RAM Tiles (34.5). As shown in 
Table 6, the Function-specific modules occupy the fewest Slice LUTs, while Platform-basic modules consume the most. Thus, extracting shared logic into Platform-basic modules is important for reducing development efforts.
   5. Evaluation
  5.1. Experimental Setup
We set up two experimental testbeds to evaluate the performance improvement on the FAST-based network experimental platform, as depicted in 
Figure 10.
Testbed 1. It is used for performance evaluation of PFW-only, IDS-only, and PFW-IDS. Since the PFW and IDS security functions are stateless, we use a network emulator to generate user-defined traffic and test the round-trip time (RTT). It consists of a security controller, a PASS prototype, and a network emulator. The PASS prototype and network emulator are built on a Xilinx Artix-7 FPGA connected to an ARM Cortex-A9 CPU (866 MHz, single core with two hardware threads) via PCIe. In the PASS prototype, one thread is allocated to the OS while the other is used for running sSFs. The network emulator is implemented based on the open-source Project FAST-ANT [
44]. It supports precise packet TX/RX service with ∼10  
s jitter. All the links between the PASS prototype and the emulator are 1 GE fibers.
 Testbed 2. It is used for performance evaluation of SFW-only and SFW-IDS. Since the SFW security function depends on the analysis of TCP connection establishment, we use a commercial off-the-shelf (COTS) server to act as both the TCP server and client. We use iperf to generate parallel TCP streams for testing throughput performance. In order to test the handshake time and the maximum number of connections per second, we deploy an Apache HTTP server and use a tool to initiate a large number of TCP connections within one second.
In order to validate the performance optimization of PASS comprehensively, three types of experimental scenarios are designed as follows.
- Since PASS decomposes security functions into sSF and hSF, we compare the performance improvement of a single security function with and without hardware acceleration. 
- Since PASS proposes a high-efficiency SFC cooperation model, we compare the performance improvement between hardware-accelerated SFC and software-based SFC. 
- Since PASS offloads the security policy translation to the security agent, we compare the latency before and after offloading these operations. 
  5.2. Single Security Function Acceleration
In order to compare the performance improvement with hardware acceleration, we implement the PFW, SFW, and IDS based on software. The packet forwarding performance consists of I/O performance and processing performance. In this paper, we focus on improving packet processing performance. Before analyzing the experimental results, the performance measurement methods are described as follows. (1) In order to remove the overhead introduced by packet I/O, we test the forwarding bandwidth and latency of hardware direct forwarding and software direct forwarding as the baseline. The packet performance improvement is obtained by computing the difference between hardware security processing and software security processing after subtracting the direct forwarding overhead. (2) Since the internal processing latency in different FPGAs varies, we test and subtract the loopback latency of the network tester from the total latency. Besides, we try to measure the non-blocking latency by setting the packet sending interval to 1 ms. When packets are blocked in the socket queue, the queuing delay becomes significant and is considered part of the I/O overhead.
PFW performance. We test the throughput and latency of HW PFW and SW PFW with 256/512/1024/1500B packets, respectively, in 
Figure 11. The throughput of Emulator Loopback, HW Direct FWD, and HW PFW FWD reaches line speed. In contrast, the throughput of SW Direct FWD and SW PFW FWD increases as the packet size increases. The performance improvement ranges from 13% to 38% when comparing the results before and after hardware acceleration (excluding I/O overhead). The latency of HW Direct FWD and HW PFW FWD is the same and exhibits no jitter. The latency of SW Direct FWD and SW PFW FWD also increases with packet size. This is because of frequent buffer allocations/de-allocations and memory copies between kernel and user space. The memory copy overhead is directly proportional to the packet size.
 SFW performance. We use iperf to establish 8 TCP connections and measure the throughput under different MSS (Max Segment Size), as shown in 
Figure 12a. The HW SFW FWD achieves line-rate forwarding at any packet size. The performance improvement is 20.5–28% before and after hardware acceleration (excluding I/O overhead). The CDF (Cumulative Distribution Function) of TCP connection establishment time is computed under 1000 connections, as depicted in 
Figure 12b. The establishment times of HW SFW FWD and HW Direct FWD are 15.5 µs and 14.7 µs, respectively. As for SW SFW FWD and SW Direct FWD, the establishment times are approximately 100 µs and 76 µs when the probability is 80%. Furthermore, we analyze the number of TCP connections established when initiating 20,000 TCP requests per second. This experiment is performed 100 times for each use case, and the results are shown in 
Figure 12c. The maximum number of TCP connections for HW SFW FWD and HW Direct FWD is around 8000 with high probability, while those of SW Direct FWD and SW SFW FWD are about 5500 and 2400, respectively.
 IDS performance. The workload of IDS depends on the input packet size and rule set. In this experiment, IDS performs an exact match and a regex match, respectively, on the payload of each packet under different packet sizes using the same security rules. The throughput and latency of IDS are shown in 
Figure 13. The bandwidth and latency gap between HW-SW IDS FWD and SW IDS FWD becomes smaller as the packet size increases. The reason is that the proportion of optimized overhead from header-based detection via hardware acceleration is reduced, while the proportion of payload-based detection overhead increases. Thus, the throughput improvement from hardware acceleration ranges from 7% to 20%, and the latency is reduced by 3% to 18%. Since not all traffic is required to pass through IDS in actual scenarios, flows traversing the IDS function are filtered by packet parsing and matching in the FPGA. Processing performance is optimized by directing the specified traffic to software. Moreover, the development effort required to design a complex security function is greatly reduced with PASS APIs.
 Summary. For single security function acceleration, the performance improvement benefits from offloading packet header-related functions to hardware. Since the packet header fields can be easily extracted based on offset and length, FPGA has great potential to perform operations related to packet header parsing and matching.
  5.3. Security Function Chain Acceleration
We design two security function chain use cases (PFW-IDS and SFW-IDS) to evaluate performance optimization. The corresponding software-only functions are implemented as a reference.
PFW -> IDS. The throughput and latency under different packet sizes are measured, as shown in 
Figure 14. Compared to the single IDS function, the performance improvement of PFW -> IDS is more significant. The throughput increases by 14% to 50%, while the latency is reduced by 8% to 26%.
 SFW -> IDS. We establish 8 TCP connections with Iperf and measure the throughput under different MSS values, as shown in 
Figure 15a. Since the processing logic of SFW is more complex than that of PFW, the throughput improvement from PASS acceleration is greater (26% to 77%). In addition, we depict the CDF of the establishment time for 1000 TCP connections in 
Figure 15b. The handshake latency of HW-SW SFW-IDS and SW SFW-IDS is 90 µs and 140 µs at a probability of 80%.
 Summary. Unlike single-function acceleration, the performance improvements of security function chain acceleration benefit from compressing the packet processing path and reusing intermediate results. Thus, as the length of the security function chain increases, the performance gains continue to grow. In addition, it is of great significance to analyze and extract the reusable processing logic among different security functions.
  5.4. Control Plane Acceleration
We design three solutions for distributing security policies to compare the performance improvements.
- (1)
- Solution 1 (S1): The content-based rule database is built in the controller. The translations from input policies to rules are performed in the controller, followed by dispatching these rules to the agent. 
- (2)
- Solution 2 (S2): The content-based rule database is built in the agent. The controller looks up the rule identifiers according to the user policies and dispatches these grouped identifiers to the agent. 
- (3)
- Solution 3 (S3): The content-based rule database is built in the agent. The agent performs the translation and distribution according to the policies received from the controller. 
In this experiment, 1,800 rules are selected from the Snort rule library and classified into five categories: information leakage, code execution, Trojan attacks, botnets, and buffer overflow attacks. The security policies for these attacks are distributed to the agent sequentially. We provide statistics on the data volume and latency for the three solutions, as shown in 
Table 7. Latency is measured from the time the controller receives the security policies to the time it receives the corresponding ACK messages.
The experimental results show that as the number of rules increases, the data volume and latency in S1 and S2 increase, while those in S3 remain unchanged. S1 and S2 dispatch raw rule data and rule identifiers, respectively. When the packet length exceeds the MTU, the packet must be fragmented and reassembled by the protocol stack. By contrast, the packet size in S3 is only 0.4 KB, avoiding complex fragmentation and reassembly because S3 transmits abstracted policies. Compared to S1 and S2, S3 reduces latency by 82% and 65%, respectively. Notably, since the cached security rules can be shared between the agent and sSFs, the agent only needs to configure rule identifiers on sSFs without sending complete rule data.
Summary. Compared to network functions, security functions have more complex rule sets for detecting diverse security attacks, especially in the rule bodies. By caching the full or frequently used rule sets in the agent of each PASS-based security device, the communication overhead between the control plane and the data plane can be greatly reduced. Furthermore, all packets can be processed in the data plane without submitting raw packet data to the control plane.
  6. Discussion
Trade-offs of PASS. (1) Customization vs. Generalization. As described in 
Section 3, we chose to customize the southbound protocol and packet metadata to improve processing performance and resource utilization. As a result, the PASS framework is not easily compatible with other SDN programming frameworks based on the OpenFlow protocol. This means that all software in both the control plane and data plane must be replaced to deploy the PASS framework. (2) Scalability of the control plane. In this paper, we focus on optimizing the latency of rule translation and distribution and propose offloading many controller operations to the security agent. Furthermore, improving the robustness of PASS by supporting scalability with multiple controllers is important. In the future, we will place more emphasis on designing communication protocols for controller-to-controller and agent-to-multi-controller communication.
 Security for PASS. Since PASS provides physically shared security resources for multi-tenants, it is important to allocate logically exclusive resources to different tenants. The flow identification (fid), as a key field in the shared metadata, determines the mapping between security rules and packets. When the security controller dispatches a security policy, the tenant identification is also carried to the security agents. The agents allocate conflict-free fids for flows from multiple tenants. Therefore, flows from different tenants traverse different paths and hit different entries in the data plane. In addition, the agent supports developers in designing other fid allocation algorithms via open APIs. In the future, we will study trust models to enhance the security of PASS under adversarial scenarios.
Adaptability to Dynamic Scenarios. In multi-tenant scenarios, security policies are often updated dynamically in response to changes in traffic and threat conditions. In PASS, the controller only dispatches abstracted policies, while all translation and distribution operations are performed by the agents on security devices. According to 
Table 7, the policy distribution latency remains at 97 ms, unaffected by the number of rules. In addition, to minimize the impact on data-plane forwarding during policy updates, each rule entry in hardware and software includes a status flag. If the status is 1, the current entry is considered valid. Therefore, before new rules take effect, the old rules remain operational. In the future, we will conduct further analysis and optimization of bottlenecks under dynamic scenarios.
   7. Related Work
(1) Optimizations on the Data Plane
Research on data plane optimization can be categorized into network I/O acceleration and packet processing acceleration. Network I/O acceleration aims to improve packet transmission and reception throughput through optimized I/O software frameworks (e.g., DPDK, PF_RING [
45], Netmap). Packet processing acceleration focuses on reducing the computational and storage overhead caused by complex packet processing logic. In this paper, we focus on accelerating packet processing. Zero-interrupt and zero-copy optimization techniques are orthogonal to our work; these I/O optimizations will be integrated in future work to further enhance performance.
The related work on network processing acceleration can be categorized into three dimensions: (1) Architecture-level optimization. NetBricks [
28] reduces resource isolation overhead by replacing VMs with lightweight containers. VPP improves cache hit rates through batch processing. NFP [
46] enhances resource utilization by parallelizing multiple network functions across multiple cores. Since security functions typically do not modify packets, their parallelism can be further improved. (2) Application-level optimization. OpenBox abstracts processing workflows into fine-grained graphs and extracts shared operations to eliminate redundancy. Inspired by this approach, PASS abstracts HW/SW shared modules (e.g., packet parsing, header classification) into a pre-processing module and defines a control block to carry key intermediate metadata. (3) Heterogeneous processing optimization. Numerous studies explore accelerating NFV using GPUs or FPGAs. PacketShader [
31] and NBA [
47] propose GPU-based network function acceleration. These GPUs are deployed in a look-aside mode, where the NIC cannot perform DMA directly to GPU memory. Although ClickNP [
32] maps network functions to both CPU and FPGA, it lacks guidance on rational function partitioning between software and hardware. VFP provides a hardware fast path for network processing using FPGAs; however, it does not elaborate on hardware/software co-design. In this paper, PASS addresses these limitations by abstracting security function processing into two paths and proposing design specifications that enable collaboration between software and hardware as well as among different functions, guided by the characteristics of security functions.
(2) Optimization of the control plane
Although SDN decouples the control and data planes, there are no standardized specifications for how to map security functions across these two planes. If a large number of packets are forwarded from the data plane to the control plane, the centralized controller can become a performance bottleneck. DIFANE [
48] and DevoFlow [
49] propose caching control rules in the data plane, allowing all packets to be processed locally. In addition to avoiding forwarding packets to the control plane, PASS offloads policy translation to the security auxiliary plane to reduce latency and minimize interaction data volume.
(3) Optimization of development efficiency
In addition to performance optimization, existing research also focuses on improving development efficiency and reducing complexity. Since middleboxes are required to support complex state-related management, mOS provides developers with abstracted fine-grained flow events through a customized network stack. Developers only need to select the appropriate state event and register user-defined callbacks. Similarly, the PASS agent is designed with an event-driven architecture and supports offloading more workload to the security auxiliary plane by defining events and registering related functions. OFX supports security function development by extending the OpenFlow protocol; however, security functions must be developed from scratch, incurring significant development costs. PASS addresses these challenges by providing high-level, well-defined APIs. These APIs are designed by abstracting the security function processing model and hiding complex and shared packet operations.