Enabling Packet Classiﬁcation with Low Update Latency for SDN Switch on FPGA

: Field Programmable Gate Array (FPGA) is widely used in real-time network processing such as Software-Deﬁned Networking (SDN) switch due to high performance and programmability. Bit-Vector (BV)-based approaches can implement high-performance multi-ﬁeld packet classiﬁcation, on FPGA, which is the core function of the SDN switch. However, the SDN switch requires not only high performance but also low update latency to avoid controller failure. Unfortunately, the update latency of BV-based approaches is inversely proportional to the number of rules, which means can hardly support the SDN switch effectively. It is reasonable to split the ruleset into sub-rulesets that can be performed in parallel, thereby reducing update latency. We thus present SplitBV for the efﬁcient update by using several distinguishable exact-bits to split the ruleset. SplitBV consists of a constrained recursive algorithm for selecting the bit positions that can minimize the latency and a hybrid lookup pipeline. It can achieve a signiﬁcant reduction in update latency with negligible memory growth and comparable high performance. We implement SplitBV and evaluate its performance by simulation and FPGA prototype. Experimental results show that our approach can reduce 73% and 36% update latency on average for synthetic 5-tuple rules and OpenFlow rules respectively.


Introduction
Software-Defined Networking (SDN) [1] satisfies both flexibility and programmability for routers. SDN separates the control plane (named controller) and data plane (named switch) of routers. OpenFlow [2] is deployed in SDN as a de facto standard, which defines the communication mechanism and message format between the controllers and the switches [3]. SDN controllers have a necessary demand on SDN switches that the control commands must be transmitted timely and executed efficiently, namely update performance guarantees.
The SDN switches use lookup tables with match-action rules to enforce the forwarding strategy, which is essentially a multi-field packet classification problem [2]. Commands issued by the SDN controllers usually become regular rule update operations (insert, delete, and modify) for the lookup tables (called rulesets). Recently, the research topic puts more emphasis on achieving both fast packet classification and fast rule updates. Ternary Content Addressable Memory (TCAM) has been widely adopted in the industrial field [4,5] for enabling line-speed classification. Despite that, TCAM has the significant disadvantages of capacity-limited, power-hungry, and invalid for range-match [5][6][7]. Field Programmable Gate Array (FPGA) has been recognized as a promising alternative in real-time network processing applications [8][9][10]. Due to its high performance with flexible reconfigurability, FPGA is increasingly being utilized for accelerating SDN software switches [11,12]. Decision-tree-based algorithms [13][14][15][16] and Tuple-Space-Search (TSS) algorithms [17][18][19][20] are two major approaches implemented on the CPU/GPU platform. However, the characteristics of these algorithms conflict with FPGA devices and cannot be effectively migrated to FPGA. Bit-Vector-based (BV-based) algorithms [21][22][23][24][25] exploit hardware parallelism of FPGA and easily achieve line-speed classification performance. In particular, the state-of-the-art research TPBV [24] and further optimized WeeBV [25,26] have implemented microsecond-level data structure updates on the hardware pipeline by using self-reconfiguration processing elements.
Unfortunately, the worst-case latency of completing one update operation by the above researches [24][25][26] is unbearable, which is crucial for SDN controllers. It has been confirmed that most flows are small in SDN [27], which generate plenty of new rules that needed to be updated. Excessive latency of the update operation raises two serious problems: (1) process pressure of the control plane caused by accumulative new packets [28]; (2) losing packets on the forwarding plane, which has a destructive impact on latency-sensitive applications [29].
In this paper, we present SplitBV to ensure as small latency as possible for high-performance SDN switches on FPGA. SplitBV selects several distinguishable exact-bits to split the ruleset into independent sub-rulesets without rule replication. These diminutive sub-rulesets can be implemented parallelly in BV-based pipelines, thus reducing update latency tremendously. Moreover, the update and lookup processes between different pipelines do not conflict with each other, which further reduces the damage of rule updates to classification performance. Our contributions in this work include: • A novel architecture for packet classification, SplitBV. SplitBV divides the ruleset into independent sub-rulesets and uses BV-based pipelines to match the sub-rulesets in parallel. • A recursive splitting algorithm based on greedy sorting, SplitAL. SplitAL first determines the searching steps of matching fields based on greedy strategy and then employs a constrained recursive algorithm to resolve the problem of exponential computational complexity. • A hybrid matching pipeline on FPGA, SplitHP. SplitHP contains a hash element for mapping different sub-rulesets into different two-dimensional pipeline consisted of processing elements (PEs, proposed in [24])).
We generate synthetic 10K 5-tuple rules and OpenFlow1.0 rules from well-known ClassBench [30] and the lastest ClassBench-ng [31] to evaluate our scheme. It is reasonable to choose rules around 10K since recent researches [32,33] have validated against dozens of thousands of rules. Experimental results show that SplitBV reduces the update latency by average 37% and 41% compared with TPBV on the two typical rulesets respectively.
The rest of the paper is organized as follows. Section 2 reviews the background and discusses problem of this paper. Section 3 surveys and analyses prior works. In Section 4, we detail our scheme. The hardware architecture is proposed in Section 5. We present experimental results in Section 6. Finally, the paper is concluded in Section 7.

Multi-Field Packet Classification
Packet classification can be defined as: Given a ruleset consisted of N rules and an input packet header consisted of F fields, find all the rules matching the packet header and export the highest priority rule. Each rule is an individual predefined entry used for classifying a packet stream, which is associated with a unique rule ID (RID), a priority and an action. A packet header matches a rule, which means that matches all matching fields of the rule.
Classical packet classification contains five matching fields: Source IP Address, Destination IP Addres, Source Port, Destination Port, and Protocol. Each matching field has a different match type. For OpenFlow, classical packet classification is transformed into multi-field packet classification. OpenFlow1.0 has 12 matching fields, while version 1.5.1 extends to 45 matching fields. Besides, the multi-field packet classification in OpenFlow has the following characteristics: (1) more rules, to support fine-grained management of SDN controllers; (2) faster rule update, caused by plenty of short streams in SDN. Packet classification for SDN switches should achieve both high-performance and low update latency.

Rule Updates on SDN Switch
SDN divides the function of the router into two parts: controller and switch. Figure 1 shows a regular router model that uses FPGA acceleration. A controller is typically implemented on the CPU platform to generate policies for new packets. The switch on FPGA usually consists of three modules: Parser, Packet Classification (Classific.), and Action. The Parser module takes head messages from the input packet as a key. Packet Classification is the core module, which uses the key to match the rule with the highest priority from the ruleset, and then transport results to the Action module for execution. A new packet generates a miss that will be uploaded to the controller, and then the controller dispatch the corresponding rule update into the Packet Classification module.
From the beginning of the first miss generation, continuously arriving packets will generate misses successively until the new rule is configured on the data plane. It is worth noting that the new rule entering the data plane will not take effect immediately. The controllers calculate a new rule at the microsecond level, while the data plane completes the update configuration at the millisecond level [34]. Excessive latency will accumulate plenty of misses for controller, resulting in a sharp decline in controller performance. Excessive latency will accumulate plenty of misses for the controller, resulting in a sharp decline of controller performance [28]. Meanwhile, massive miss-route packets on the data plane can seriously affect latency-sensitive applications [29].

Packet Classification Algorithms
Packet classifications recently can be divided into two main categories: RAM (Random Access Memory)-based algorithmic solutions and TCAM (Ternary Content Addressable Memory)-based solutions [3,35]. Each storage unit in TCAM can have three different types of states: {0, 1, and *}, and TCAM can search all rules in parallel in a single lookup cycle. TCAM suffers high cost and high power consumption, and hard to perform range matching [3]. Anat Bremler-Barr et al. proposed gray code [36] and layered interval code [37] to enable TCAM support range matching at the cost of entry explosion. Thus in practice, the number of range fields is severely limited.
Decision-tree-based approaches [13][14][15] analyze all fields in a ruleset to construct decision trees for packet classification. Tree depth and rule duplication in a decision tree affect the searching efficiency and memory requirement of one implementation. Also, the state-of-the-art CutSplit [16] can hardly make an excellent trade-off among storage, performance, and updating, Tuple space solutions (TSS) [17,18] leverage the fact that the number of distinct tuples is much less than that of rules in a ruleset. TSS can insert or delete rules from distinct tuples in amortized one memory access, resulting in high update performance. However, with the growing field number in a ruleset, both tuple number and tuple size increase. A longer processing latency could be expected. PartitionSort [19] reduces average lookup times by presenting a pre-computed priority for each tuple space, but there is no guarantee of its worst-case performance. The further TupleMerge [20] relaxes the rule's restrictions on entering the same tuple to reduce the number of tuples. However, too many tuples merging can lead to hash conflicts, which can seriously affect performance.

BV-Based Solutions Analyses
The BV-based approaches split matching fields (e.g., source IP address) into L s sub-fields (s (1 s L) denotes the length of a sub-field, and L denotes the length of total matching fields). A bit-vector (BV) of length N is used to represent the matching result of N rules on a sub-field; usually, N is the number of the entire ruleset. Each bit of BV corresponds to a rule, with 0 for success and 1 for failure. When performing AND operations on L s BVs of all sub-fields, the matching result of the ruleset for the input packet is obtained. FSBV [22] is a particular case where s is 1, which creates bit-level sub-field partitioning. StrideBV [23] increases s for enhancing performance. The state-of-the-art TPBV [24] divides a ruleset into N n sub-rulesets for improving the scalability on large rulesets. WeeBV [26] further reduces the memory consumption of TPBV but damages rule updates on the hardware.
To support large-scale ruleset and ensure performance, TPBV implements a two-dimensional pipeline consisted of processing elements (PEs) and priority encoders (PrEncs) on FPGAs [24]. Each PE is responsible for matching a sub-field and is self-reconfigurable. However, the increasing number of rulesets forces the pipeline array to expand, which severely affects update latency. Figure 2 shows the worst-case update latency of TPBV, where parameters s and n are the best scheme in [24]. The update latency of TPBV consists of two parts: (1) propagation time, the time from entering the pipeline to finding the updated location; (2) processing time, the time from the execution of the updated instruction at the updated location to the new rule effective. It can be seen from Figure 2 that propagation time has a significant impact on update latency. An matching example of TPBV is illustrated in Figure 3a. A bit-vector BV i K j is used to represent the matching result of K j for the corresponding matching sub-field j of the sub-ruleset i . The values of BV i K j can be calculated by algorithms in [23]. In this example, parameter s determines that each horizontal pipeline has two PEs ( L s = 4 2 =2), and parameter n determines that there are two horizontal pipelines ( N n = 6 3 =2). The propagation path of packet header is: , the update command have to wait 2 clock cycles for propagating.
The parameters s and n are generally related to the characteristics of the FPGA devices used. In general, each FPGA has its fixed s and n to achieve maximum performance. This means that the "number of layers" of a two-dimensional pipeline is related to the size of the ruleset, in other words, a larger ruleset causes a larger propagation time. We note that the exact-bits in the rules can be used to split the ruleset. As shown in Figure 3b, selecting the first bit in the rule as the split-bit, ruleset can be divided into two sub-rulesets {R 0 , R 1 , R 4 }, and {R 2 , R 3 , R 5 }. With the same s and n, the two sub-rulesets correspond to two independent pipelines, respectively. In particular, the input packet head will enter only one pipeline according to 1st bit. If the 1st bit of input packet is 1, the rules in the previous sub-ruleset {R 0 , R 1 , R 4 } must not be matched. Therefore, these two pipelines can handle updates and matches concurrently. In this case, the propagation time of the entire pipeline is reduced to half of its original value. Based on the above observation, we are motioned to reduce propagation time to reduce rule update latency.

Ideas and Architecture
As mentioned above, our proposed SplitBV divides the whole ruleset into different sub-ruleset by selecting several specific bit locations. Notice that there is no rule replication between these sub-rulesets. Rule replication brings two problems: more memory consumption and the challenge of rule updates. If a rule exists in more than one sub-ruleset, it will result in multiple rule updates. Rule replication occurs when there is a wildcard in the selected location. Given the selected bit positions, the candidate rules that are exact bits at these positions constitute a sub-rule set. Figure 4 shows the architecture of SplitBV. The splitting of ruleset can be divided into F steps (F is the number of matching fields). Each step corresponds to a field partition process because the rules have different characteristics in different matching fields. Ideally, the original ruleset can find p bit positions on one field, Splitting all rules into a HASH table of 2 p table items. In this case, only one step, sub − ruleset 1 needs to be executed, which is the original ruleset. Each item in a HASH table contains rules that are looked up using TPBV-pipeline [24]. However, it is often difficult to find such a bit position in a single field. Because the rules are not usually accurate in all fields, that is, in some fields is a full wildcard match. To avoid rule replication, SplitBV tries to partition the ruleset in different fields. It is a noteworthy fact that no rule can be a full wildcard match on all fields. Therefore, up to F sub-rulesets can be obtained from the original ruleset without rule replication. At the same time, for each sub-ruleset, at least one-bit position can be found to generate a HASH table. Each item corresponds to the TPBV pipeline responsible for matching N (i,j) rules.Moreover, if the number of remaining rules is less than the minimum size of the HASH table (2 × n), then we can directly use TPBV for matching without splitting. Step 1

Sub-RulesetF
Step i Step Step 1 Step i Step   (N (i,j) ) n − 1. This is because the TPBV pipelines corresponding to the different HASH tables can be matched in parallel. For a HASH table, the input packet will enter only one TPBV pipeline corresponding to a table item. Therefore, in order to minimize the latency, of SplitBV, especially propagation time, it is necessary to find a field selection order and a bit location scheme on each field to minimize Max (N (i,j) ).

Brute Force Strategy
We first define the following variables: the total number of rules N, the total length of all matching fields L, and the number of matching fields F. An alternative combination consists of two parts: the matching field selection order and the bit position selection on each field. Given an alternative combination, the original ruleset can be quickly split, and the propagation time can be easily calculated. In order to get the combination of minimal propagation time, the brute force strategy is to find out all possible alternative combinations and then calculate each alternative combination. The selection order of matching fields is a sequential arrangement problem, and there are a total of F! matching arrangement schemes. For matching fields with matching length of f i , p (1 p f i ) bits can be randomly selected. Formula 1 gives the total number of selected location combinations.
Given a matching field order permutation StepList, formula 2 gives the total number of selected position combinations on F matching fields. Therefore, the brute force strategy must calculate the values of F! * 2 L schemes, and then select the optimal solution through the sorting algorithm. This kind of computational complexity is obviously unbearable.

Split Algorithm
Notice that the two parts of an alternative combination are in sequence. Therefore, we first try to determine the near-optimal matching field selection order quickly and then propose a constrained recursive algorithm to find all the bit position combination schemes.

Distinguish by field
We use Dis to denote the distinguishability of a field, and formula 3 gives the calculation function of Dis. Pre f ixKind represents the number of different prefixes of the original rule set in the current field, and the prefix consists of three characters. The greater the distinguishability of a field, the more precise specific locations can be found in this field. Meanwhile, it also means that these exact bit positions can be used to build a HASH table, where each table item contains as small a set of rules as possible. We adopt the idea of greedy choice: calculate the distinguishable for each field, and then give StepList in descending order.

Select-Bits Combination
There is a noteworthy fact that wildcards in prefix matching appear continuously and must appear after exact bits. We adopt a simple strategy that we select consecutive p i (1 p i f i ) bits starting from the first bit of f ield i for the sub-rulesets. Even with the above strategy, the total number of bit location selection combination is still exponential, and formula 4 gives the upper limit (derived by inequality of arithmetic and geometric means).
In order to further reduce the number of alternatives and approach the optimal solution as much as possible, we propose a recursive algorithm SelectBits with constraints. Algorithm 1 shows the pseudo-code of the SelectBit algorithm. SelectBits recurses the matching fields in StepList sorted by Dis in descending order from beginning to end. In particular, line 8 of SelectBits gives constraints to reduce the number of recursions. CurRuleSet.num indicates the number of rules currently waiting to be partitioned. Each table item in, HASH table is expected to have at least n rules. For f ield i , the p i needs to satisfy: 2 p i × n CurRuleSet.num ⇔ p i log 2 CurRuleSet.num n . In our tests, we found that log 2 CurRuleSet.num n is usually much less than f i , which greatly reduces the number of recursions. For the 10K rules, the average computing percentage ( Search times Rule number ) of the 5-tuple rule is 1.18% (0.85% of ACL, 2.13% of FW, and 0.55% of IPC), and the average computing percentage of the OpenFlow1.0 (12-tuple) rules is 2.37%. Line 2 gives the conditions for recursive termination: (1) all fields have been searched stepped by StepList; (2) the number of remaining rules is less than the minimum size of the HASH table (2 × n, when p = 1), and then set p = 0 and return. In addition, considering that the greedy strategy of StepList may lead to a locally but bot globally optimal solution, line 18 allows skipping of f ield i that currently cannot be effectively partitioned. Skipping all fields is obviously the wrong solution, and line 3 excludes this situation. if CurStep > F or CurRuleSet.num < 2 × n then 3: if PList is all −1 then return

Algorithm 1 SelectBits
CurRuleSet.num n ) do 9: for rule in CurRuleSet do 10: if no wildcard in rule  Table 1 gives an example of a ruleset that contains 12 rules for two matching fields, with the cluster parameter n. First, we calculate Dis(Field 1 ) = 3 4 and Dis(Field 2 ) = 1 2 . The selection order of matching field is [Field 1 , Field 2 ]. First, the ruleset is divided on the Field 1 . According to line 8 of the SelectBits algorithm, the optional range of p 1 is 1 and 2. When p 1 = 1, the ruleset is divided into sub − ruleset 1 = {0 : {R 0 , R 3 , R 6 , R 8 }, 1 : {R 1 , R 2 , R 5 , R 10 , R 11 }} and sub − ruleset 2 = {R 4 , R 7 , R 9 }. Then the remaining sub − ruleset 2 is divided on Field 2 , and the search will be stopped since the number of remaining rules is less than 2 × n = 6. Therefore, the bit position selection scheme Cb 1 = [1, 0] for p 1 =1 (0 meaning that the remaining rules are not split in Field 2 after the Field 1 selection), and Max(N (i,j) ) n = 5 3 = 2. In the same way, the bit location selection scheme Cb 2 = [2, 0] for pendant 2 is divided into sub − ruleset 1 ={00 : {R 0 , R 3 }, 01 : {R 6 , R 8 }, 10 : {R 1 , R 5 , R 11 }, 11 : {R 2 , R 10 }, sub − ruleset 2 = {R 4 , R 7 , R 9 }, and Max(N (i,j) ) n = 3 3 = 1. It can be seen that Cb 2 has less propagation time than Cb 1 . However, 4 horizontal pipelines will be used for Cb 1 with the given parameter n, while 5 horizontal pipelines will be used for Cb 2 . It is necessary to make a trade-off between propagation time and increasing memory consumption caused by pipeline expansion.

Hybrid SplitHP
We propose a hybrid two-dimensional pipeline to implement SplitBV, named SplitHP on FPGA. The SplitHP consists of two stages, the first stage is composed of the splitting elements (SEs), and the second stage is composed of several horizontal two-dimensional pipelines composed of PEs and PrEncs. Figure 5 shows an example of the hardware framework of SplitHP for Figure 3b. In the first stage, each SE-p i corresponds to a sub-ruleset, obtained by splitting on f ield i with p i bits. SE-p i associates different rules corresponding to 2 p i items of the HASH table to different horizontal pipelines subsequently. In the second stage, the rules in each table item are matched using a two-dimensional pipeline implemented by TPBV. Note that, partitioning the ruleset changes the original order of rules, but does not affect the correctness of the final match results. PrEnc [23] reports a local highest priority match, which does not limit the order in which rules are arranged. The final match result is collected by the vertical pipeline of the priority encoders. The input packet header will be sent to all SE-p i s in the first stage in parallel, while the entire ruleset is matched to obtain the correct result. The two-dimensional pipelines corresponding to a SE-p i are isolated and independent, that is, the packet header will only enter one of the two-dimensional pipelines and only one PrEnc will output a valid matching result. We use a multiplexer (MUX) to integrate the matching results of the SE-p i and send it to PrEnc in the bottom horizontal pipeline. As mentioned in Section 3.1, the worst-case propagation time of SplitHP depends on Max (N (i,j) ). In Figure 5, PrEnc [4] executes one clock cycle ahead of PrEnc[1]. We adopt a simple strategy that the PrEnc in the bottom horizontal pipeline waits until all the matching results are received.  The detailed design of the main elements in SplitHP is described below and shown in Figure 6.

Splitting Element (SE)
There are two types of SE: SE-0 and SE-p i . SE-0 is usually used only in the bottom horizontal pipeline and is responsible for matching the remaining sub-ruleset (the number of rules is less than 2 × n). SE-p i is used to find the corresponding item of the HASH table for the packet header. The SE-p i contains the following components: (1) Controller, is responsible for extracting split-bits from packet header; (2) DMUX, one or multi-level demultiplexer for outputting packet header to the corresponding pipeline. Figure 6b,c show the SE with p i equals 1 and 2, respectively. When p i is greater than 1, multi-level DMUX is used to satisfy 2 p i table items. Research has shown that multi-level small DMUX can provide higher performance than a large DMUX [38].

Process Element (PE)
Considering the simplicity of implementation, we reused the PE in [26]. In particular, for the two-dimensional pipeline corresponding to each item of the SE, the packet header is transmitted in the horizontal direction in sequence only in the uppermost horizontal pipeline. Then, the packet header is propagated to the next horizontal pipeline at the same time every other clock cycle.

Update Strategy
Rule update can be seen as three operations: modification, insertion, and deletion. The controller component in PE can automatically configure the writable memory according to update commands. In addition, valid-bit is used to support fast deletion [24]. We follow the strategy of reuse, where inserts are processed at the location of invalid rules.
Delete and Modify. To delete a rule is to set the valid-bit of the rule to 0. The BV that PE fetches from memory will be ANDed with valid-bits, and invalid rules will not be matched. A modification can be divided into a deletion and an insertion at the same location.
Insert. For inserts, the controller component of PE can convert the new rules into the entire BV table and rewrite the memory component. The main difficulty is whether an invalid rule can be found in the corresponding table item. Plenty of short streams in SDN also means frequent invalidation of rules, which generates increasing deletion operations. Therefore, finding an invalid rule is easy. In the worst case, SE may need to be refactored, which can be quickly resolved by using FPGA technology DPR(Dynamic Partial Reconfiguration) [39,40].

Experimental Setup
Synthetic classifiers: To test the performance of our scheme and prior art, we generate the 5-tuple rules by the well-known ClassBench [30] and the OpenFlow1.0(OF) rules (with 12 fields) by the inherited ClassBench-ng [31]. In our experiments, the 5-tuple rules we have used contain Accesses Control List (ACL), Firewall (FW) and IP Chain (IPC). The seed files from real-life 5-tuple rules and a data-center can make the performance as close to practice as possible.
Implementation platform: We implement the SplitHP with an Intel Arria II GX EP2AGX65DF25C4 FPGA, which consists of 4.35Mb Block RAM, 50,600 Combinational ALUTs, and 51336 logic registers. The splitting algorithm is implemented by software and runs on a machine with Intel Xeon E5-3650 CPU and Ubuntu 16.04 LTS operation system.
Parameter selection: There are two key parameters s and n in TPBV, which will affect the update delay of the hardware pipeline and the memory footprint of the BV vector table. In order to make a fair comparison with TPBV, we choose the best parameter given in [24], that is, swatches 4 and 8.

Split-Bits Trade-Off
As mentioned in Section 4.3, SplitBV will lead to the expansion of the hardware pipeline when choosing the location of dividing bits. This is because the selection bit length p is increased so that the propagation time reduces the number of items in the HASH table and increases at the same time. Because each table item corresponds to at least one horizontal pipeline, the total number of pipelines implemented by SplitBV hardware increases. We used 12 different seed files (5 kinds of ACL,5, 5 kinds of FW, 2 kinds of IPC, 2 kinds of OF) to generate 10 K rules for evaluation.

Pipeline Expend
Pipeline Expend (c) IPC1-seed 10 K.  Due to limited space, Figure 7 shows the tradeoff of running split-algorithm only on rule sets generated by four different types of representative seed files. In terms of a clock cycle, we give the expansion multiple of the number of horizontal pipelines of the combined scheme represented by propagation time, pipeline expend compared with the number of horizontal pipelines of the original TPBV. We arrange all the alternatives in the ascending order of the propagation clock cycle.
As can be seen from Figure 7, both metrics are generally in line with an exponential rise (or decline). Because the SelectBits algorithm allows you to skip a field, where two propagation clock cycles are close to each other, there may also be a big difference in pipeline expend. The best solution is near the intersection of the trend curves of the two metrics.

Update Latency
We measured the worst-case update latency for TPBV and SplitBV, respectively. The update delay in the worst case is an important index to evaluate the performance and reliability of the SDN switch. We assume that the clock rate of the FPGA implementation is the usual 100 MHz, then the clock cycle is 10ns. Figure 8 shows the update delay comparison between SplitBV and TPBV. SplitBV reduces update latency by an average of 73%, 74%, 65% and 36% for ACL, FW, IPC, and OpenFlow1.0, respectively. In particular, it is an accepted fact that SDN can achieve 42,000 rules updates per second [34]. This means that the update delay must be controlled at a subtle level of 20. The results in Figure 8 show that SplitBV meets the update latency requirements under SDN for various rulesets of the order of magnitude of 10 K.

Memory Consumption
We also measured the memory footprint of SplitBV and TPBV. The limited Block Memory on FPGA severely limits the scalability of BV-based approaches. A large number of BV tables must be stored to support fast parallel matching on the hardware pipeline. Considering that our implementation of PE is not consistent with the details in the original text, we only compare the memory footprint of the BV table. Figure 8 also shows the comparison of SplitBV's best scheme with the original TPBV. The data on the red column is a multiple of memory expansion. SplitBV increases memory consumption by an average of 1.00137, 1.00149, 1.00266 and 1.00385 times for ACL, FW, IPC, and OpenFlow1.0, respectively. That is because SplitBV requires more pipelines than the original TPBV. In other words, the cost of reducing update latency is the expansion of memory consumption. Fortunately, the increase of memory is acceptable.

Resource and Throughput
Limited by hardware resources, we tested an additional simple prototype system on FPGA for 36 rules. The size of the Block Memory of the FPGA we use is m9K, which is equal to 9 × 1024 = 9 × 2 9 bits. Therefore, we set parameter s = 9 and n = 36 so that each PE has two memory blocks. In this way, TPBV only needs one horizontal pipeline, and SplitBV has a maximum of 2 SEs. Table 2 shows the resource consumption and performance of TPBV and SplitBV, and there are three schemes for SplitBV to be tested. Scheme Cb1, Cb2 and Cb3 represent the first SE with 1, 2 and 3 levels of DMUXs, respectively. For level i, SE − i connects 2 i horizontal pipelines. Considering that a small number of rules can lead to wasted memory, we are more concerned with logical resource overhead. We use Clock Rate to reflect throughput. SplitBV increases the number of horizontal pipelines, so logical resources increase accordingly. Table 2 shows that the increase in the number of DMUX series in SE results in linear growth of logical resources without significantly compromising performance.

Conclusions
In this paper, we present a scheme named SplitBV to support diminutive update latency of SDN switch while ensuring high throughput processing performance. SplitBV splits the ruleset into independent sub-rulesets without rule replication by several distinguishable exact-bits. SplitBV then utilizes BV-based pipelines to parallelly match these sub-rulesets. Besides, SplitBV utilizes a constrained recursive algorithm, which is used to determine the near-optimal bit-position combination schemes. A two-stage hybrid pipeline called SplitHP is proposed to implement SplitBV on FPGA. SplitHP uses the DMUX to send packet header to different pipelines and uses MUX to integrate the final matching results. Experimental results show that SplitBV has an excellent update latency reduction on both 5-tuple and OpenFlow rulesets. Moreover, SplitBV brings negligible memory growth but maintain comparable high performance. As our future work, we will improve SplitBV from the following aspects: (1) test MUX and DMUX model for large-scale ruleset; (2) improve splitting element with DPR technology.