Estimating the Memory Consumption of a Hardware IP Defragmentation Block

: IP fragmentation is still prevalent on the Internet. Defragmented trafﬁc is a prerequisite for many network processing algorithms. This work focuses on the size and organization of a ﬂow table, which is an essential ingredient of the hardware IP defragmentation block. Previous research suggests that fragmented IP trafﬁc is highly local, and a relatively small ﬂow table (on the order of a thousand entries) can process most of the trafﬁc. Samples of IP trafﬁc were obtained from public data sources and used for a statistical analysis, revealing the key factors in achieving design goals. The ﬁndings were backed by an extensive design space exploration of the software defragmentation model, which resulted in the efﬁciency estimates. To provide a robust score of the simulation model, a new validation technique is employed that helps to overcome the limitations of the samples.


Introduction
According to surveys [1] (and backed by this research) IP Fragmentation constitutes a tiny fraction of Internet traffic (below 1%). However, its impact on the key Internet technologies cannot be measured by volume alone. The development of network architectures has been increasingly focusing on security and privacy on the one hand and flexibility of configuration on another. It turns out that both needs are satisfied by a similar technological solution, namely network tunnels. The security and privacy are currently satisfied by end-to-end encryption protocols such as Transport Layer Security (TLS) [2] and IPSec [3], while the other end of the spectrum is represented by unencrypted IP tunnels used in Software-Defined Networking (SDN), such as MPLS/GRE [4] or VxLAN [5]. Each of the mentioned protocols has its technological niche. For instance, TLS has become the de facto standard for web browsing security and other application-layer traffic in data centers [6], and IPSec is deployed, among others, as a security measure for User Equipment in non-trusted 3GPP networks [7].
All the deployments described above must deal with a potential IP traffic fragmentation. This is a prevalent challenge for both IPv4 and IPv6 networks, despite numerous attempts to circumvent it [8]. In addition, the centralized nature of tunnels and sheer volume of traffic in question is causing an increasing adoption of hardware-based TLS [9], IPSec [10], and SDN [10,11] protocol stacks. Protocol acceleration IP blocks (in hardware parlance, an IP block is a hardware functionality enclosed as an independent, re-usable module) are implemented either in silicon or using Field Programmable Gate Array (FPGA) technology. A silicon-based solution can be a part of a System On Chip (SOC) [12] or a Smart Network Interface Card (NIC) Ethernet card form factor. FPGA-based IP blocks are implemented in reconfigurable Smart NICs such as Intel PAC N3000 [9] or Xilinx Alveo [13].
Given the task of designing a protocol acceleration block in FPGA or silicon technology, an engineer faces the challenge of predicting the utilization of hardware resources required for flawless operation in a particular deployment scenario. The resources in question mostly consist of memory blocks that could be static or dynamic RAM. Memory consumption is the main factor determining the cost of the design up to the point of rendering it completely infeasible when predicted RAM consumption exceeds platform capacity. It is worth noting that contemporary high-end FPGA silicon typically has an order of hundreds of megabits of internal Synchronous Random Access Memory (SRAM) memory-a magnitude that is not so impressive given the fact that those cards are designed for interfacing with 100 Gb/s Ethernet links. One of the ways to escape the limitations of static RAM arrays is to use external dynamic memory banks (DRAM). Some of the cards have separate external DRAM dies (DDR4) or internal memory slices (HBM2) [13], but relying on a memory controller has certain consequences. First and foremost, it mandates a completely different hardware logic as opposed to the fast, synchronous SRAM. Secondly, it greatly increases the overall hardware cost and energy consumption, especially for SOC platforms.
Taking all these factors into consideration, any attempt to design a hardware protocol stack should be preceded by a careful estimation of memory consumption for a given design. Such estimation should help to make a business decision (influencing cost) as well as a design choice for the target platform.

Significance of the Research
The evolution of fiber networks leads to a rapid increase in network data throughput. Leading examples of this trend are represented by an adoption of 100 Gb and 400 Gb Ethernet standards. Meanwhile, the general-purpose CPU computing power lags behind. In effect, the network processing tasks are being shifted from software to hardware. The hardware takes the form of a Smart NIC, such as Intel X700 [11] or Mellanox Connectx6 [10], or becomes a SOC with integrated network functions, such as Marvell Octeon family [12].
This high-end networking hardware is used predominantly in: • 5G infrastructure, e.g., packet data gateways; • Security appliances, such as firewall or Intrusion Detection System (IDS).
In both cases IP defragmentation plays a significant role in the performance of a system. The effectiveness of IDS, for instance, depends on the ability to reconstruct the original data stream from fragmented packets. That is, the core functionality relies on Deep Packet Inspection (DPI). Some of the popular evasion techniques (means to bypass security measures) rely on a specially crafted fragmented packet stream.
In spite of those challenges, there is a pressing need to equip the networking hardware with the IP defragmentation function. As always, the industry strives to optimize the use of resources such as SRAM, which is relatively expensive (w.r.t silicon area) and energyconsuming.
This work attempts to address the resource optimization problem in a systematic manner. That is, by doing design space search and interpreting the results using modern statistical techniques.
In summary, the significance of this work stems from the significance of network accelerator hardware, which becomes ubiquitous both in the cloud operators' data centers as well as in the telecom infrastructure. Moreover, this research originates from the industry itself, as one of the authors has been working for the networking silicon company. In this case it is not a purely theoretical endeavor but an effect of cooperation between academia and industry.

Problem Statement
This article addresses the design space exploration problem to specify the memory consumption of a typical (as hypothesized in this article) IP reassembly block, using contemporary data science methods. These methods consist of unveiling a relevant statistical image of the network data, constructing the simulation model equipped with tuning parameters and performing Monte Carlo estimation of key model performance indicators.
The main hypothesis is that the traffic locality of IP fragments is very high, which translates to the proximity of the packets from the same IP flow. Therefore, the memory requirements of the IP reassembly module should not be measured by the total number of IP connections but by the local traffic behavior. As similar prior experiments (Section 2.3) demonstrate, a flow table of modest size in the order of 1000 entries or even less should be able to process most of the traffic.

Contribution of the Article
The original contribution of the article includes: • carrying out an extensive design space exploration that yields robust confidence intervals for obtained performance metrics and confirms the original hypothesis about the performance of an IP reassembly block • developing an original Monte Carlo validation method for estimating the performance of network protocol accelerators and, in particular, for an IP reassembly block.
As noted in Section 2.5, none of the previous works on flow processing focused exclusively on IP fragments. Although this problem seems to be similar to tracking connections, it has its unique characteristics and challenges. The similarities and differences between connection (TCP, UDP) tracking and IP defragmentation are summarized in Table 1. The TCP reassembly function was presented for reference despite belonging to different problem domain. IP defragmentation has more similarities to the TCP reassembly than to connection tracking since both require buffering and maintenance of the state. Tracking the state of L4 connections is optional, and implementations based on the concept of a cache (e.g., Yamaki [14], Tanaka [15] in Table 2) often do not keep the connection state. The bandwidth criterion in Table 1 means the availability of the relevant traffic in network samples. This presents a unique challenge for this work, as IP fragments are considered an exception in the general Internet traffic. This situation, however, may be changed during DDOS attack, when the bandwidth of defragmented IP traffic might be significant. In summary, IP defragmentation is challenging due to the low bandwidth ratio in the publicly available traffic samples and due to the need to maintain a state. Furthermore, there is no concept of long-lived connections in the realm of IP fragments, while TCP flows may last for days. These differences present a sufficient rationale for a focus on IP defragmentation alone in a novel manner.

Paper Organization
The paper is organized as follows. Section 2 describes state of the art. Next, a software simulation model is derived in Section 3.1. An original methodology of computing a robust score and confidence interval is laid out in Section 3.2. Results of the simulations are included in Section 4. Finally, Section 5 draws conclusions from the experimental data.  [17], Internet Protocol). Later on, the usage of the IP ID field was altered in RFC 6864 (Touch [18], Updated Specification of the IPv4 ID Field). The IPv6 protocol, as described in RFC 8200 (Deering and Hinden [19], Internet Protocol, Version 6 (IPv6) Specification) contains the Fragment Extension, which plays the same role as Flags, Offset, and ID fields in the IPv4 header. Hardware IP reassembly acceleration is a feature of several high-end networking SOC platforms, such as QorIQ from NXP [20]. The problem is also addressed in a patent by Lin and Manral [21]. This function is similar in principle to TCP acceleration present in some network accelerator cards, e.g., Mellanox Connectx6 [10].

Related Work
The concept of stateful Layer 3-4 processing in hardware is also present in scientific literature. Zhao [22] developed a TCP state tracking engine. Ruiz [23] created an opensource TCP/IP stack for the FPGA platform.

Ip Defragmentation Algorithm
This section presents a brief overview of the IP defragmentation algorithm. The algorithm is generalized to work for both IPv4 and IPv6. The difference is in the packet structure. Figure 1 presents the difference between IPv4 and IPv6 fragments. The payload is exemplified by the TCP packet (it may by any payload, however). In the case of IPv4, the fragmentation status is embedded in the IPv4 header. In the case of IPv6, there is an extra header (Fragment Header) that resides between the payload and the main IPv6 header.
In IPv4, all relevant fields (contained in the IPv4 header) are: • id-a 16-bit field identifying the fragment series; • total length-length of the entire IP datagram; • offset-a 16-bit data offset from the beginning of the original (not fragmented) IP packet; • MF-"more fragments" flag indicating that this is not the last fragment; • DF-"do not fragment" flag preventing fragmentation by gateways and routers; • proto-type of the payload, e.g., TCP or UDP.
In the case of IPv6, some fields are moved to the fragment header (retaining similar semantics to IPv4). Overall, the relevant fields are: • id-a 32-bit field identifying the fragment series; • payload length-length of the payload (including extension headers); • offset-a 16-bit data offset from the beginning of the original (not fragmented) IP packet; • M-"more fragments" flag indicating that this is not the last fragment; • next-type of the next header: e.g.,: TCP or UDP or an extension header. The IPv6 protocol prevents fragmentation by the intermediate nodes, so a packet may be treated as having the DF = 1 flag. The source and destination address fields are left out from the picture as they require no explanation in this context. The fragment series is identified by the tuple: 1.
A fragment flow or a fragment series is thus uniquely identified by the tuple number 1 if the Proto field is left "blank" for IPv6. Note that the bit width of the fields is different, so the implementation may choose to zero-extend the fields in the case of IPv4 to retain generality. Figure 2 depicts a transformation process from a "whole" IPv4 packet to a fragment series. When a transmitting device detects that the IP datagram does not fit into the Maximum Transmission Unit (MTU), then the packet payload is split into several parts and each part gets its own IP header. If an original payload is 2000 bytes, then the size of an IPv4 packet is at least 2020 bytes due to the packet header overhead (the 20-byte length assumes no IP options in the header). Subsequently, two-part fragment series has 1520 and 520 bytes, respectively. In the case of IPv6, the overhead consists of the length of all the extension headers between the IPv6 packet header and the data beyond the extension header (there may be many extension headers). The last header in the series (containing the final bytes of the payload) has the MF = 0 flag (or M = 0 for IPv6), thus indicating the final size of the packet. Since IP delivery is not reliable, the packets may be reordered, dropped, or duplicated and the receiving node must attempt to reconstruct the original payload in all those cases. As a consequence, each fragment series must be buffered in the receiving device until a complete reconstruction is made. This process, if left uncompleted, is eventually interrupted by a timeout.
The process of IP reassembly is serviced by the flow table. A flow table is a lookup structure that keeps a state and payload for each active fragment flow. The data kept in the flow table should represent a sorted list of received fragments. When the list is complete, the original IP packet is reconstructed, and the entry is removed from the table. Figure 3 contains such a list (the length field is a payload-only length without the header overhead). The fragment series with id=0x100A has a known length, since the last part was received with MF = 0. The total length is calculated as an offset of the last fragment (1800) plus its length. Since two previous packets contain only 1200 bytes, there are still 600 bytes lacking. The fragment series with id=0x1000B has an unknown length since there was no packet with MF = 0.
The IP defragmentation process follows the "best-effort" approach. A failed attempt to defragment an IP packet (e.g., due to lack of resources) is not considered fatal but should be avoided. In the Internet network architecture, it is the responsibility of a higher layer (TCP, DNS, etc.) to guarantee reliable delivery.

Network Traffic Statistics
Stateful traffic processing algorithms can vary in performance and resource consumption based on a statistical distribution of connections (flows) in the traffic. Real-world flow distribution in traffic has been characterized as highly skewed, long tail, and non-stationary by Adamic and Huberman [24]. There are noticeable small-scale phenomenons, such as burstiness as well as dependence on time of the day in the particular time zone. Both effects are well described in Ribeiro [25] and Benson [26]. Aggregated, high-bandwidth links were also extensively researched in Arfeen [27], where both small-scale and large-scale correlations can be detected. Fragmented IP traffic is a well-understood but somewhat marginal phenomenon in research due to the overall small share in Internet traffic bandwidth. The extensive analysis in Shannon [1] claims that it contributes to less than 1% of total traffic.

Flow State Memory and Caching
Stateful flow processing can be employed in various algorithms. One of the popular techniques is flow caching. Caching unicast routing decisions were demonstrated to be highly effective in Feldmeier [28]. It can speed up various networking algorithms such as label switching described in Kim [29] or OpenFlow rule processing in Congdon [16]. The design of OpenVSwitch, an open-source SDN forwarding plane described in Pfaff [30], is centered around the concept of caching. This technique is known to work well both in software and hardware. An example of a cache-based hardware architecture can be found in Okuno [31]. The advent of high-bandwidth fiber networks revived the interest in the energy efficiency of TCAM memory, which led Tanaka [15] to propose a flow cache as a means of reducing the TCAM hit rate. The general evaluation of flow cache memory performance is presented in Czekaj and Jamro [32], and subsequently, in Yamaki [14]. A recent example of using an FPGA-based Smart NIC for a flow-aware network processing can be found in Li [33].

Creating Synthetic Traffic in Networking
The need for creating synthetic traffic workloads has been well recognized in the industry. This is the basis of commercial network equipment testers such as IXIA [34] or Spirent [35]. A synthetic workload may be built out of real-world traffic samples containing single connections. Erlacher and Dressler [36] proposed using a mix of ordinary and attack payloads using the TRex traffic generator [37]. Gadelrab [38] used a similar technique for security research. Cerqueira [39] evaluated time series forecasting methods, showing that blocked cross-validation provides the best estimate of predicted performance for stationary time series and the second best one for a non-stationary case. This is especially relevant for networking as real-world traffic is proven to be non-stationary, as discussed in Arfeen [27] and Adamic and Huberman [24]. Table 2 presents a comparison of related works that focus on a flow table design. Most of the presented publications (except Tanaka [15]) concern a flow table of size of 1 K (one thousand entries) or less. The associativity is also similar (e.g., 2, 4, or 8) except of Congdon [16], which used small, fully associative Content-Addressable Memory (CAM). Two notable differences contribute to the novelty of this work. First is the testing method, which uses network samples in a novel way (see Section 2.4). Second is the protocol concerned, which is not a Layer 4 transport.

Materials and Methods
3.1. Software Simulation of the Ip Reassembly Module 3.1.1. Flow Table Design The main subject of this work is to assess the required size (memory consumption) of an IP reassembly flow table required to keep the fragmented payload. The primary goal is a hardware design space exploration, the flow table is constructed in a hardwarefriendly manner. The results do not lose generality as the same scheme (or even more sophisticated) can be efficiently implemented in software. The design attempts to maximize performance while keeping the memory organization simple and deterministic. This is especially important for high-speed fiber networks where the 100 Gb/s throughput (and beyond) mandates a pipelined architecture [15]. The conceptual organization of the flow table is presented in Figure 4. The main design factors are:  The flow table design resembles the CPU data cache organization, hence the term "associativity" or "ways" is being preferred over the "bucket size" (used for hash tables). The address of the memory entry is generated by the hashing procedure, which is not present in data caches (which use a physical and virtual address as a key). The LRU policy is working within a "set" or a bucket. For example, for a four-way memory, the replacement policy can overwrite the oldest out of the four entries.
The main design challenge for such an algorithmic block is to choose a memory organization that is capable of holding all concurrent fragment flows without sacrificing too much silicon space. The rest of the article lays out the methodology and results of an experiment that systematically explores the solution space.
The IP reassembly algorithm has been implemented in software to perform a simulation of the proposed hardware block. Every attempt at implementing such a block in hardware should be preceded by algorithmic analysis.
An additional benefit of the simulation is that the results are applicable for software implementation as well. It can be safely assumed that a software variant can be much more sophisticated when it comes to control logic and thus outperform hardware on the algorithmic level (e.g., using cuckoo hashing or dynamic allocation). Thus, any benefit of this simulation can be applied to software with equal or even greater success.
Design space exploration parameters: LRU replacement policy based on a timestamp. 6.
IP Fragments are properly defragmented even when they arrive out of order or duplicated. 7.
No data flow modeling. 9.
No cycle-level modeling.
The size of the flow table A × B (point 1) was chosen to be relatively small, which is typical for the network accelerators based on SRAM memory (e.g., Intel X700 [11]). Other hardware-oriented publications (e.g., Yamaki [14]) also consider the size in the order of a thousand entries.
Associativity of a table A defined in point 2 is an important parameter for the hardware design, as the complexity of implementation rises significantly beyond the four entries. Especially, the idealized LRU policy is often replaced by the pseudo-LRU, which is a design trade-off [40]. Furthermore, as discussed in Section 4, the benefits of increased associativity tend to diminish beyond 8.
Two design parameters, namely packet buffer space and fragment memory size (points 3 and 7) are considered "infinite". This modeling practice is motivated by the need to limit the number of design parameters and focus only on the essential ones. The size of the buffer space strongly depends on the size of the flow table and can be estimated as A × B ×S f low , whereS f low is an average size of a defragmented (original) IP packet. Thus, the buffer space can be considered a dependent variable.
The fragment list size is an independent variable, but the proper value can be determined from the analysis of packet traces. This is done in Section 4.3.4. Since that value can be determined empirically, there is little incentive to assume it a priori.
The lack of data flow and cycle-level modeling stems from the standard hardware design practice. The key design parameters should be determined before the more detailed cycle-accurate model is implemented.
In summary, the main point of the simulation is to characterize the behavior of the flow table while ignoring or simplifying other aspects of the design.

Simulation System
The simulation system consists of several applications designed to work as a data pipeline. The schematic of the architecture can be found in Figure 5. The test design is generated and stored in a workflow definition (described in Section 3.2.2). Next, a test is constructed on the fly using a mixer application. A simulator accepts an input stream constructed by the mixer. The output of the simulation contains tables of network flows and cache events, which are stored for offline processing by statistical software. The essential parts of the process are described in Section 3.2.

Rationale
The methodology described in this section is originally designed to increase the robustness of simulation results and statistical estimates. Typical tests of network equipment are conducted using either simulated traffic or real-life packet traces. As noted in Section 2.4, there are hybrid approaches that use captured packet traces to construct specific tests.
The main drawback of the original data captured in the real network is that the fragmented packets are rare. Using some form of sample generation helps to overcome the limitation of insufficient bandwidth. Synthetic data creation is popular among network security researchers (e.g., Gadelrab [38]). The machine learning (ML) community has also been using synthetic test creation to a great extent. This is a cornerstone of making ML systems more robust and prevent overfitting to the data. Randomized test creation ensures that ML systems and statistical inference tools capture the most important structure of the data while ignoring the accidental parts. Thus, cross-validation, split-validation, and derivatives are mandatory tools for both training and estimation [41,42].
The key observation is that a limited-size flow table with a replacement policy can be treated as an inference system or as a predictor. The overall simulation is a "model" that can be evaluated in the same manner as ML models. The only difference is the fact that the default LRU policy is fixed, so there is no training involved. Thus, the crossvalidation technique can be used as a basis for constructing the validation test for a network processing model.

Test Generation
Deficiencies of traffic samples may stem from several factors:
Time-dependent traffic characteristics (network traffic from a single time zone follows daily and weekly patterns [26]); 3.
Insufficient bandwidth in the original sampled traffic; 4.
Operator-specific traffic profile.
The technique proposed below addresses all the issues above using a scheme similar to K-fold Cross-validation from ML field. Since the original method is sub-optimal for non-stationary time series [39], it has been adapted by mixing packet traces as opposed to creating a sequential vector. Algorithm 1 is a main test generation algorithm. Algorithm 2 is an address anonymization algorithm invoked by the main algorithm. Algorithm 3 is a traffic mixing algorithm invoked by the main part.
The prerequisites for an algorithm are as follows: • S = {Ti}, |S| = K -a set of all packet traces, each trace spans equal time; • T i = {p i , time(p i ) >= 0, time(p 0 ) = 0}-a packet trace, i.e., a sequence of packets p i each with the associated timestamp time(p i ), such that a first timestamp is 0 (the trace records only the relative time); • M-the desired total number of tests; • N-the number of packet traces forming a single test, N < K; • IP 0 src , IP 0 dst -predefined IPv4 address constants. The final remark about the algorithm is that the use of sampling makes the task of creating a test set somewhat independent (except N < K due to the sampling without replacement) of the available number of traces as well as the individual trace size. Figure 6 depicts the process by showing that from K = |S| packet traces there are M tests at the outcome, each consisting of N traces.

Estimators and Confidence Intervals
The output of the simulation consists of M discrete statistics in a form of a cumulative distribution function (CDF). Each statistic from the test is an empirical CDF function F k . The function F k (x i ) = y i is a discrete cumulative "histogram" defined on L intervals such that each pair (x i , y i ) denotes y i observations of the random variable from the interval [x i , x i+1 ). The combined CDF function F is a collection of all empirical CDFs from M tests.
tests, which can be treated as a distribution of results. To produce a point estimate from M CDF functions, the mean µ F (x i ) and Confidence Interval (CI) estimates CI(x i ) are computed for each interval.
The confidence interval estimate CI is calculated by estimating the 5th and 95th percentile of a sample distribution: T 5 , T 95 . The quantile estimation method was chosen to be deleted Jackknife with random subsampling [43]. Appendix A describes this method in detail. It is a first-order accurate estimator (similar to bias-corrected bootstrap) [42] but less computationally intensive than bootstrap for a small sample size (less than 1000).
The exact choice of L and the exact number of intervals depends on the generated statistic. For Figures 7, 8, and 9a it follows the logarithmic scale: where dom(F k (x)) denotes the domain of the empirical CDF function F k (x) from Equation (2)

Data Sources
For the simulation, two popular network traffic datasets were extensively used: • WAND, Waikato Internet Traffic Storage archives from WAND research group [44,45]. The traffic used in the paper comes from New Zealand's Internet Service Provider (ISP) core router. • MAWI, MAWI Working Group Traffic Archive [46,47]. This is the most recent traffic coming from a Japanese ISP's backbone link.

Simulation Parameters
The simulation was configured as follows: The test composition ratio R deserves a longer explanation, as it controls the outcome of the simulation. Since the number of tests M is constant throughout the experiment, the ratio R determines the number N of packet traces being sampled (Algorithm 1) and mixed (Algorithm 3). As the ratio R increases up to 1.0, the number of tests increase up to M. The number of sampled and mixed tests have two consequences on the statistics obtained from the simulation: • The bandwidth of a synthetic packet trace (mixed by Algorithm 3) increases with the ratio R. This is important in the case of IP fragmentation, as the original packet traces contain less than 1% of the relevant traffic.
• The statistical diversity (variance of the estimators) obtained from samples diminish as the R grows. This is observed empirically in Section 4.3 and can be deduced from the standard formula of a variance of a sample mean: As a result, the choice of the constant M should be a function of the desired traffic bandwidth and variance of the results. Note that M is not dependent on the number of original samples K. Therefore, the choice of M can be arbitrary. This is a major advantage of such a sampling method over the traditional approach. It is somewhat analogous to the application of the bootstrap, which improves the robustness of any statistical estimator based on limited data on hand. For each flow f i ∈ U, create a new flow f i = (Unique(IP 0 src , id, i), Unique(IP 0 dst , id, i)), where Unique(IP, id, i) = IP + (id << 24) + i. The unique IP address is a bit concatenation of 2 parts: a unique trace number IP + (id << 24) and a unique flow number i occupying lower 24 bits of the IPv4 address.
Step 3 Create a new trace T by replacing all the IP addresses of the flows f i with f i .

Algorithm 3 Trace mixing. function Mix(S) → T mix
Step 1 Create e new trace T mix by constructing a union of all traces T mix := T ∈ S

Step 2
Sort all packets p i ∈ T mix in ascending order according to a timestamp time(p i ) Figure 7 shows the efficiency metric collected among all test runs. The X axis is a table size (a number of entries), and the Y axis is the efficiency measured as a success rate. E = N success f ul N complete (10) where N complete is a total number of complete (defragmented) IP packets and N success f ul is a number of IP packets successfully defragmented by the given configuration (e.g., 16 entries one-way). Efficiency data on Figure 7 demonstrates that higher bandwidth imposes more "stress" to the model, which results in lowering the efficiency (e.g., lowest score being 0.8 for R = 0.9 vs. 0.92 for R = 0.1) but at the same time reduces the spread of the results.

Efficiency
The main observation is that the efficiency results support the research hypothesis: the IP defragmentation algorithm is highly efficient with relatively small table size. If the one-way table is excluded from the consideration, the efficiency of 95% is reached with the table size of 128.
The second conclusion is that there are diminishing returns from increasing the associativity beyond 4. Technical challenges that haunt the highly associative memories increase the relevance of minimizing this parameter. Figure 8 displays the sample mean and confidence interval (CI) of intra-arrival count. This metric is a distance between two packet fragments from the same flow, so it is an indicator of traffic locality. For instance, a CDF value 0.6 for distance 8 means that in 60% of cases, the next packet from the same flow was at most eight packets apart. This chart demonstrates the effect of increasing the throughput of the packet traces used for testing. For example, for a distance of 1, which means that the next packet is from the same fragment flow, the possible range of values is between 0.35 for R = 0.9 and 0.6 for R = 0.1.

Intra-Arrival
The main conclusion from Figure 8 is that traffic locality is strong for all tests. Even accounting for confidence sets, there is at least a 50% chance that one of the next five packets is from the same flow, which can be derived from the lower confidence interval for series with R = 0.9.
The high traffic locality is the key to efficient flow processing and supports the results from Figure 7. The data should be interpreted as follows: fragmented IP packets are sent as a consecutive series of network frames and retain that property even in the aggregated traffic. This allows for processing them with relatively low memory consumption.
The notable trend that can be observed in Figure 8 and subsequent charts is that increasing the mix ratio R decreases the overall size of the confidence sets. This can be explained by the fact that the original diversity contained in the packet traces (smaller than synthetic) is recovered when the sampling ratio approaches 100%.

Flow Parallelism
In Figure 9, the number of flows in a packet window was approximated by a linear function for each of the data series associated with ratio R. The original metric measures how many unique flows can be found in a rolling window of a certain size.
where p i is a ith packet in a trace, w is a packet window of size n, and UniqueFlows(w) is a flow counting function. This metric can be interpreted as a measure of "flow parallelism", i.e., how many flows are "active" in the same unit of time. For a single test, a single window of size n produces a population of measurements, i.e., W(n) is a sampled random variable. So, the final metric computesthe 95th percentile of all measurements W(n) for n. For example, a window of 256 with a number of flows 100, that is, W(256) = 100, means that in 95% cases in 256 consecutive packets there are no more than 100 unique IP fragment flows. Since there are M tests, the estimate W(n) is itself a population sample, i.e., there are M samples of W(n) for each data series in Figure 9.
The point measurements (n, W(n)) are approximated by f (n) = an using the least squares method (constant term, a.k.a. intercept, is fixed at 0). The slope coefficient a for each data set is slightly different, so Figure 9a displays the mean slope for each series of M tests along with the two biggest [5%, 95%] confidence regions. Each data series represents all tests created with the same ratio of tests R. The detailed results of the linear fit can be found in Table 3. The minimal coefficient of determination r 2 is defined as where (n, W(n)) are samples, f (n) ≈ W(n) is an approximating function and υ is a variance estimator. r 2 is more than 0.97 for all cases (shown in Table 3), which suggests a strong linear relationship (at least 97% of the variance is explained by the linear function f (x) = ax). Figure 9b shows the Gaussian kernel density estimate of the distribution of the slope parameter a. The variance becomes smaller with the growing number R. None of the probability mass functions can pass a t-test for identity with any other. The p-value is negligible for all pairs of distributions (less than 10 −36 ). Measuring the flow parallelism provides additional evidence for flow locality. Highly local traffic should have lower W(n) for a given n than non-local traffic. Figure 9b shows the relation between mix ratio R and the distribution of the slope of W(n) in the samples. The differences between slope distributions are consistent with the previous metrics, e.g., intra-arrival. That is, the variance becomes smaller with higher R.  Figure 10 represents a histogram of a sample mean of packet count in a single IP fragment flow. The expected number of fragments is 2, while the most extreme cases reach 8 packets and more. This follows the findings from Shannon et al. [1]. The higher number of packets in the fragment series most likely indicates an invalid flow (it can be a part of a DDOS attack or a firewall evasion technique). Fragment flows with packet count 1 are incomplete and cannot be reassembled. Figure 10. Histogram of the sample mean and the confidence intervals of the fragment count. Each data series represents a different ratio R.
The distribution of the flow length can be used to resolve the question from Section 3.1.1 about the fragment list size. Since the expected fragment length seldom exceeds 8, the list length of 8 should suffice for practical implementation.

Simulation Results
The results from Section 4 indicate that the initial hypothesis of high-traffic locality holds for a wide range of tests with real-world data. Furthermore, the amount of flow memory needed to successfully reassemble most of the IP fragments does not have to be large. The number of ways should be no less than 4 and no more than 16, while the flow table size should be 128 or larger. That parameter range yields at least 95% effectiveness across tests.

Methodology
The original methodological approach undertaken for this particular simulation experiment deserves an independent assessment. The main conclusion from the large-scale traffic mixing experiment is that it allows the range of tests or simulations to be "enriched" in the case of an insufficient number of original test samples or when individual sample diversity is unsatisfactory. The mixing ratio R plays a crucial role in controlling the variance of the results. Choosing an R that is too high may result in a large number of traffic tests that are highly similar to each other. If the main purpose of such an experiment is parameter tuning, this may lead to "overfitting" (this hypothesis is beyond the scope of this work).
The randomized nature of the original sample selection for mixing should be controlled by "seeding" the random number generator. This ensures that individual tests are reproducible and do not need to be stored but can be generated on the fly. This gives an additional benefit of making the storage space independent of the total number of tests (but rather on the number of original samples). This is not a trivial problem as packet traces from high-speed networks can easily consume terabytes of storage. The obvious cost of a large number of tests is the amount of computing power that in this particular case approached hundreds of CPU hours. The recommended way to tackle this problem is to run many test cases in parallel on a computing cluster or a high-end multi-core platform.

Future Work
The sample generation technique presented in this paper can be adapted to a wide range of simulation problems, such as design space search for algorithms, hardware model verification, or fuzzing software systems (randomized test generators). Network-oriented machine learning algorithms is a growing and popular field, which could directly benefit from large-scale packet sample synthesis. However, it is still undetermined how this particular method affects the supervised learning process or out-of-sample verification.
An obvious research direction when it comes to IP defragmentation is a replacement policy employed be the flow table. Some research mentioned in Section 2 (e.g., Yamaki [14]) was exclusively focused on developing a technique that can beat the LRU either in performance or resource efficiency. Since the problem of IP defragmentation is somewhat different from connection tracking (as discussed in Section 1.3), it deserves an independent assessment of cache replacement policies. TLS Transport Layer Security.

MPLS/GRE A tunnel protocol based on Multi Protocol Label Switching (MPLS) and
Greneric Routing Encapsulation(GRE).

HBM2
High Bandwidth Memory, an on-chip dynamic RAM optimized for high bandwidth. Puplar in Graphics Prosessing Units (GPUs) and high-end FPGAs.
DDOS Distributed Denial of Service attack, a massive request stream aimed at overwhelming the network service originated from many (geographically distributed) clients.

Appendix A. Estimation Method
Let {X 1 , . . . , X n } denote the collection of independent and identically distributed (i.i.d.) samples. Let T n = T n (X 1 , . . . , X n ) be a statistic of interest. The variance estimator υ J ACK−d of a statistic T n is based on computing a statistic T r,s by generating subsets of all samples s ⊂ {1, . . . , n} of size r = n − d (d is the number of deleted samples in a single set). There are N = ( n d ) possible subsets s, so full estimation may be computationally more expensive than bootstrap, which typically uses a constant number of subsets from 1000 to 10,000 [41]. However, the collection of N statistics T r,s can also be sampled, in which case the method is called "random subsampling". The method draws a random sample In order to retain consistency of the estimator, the numbers m and d should be chosen so that n/d is bounded while m → ∞ [43,48]. The term T is an estimate of a statistic obtained by a sample mean and r dm is a variance correction term. Since the Jackknife estimator approximates the normal distribution, the confidence interval can be computed using the Student approximation of the normal. The confidence interval 1 − α can be obtained by CI Jack = [ T − t a/2 * υ, T − t a/2 * υ] (A3) where t α is the inverse Student CDF.