An Efﬁcient Indexing Scheme for Network Trafﬁc Collection and Retrieval System

: Historical network trafﬁc retrieval, both at the packet and ﬂow level, has been applied in many ﬁelds of network security, such as network trafﬁc analysis and network forensics. To retrieve speciﬁc packets from a vast number of packet traces, it is an effective solution to build indexes for the query attributes. However, it brings challenges of storage consumption and construction time overhead for packet indexing. To address these challenges, we propose an efﬁcient indexing scheme called IndexWM based on the wavelet matrix data structure for packet indexing. Moreover, we design a packet storage format based on the PcapNG format for our network trafﬁc collection and retrieval system, which can speed up the extraction of index data from packet traces. Ofﬂine experiments on randomly generated network trafﬁc and actual network trafﬁc are performed to evaluate the performance of the proposed indexing scheme. We choose an open-source and widely used bitmap indexing scheme, FastBit, for comparison. Apart from the native bitmap compression method Word-Aligned Hybrid (WAH), we implement an efﬁcient bitmap compression method Scope-Extended COMPAX (SECOMPAX) in FastBit for performance evaluation. The comparison results show that our scheme outperforms the selected bitmap indexing schemes in terms of time consumption, storage consumption and retrieval efﬁciency.

• Hash: Hash index uses a certain hash algorithm to convert the key value into a hash value. According to the hash value, the location for an entry to be inserted or queried can be found immediately. Therefore, it has a very fast speed. Linked lists are the most commonly used method to resolve hash conflicts. Inevitably, as the number of entries increases, hash collisions will become more frequent. This will result in a significant reduction in retrieval efficiency. Moreover, a hash index cannot support range queries and union queries. In Reference [7], S. Kornexl and V. Paxson et al.
proposed a typical hash-based indexing scheme Time Machine for the archiving and retrieval of the network traffic stream. It could greatly reduce the recorded volume based on the heavy-tailed characteristics of network traffic. Due to the characteristics of hash, Time Machine was faced with the problem that the retrieval efficiency will be significantly reduced with the increase of network traffic.
• B+ Tree: B+ tree has been widely used in relational database management systems, such as MySQL and SQL Server. Compared with the hash index, the B+ tree index has good support for range queries. However, the B+ tree is not suitable for indexing columns with many duplicate values. For network traffic retrieval, some commonly used indexed attributes, such as the destination port, typically contain a large number of duplicate values. In References [10,11], the B+ tree was used to build indexes for network traffic. According to the evaluation results in Reference [10], the performance of indexing for network traffic using B+ trees is up to 40 K records per second. Hence, the indexing efficiency of the B+ tree cannot meet the requirements of the current 10G or even higher network links. • Bitmap: Indexing for packets with bitmaps is a popular way to solve the problem of network traffic retrieval. Many indexing schemes based on bitmaps, such as pcapIndex [12], FastBit [13] and NET-FLi [14], have been studied. Bitmap index maps a key value to a bit vector which contains 0 or 1. Table 1 illustrates a bitmap example. Bitmap index supports efficient queries and can easily handle complex queries. However, the disadvantage of the bitmap index is that it takes up much storage space when handling a large number of records. To address this challenge, many bitmap compression algorithms have been proposed. In Reference [16], Wu et al. proposed the WAH algorithm. For further compression on the basis of WAH, many variants of WAH have been proposed, including Position List WAH (PLWAH) [17], Compressed 'n' Composable Integer Set (CONCISE) [18], Compressed Adaptive Index (COMPAX) [14], SECOMPAX [19], Byte Aligned Hybrid (BAH) [20] and Compressing Dirty Snippet (CODIS) [21]. Some of them are already applied in network traffic archival and retrieval. In Reference [22], Chen et al. presented a survey of different traffic archival and retrieval systems and various bitmap index compression algorithms.  FastBit is an open-source data processing library, which offers a set of searching functions supported by compressed bitmap indexes. FastBit uses the basic WAH for bitmap compression, and it is the most widely used bitmap index technique. For example, it has been used in several network traffic archiving and retrieval systems proposed in References [8,23,24]. In Reference [25], Fusco et al. presented the real-time bitmap index WAH for network traffic. In Reference [14], COMPAX was applied for the indexing of streaming network traffic. Although bitmaps have been compressed by various methods, the usage of storage space and retrieval efficiency still cannot meet the requirements of high-speed network links.

Overview of Wavelet Matrix
In this paper, the proposed indexing scheme is implemented based on the wavelet matrix data structure introduced by Claude et al. in Reference [26]. This section provides a concise description of the wavelet matrix. The wavelet matrix is an alternative representation for large alphabets that retains all the properties of wavelet trees but is significantly faster [26]. Therefore, we first give a brief overview of the wavelet tree [27,28] before illustrating the usage of the wavelet matrix.

Wavelet Tree
The wavelet tree is a compact and efficient data structure that converts a string into a balanced binary tree composed of bit vectors. It was introduced by Grossi, Gupta and Vitter in Reference [27]. According to the Reference [28], the wavelet tree data structure has been adopted in various applications, such as basic and weighted point grids, sets of rectangles, strings, inverted indexes, document retrieval indexes, full-text indexes and so on.
Let S = {a 1 , a 2 , · · · , a n } represents a sequence of symbols a i ∈ Σ, where Σ is an alphabet with the size σ. Three basic operations, namely rank, select and access, are defined for wavelet tree. For sequence S, rank a (S, i) the operation returns the number of times that the symbol a appears in the sequence S from the start position to position i, select a (S, j) the operation returns the position of the j-th occurrence of symbol a in S, and access(S, i) operation retrieves the symbol at position i in S. Figure 1 shows an example of the balanced wavelet tree constructed from the sequence S = {1521863875843287}. At the root node, the alphabet Σ = {1, 2, 3, 4, 5, 6, 7, 8} of S can be divided into two parts, Σ l = {1, 2, 3, 4} and Σ r = {5, 6, 7, 8}. The symbols belonging to the alphabet Σ l in S are marked with "0", and the symbols belonging to the alphabet Σ r are marked with "1". Thus, the sequence S can be encoded as the bit sequence {0100110111100011} at the root node. In the bitmap of each node, a 0 bit means the corresponding symbol belongs to the left subtree of this node, and a 1 bit means that it belongs to the right subtree. This procedure is executed recursively at each node until reaching the leaf node. faster [26]. Therefore, we first give a brief overview of the wavelet tree [27,28] before illustrating the usage of the wavelet matrix.

Wavelet Tree
The wavelet tree is a compact and efficient data structure that converts a string into a balanced binary tree composed of bit vectors. It was introduced by Grossi, Gupta and Vitter in Reference [27]. According to the Reference [28], the wavelet tree data structure has been adopted in various applications, such as basic and weighted point grids, sets of rectangles, strings, inverted indexes, document retrieval indexes, full-text indexes and so on. Let = { 1 , 2 , ⋯ , } represents a sequence of symbols ∈ Σ, where Σ is an alphabet with the size . Three basic operations, namely rank, select and access, are defined for wavelet tree. For sequence , ( , ) the operation returns the number of times that the symbol appears in the sequence from the start position to position , ( , ) the operation returns the position of the -th occurrence of symbol in , and ( , ) operation retrieves the symbol at position in . Figure 1 shows an example of the balanced wavelet tree constructed from the sequence = {1521863875843287}. At the root node, the alphabet Σ = {1,2,3,4,5,6,7,8} of can be divided into two parts, Σ = {1,2,3,4} and Σ = {5,6,7,8}. The symbols belonging to the alphabet Σ in are marked with "0", and the symbols belonging to the alphabet Σ are marked with "1". Thus, the sequence can be encoded as the bit sequence {0100110111100011} at the root node. In the bitmap of each node, a 0 bit means the corresponding symbol belongs to the left subtree of this node, and a 1 bit means that it belongs to the right subtree. This procedure is executed recursively at each node until reaching the leaf node. A balanced wavelet tree constructed from sequence has leaves and has a height . At each level, bits are required to represent the bit sequence. In this basic form, the space required by wavelet tree is + ( ) + ( ) bits. Using the bitmap implementation in Reference [29], these three operations described above are performed in time ( ). It can be seen that the execution time has nothing to do with the length of sequence . It is only related to the alphabet size . However, the space consumption will increase significantly for large alphabets because extra ( ) bits are required to store the topology of the tree. It is significant to remove the ( ) bits used for pointers for large alphabets. The wavelet matrix proposed is a good solution for this purpose.

Wavelet Matrix
Compared with the wavelet tree, the wavelet matrix employs a simpler mapping mechanism from one level to the next. Specifically, all the symbols that are marked with "0" at each level go left, and the rest goes right. For each level, a single value is stored to mark the number of zeros in level . Figure 2a,b show the wavelet tree without pointers and the wavelet matrix for the sequence in Figure 1, respectively. A balanced wavelet tree constructed from sequence S has σ leaves and has a height lg σ. At each level, n bits are required to represent the bit sequence. In this basic form, the space required by wavelet tree is n lg σ + o(n lg σ) + O(σ lg n) bits. Using the bitmap implementation in Reference [29], these three operations described above are performed in time O(lg σ). It can be seen that the execution time has nothing to do with the length of sequence S. It is only related to the alphabet size σ. However, the space consumption will increase significantly for large alphabets because extra O(σ lg n) bits are required to store the topology of the tree. It is significant to remove the O(σ lg n) bits used for pointers for large alphabets. The wavelet matrix proposed is a good solution for this purpose.

Wavelet Matrix
Compared with the wavelet tree, the wavelet matrix employs a simpler mapping mechanism from one level to the next. Specifically, all the symbols that are marked with "0" at each level go left, and the rest goes right. For each level, a single value z l is stored to mark the number of zeros in level l. Figure 2a,b show the wavelet tree without pointers and the wavelet matrix for the sequence in Figure 1, respectively. In order to further illustrate the usage of the wavelet matrix, we give a description of the three basic operations for sequence in detail. Let represents the bitmap at level and = 0 . To perform the ( , ) the operation, the first step is to check the value of 0 ( 0 ). If 0 ( ) = 0, then the corresponding position 1 at the next level is set to 0 ( 0 , 0 ). Otherwise, 1 is set to 0 + 1 ( 0 , 0 ). Next, a similar operation can be performed at the next level. This procedure is repeated until reaching the leaf node. The value at the leaf node is the result of ( , ). For the computation of ( , ), the scope of position from 0 to position should be tracked step by step. Initially, 0 = 0 and = 0 . At each level , is mapped to +1 and is mapped to +1 . If the symbol is marked as "0", then +1 is set to 0 ( , ) and +1 is set to 0 ( , ) . Otherwise, +1 = + 1 ( , ) and +1 = + 1 ( , ). The procedure is repeated until arriving at the leaf level. Thus, the final result of ( , ) is − . Compared with the standard wavelet tree, the operation needs an extra binary rank operation for the wavelet matrix. This is because the position needs to be tracked. It can be accelerated by storing an array , in which [ ] points to the start position of the symbol in the array of leaves. Thus, the final return value of ( , ) is − [ ]. The is the inverse operation of . For the operation ( , ), starting from the leaf node corresponding to the symbol , an upward tracking of the position of needs to be done. If the position of symbol at level is mapped from a "0" in level −1 , the corresponding position −1 is 0 ( , ). Otherwise, the position −1 is 1 ( , − ). The position 0 at the root, a bitmap is the final result. Algorithm 1 gives the pseudocode for the three basic operations on the wavelet matrix.
( ) denotes the symbol at the leaf node v and ( ) denotes the corresponding bit of symbol at level .
Just like the wavelet tree, the wavelet matrix is also an excellent index data structure. It is very suitable for the compressed representation of data, which is achieved by using specific coding on the bitmap or changing the shape of the tree. At the same time, it supports three fast basic operations: , and . These properties of the wavelet matrix make it very useful for packet indexing in the case of high-volume network traffic. Our work is dedicated to extracting packet fields from network packet traces and then indexing these fields for the fast retrieval of network traffic. We will give a detailed description of our index scheme based on the wavelet matrix in Section 4. In order to further illustrate the usage of the wavelet matrix, we give a description of the three basic operations for sequence S in detail. Let B l represents the bitmap at level l and i = i 0 . To perform the access(S, i) the operation, the first step is to check the value of B 0 (i 0 ). If B 0 (i) = 0, then the corresponding position i 1 at the next level is set to rank 0 (B 0 , i 0 ). Otherwise, i 1 is set to z 0 + rank 1 (B 0 , i 0 ). Next, a similar operation can be performed at the next level. This procedure is repeated until reaching the leaf node. The value at the leaf node is the result of access(S, i).
For the computation of rank a (S, i), the scope of position from 0 to position i should be tracked step by step. Initially, p 0 = 0 and i = i 0 . At each level l, p l is mapped to p l+1 and i l is mapped to i l+1 . If the symbol a is marked as "0", then p l+1 is set to rank 0 (B l , p l ) and i l+1 is set to rank 0 (B l , i l ). Otherwise, p l+1 = z l + rank 1 (B l , p l ) and i l+1 = z l + rank 1 (B l , i l ). The procedure is repeated until arriving at the leaf level. Thus, the final result of rank a (S, i) is i l − p l . Compared with the standard wavelet tree, the rank operation needs an extra binary rank operation for the wavelet matrix. This is because the position p l needs to be tracked. It can be accelerated by storing an array C, in which C[a] points to the start position of the symbol a in the array of leaves. Thus, the final return value of rank a (S, i) is The select is the inverse operation of rank. For the operation select a (S, j), starting from the leaf node corresponding to the symbol a, an upward tracking of the position of a needs to be done. If the position j l of symbol a at level l is mapped from a "0" in level B l−1 , the corresponding position j l−1 is select 0 (B l , j l ). Otherwise, the position j l−1 is select 1 (B l , j l − z l ). The position j 0 at the root, a bitmap is the final result.
Algorithm 1 gives the pseudocode for the three basic operations on the wavelet matrix. label(v) denotes the symbol at the leaf node v and mark l (a) denotes the corresponding bit of symbol a at level l.
Just like the wavelet tree, the wavelet matrix is also an excellent index data structure. It is very suitable for the compressed representation of data, which is achieved by using specific coding on the bitmap or changing the shape of the tree. At the same time, it supports three fast basic operations: rank, select and access. These properties of the wavelet matrix make it very useful for packet indexing in the case of high-volume network traffic. Our work is dedicated to extracting packet fields from network packet traces and then indexing these fields for the fast retrieval of network traffic. We will give a detailed description of our index scheme based on the wavelet matrix in Section 4. Algorithm 1 Wavelet matrix algorithms: on the wavelet matrix of sequence S, access(0, i) returns S[i]; rank(0, a, i) returns rank a (S, i); and select(a, j) returns select a (S, j). access(l, i) rank(l, a, i) select(a, j) if reach a leaf node v then if reach a leaf node v then

Implementation
In this section, we first introduce the architecture of our network traffic collection and retrieval system and the function of each module in the system. Then we show the design of the packet storage format based on the PcapNG format. Finally, we give the implementation details of the proposed indexing scheme called IndexWM.

System Architecture
In this part, we present the architecture of the proposed network traffic collection and retrieval system in detail. As illustrated in Figure 3

Implementation
In this section, we first introduce the architecture of our network traffic collection and retrieval system and the function of each module in the system. Then we show the design of the packet storage format based on the PcapNG format. Finally, we give the implementation details of the proposed indexing scheme called IndexWM.

System Architecture
In this part, we present the architecture of the proposed network traffic collection and retrieval system in detail. As illustrated in Figure   The collection module obtains real-time network traffic from the network interface or obtains network packets from the packet trace files stored on the disk. In addition, this module performs session reassemble to eliminate the influences of packet disorder, retransmission and fragmentation on subsequent analysis, analyzes application layer protocols and generates application logs. The storage module is used for recording raw packets, flow information and application layer logs. The raw packets are stored in a custom format based on PcapNG. Application layer logs contain application data, which can be used for the analysis of historical network behaviors. The indexing module extracts index  The collection module obtains real-time network traffic from the network interface or obtains network packets from the packet trace files stored on the disk. In addition, this module performs session reassemble to eliminate the influences of packet disorder, retransmission and fragmentation on subsequent analysis, analyzes application layer protocols and generates application logs. The storage module is used for recording raw packets, flow information and application layer logs. The raw packets are stored in a custom format based on PcapNG. Application layer logs contain application data, which can be used for the analysis of historical network behaviors. The indexing module extracts index data from the packet trace files and builds indexes for packets. The retrieval module receives queries from users and retrieves the indexes associated with the query attributes to obtain the numbers and offsets of the matched packets. With these offsets, the system can easily extract the matched packets from the packet trace files. Usually, queries are generated based on the application logs.

Packet Storage Format
Here we first give a concise introduction of the Pcap Next-Generation (PcapNG) File Format [3]. A PcapNG format file must start with a Section Header Block (SHB). Generally, the whole file contains only one section header block, but there can be more than one Header Block in a dump file. A typical PcapNG file structure is composed of a Section Header Block (SHB), a single Interface Description Block (IDB) and several Enhanced Packet Blocks (EPBs). The PcapNG file format possesses good extensibility and portability. Meanwhile, it supports appending data to the end of the given file without affecting the readability of the file.
As described in Section 4.1, the collection module carries out a deep analysis of the network traffic to get some metadata information of packets, such as timestamp, payload length and application layer protocol. By attaching some specific metadata information to the packet trace file, the difficulty of extracting index data will be significantly reduced for the indexing module. Therefore, we introduce a custom packet storage format based on the PcapNG format. In our design, we add options that include some metadata information to the SHB and EPBs. At the same time, a Timestamp Index Block (TIB) is defined and appended to the end of the trace file. The packet storage format we designed is shown in Figure 4. Next, we will give a detailed description of this format. data from the packet trace files and builds indexes for packets. The retrieval module receives queries from users and retrieves the indexes associated with the query attributes to obtain the numbers and offsets of the matched packets. With these offsets, the system can easily extract the matched packets from the packet trace files. Usually, queries are generated based on the application logs.

Packet Storage Format
Here we first give a concise introduction of the Pcap Next-Generation (PcapNG) File Format [3]. A PcapNG format file must start with a Section Header Block (SHB). Generally, the whole file contains only one section header block, but there can be more than one Header Block in a dump file. A typical PcapNG file structure is composed of a Section Header Block (SHB), a single Interface Description Block (IDB) and several Enhanced Packet Blocks (EPBs). The PcapNG file format possesses good extensibility and portability. Meanwhile, it supports appending data to the end of the given file without affecting the readability of the file.
As described in Section 4.1, the collection module carries out a deep analysis of the network traffic to get some metadata information of packets, such as timestamp, payload length and application layer protocol. By attaching some specific metadata information to the packet trace file, the difficulty of extracting index data will be significantly reduced for the indexing module. Therefore, we introduce a custom packet storage format based on the PcapNG format. In our design, we add options that include some metadata information to the SHB and EPBs. At the same time, a Timestamp Index Block (TIB) is defined and appended to the end of the trace file. The packet storage format we designed is shown in Figure 4. Next, we will give a detailed description of this format. The option added to an SHB is some file description information, including, but not limited to, magic number, file generation time and packet capture device information. For each EPB, we embed an option that contains the description information of the packet or flow entry. The description information includes, but is not limited to VLAN Tag, layer 3 protocol (L3_Proto), layer 3 offset (L3_Offset), layer 4 protocol (L4_Proto), layer 4 offset (L4_Offset), Payload Length (PayloadLen), Application Layer Protocol (App_Proto), Session ID, etc. The specific format of the EPB option can be flexibly defined according to the practical system requirements. Thus, there is no need to parse packets again when we build indexes for packets. The retrieval module is facing the problem of how to retrieve packets quickly in a huge number of packets and how to limit the scope of retrieval, thereby enhancing the efficiency of packet retrieval. Generally, the packet retrieval conditions contain the start and end timestamp information, but it is not a good idea to build indexes for the timestamp of every packet. To solve this problem, we design a Timestamp Index Block (TIB) that contains the packet number and the corresponding timestamp information. The TIB is appended to the end of the file.  The option added to an SHB is some file description information, including, but not limited to, magic number, file generation time and packet capture device information. For each EPB, we embed an option that contains the description information of the packet or flow entry. The description information includes, but is not limited to VLAN Tag, layer 3 protocol (L3_Proto), layer 3 offset (L3_Offset), layer 4 protocol (L4_Proto), layer 4 offset (L4_Offset), Payload Length (PayloadLen), Application Layer Protocol (App_Proto), Session ID, etc. The specific format of the EPB option can be flexibly defined according to the practical system requirements. Thus, there is no need to parse packets again when we build indexes for packets. The retrieval module is facing the problem of how to retrieve packets quickly in a huge number of packets and how to limit the scope of retrieval, thereby enhancing the efficiency of packet retrieval. Generally, the packet retrieval conditions contain the start and end timestamp information, but it is not a good idea to build indexes for the timestamp of every packet. To solve this problem, we design a Timestamp Index Block (TIB) that contains the packet number and the corresponding timestamp information. The TIB is appended to the end of the file. Figure 5 shows the format of TIB.  Now we explain the steps of recording packets with the proposed packet storage format in detail. We first create an SHB with a file description option and a single IDB. Then, for each received packet, we create an EPB to record the content of the packet. Meanwhile, an option that contains packet description information is added to the EPB. The above process of writing EPB will be repeated until no packets arrive or the file size reaches the set threshold. Starting from the first packet, we record the current packet number and timestamp every packets and add them to the TIB. The value of can be adjusted according to the file size threshold. Anyway, the number and timestamp of the last packet must be written to the TIB. Finally, we append the TIB to the end of the file.

Indexing
Based on the wavelet matrix data structure, we propose an indexing scheme called IndexWM. Let us first introduce the attributes used for indexing. Although any field in a packet can be used as an index attribute, there are some common index attributes including timestamp, source IP address (SRCIP), destination IP address (DSTIP), source port (SPORT), destination port (DPORT), layer 4 protocol (L4_Proto), payload length (Payload-Len), application layer protocol (App_Proto) and so on. In this paper, we select the classical 5-tuples: SRCIP, DSTIP, SPORT, DPORT and L4_Proto as index attributes to simplify the implementation and evaluation. As shown in Figure 6, the first step of indexing is to extract the values of the index attributes from the original packet trace file. In our implementation, the packet trace file is achieved in the format described in Section 4.2. As a result, the process of extracting 5-tuples information is significantly simplified. For example, if we want to get the IP address of a packet, we only need to obtain the L3 offset and L3 protocol from the EPB option and then jump to the corresponding position to get the IP address value according to the L3_Proto format. It is similar to obtain the values of other fields in the packet. Moreover, we should maintain an array that stores the offsets of packets. The size of the array is equal to the number of packets. Using the offset in this array, we can quickly extract the matched packets from the packet trace file. After all the index data have been extracted from the trace file, we construct a wavelet matrix for each attribute in 5-tuples. In our implementation, we divide the SRCIP and DSTIP into four parts to reduce the size of the alphabet, thereby improving the retrieval efficiency of network traffic. At the end of the trace file, the timestamp and packet number information in TIB is utilized to index the timestamp. Now we explain the steps of recording packets with the proposed packet storage format in detail. We first create an SHB with a file description option and a single IDB. Then, for each received packet, we create an EPB to record the content of the packet. Meanwhile, an option that contains packet description information is added to the EPB. The above process of writing EPB will be repeated until no packets arrive or the file size reaches the set threshold. Starting from the first packet, we record the current packet number and timestamp every N packets and add them to the TIB. The value of N can be adjusted according to the file size threshold. Anyway, the number and timestamp of the last packet must be written to the TIB. Finally, we append the TIB to the end of the file.

Indexing
Based on the wavelet matrix data structure, we propose an indexing scheme called IndexWM. Let us first introduce the attributes used for indexing. Although any field in a packet can be used as an index attribute, there are some common index attributes including timestamp, source IP address (SRCIP), destination IP address (DSTIP), source port (SPORT), destination port (DPORT), layer 4 protocol (L4_Proto), payload length (PayloadLen), application layer protocol (App_Proto) and so on. In this paper, we select the classical 5-tuples: SRCIP, DSTIP, SPORT, DPORT and L4_Proto as index attributes to simplify the implementation and evaluation. As shown in Figure 6, the first step of indexing is to extract the values of the index attributes from the original packet trace file. In our implementation, the packet trace file is achieved in the format described in Section 4.2. As a result, the process of extracting 5-tuples information is significantly simplified. For example, if we want to get the IP address of a packet, we only need to obtain the L3 offset and L3 protocol from the EPB option and then jump to the corresponding position to get the IP address value according to the L3_Proto format. It is similar to obtain the values of other fields in the packet. Moreover, we should maintain an array that stores the offsets of packets. The size of the array is equal to the number of packets. Using the offset in this array, we can quickly extract the matched packets from the packet trace file. After all the index data have been extracted from the trace file, we construct a wavelet matrix for each attribute in 5-tuples. In our implementation, we divide the SRCIP and DSTIP into four parts to reduce the size of the alphabet, thereby improving the retrieval efficiency of network traffic. At the end of the trace file, the timestamp and packet number information in TIB is utilized to index the timestamp. In Reference [30], Raman, Raman and Rao proposed the RRR implementation for bitmap, which could answer rank queries in (1) time and provide implicit compression for bit sequences. With the RRR structure, a bitmap can be divided into blocks. It is assumed that each block is represented by bits, and blocks are grouped into a superblock. Each block can be replaced with a pair of values, namely a class value and offset value . The class value represents the number of 1 s in a block, and the offset value indicates the index of this block in the RRR table. Figure 7 shows an example of the RRR table. Each superblock boundary stores the sum of rank queries of previous blocks, which can avoid iterating the entire RRR sequence to answer the rank queries. For a rank query, we only need to jump to the corresponding superblock and then iterate over the blocks in this superblock. Despite the fact that the RRR implementation incurs some performance penalty in practice, they can significantly reduce the index size. In our indexing scheme, the RRR structure is employed to represent the bit sequences in the wavelet matrix. The value of is set to 32, and is set to 4. Rank operation on a bit sequence is the key step to perform the three operations described in Section 3. Once the superblock is obtained, the result is set to the sum of the rank values before boundary, which is precalculated and stored at the boundary. Then we turn to the superblock for iterative calculation. For each block in this superblock, we calculate the rank value through the class-offset pair ( , ) and add it to the result. This procedure is repeated until reaching . The final result can be obtained by adding ( , ) to the previous result, where = . In Reference [30], Raman, Raman and Rao proposed the RRR implementation for bitmap, which could answer rank queries in O(1) time and provide implicit compression for bit sequences. With the RRR structure, a bitmap can be divided into blocks. It is assumed that each block is represented by b bits, and f blocks are grouped into a superblock. Each block can be replaced with a pair of values, namely a class value c and offset value o. The class value represents the number of 1 s in a block, and the offset value indicates the index of this block in the RRR table. Figure 7 shows an example of the RRR table. Each superblock boundary stores the sum of rank queries of previous blocks, which can avoid iterating the entire RRR sequence to answer the rank queries. For a rank query, we only need to jump to the corresponding superblock and then iterate over the blocks in this superblock. Despite the fact that the RRR implementation incurs some performance penalty in practice, they can significantly reduce the index size. In our indexing scheme, the RRR structure is employed to represent the bit sequences in the wavelet matrix. The value of b is set to 32, and f is set to 4. Rank operation on a bit sequence is the key step to perform the three operations described in Section 3.  It is worth noting that the proposed indexing method supports indexing more packet fields, such as the flag fields in IP or TCP, payload length, and application layer protocol. The alphabet size of these fields is small, so indexing them does not take up much storage space. In a word, we can flexibly select fields to build indexes based on the wavelet matrix.

Index Query
In this part, we give the procedure of packet retrieval in detail through an example. To retrieve specific packets, we first need to query the index files on the disk. Here we consider two types of queries: single attribute query and multi-attribute query.
A single attribute query refers to the query on a single index attribute. The value of the query attribute can be a certain value (e.g., SRCIP − "10.10.0.80") or a range represented by a wildcard (e.g., SRCIP − "10.10.8. * "). A multi-attribute query can be regarded as the intersection or union of several single attribute queries. It means that the multiattribute query can be decomposed into several independent subqueries, and the final query result can be obtained by merging the results of these subqueries. Figure 8 presents the procedure of packet lookup and extraction in terms of the query: = "10.10. * . * " = 80. It is worth noting that the proposed indexing method supports indexing more packet fields, such as the flag fields in IP or TCP, payload length, and application layer protocol. The alphabet size of these fields is small, so indexing them does not take up much storage space. In a word, we can flexibly select fields to build indexes based on the wavelet matrix.

Index Query
In this part, we give the procedure of packet retrieval in detail through an example. To retrieve specific packets, we first need to query the index files on the disk. Here we consider two types of queries: single attribute query and multi-attribute query.
A single attribute query refers to the query on a single index attribute. The value of the query attribute can be a certain value (e.g., SRCIP − "10.10.0.80") or a range represented by a wildcard (e.g., SRCIP − "10.10.8. * "). A multi-attribute query can be regarded as the intersection or union of several single attribute queries. It means that the multiattribute query can be decomposed into several independent subqueries, and the final query result can be obtained by merging the results of these subqueries. Figure 8 presents the procedure of packet lookup and extraction in terms of the query: SRCIP = "10.10. * . * " AND DPORT = 80. It is worth noting that the proposed indexing method supports indexing more packet fields, such as the flag fields in IP or TCP, payload length, and application layer protocol. The alphabet size of these fields is small, so indexing them does not take up much storage space. In a word, we can flexibly select fields to build indexes based on the wavelet matrix.

Index Query
In this part, we give the procedure of packet retrieval in detail through an example. To retrieve specific packets, we first need to query the index files on the disk. Here we consider two types of queries: single attribute query and multi-attribute query.
A single attribute query refers to the query on a single index attribute. The value of the query attribute can be a certain value (e.g., SRCIP − "10.10.0.80") or a range represented by a wildcard (e.g., SRCIP − "10.10.8. * "). A multi-attribute query can be regarded as the intersection or union of several single attribute queries. It means that the multiattribute query can be decomposed into several independent subqueries, and the final query result can be obtained by merging the results of these subqueries. Figure 8 presents the procedure of packet lookup and extraction in terms of the query: = "10.10. * . * " = 80. The query in Figure 8 is the intersection of a query on SRCIP and a query on DPORT, where the value of SRCIP is in the form of wildcards. Obviously, if a certain query attribute is represented by a wildcard, we do not need to query the corresponding index file. Therefore, for the example in Figure 8, we only need to pay attention to three index attributes: the first byte of SRCIP, the second byte of SRCIP and the DPORT. The first step is to retrieve the corresponding index file to obtain a list of matched packets for each attribute in the query. Given the wavelet matrix structure described in Section 3.2, the number of packets that satisfy the query attribute can be obtained by performing the rank() operation. Then we execute the select() operation iteratively to figure out the packet number of each matched packet. The list of matched packets for each attribute is represented by a binary array whose size is the same as the number of packets. Then, the results are merged into a single consistent list of packet numbers by performing the and operation. In the indexing phase, we maintain the mapping relationship between the packet number and offset. Thus, we can easily get the offsets of matched packets, which can be used to extract packets from the trace file.
Generally speaking, timestamp information is one of the essential attributes of a query. Without the limitation of the timestamp range, it is difficult to guarantee the efficiency of retrieval for high-volume network traffic. For the queries related to timestamps, we deal with them in a different way because the timestamp information is not indexed by the wavelet matrix. As the packet storage format described in Section 4.2, we create a timestamp index every N packets and add a custom timestamp index block (TIB) used to hold these indexes at the end of the trace file. The value of N is set to 10,000 empirically to reduce the timestamp index data size. This means that we can quickly find the packet number range matching a given timestamp query through a binary search in the TIB. In this way, the scope of retrieval is significantly reduced, thereby reducing the number of times to perform the select() operation for other query attributes.

Performance Optimization
The packet storage format designed in this paper is used to speed up the extraction of index data, and our indexing approach focuses on reducing the storage space and accelerating the packet retrieval. In this part, we will present some other techniques to improve the overall performance of the proposed system in Section 4.1.
With the rapid growth of network traffic, the traditional kernel stack is facing the problem of packet loss on high-speed network links. Thus, a large number of solutions, either software-based or hardware-based, such as Netmap [31], PF_RING [32], Data Plane Development Kit (DPDK) [33] and PacketShader [34], have been proposed to accelerate network packet processing. In the collection module of our system, DPDK is employed to capture high-speed network traffic. DPDK runs in user space and receives packets through polling mode, which eliminates the overhead of memory copy between user mode and kernel mode. The performance is optimized through CPU affinity, huge pages, memory pool management, lock-free ring queues, prefetching and some other techniques. To reduce the overhead of dynamic memory allocation, the memory pool in DPDK is used to pre-allocate memory for our application. Moreover, we focus on making full use of the multi-core hardware to parallelize different modules in our system. Based on the network card RSS technology, we distribute network traffic to multiple cores for parallel processing. The lock-free ring queue proposed in DPDK is exploited for inter-core communication.
There are multiple threads for the indexing module in the system. Each indexing thread builds an index for an index attribute.
To save storage space, it is necessary to dump the packet trace files and index files to the disk in a compressed format. Due to the additional performance penalties caused by compression and decompression, we need to make a tradeoff between space and time. A comparison of several widely used compression algorithms, including gzip, bzip2, lzma, xz, lz4 and lzop, is given in Reference [35]. It can be seen from the comparison results that the lz4 compression algorithm is far superior to other compression algorithms in terms of time consumption. For this reason, we choose the lz4 compression algorithm for compression, although it does not have an advantage in the compression ratio. Usually, the index file size is much smaller than the original trace file size, so the compression time of the index file is insignificant.

Performance Evaluation
In this section, we present a performance evaluation of the proposed indexing approach based on the following three metrics: time consumption for indexing, storage consumption for indexing and query response time. The platform used for our experiments is equipped with an Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40 GHz (Intel, Santa Clara, California, USA), 16 GB physical memory and 20 TB SAS HDD (TOSHIBA MG04SCA20EN, Tokyo, Japan). The operation system is Centos 7.4 (kernel version 3.10.0-693.el7.x86_64). As we know, the bitmap index is the most commonly used approach for packet indexing. WAH [16] is a representative bitmap compression algorithm. Several algorithms, such as PLWAH [17], COMPAX [14] and SECOMAX [19], are proposed on the basis of WAH to achieve further compression. From the results of Reference [19], it can be concluded that SECOMPAX outperforms other bitmap compression algorithms in terms of index size. Therefore, we choose WAH and SECOMPAX for performance comparison with the proposed IndexWM. FastBit [13] is a practical implementation of WAH, which supports efficient data processing and flexible query. The open-source code of FastBit can be found on the website (https://sdm.lbl.gov/fastbit/). We also implement SECOMPAX in FastBit. These two indexing methods are called FastBit-WAH and FastBit-SECOMPAX, respectively. Then we discuss the performance comparison between our method and pcapIndex [12], which uses COMPAX [14] for bitmap compression.
In our experiment, we build indexes on five attributes: SRCIP, DSTIP, SPORT, DPORT, L4_Proto. Six packet trace files, namely P1, P2, P3, P4, P5 and P6, are used for performance evaluation. P1 and P2 come from the real-time communication data on the egress route of a research institution, P3, P4 and P5 are randomly generated by Spirent traffic generator, and P6 is a public traffic data set from [36]. The capture length of P6 is 96 bytes. The details of the six packet datasets are shown in Table 2. In particular, since P5 and P6 contain more than 100 million packets, they are used to evaluate the indexing performance as a function of the number of packets. We mainly focus on the performance of the indexing and retrieval module in the proposed system, so we perform only an offline experiment for the evaluation. The collection module loads the packets from the packet trace files stored on the local disk instead of obtaining the real-time network traffic from the network interface.

Time Consumption for Indexing
First of all, we evaluate the time consumption for indexing of the proposed IndexWM, FastBit-WAH and FastBit-SECOMPAX. In order to eliminate the impact of disk I/O, we load the index data into memory in advance. Figure 9a illustrates the comparison results of the indexes building time among IndexWM, FastBit-WAH and FastBit-SECOMPAX on four different datasets from P1 to P4. It can be seen that the time consumption of FastBit-WAH is slightly less than that of FastBit-SECOMPAX. Next, we utilize the results of FastBit-WAH to specify the advantages of IndexWM in terms of time consumption for indexing. The time consumption of IndexWM for indexing on datasets P1, P2, P3 and P4 is 1.42, 2.08, 6.85 and 10.05 s, respectively, while the time consumption of FastBit-WAH is 4.21, 8.32, 32.66 and 51.82 s, respectively. Figure 9b shows the time consumption of these three methods on datasets P5 and P6 as a function of the number of packets. In our experiment, the number of packets changes from 10 million to 100 million with an increment of 10 million. For the simulated network traffic P5, the time consumption of IndexWM ranges from 6.81 s to 68.15 s, while that of FastBit-WAH ranges from 28.85 s to 289.96 s. For the actual network traffic P6, the time consumed by IndexWM ranges from 5.86 s to 59.12 s, whereas the time consumed by FastBit-WAH ranges from 24.12 s to 242.10 s. We can find out that the time consumption of these methods almost increases linearly with the increasing number of packets. However, in the case of the same number of packets, it takes less time to build indexes on P6 than on P5 for these three methods. This is because the randomly generated network traffic P5 has a larger alphabet size than P6 for index attributes. It can be concluded that IndexWM takes less time to build indexes than FastBit-WAH by about 68% to 76% for the packet trace files in Table 2. All the results above demonstrate that the IndexWM consistently outperforms FastBit-WAH and FastBit-SECOMPAX in terms of the time consumption for indexing.
It can be seen that the time consumption of FastBit-WAH is slightly less than that of Fast-Bit-SECOMPAX. Next, we utilize the results of FastBit-WAH to specify the advantages of IndexWM in terms of time consumption for indexing. The time consumption of IndexWM for indexing on datasets P1, P2, P3 and P4 is 1.42, 2.08, 6.85 and 10.05 s, respectively, while the time consumption of FastBit-WAH is 4.21, 8.32, 32.66 and 51.82 s, respectively. Figure  9b shows the time consumption of these three methods on datasets P5 and P6 as a function of the number of packets. In our experiment, the number of packets changes from 10 million to 100 million with an increment of 10 million. For the simulated network traffic P5, the time consumption of IndexWM ranges from 6.81 s to 68.15 s, while that of FastBit-WAH ranges from 28.85 s to 289.96 s. For the actual network traffic P6, the time consumed by IndexWM ranges from 5.86 s to 59.12 s, whereas the time consumed by FastBit-WAH ranges from 24.12 s to 242.10 s. We can find out that the time consumption of these methods almost increases linearly with the increasing number of packets. However, in the case of the same number of packets, it takes less time to build indexes on P6 than on P5 for these three methods. This is because the randomly generated network traffic P5 has a larger alphabet size than P6 for index attributes. It can be concluded that IndexWM takes less time to build indexes than FastBit-WAH by about 68% to 76% for the packet trace files in Table 2. All the results above demonstrate that the IndexWM consistently outperforms FastBit-WAH and FastBit-SECOMPAX in terms of the time consumption for indexing.

Storage Consumption for Indexing
Indexing can improve the efficiency of packet retrieval, but at the same time, it also brings additional storage space consumption, which refers to the cost of dumping index files to disk. Figure 10a illustrates the storage consumption for indexing on the four packet datasets in Table 2. Compared with FastBit-WAH, FastBit-SECOMPAX reduces the size of the index for datasets P1, P2, P3 and P4 by about 50%. While compared with the FastBit-WAH, our IndexWM saves about 58.8%, 63.9%, 74.3% and 74.4% disk space on datasets P1, P2, P3 and P4, respectively. Figure 10b presents the index size for datasets P5 and P6 as a function of the number of packets. Next, we take FastBit-SECOMPAX and IndexWM as examples to explain in detail. For the simulated network traffic P5, the disk space consumption of IndexWM ranges from 116 MB for 10 million packets to 1159 MB for 100 million packets, while that of FastBit-SECOMPAX ranges from 232 MB to 2318 MB. As for the actual network traffic P6, the disk space consumed by IndexWM is from 54 MB for 10 million packets to 538 MB for 100 million packets, whereas the disk space consumed by

Storage Consumption for Indexing
Indexing can improve the efficiency of packet retrieval, but at the same time, it also brings additional storage space consumption, which refers to the cost of dumping index files to disk. Figure 10a illustrates the storage consumption for indexing on the four packet datasets in Table 2. Compared with FastBit-WAH, FastBit-SECOMPAX reduces the size of the index for datasets P1, P2, P3 and P4 by about 50%. While compared with the FastBit-WAH, our IndexWM saves about 58.8%, 63.9%, 74.3% and 74.4% disk space on datasets P1, P2, P3 and P4, respectively. Figure 10b presents the index size for datasets P5 and P6 as a function of the number of packets. Next, we take FastBit-SECOMPAX and IndexWM as examples to explain in detail. For the simulated network traffic P5, the disk space consumption of IndexWM ranges from 116 MB for 10 million packets to 1159 MB for 100 million packets, while that of FastBit-SECOMPAX ranges from 232 MB to 2318 MB. As for the actual network traffic P6, the disk space consumed by IndexWM is from 54 MB for 10 million packets to 538 MB for 100 million packets, whereas the disk space consumed by FastBit-SECOMPAX is from 97 MB for 10 million packets to 967 MB for 100 million packets. Similar to the time consumption, the storage consumption for indexing also increases almost linearly with the increased number of packets. Moreover, these three methods achieve better performance on P6. This can be explained by the fact that the alphabet size is smaller, and the compression ratio of bitmaps is higher for actual network traffic. Based on the above results, IndexWM performs better than FastBit-WAH and FastBit-SECOMPAX in terms of storage space consumption.
FastBit-SECOMPAX is from 97 MB for 10 million packets to 967 MB for 100 million packets. Similar to the time consumption, the storage consumption for indexing also increases almost linearly with the increased number of packets. Moreover, these three methods achieve better performance on P6. This can be explained by the fact that the alphabet size is smaller, and the compression ratio of bitmaps is higher for actual network traffic. Based on the above results, IndexWM performs better than FastBit-WAH and FastBit-SECOMPAX in terms of storage space consumption.

Query Response Time
Query response time is one of the most important indicators of packet retrieval efficiency. In our experiment, the query response refers to the number and the corresponding offset of the packets that meet the query conditions. As shown in Table 3, three types of queries are considered for the evaluation of query response time. To illustrate the effectiveness and practicality of the proposed indexing scheme for actual network traffic, P2 and P6 are selected for query performance evaluation. For a single attribute query, we generate a query on each index attribute according to the values of the attribute in our dataset. The multi-attribute query can be obtained by the intersection or union of several single-attribute queries. As for the wildcard query, two queries are generated using the values of SRCIP and DSTIP, respectively. We generate a total of ten queries, including five single-attribute queries, three multi-attribute queries, and two wildcard queries.  In our experiment, the query time depends on the time to retrieve the indexes and merge the results, which is usually closely related to the specific encoding method. The main purpose of various bitmap compression algorithms based on WAH, such as PLWAH, COMPAX and SECOMPAX, is to reduce the size of the index while ensuring comparable query efficiency. This can be proved in References [17,19]. Therefore, FastBit-WAH is selected as a representative comparison scheme to evaluate the query performance. The results of query response time for various queries on P2 are shown in Figure   Figure 10.

Query Response Time
Query response time is one of the most important indicators of packet retrieval efficiency. In our experiment, the query response refers to the number and the corresponding offset of the packets that meet the query conditions. As shown in Table 3, three types of queries are considered for the evaluation of query response time. To illustrate the effectiveness and practicality of the proposed indexing scheme for actual network traffic, P2 and P6 are selected for query performance evaluation. For a single attribute query, we generate a query on each index attribute according to the values of the attribute in our dataset. The multi-attribute query can be obtained by the intersection or union of several single-attribute queries. As for the wildcard query, two queries are generated using the values of SRCIP and DSTIP, respectively. We generate a total of ten queries, including five single-attribute queries, three multi-attribute queries, and two wildcard queries. These queries are denoted by A to J, where E = {A ∩ C}, F = {B ∩ D} and G = {D ∩ E}. Table 3. Example of three types of queries.

Types Description Examples
Single Exact match on a single attribute SRCIP = "10. In our experiment, the query time depends on the time to retrieve the indexes and merge the results, which is usually closely related to the specific encoding method. The main purpose of various bitmap compression algorithms based on WAH, such as PLWAH, COMPAX and SECOMPAX, is to reduce the size of the index while ensuring comparable query efficiency. This can be proved in References [17,19]. Therefore, FastBit-WAH is selected as a representative comparison scheme to evaluate the query performance. The results of query response time for various queries on P2 are shown in Figure 11a. In experiments using small dataset P2, queries A, B, C, D and E extract 164,982, 90,946, 158,777, 531,774 and 2,878,828 packets from P2, respectively. Query H, which is the intersection of query D and query E, returns the same number of matched packets as query D. Figure 11b shows the percentage of matched packets to the total number of packets. It can be seen from Figure 11a that IndexWM has a faster response speed than FastBit-WAH, especially for those single-attribute queries that only retrieve a small number of packets, such as queries A, B and C. The performance gap will become smaller when the query returns a large number of matched packets. This is because the more packets are extracted, the more select operations are required for IndexWM. This case is more significant in our experiments using a large dataset, P6. Figure 12a gives the evaluation results of query response time on P6. For dataset P6, the total number of packets exceeds 110 million. As shown in Figure 12b, the percentages of matched packets for single-attribute queries A, B, C, D and E are 23.38%, 0.48%, 18.88%, 6.70% and 86.2%, respectively. For queries A, C and E, more than 10 million packets are retrieved. As a result, the response speed of IndexWM is slower than FastBit-WAH. For queries B, G and I, IndexWM is faster than FastBit-WAH because the percentage of matched packets is very low.  Figure 11b shows the percentage of matched packets to the total number of packets. It can be seen from Figure 11a that IndexWM has a faster response speed than FastBit-WAH, especially for those single-attribute queries that only retrieve a small number of packets, such as queries A, B and C. The performance gap will become smaller when the query returns a large number of matched packets. This is because the more packets are extracted, the more select operations are required for IndexWM. This case is more significant in our experiments using a large dataset, P6. Figure 12a gives the evaluation results of query response time on P6. For dataset P6, the total number of packets exceeds 110 million. As shown in Figure 12b, the percentages of matched packets for single-attribute queries A, B, C, D and E are 23.38%, 0.48%, 18.88%, 6.70% and 86.2%, respectively. For queries A, C and E, more than 10 million packets are retrieved. As a result, the response speed of IndexWM is slower than FastBit-WAH. For queries B, G and I, IndexWM is faster than FastBit-WAH because the percentage of matched packets is very low.    Figure 11b shows the percentage of matched packets to the total number of packets. It can be seen from Figure 11a that IndexWM has a faster response speed than FastBit-WAH, especially for those single-attribute queries that only retrieve a small number of packets, such as queries A, B and C. The performance gap will become smaller when the query returns a large number of matched packets. This is because the more packets are extracted, the more select operations are required for IndexWM. This case is more significant in our experiments using a large dataset, P6. Figure 12a gives the evaluation results of query response time on P6. For dataset P6, the total number of packets exceeds 110 million. As shown in Figure 12b, the percentages of matched packets for single-attribute queries A, B, C, D and E are 23.38%, 0.48%, 18.88%, 6.70% and 86.2%, respectively. For queries A, C and E, more than 10 million packets are retrieved. As a result, the response speed of IndexWM is slower than FastBit-WAH. For queries B, G and I, IndexWM is faster than FastBit-WAH because the percentage of matched packets is very low.  According to the above results, we can conclude that IndexWM outperforms FastBit-WAH in terms of response time when the number of extracted packets is small. However, the result may be the opposite in the case of extracting a large number of packets. In our experiment, the timestamp attribute is not taken into consideration. However, in the actual network traffic collection and retrieval system, the timestamp is usually a necessary query condition, which can reduce the retrieval range. In most cases, the number within the timestamp range will not be as high as a million or even higher. Thus, in terms of query response time, IndexWM is still more efficient than FastBit-WAH in the practical system.

Performance Evaluation Discussion
According to Reference [26], the basic form of wavelet matrix requires n lg σ + o(n lg σ) bits of space and can be built in O(n lg σ) time, where n represents the number of packets and σ represents the alphabet size of the index attribute. As we can see from Figures 9b and 10b, these two performance metrics are approximatively linear with the number of packets for the same packet dataset. This is because the alphabet size changes very little when the number of packets varies from 10 million to 100 million for the same packet dataset. Besides, all these schemes have better performance on actual network traffic. This phenomenon is closely related to the statistical characteristics of index attributes in different packet traces. Taking the DPORT attribute as an example, the alphabet size of DPORT for 10 million packets in P6 is 32,871, while the size for 10 million packets in P5 increases to 65,536. This may also lead to a better compression ratio for bitmaps. It is similar to other index attributes. Figure 9a,b shows that the wavelet matrix can be built within 1 s for one million packets. Compared to the bitmap index schemes, this can meet the short construction time requirement of our network traffic collection and retrieval system. As illustrated in Figure 10a,b, the index data size of IndexWM corresponds to about 1% to 3% compared to the volume of the original packet traces in Table 2.
As described in Section 4.3, retrieving specific packets from the indexes requires performing a rank operation and several select operations on the wavelet matrix. These two operations can be performed in time O(lg σ). It is obvious that the cost is only related to the alphabet size. Considering the query response time of a single-attribute query on a single field, it depends on the number of extracted packets, which determines the number of times that the select operation is performed. However, in practice, query results are usually obtained by combining the results of multiple subqueries. For example, if we query on SRCIP with a certain value, we need to retrieve the index files corresponding to the four bytes of SRCIP and merge the results. This will incur extra time overhead associated with the number of packets. Figure 11a indicates that IndexWM takes less retrieving time than FastBit-WAH in the case of extracting a small number of packets. However, as shown in Figure 12a, the performance of IndexWM decreases significantly if the number of extracted packets increases dramatically. In this case, FastBit-WAH has a better performance than IndexWM. To avoid this, the timestamp is one of the essential query attributes in our implementation for the purpose of reducing the retrieval scope. Thus, IndexWM will become a good solution for retrieving specific packets from a limited retrieval scope.
PcapIndex [12] is an efficient tool for packet indexing. Since the performance comparison results of WAH and COMPAX are provided in Reference [12], we can make an indirect comparison with pcapIndex. For storage consumption, pcapIndex is up to about 50% smaller than FastBit, and our IndexWM is about 60% to 75% smaller than FastBit. In terms of the time consumption for indexing, pcapIndex requires about 5% less time than FastBit, whereas IndexWM requires about 68% to 76% less time than FastBit. The query response time depends on the specific query, but IndexWM shows a remarkable performance improvement for retrieving a small number of packets compared to FastBit. Considering these three performance metrics, it is clear that our IndexWM also outperforms pcapIndex for packet indexing.
Compared to popular bitmap indexing algorithms, our indexing scheme IndexWM performs better, especially in terms of storage consumption and construction time. It is well known that bitmap indexing has high retrieval efficiency because it supports efficient bitwise operations between different bitmap indexes. The proposed IndexWM has comparable or even better performance when extracting a small number of packets. Typically, our system only retrieves only a small number of packets within a small timestamp range. In this case, our system can achieve high retrieval efficiency. From what has been discussed above, we can conclude that IndexWM is perfectly viable for packet indexing in the network traffic collection and retrieval system.

Conclusions
Nowadays, network attacks are becoming increasingly sophisticated, which makes them often unable to be detected in real time. Therefore, it is becoming increasingly necessary to record all the content of network traffic for retrospective analysis. Historical network traffic retrieval is becoming increasingly essential in network monitoring, network forensics and many other research fields. In order to quickly retrieve specific packets from high-volume network traffic, indexing for packets has proven to be a highly effective solution, but it brings challenges of storage consumption and construction time for indexes. The wavelet matrix is a compact and efficient data structure to represent a sequence and answer some queries on it. In this paper, we propose an efficient indexing scheme IndexWM based on the wavelet matrix to achieve short construction time, low storage consumption and high retrieval efficiency. Moreover, we design a packet storage format based on PcapNG to reduce the difficulty of extracting index data from packet trace files and reduce the scope of retrieval by appending timestamp indexes to the end of the file. Based on the above optimization schemes, we implemented the network traffic collection and retrieval system. We use different types of network traffic, such as actual network traffic from a research institution and network traffic generated by Spirent, to evaluate the performance of the proposed IndexWM. The results indicate that IndexWM requires about 68% to 76% less time than FastBit-WAH and saves about 60% to 75% storage space compared to FastBit-WAH. Moreover, even for large datasets, the IndexWM has excellent retrieval efficiency when only extracting a small number of packets. We can conclude that the proposed IndexWM performs significantly better than FastBit-WAH and FastBit-SECOMPAX in terms of the indexing rate, storage consumption and retrieval efficiency. In the future, we will select more fields in packets for indexing, enabling fine-grained and flexible queries. The corresponding wavelet matrix is expected to be small in size and has a simple structure because each of these fields has a small alphabet size.