2.1.1.  Perform Feature Extraction
The first step is to read the pcap file in binary mode and extract the link layer data frames from it. After reading the link layer data frame into the computer memory, proceed to the next step of processing. In addition, the program implemented in this article does not currently support the IPv6 protocol, so it is necessary to filter out the data frames of the IPv6 protocol in this step.
The second step is to segment and reassemble the extracted link layer data frames. Link layer sharding refers to dividing large data packets into several small data packets, which are processed separately during transmission. It usually occurs at the link layer, which is the second layer of network data transmission. The reason for sharding is that in the network, the size of the data packets may exceed the maximum value transmitted by the network, so sharding is necessary for transmission. The fragmented data packets can be recombined in order at the destination to form the original big data packets. Link-layer sharding not only improves the transmission efficiency but also helps prevent network data loss. When network congestion occurs, fragmented data packets can be more easily transmitted through the network, thereby reducing the risk of data loss. Therefore, several shards should belong to the same network layer data packet. In this step, based on the header of the link layer data frame, different shards belonging to the same data packet are identified and reorganized into network layer data packets; for data frames that have not been sharded, the header of their frames is directly removed. The result of this step is several network layer packets.
The third step is to aggregate network layer packets belonging to the same session. Here, a session refers to a series of peer-to-peer communication operations between two network devices. They establish connections by sending information to each other to complete specific tasks, such as file transfers or database queries. This article does not separate feature extraction for each network layer packet, but rather aggregates it as a session. The reason for doing this is that a continuous session may contain an extremely large number of network layer packets, and the information of each packet is highly correlated. Extracting the features of each network layer packet separately often results in overly discrete information. In addition, it is also very difficult for the algorithm to recognize different data packets that belong to the same session but are scattered in order, and the time-series information in the dataset is also difficult to embed into the algorithm model.
For TCP protocol packets, since TCP itself is a connection-oriented transport layer protocol, each connection ends with three handshakes and four waves; the packets in each connection can be considered the same session. For UDP and ICMP protocol packets, there is no concept of connection, and this article uses a timeout mechanism for determination. When A and B send data packets for the first time, set a timer and timeout for them. If the data packets are sent again between A and B before the timeout, reset the timer. If A and B do not send any more packets until the timeout, it is considered that the packet from the first packet to the one before the timeout belongs to the same session. According to the protocol fields in the IP protocol header, there are theoretically 256 protocols. However, the other 253 protocol types have a very low frequency in most network environments, so this article will not consider these protocol types of packets for the time being and will directly filter them out.
  2.1.2.  Feature Design and Selection
Feature extraction refers to extracting meaningful features from raw data for further analysis and processing. Feature extraction is often a preprocessing step in machine learning algorithms, aiming to select useful information from raw data to make it easier for the algorithm to identify patterns in the input data. In order to comprehensively and accurately describe network traffic features, this article designs five categories of features based on the extracted binary session data, as shown in the 
Table 1.
Here, the sender of the first data packet in each session is defined as the client, and the receiver is defined as the server.
The first category of features is the identity features of the session. There are a total of five identity features, which are the network addresses and communication protocol types of the two network devices in each session.
The second category of features is transmission features. There are a total of 38 transmission features, which aim to describe the statistical characteristics of data packet transmission within each session.
The third category of features is TCP features. There are a total of 27 TCP features, which aim to describe the statistical characteristics of the fields contained in the TCP sessions. If it is a non-TCP session, it is simply set to 0.
The fourth category of features is frequency features. There are a total of 19 frequency features, which aim to describe the frequency information of data packet transmission within each session.
The fifth category of features is duplicate features. There are a total of 19 duplicate features, which aim to describe the statistical characteristics of multiple sessions within a period of time.
There are a total of 108 features, which can accurately describe the state information of the sessions from various perspectives. Some features are shown in the 
Table 2. Obviously, some of these features are redundant or correlated. However, on the one hand, this article believes that providing direct information to the detection algorithm is a better choice than letting the algorithm dig out hidden information from the data itself. On the other hand, this article will also perform feature selection to identify the best subset of features.
In addition, each session has a label feature, which is the category of the session. If there is no manual labeling or given labeling rules, this feature is empty.
The dataset constructed in this article has multiple types of features, and some of the features do not follow normal distribution. Therefore, a combination of filter and wrapper methods is used for feature selection.
Firstly, features that cannot be used for model training, namely the four categorical features (client_ip, client_port, server_ip, server_port) and two timestamp features (start_time, end_time) are removed. The meanings of these features are already reflected in the session aggregation and the sequence of sessions.
For categorical features, such as protocol, service, and state, since their values do not have a sense of size and order and the number of feature values is not large, one-hot encoding is performed. This means adding a binary feature for each discrete value variable.
When there are a large number of categories, the one-hot encoding vector can become very large, causing memory and computational difficulties. Therefore, in this experiment, a recursive feature elimination process is designed to filter out redundant, unrelated, and low-information features to improve model efficiency and performance.
For the categorical label feature type, it is converted into 0 and 1 according to normal and abnormal, respectively.
All other numerical features are standardized. Then, according to the filter method, the variance of each feature is calculated, and features with a variance smaller than  are removed. This yields 121 features for selection.
Finally, using the wrapper method, all numerical features, except type, are used as the initial feature set, with type as the label feature. Machine learning algorithms, such as stochastic gradient descent classification and XGBoost, are used, with the F1-score as the evaluation metric. Recursive feature elimination with cross validation is performed on the feature set, and the trend of the F1-score is observed under different feature set sizes to select the best feature set.
Here, recursive feature elimination is a method of evaluating the importance of features to the model by recursively deleting features. The specific steps are shown in Algorithm 1.
          
| Algorithm 1: Recursive feature elimination algorithm | 
| Input: classification data set D with n features; basic classification algorithm model (such as random forest, logistic regression, etc.); the number of features k () for iteration stop; the number of features eliminated at each step. 1Initial feature set 2Training the basic algorithm model:use the F feature set of the data set D to train the basic algorithm model;3Calculate feature importance: for each feature in F, calculate its impact on model prediction results according to the training results of the model.4Recursively delete features: delete the least important step features in F;5Repeat steps 2 to 4 until the required number of features is reached, i.e., ;6Preserve the selected features: use the feature set F to train the final model, and output the feature subset F and model.
 Output: A feature subset composed of k features. | 
In order to reduce the impact of randomness, K-fold cross validation is used, which means that K-1 parts are used for training and 1 part is used for testing each time. After K training sessions, the test results are averaged.