Fine-Grained High-Utility Dynamic Fingerprinting Extraction for Network Trafﬁc Analysis

: Previous network feature extraction methods used for network anomaly detection have some problems, such as being unable to extract features from the original network trafﬁc, or that they can only extract coarse-grained features, as well as that they are highly dependent on manual analysis. To solve these problems, this paper proposes a ﬁne-grained and highly practical dynamic application ﬁngerprint extraction method. By putting forward a ﬁne-grained high-utility dynamic ﬁngerprinting (Huf) algorithm to build a Huf-Tree based on the N -gram (every substring of a larger string, of a ﬁxed length n) model, combining it with the network trafﬁc segment-IP address transition (IAT) method to achieve dynamic application ﬁngerprint extraction, and through the utility of ﬁngerprint, the calculation was performed to obtain a more valuable ﬁngerprint, to achieve ﬁne-grained and efﬁcient ﬂow characteristic extraction, and to solve the problem of this method being highly dependent on manual analysis. The experimental results show that the Huf algorithm can realize the dynamic application of ﬁngerprint extraction and solve the existing problems.


Introduction
Anomaly detection is a very important network security method. It can detect anomalies by extracting network features, establishing a behavior standard model, and calculating sample offset. In the real network environment, huge raw traffic will be generated, which contains a large number of data packets. These data packets often contain a lot of valuable information and can be analyzed by what the operations terminal is doing. However, most of the previous anomaly detection methods can only analyze the features generated by specific systems and cannot directly extract the features from the original network traffic.
The feature that can uniquely mark network behavior and identify a certain network behavior is defined as the application fingerprint [1]. The fingerprint is a relatively broad concept in network security. Any feature that is different from other applications and can identify applications can be defined as a fingerprint. It is necessary to use application fingerprint to identify smartphone applications. Taylor et al. [2] use application fingerprint to identify some information of smartphone applications.
In [3][4][5][6][7][8][9][10][11], it can be found that the application fingerprint can be implemented in various ways, and its purpose is to extract features to uniquely mark the communication behavior of an application. The target object of fingerprint extraction can be TCP messages, HTTPS encrypted packets, log events, Application Protocol data units, and others. Elements in a fingerprint can consist of message types, statistical data, or others. Application fingerprints can be expressed by sequences, vectors, state graphs, and logical expressions. The extracted application fingerprints are mainly used to process encrypted traffic, mostly for traffic network behavior changes, data can be obtained in time. To meet the requirements of dynamic processing, the proposed algorithm uses a tree based on the N-gram model to save candidate sets and then extracts the application fingerprint from the structure of the tree.

Application Fingerprints with Typical Characteristics Need to Be Extracted
Network communication is conducted according to the corresponding protocol, which means that most application fingerprints are common structures, such as the "three-way handshake", which is the way almost all application communication takes place. If the feature is only extracted from the usage frequency, it will be universal and cannot be used for the next traffic classification. Therefore, the unique features that set each app apart from the others were extracted.

Solutions
In response to the above problems, the solution in this paper is as follows: preprocess the original network traffic obtained directly from the network and focus on finding "breakpoints" when dividing network traffic; this "breakpoint" no longer only depends on time, but it is determined by other corresponding IP addresses, protocol types, and other related attributes so that a complete fine-grained application fingerprint can be extracted. The algorithm proposed in the dynamic processing uses a tree based on the Ngram model to save the candidate set, and then extracts the application fingerprint from the structure of this tree. The last step is to extract the unique features of each application that distinguish it from other applications.
The Huf algorithm proposed in this paper can solve the problem. The Huf algorithm is a fine-grained and efficient dynamic fingerprint extraction method. Its structure is shown in Figure 1. It can be seen that the process of extracting application fingerprints can be divided into two parts: network traffic preprocessing and frequent set mining to obtain application fingerprints. This paper performs data preprocessing in Section 3 and performs fine-grained and efficient dynamic fingerprint extraction in Section 4.

Data Processing
It is difficult to deal with unstructured data with a huge amount of data and no typical characteristics. Network traffic is a continuously generated data flow. Before extracting features, the necessary step is to divide the data flow, obtain several small fragments, and further process them. This section first analyzes the relevant content of the TLS Protocol, takes the communication message as the analysis object, marks each message, and then describes the communication process between the client and the server. Then, it analyzes the characteristics of domain names of various application categories, extracts representative domain names, and provides relevant domain name data for exception analysis for further manual analysis of exceptions. In the fourth section, we introduce the process of fine-grained efficient dynamic fingerprint extraction. Frequent project set mining is an algorithm for discovering relationships between independent items. An improved frequent project set mining algorithm applied to fingerprint extraction is proposed, which has three characteristics: first, the method can process dynamic data, dynamically extract feature fingerprints from continuously generated network data, find the "breakpoint" that divides network traffic, and extract the entire application fingerprint; second, in data analysis, the sequential dependence of messages is the key content of application behavior, and the application fingerprint is established through the N-gram model; third, the utility of the message is used as a screening criterion.

SSL/TLS Encryption Protocol and Communication Process
This section analyzes TLS1.2, a Transport Layer Protocol that works over TCP. TLS version 1.2 consists of two layers: the TLS Recording Protocol and the TLS Handshake Protocol. The TLS Recording Protocol works under the TLS Handshake Protocol and provides basic functionality. The TLS Logging Protocol defines four specific protocol formats, which are the Handshake Protocol, the Alert Protocol, the Change Password Specification Protocol, and the Application Data Protocol. The upper-layer TLS handshake protocol is primarily used for negotiation sessions.
The TLS Handshake Protocol is superior to the TLS Recording Protocol. When the client and server initiate communication for the first time, the two parties agree on the protocol version, encryption algorithm, digital certificate, etc., and complete the following functions. First is the Hello Message exchange and agreement to carry out algorithms, random numbers, and other information on the server, and check the session recovery. Second, the two sides exchange necessary key parameters, allowing the client and server to directly exchange the Hello Message, the server in the algorithm, the random number, and other information on the agreement, and check the session resumption. Third is the exchange of digital certificates and related parameters so that the client and server can authenticate each other. Fourth, the master key is generated from the pre-master key and random numbers are exchanged with each other. Last, it must be ensured that the client and server can verify each other's parameters and that the parameters will not be tampered with by attackers.
When the client and server need to use TLS Protocols to establish communication, they must follow the TLS Handshake Protocol to establish communication, which can be divided into three phases.

Establish Communication
The client sends a ClientHello Message to the server. After receiving the request, the server sends a ServerHello Message. Both the ClientHello Message and ServerHello Message contain the information necessary to establish communication.

Exchange Keys and Certificate
The two parties need to exchange keys and certificates. There are four types of messages related to keys: Server Certificates, Server Key Exchanges, Client Certificates, and Client Secret Exchanges. In addition, there is a "Change Password Specification" message and a "Certificate Request" message to complete some functions. Figure 2 describes the process in detail.

Data Transmission
In this phase, the data that the user needs to transfer are transferred in the application data format.
The communication process shown in Figure 2 is the first complete communication process between the client and server. The actual use of encrypted communication is a continuous process. Exchanging keys and certificates for each communication consumes resources and is unnecessary. The server will retain the session ID of communication with the client for a while and wait for the client to continue using the previous session. If the session ID of the client cannot be found in the existing list of the server, the server initiates a new session and performs a complete handshake. Figure 3 shows a simplified TLS communication process.
Appl. Sci. 2022, 12, x FOR PEER REVIEW 6 of 32 and a "Certificate Request" message to complete some functions. Figure 2 describes the process in detail.

Data Transmission
In this phase, the data that the user needs to transfer are transferred in the application data format.
The communication process shown in Figure 2 is the first complete communication process between the client and server. The actual use of encrypted communication is a continuous process. Exchanging keys and certificates for each communication consumes resources and is unnecessary. The server will retain the session ID of communication with the client for a while and wait for the client to continue using the previous session. If the session ID of the client cannot be found in the existing list of the server, the server initiates a new session and performs a complete handshake. Figure 3 shows a simplified TLS communication process.

Data Transmission
In this phase, the data that the user needs to transfer are transferred in the application data format.
The communication process shown in Figure 2 is the first complete communication process between the client and server. The actual use of encrypted communication is a continuous process. Exchanging keys and certificates for each communication consumes resources and is unnecessary. The server will retain the session ID of communication with the client for a while and wait for the client to continue using the previous session. If the session ID of the client cannot be found in the existing list of the server, the server initiates a new session and performs a complete handshake. Figure 3 shows a simplified TLS communication process.   As shown in Figures 2 and 3, different messages are required at different stages of communication establishment. Each message has a specific format and performs certain functions. In practical applications, programmers generally implement the communication between client and server according to protocol requirements, but also make some adjustments according to function requirements, for example, omitting some unnecessary messages or combining multiple packets to improve communication efficiency. The diversity of messages in communication and the flexibility in practice makes it possible to extract fingerprint features. Table 1 lists all message types and functions of the TLS Handshake Protocol.
The notations in the right-most column of Table 1 are used to describe the communication process. The research object of this paper is encrypted traffic, but in actual network data, common protocols are used to complete some unimportant communication between client and server. The entire network data is a mixture of encrypted and non-encrypted protocols. Figure 4 shows the hierarchy of common protocols in the network. client and server. The entire network data is a mixture of encrypted and non-encrypted protocols. Figure 4 shows the hierarchy of common protocols in the network.  The protocols in Figure 4 are labeled for description, as shown in Table 2.   The protocols in Figure 4 are labeled for description, as shown in Table 2. This section describes the communication process between the client and server by analyzing TLS protocol contents and analyzing communication messages. After marking each message, one can prepare for fingerprinting extraction.

Network Traffic Segment
Packets constitute the data in the network, and each packet has a fixed attribute. Figure 5 shows a segment of network traffic captured in the actual QQ communication process, which also contains some data packets from other applications.
In the actual process of obtaining traffic, the traffic packets sent and received by the terminal must be a mixture of multiple applications, which makes it possible to divide traffic based on IP address transition (IAT).
The network traffic obtained here contains a large number of encrypted packets, which can be directly obtained by processing only six limited attributes: time, source, destination, protocol, length, and info. Here, to define any packets the following is used: p i = [t, I I src , IP dst , prt, l, In f o]. Based on the data packet attributes that can be directly obtained, this section analyzes the sequence of data packets to obtain encrypted traffic fingerprints.
Data packets generated when multiple applications installed on a terminal exchange data with the server are sent sequentially. Within a time interval T = [t start , t end ], define the data flow T f = (p 1 , p 2 , · · · , p N ) generated (N is the number of packets obtained). Any packet contains six attributes: p i = [t, IP src , IP dst , prt, l, In f o].
The object to be dealt with is the data flow defined above. Given a data flow T f of mixed data from multiple applications, the proposed method should accomplish two tasks:

•
Process the data stream T f and divide the packet sequence seq T = {p 1 , p 2 , · · · , p N } into several small data stream segments according to the relevant attributes and message analysis: S T f = p 1 , p 2 , · · · , p I , p i+1 , p i+2 , · · · , p j , · · · , p j+1 , p j+2 , · · · , p N • Based on data stream fragments, the dynamic extraction of fine-grained fingerprint features is realized through the method. The specific method is described in Section 3. The extracted application fingerprint is a message sequence containing encrypted messages: seq M = msg 1 , msg 2 , · · · , msg M .
Appl. Sci. 2022, 12, x FOR PEER REVIEW 9 of 32 In the actual process of obtaining traffic, the traffic packets sent and received by the terminal must be a mixture of multiple applications, which makes it possible to divide traffic based on IP address transition (IAT).
The network traffic obtained here contains a large number of encrypted packets, which can be directly obtained by processing only six limited attributes: time, source, destination, protocol, length, and info. Here, to define any packets the following is used: The object to be dealt with is the data flow defined above. Given a data flow T f of mixed data from multiple applications, the proposed method should accomplish two tasks: • Process the data stream T f and divide the packet sequence seq T = {p 1 ,p 2 ,⋯,p N } into several small data stream segments according to the relevant attributes and message analysis: S T f = {< p 1 ,p 2 ,⋯,p I > , < p i+1 ,p i+2 ,⋯,p j > ,⋯, < p j+1 ,p j+2 ,⋯,p N >} This paper proposes a new method for the division of data flow T f , which is no longer based on the simple time interval but on the six basic attributes of packets to find the "breakpoint" of network traffic, to achieve the purpose of accurate division of network traffic.
In the process of sending and receiving data packets, network devices complete an "action" in a time interval, similar to that in an operating system. The CPU is divided into several time slices, and each time slice processes one process. This "action" is equivalent to the application fingerprint that can be extracted. Therefore, in dividing data flows, the determination of "breakpoints" is based on this feature of network devices processing data. The "breakpoint" is determined according to the partitioning performance of each attribute of network traffic, that is, the optimal partitioning attribute. Among the six basic attributes shared by network packets, three attributes can be used as network traffic partitioning: time, source IP address, and destination IP address. In this paper, network traffic is considered as a whole to determine the "breakpoint", and the partition performance of these three attributes in dividing network traffic is proved from the perspective of information entropy in information theory.
This paper introduces information entropy to calculate the information value contained in a sample. The basic idea is that if each element in a sample has a specific rule, it can be determined according to statistical methods, and its change content can be predicted in advance, which means that the information value of this sample is not high, and the information entropy contained in it is small. On the contrary, if all elements of the sample are disordered and the specific situation is unpredictable, the amount of information contained in the sample will be large, and the information entropy will naturally be larger. Equation (1) is the definition of information entropy: The network traffic classification process is to divide the packets according to the properties of continuous discretization into information entropy smaller fragments, and each fragment contains less information entropy, which means that the element is regular and will repeat. This is the same as the nature of the application of a fingerprint; the information entropy is small and it will repeat traffic division and be representative. In addition, it can be used as an application fingerprint to identify traffic. Information gain is used as the measurement standard when selecting partition attributes. Information gain is the change that occurs before and after the dataset is partitioned, and (2) is the definition of information gain: The classification process can be described in a tree structure, in which H(X) represents the entropy of attributes of the parent node and H(X) i represents the entropy of attributes of several child nodes. The larger the information gain In f o G is, the more information is contained before classification. The more impure the sample elements are, the smaller the information entropy after classification is, and the purer the sample elements of the child nodes will be. Each sample is more representative of the applied fingerprint, so the division based on IP address is better than the division based on time.
In network traffic, network packets are continuous-discrete samples and network traffic is , ∆t i is the time difference between adjacent packets and {IP src , IP dst } is the IP addresses at both ends of communication for unified analysis, regardless of the order. Therefore, according to the method of time interval division, the information gain of the obtained samples is: H(∆T) is the network traffic time interval of information entropy, as shown in (4): H({IP src , IP dst }) is the information entropy in the ith partition, which can be calculated as follows: Another division method is based on IP address division {IP src , IP dst }. The calculation method is as follows: In the actual division process, only the data flow is divided into small fragments. The information entropy only proves that the method of division by IP address proposed in this paper, IAT, is more complete and accurate than that by time interval. Specific results of two different division methods for network data flow are calculated in the experimental part based on actual network communication data according to Formulas (3) and (6).
The following defines IAT: Given a data stream seq T = p 1 , p 2 , · · · , p i , p j , · · · , p N , any data packet p i has an attribute time and IP address pair (∆t i , {IP src , IP dst }) . If any of the IP address pairs {IP src , IP dst } of two adjacent packets p i , p j change, the data stream is divided into two p 1 , p 2 , · · · , p i , p j , p j+1 , · · · , p j segments. The sequence exchange of IP addresses is not considered as the change of IP addresses.
A segment of network traffic obtained by IP address is classified into N-gram types based on message types. Here, it is first expressed in sequence, and each message is represented by a label defined above. For example, 42 packets in Figure 3 can be divided into 17 segments, as shown in Figure 6.

DNS Domain Text Clustering
The Domain Name System (DNS) is a database that maps IP addresses to domain names. The DNS server performs this function. Among the obtained network data packets, DNS messages are a few unencrypted messages with certain meanings. The domain names of DNS packets are extracted for text analysis, which can realize automatic network traffic analysis and largely replace some manual analysis. In addition, the corresponding URL can be quickly obtained in traffic classification and anomaly detection, which is convenient for further manual analysis and control in the massive network data flow. Part of the name is derived from the actual meaning of natural language, and part can be used as a combination of letters to represent a certain meaning. By mining domain name information, one can extract effective text information from the domain name, which is of great help to the analysis of encrypted traffic.
This section analyzes the characteristics of domain names of each application category and extracts representative domain names to provide domain name data for exception analysis and further manual exception analysis. Domain name analysis uses incre-  Tables 1 and 2, as shown in Table 3.

DNS Domain Text Clustering
The Domain Name System (DNS) is a database that maps IP addresses to domain names. The DNS server performs this function. Among the obtained network data packets, DNS messages are a few unencrypted messages with certain meanings. The domain names of DNS packets are extracted for text analysis, which can realize automatic network traffic analysis and largely replace some manual analysis. In addition, the corresponding URL can be quickly obtained in traffic classification and anomaly detection, which is convenient for further manual analysis and control in the massive network data flow. Part of the name is derived from the actual meaning of natural language, and part can be used as a combination of letters to represent a certain meaning. By mining domain name information, one can extract effective text information from the domain name, which is of great help to the analysis of encrypted traffic. This section analyzes the characteristics of domain names of each application category and extracts representative domain names to provide domain name data for exception analysis and further manual exception analysis. Domain name analysis uses incremental clustering to process dynamic data streams, including RS extraction, and generation of related-word set, feature vector generation, and domain name clustering. Figure 7 shows the process.
Appl. Sci. 2022, 12, x FOR PEER REVIEW 13 of 32 mental clustering to process dynamic data streams, including RS extraction, and generation of related-word set, feature vector generation, and domain name clustering. Figure 7 shows the process.  . . n str str str ", according to". "to extract string of "  , and because the new DNS message is generated continuously, RS will continue to expand. For one type of other DNS-message-generated domain correlation set, In the DNS graph, vertices represent strings in domain names, and edges represent relationships between domain names. Edges are added according to the domain name correlation set RS. In a domain name correlation set, edges can be added, and the direction is from the high-level domain name to the lower-level domain name, thus forming the directed DNS graph.
Different from natural language, the domain name can be any combination of the characters, so as the network traffic constantly produces, there will be a new domain name constantly generated, which means that the new node will continue to produce to solve this problem, and to a certain extent reduce the computational complexity, as well as Given a domain name as shown in "str 1 .str 2 . · · · .str n ", according to". "to extract string of "str 1 .str 2 . · · · .str n " to generate string set, defined as an item in domain name correlation set RS.
Network data flows usually generate multiple DNS messages, so RS consists of multiple components, RS = RS 1 , RS 2 , · · · , RS i , · · · , and because the new DNS message is generated continuously, RS will continue to expand. For one type of other DNS-messagegenerated domain correlation set, RS = RS 1 , RS 2 , · · · , RS i , · · · , based on the RS built DNS, DNS is graphed with the vertex and edge of the relationship between domain name and domain names. In the DNS graph, vertices represent strings in domain names, and edges represent relationships between domain names. Edges are added according to the domain name correlation set RS. In a domain name correlation set, edges can be added, and the direction is from the high-level domain name to the lower-level domain name, thus forming the directed DNS graph.
Different from natural language, the domain name can be any combination of the characters, so as the network traffic constantly produces, there will be a new domain name constantly generated, which means that the new node will continue to produce to solve this problem, and to a certain extent reduce the computational complexity, as well as needs to obtain a large number of DNS messages and complete the extraction of node information. After the DNS graph is established, the importance of nodes is used as the feature vector of DNS messages.
The PageRank algorithm was first used in the Google search engine. In order to accurately rank web pages, the number and quality of links in web pages are used to measure the importance of web pages. The directed graph G = (V, E) constructed by DNS messages is similar to the web-page ranking in search engines, in order to find important node information. Assume that n is the number of nodes, V and E represent vertices and edges, and the PageRank value G(i) of vertex i is shown in (7): Calculate the PageRank value of each node, and use G to represent the PageRank value of all nodes, as shown in (8): Assume that A is the adjacency matrix of graph G, as shown in (9): Then, (7) can be converted into (10): In reality, the process of users surfing the Internet is relatively complex. The PageRank algorithm assumes that surfing the Internet is a random process. In order to make A meet this condition, Equation (10) is improved, as shown in (11): E is the full 1 matrix, f is the attenuation factor, and (11) can be converted into: Therefore, (7) can be expressed as follows: The calculation process of PageRank is a cyclic process. Given an initial value, the PageRank value of the node is calculated. When the change is less than the threshold value, the calculation is stopped. Calculating the PageRank value of the domain name can extract important domain names, the appearance of these important domain names often has certain meaning, and new domain names through manual analysis can have important roles in helping in anomaly detection.
The generation of the network DNS message is continuous. In order to process the newly generated messages in real time, the single-pass incremental clustering algorithm is adopted. Firstly, the eigenvalues of vectors of each class are calculated as the cluster centers, and the similarity is used as the judgment criterion whenever a new message is generated, as shown in (14): M is the clustering center vector, and n is the feature of DNS messages. The classification of DNS messages is determined through similarity calculation. After a certain period, a new DNS is generated, and the clustering center is recalculated according to the clustering result.

Fine-Grained High-Utility Dynamic Fingerprinting Extraction
According to the preprocessing of the network traffic in Section 3, several segments of the network traffic are obtained. The N-gram model is used to represent the results after the partition, and the label is used to represent the results, as shown in Table 3. In this way, the partition results can retain the complete application fingerprint to the maximum extent. Network behavior is a constantly changing data flow. To obtain the changes in new network behavior in time, it is necessary to obtain application fingerprints dynamically online, which requires that network traffic can only be processed once.
In Section 4, the problem to be solved is that application fingerprints with typical features can be dynamically obtained. Based on a frequent set mining algorithm, a finegrained efficient dynamic application fingerprint extraction method Huf is proposed. The Huf algorithm focuses on two parts: one is to calculate the utility of application fingerprints, and the other is to build an N-gram tree structure to mine application fingerprints with high-utility values according to the tree structure.

The Utility of Fingerprinting
The process of obtaining the most typical key exchange in the communication process can more accurately describe the communication process. The TCP Triple Handshake Protocol accounts for more than 80% of the total number of packets. It is difficult to analyze except for differences in length. The number of TLS-related messages is about 10%. Every time an encrypted communication process is established, the communication between the client and the server requires a complete key and certificate exchange process, which has a great impact on the analysis of communication behavior. That is, the smaller number of message types have strong behavior analysis capabilities. Therefore, the differentiation ability of such messages to data flows is defined as utility, which can be understood as the value of the message itself. Similar to natural language processing, there are some frequently used words in natural languages, such as "de", "le", and "me", which are not helpful for the next algorithm.
Here, combined with the evaluation of the importance of vocabulary in natural language processing algorithms, the Term Frequency-Inverse Document Frequency (tf-idf) algorithm defines the utility.
In the sequence representation s = p l , p m , · · · , p n of an N-gram model, the utility of a term p m is defined as: The number of messages p m in sequence s is n p m , and the length of s is l. The proportion of p m in sequence s is ϕ n p m , l , which is defined as follows: The total number of fingerprints is n d , n p m d is the fingerprint containing the message p m , and φ n d , n p m d is defined as the following: A large number of data flow fragments in Table 3 are useless, for example, a data flow fragment with only one packet. Table 3 shows only a small part of the data flow. To illustrate the utility's calculation, assume that the Table 3 below lists all the fragments of the data flow. Table 4 describes the utility calculation process. When network traffic is collected for a long enough time, a large-scale utility using a fingerprint is calculated by a statistical method, which is similar to natural language. A large enough corpus ensures the accuracy of the tf-idf algorithm. Data stream fragments are directly associated with transactional databases in the frequent item set mining. Calculate the utility of message 23:2 in Table 4 for a fingerprint TID = 4: The average utility of sequence s = p l , p m , · · · , p n calculates the average utility of the whole fingerprint according to the utility of the message, which is defined as follows: The utility still calculates the entire fingerprint with the application fingerprint of TID = 4:

Constructing N-gram Huf-Tree of Huf Algorithm
Applications on a terminal device need to send data, so seemingly unrelated independent data packets are connected. This paper uses the method based on association rules to discover the relationship between independent packets. The number of data packets is huge and cannot be completed manually. Therefore, the mining method based on frequent sets proposed in this paper realizes the automatic acquisition of data packet sequences, that is, the behavior pattern of communication.
Apriori is the first classical frequent item set mining algorithm in the world, which laid the foundation for many classical algorithms later. The main process of the Apriori-like algorithm is to find all candidate sets, and then remove the candidate sets that do not meet the requirements by pruning [23]. This algorithm needs to scan the database many times to generate a large number of candidate sets, which has a large space and time complexity and has a great impact on the efficiency of the algorithm. To improve the process, scan the database only once, and reduce the space and time complexity, many kinds of studies have proposed improvement methods, which are generally summarized as after scanning the data objects once, creating an "intermediate storage" to temporarily store the selected candidate set. All subsequent analysis is directed to this "intermediate store". The difference is that these documents use different structures to represent this "intermediate storage", such as LQS-Tree [24], EFP-Tree [25], MED-Tree [26], WMFP-SW-Tree [27], IHAUI-Tree [28], MAS-Tree [29], and HUWAS-Tree [30]. These are all variants of FP-Tree in the FP-Growth algorithm, an algorithm adapted from Apriori. The dynamic mining algorithm of the frequent episode generally uses the tree structure to replace several scans of sample data.
The Apriori algorithm contains two thresholds, which are important indicators of the algorithm and are used to measure the selection of frequent item sets. One is minimum support, which represents the minimum probability of the occurrence of an item in a transaction database, and the other is minimum confidence, which defines the probability of the occurrence of another item in a set of items when one item is present. Association rules use these two indicators to determine the relationship between item sets, which can be understood as a certain relationship between two items when they appear together with a large probability.
In general, the process for defining frequent item set mining is as follows.
The transaction database D = (I 1 , I 2 , · · · , I n ) represents a set of several items, and each item can be represented by TID. The support degree of frequent item set FI = (X 1 , X 2 , · · · , X m ) in D is represented by the proportion of frequent item set X 1 , X 2 , · · · X m in D = (I 1 , I 2 , · · · , I n ), that is, P sup = (X 1 , X 2 , . . . , X m ). The confidence is used in the next prune operation and represents the probability of the term X, i.e., P, when I contains Z. The Apriori algorithm is used to complete the mining of frequent item sets through continuous connection and pruning. The process of the Apriori algorithm is shown as follows:

2.
Start the connection from a 2-item set. The new item set is connected to the existing item set: The confidence degree of the candidate k-item set FIS k = FI k 1 , FI k 2 , · · · , FI k l is calculated and pruned to remove the item set that does not meet the confidence degree.

3.
Repeat the above steps to generate a new item set until there is only one item in the item set, then stop the loop.
One of the most important features of network data flow is the continuous generation of data packets, which is an incremental process. Many existing kinds of literature carry out feature extraction on static datasets after obtaining datasets, and these methods obviously cannot realize the online dynamic processing and analysis of data packets. Dynamic analysis can extract new features in time and monitor network data in real-time, so the method proposed in this paper can extract features dynamically. Above, the communication process between the client and server is extracted by establishing a sequential transaction database of data flow. In this section, the method of processing dynamic data flow by constructing the Huf-Tree is mainly proposed. A threshold ε is first set to determine whether to retain a candidate, as defined below.
The minimum threshold of average utility determines whether to join the Huf-Tree, which is expressed in proportion: Assuming that the threshold "ε = 50%" is selected, Table 6 is used as an example: The target data to be processed in this paper are network data flow, which can be regarded as an incremental process InP = (SDB 1 , SDB 2 , · · · , SDB n , · · ·). The process of constructing the Huf-Tree is shown in Figure 8.  The process shown in Figure 8 is the process of generating an application fingerprint high-utility sequence, which can be divided into four phases:

•
The data are preprocessed and the incremental representation SDB SDB SDB = of data flow is generated; • The initial Huf-Tree is built, which takes SDB1 as the initial database; • When a new SDBn is generated, each item is gradually added to update the Huf-Tree; • New application fingerprints are generated based on the threshold at intervals.
The utility list is used to store the data related to the Huf-Tree. In essence, it is an array used to retrieve fingerprint 12 , , , n s p p p =  . The construction process of Huf-Tree is briefly introduced according to Table 7.  The process shown in Figure 8 is the process of generating an application fingerprint high-utility sequence, which can be divided into four phases:

•
The data are preprocessed and the incremental representation InP = (SDB 1 , SDB 2 , · · · , SDB n , · · · ) of data flow is generated; • The initial Huf-Tree is built, which takes SDB 1 as the initial database; • When a new SDB n is generated, each item is gradually added to update the Huf-Tree; • New application fingerprints are generated based on the threshold at intervals.
The utility list is used to store the data related to the Huf-Tree. In essence, it is an array used to retrieve fingerprint s =< p 1 , p 2 , · · · , p n >.
The construction process of Huf-Tree is briefly introduced according to Table 7.

1.
According to the data preprocessing part generated using a fingerprint, and gradually adding each path, the first path to add in the process is shown in Figure 9:  2. Add each application fingerprint step by step. If there are duplicate fingerprints, some paths need to be merged, as shown in Figure 10. (a) (b) Figure 9. Huf-Tree.

2.
Add each application fingerprint step by step. If there are duplicate fingerprints, some paths need to be merged, as shown in Figure 10.  2. Add each application fingerprint step by step. If there are duplicate fingerprints, some paths need to be merged, as shown in Figure 10. 3. Every once in a while, according to the preset threshold, the application fingerprint that meets the condition is extracted and the messages need to be sorted, as shown in Table 8: Based on the transaction database information shown in Table 8 It can be seen from Figure 10d that the structure of the Huf-Tree is very complex, which is only a Huf-Tree of a small section of the data stream. When new data flows are continuously processed, the generated Huf-Tree will be very large, so the structure of the Huf-Tree needs to be merged. Since each node of the Huf-Tree is based on the N-gram model, the order of nodes cannot be exchanged. Therefore, in the prune stage, this paper

3.
Every once in a while, according to the preset threshold, the application fingerprint that meets the condition is extracted and the messages need to be sorted, as shown in Table 8: Based on the transaction database information shown in Table 8 It can be seen from Figure 10d that the structure of the Huf-Tree is very complex, which is only a Huf-Tree of a small section of the data stream. When new data flows are continuously processed, the generated Huf-Tree will be very large, so the structure of the Huf-Tree needs to be merged. Since each node of the Huf-Tree is based on the N-gram model, the order of nodes cannot be exchanged. Therefore, in the prune stage, this paper adopts two steps to prune: removing low-utility nodes and merging paths. The process is shown in Figure 11. adopts two steps to prune: removing low-utility nodes and merging paths. The process is shown in Figure 11. In order to ensure the integrity of the application fingerprint, the nodes in the middle of the application fingerprint cannot be deleted, but can only be deleted successively from the root node. Figure11a is the result after deleting node <11:01, 11:02, 11:03, 21:>. Figure  11a merged the right-most path from the root node to the adjacent path. The same path of nodes is required to merge, with different nodes as children.

Dynamic Fingerprinting Extraction Algorithm Based on Huf-Tree
This section mainly proposes a dynamic application fingerprinting extraction algorithm based on Huf-Tree. The Huf algorithm mainly consists of three parts: data preprocessing, building Huf-Tree, and mining high-utility sequence, as shown in Figure 12:   In order to ensure the integrity of the application fingerprint, the nodes in the middle of the application fingerprint cannot be deleted, but can only be deleted successively from the root node. Figure 11a is the result after deleting node <11:01, 11:02, 11:03, 21:>. Figure 11a merged the right-most path from the root node to the adjacent path. The same path of nodes is required to merge, with different nodes as children.

Dynamic Fingerprinting Extraction Algorithm Based on Huf-Tree
This section mainly proposes a dynamic application fingerprinting extraction algorithm based on Huf-Tree. The Huf algorithm mainly consists of three parts: data preprocessing, building Huf-Tree, and mining high-utility sequence, as shown in Figure 12: Appl. Sci. 2022, 12, x FOR PEER REVIEW 21 of 32 adopts two steps to prune: removing low-utility nodes and merging paths. The process is shown in Figure 11. In order to ensure the integrity of the application fingerprint, the nodes in the middle of the application fingerprint cannot be deleted, but can only be deleted successively from the root node. Figure11a is the result after deleting node <11:01, 11:02, 11:03, 21:>. Figure  11a merged the right-most path from the root node to the adjacent path. The same path of nodes is required to merge, with different nodes as children.

Dynamic Fingerprinting Extraction Algorithm Based on Huf-Tree
This section mainly proposes a dynamic application fingerprinting extraction algorithm based on Huf-Tree. The Huf algorithm mainly consists of three parts: data preprocessing, building Huf-Tree, and mining high-utility sequence, as shown in Figure 12:   (4) Calculate the average utility of T tid :au(s).

Algorithm 2: Huf
Input: SDB 1 , SDB 2 , · · · , SDB n , · · · , ε; Output: Huf-Tree T hu f ; (1) SDB 1 = p l , p m , · · · , p n , Generate the utility list: L T ; (2) Create ROOT nodes and insert nodes in sequence: ROOT → T Hu f ; (3) Insert each fingerprint P: p l , p m , · · · , p n → T Hu f ; (4) If the new fingerprint coincides with the existing path, it is merged with the existing path L T ; (5) Delete nodes whose value is smaller than the threshold and connect their child nodes to ROOT; (6) The duplicate nodes of the ROOT node are merged into path P, and the different nodes become child nodes; Repeat (1)~(6), insert SDB 2 , · · · , SDB n .

Experimental Analysis
This section is mainly based on the problem analysis, data preprocessing, and fingerprint extraction in the previous Sections 2-4. The following contents are divided into four parts. The first part is the non-invasive model, which only analyzes the situation of user equipment by grabbing network data without requiring the installation of specific software. The second part is the IAT traffic segment experiment, which performs information gain analysis of data flow division. The third part is the fingerprinting integrity experiment, mainly to analyze whether the acquired fingerprints are complete and accurate, and the fourth part is the high-utility fingerprinting extraction experiment.

Non-Invasive Model
Limited by hardware, mobile terminals represented by smartphones seldom install corresponding security software, and users sometimes doubt the monitoring software itself. Moreover, with the popularity of mobile payment, mobile terminals have greater risks. In order to complete the security monitoring of terminal devices without affecting the use of users and without requiring the installation of specific software, this paper proposes a non-invasive model that analyzes the situation of user devices only by capturing network data, as shown in Figure 13.
In the model shown in Figure 13, the monitor is connected to the Internet and the mobile device. The detector sends WiFi signals to enable the mobile device to communicate with the server. The monitor has a packet capture function and can obtain the communication data between the mobile terminal and the server. In practical application, the monitor is generally an industrial computer integrated with wireless WiFi, wired network card, packet capture, and analysis functions. However, in the experimental analysis stage of the laboratory, computers and wireless network cards with promiscuous mode are used first, and packet capture is realized by Wireshark, which is also the packet capture tool used in the most relevant literature. The mobile device uses two Android mobile phones, which form the same structure as Figure 13. Wireshark is used for data processing, and the specific calculation is implemented by MATLAB and C programming.
In the model shown in Figure 13, the monitor is connected to the Internet and the mobile device. The detector sends WiFi signals to enable the mobile device to communicate with the server. The monitor has a packet capture function and can obtain the communication data between the mobile terminal and the server. In practical application, the monitor is generally an industrial computer integrated with wireless WiFi, wired network card, packet capture, and analysis functions. However, in the experimental analysis stage of the laboratory, computers and wireless network cards with promiscuous mode are used first, and packet capture is realized by Wireshark, which is also the packet capture tool used in the most relevant literature. The mobile device uses two Android mobile phones, which form the same structure as Figure 13. Wireshark is used for data processing, and the specific calculation is implemented by MATLAB and C programming.
The datasets adopted in the experiment are divided into two types. The first type is the public dataset, which comes from Netresec, a Swedish network security company. It published a lot of datasets, so only one dataset (WRCCDC-PCAPS from the Western Regional Collegiate Cyber Defense Competition) is selected as part of the public dataset in this experiment.
The second type of dataset is the network data captured by Wireshark in the actual environment. The experimental data captured are two smartphones equipped with the Android system, on which common software is installed. According to the classification of each application market, this paper installs the four most common categories of software on smartphones, including video, news, communication, and life, such as QQ, WeChat, email clients, the news client, etc. This paper puts forward the application for communication sessions through the frequent item set mining method of fingerprints. In the past there was no relevant research on this topic, so there was no ability for comparison with previous methods. This section outlines an experiment from the extracted fingerprint to analyze the integrity and accuracy of the analysis. The specific datasets are shown in Table 9.  The datasets adopted in the experiment are divided into two types. The first type is the public dataset, which comes from Netresec, a Swedish network security company. It published a lot of datasets, so only one dataset (WRCCDC-PCAPS from the Western Regional Collegiate Cyber Defense Competition) is selected as part of the public dataset in this experiment.
The second type of dataset is the network data captured by Wireshark in the actual environment. The experimental data captured are two smartphones equipped with the Android system, on which common software is installed. According to the classification of each application market, this paper installs the four most common categories of software on smartphones, including video, news, communication, and life, such as QQ, WeChat, email clients, the news client, etc. This paper puts forward the application for communication sessions through the frequent item set mining method of fingerprints. In the past there was no relevant research on this topic, so there was no ability for comparison with previous methods. This section outlines an experiment from the extracted fingerprint to analyze the integrity and accuracy of the analysis. The specific datasets are shown in Table 9. Public data 40 28,884 Dataset1 refers to the case that the smartphone obtains data packets when the user does not run any installation program. The smartphone sends a few data packets when it is silent, mainly the push of some messages and the data sent by the operating system itself. The purpose of collecting such data is to observe the data used by non-users and measure the impact on the data used by users. There are only 1755 data points in 30 min, which is very few compared with the mixed data collected in 15 min, so this part of the data can be ignored. Dataset2, Dataset3, Dataset4, and Dataset5 are traffic collection situations that only run software of related categories. Dataset6, Dataset7, and Dataset8 are mixed traffic that run all applications to collect traffic. The difference is that the time is 5 min, 10 min, and 15 min. Dataset9 and Dataset10 are public datasets for online attack and defense competitions. The experiment consists of three parts: IAT traffic partition, application fingerprint extraction integrity analysis, and efficient application fingerprint extraction results analysis.

IAT Traffic Segment Experiment
The first step is traffic partitioning. According to the IAT address partitioning method proposed above, the following table shows the number of application fingerprints after partitioning. As can be seen from Table 10, the average length of a fingerprint has a relation with the number of packets and application category, such as that the video class fingerprint is long, because the video class sends traffic that is bigger, and users in general will finish loading the video they are watching. When a large amount of data are collected and more application analogies are used, the advantages of IAT address division will gradually expand, and the average fingerprint length will rapidly decrease. With a shorter fingerprint length, the extracted features will be more typical, and the traffic classification effect will be better. As can be seen from Table 10, the information gain of data stream division by IAT is higher than that by time division. The reason is that the data stream divided by time has great contingency, the slice repeatability is small, the information entropy is still large, and the nodes are not pure. Therefore, the repeatability of the application fingerprint is very high, and only dozens of application fingerprints can be extracted from the possibly large data stream. Moreover, when the data volume is large enough, the more applications are used, the more frequent the switching and switching between addresses, the shorter the fingerprint obtained, the greater the possibility of repetition, the more pure the node information is, and the greater the information gain will be. This indicates that the application fingerprint classification effect is better.
First, calculate the information entropy of the entire network traffic, as shown in Formula (20).
where M is the message type, including TCP, HTTP, POP, and TLS. Then, the information entropy of the whole data stream can be obtained by calculating the probability of different types of messages under different protocols. First, the data packet is divided according to the IAT method, and then the information entropy of the divided data stream is calculated, as shown in Formula (21).
H j ({IP src , IP dst }) is the information entropy of each segment in the network traffic, and the corresponding information gain is the following: To compare with the division of time intervals, the number of fragments divided by the two methods should be as close as possible. If the number of fragments divided by IAT is N I AT , the corresponding time interval is N N I AT . The information entropy of the corresponding data stream divided by the time interval is shown in Equation (23): The information gain divided by the time interval is shown in Equation (24): The information gains of the two methods are calculated according to the above methods, and the results are shown in Table 11. To compare with the effect of time division, we use the same number of fragments for the same dataset. It can be seen from Table 11 that the information gain of data stream division by IAT is higher than that by time division. The reason is that the data stream divided by time has a lot of contingency, the repeatability of the divided slice is very small, the entropy of the divided information is still large, and the nodes are not pure. Because the "breakpoint" of the data stream can be found accurately by IAT, the repeatability of the application fingerprint is very high. There are only dozens of application fingerprints extracted from potentially large data streams. When the data volume is large enough, more applications are used, and the switching between addresses is frequent. The shorter the fingerprints that are obtained, the more likely they are to be repeated, and the purer the node's information is, so the information gained will be greater. This indicates that the application fingerprint classification effect is better.

Fingerprinting Integrity Experiment
Network packets are mostly encrypted data. Due to the huge amount of data, manual marking can only be used as part of the work, mainly through the text clustering of DNS messages to achieve data marking. In the experiment, an application is continuously used to compare with existing DNS messages to determine the type of data. Figure 14 shows the data packets when WeChat started.
of the data stream can be found accurately by IAT, the repeatability of the application fingerprint is very high. There are only dozens of application fingerprints extracted from potentially large data streams. When the data volume is large enough, more applications are used, and the switching between addresses is frequent. The shorter the fingerprints that are obtained, the more likely they are to be repeated, and the purer the node's information is, so the information gained will be greater. This indicates that the application fingerprint classification effect is better.

Fingerprinting Integrity Experiment
Network packets are mostly encrypted data. Due to the huge amount of data, manual marking can only be used as part of the work, mainly through the text clustering of DNS messages to achieve data marking. In the experiment, an application is continuously used to compare with existing DNS messages to determine the type of data. Figure 14 shows the data packets when WeChat started. This part of the experiment is to verify the accuracy of fingerprinting extraction, mainly to analyze the comparison between the divided data stream fragment and the standard application fingerprint. The experiment needs to be manually marked first because the amount of data is huge, and the marking cannot be completely accurate. Therefore, through the observation and analysis of the above dataset many times, typical application fingerprints of various application categories are analyzed. Table 12 lists the partial fingerprints of various applications with the highest frequency.  This part of the experiment is to verify the accuracy of fingerprinting extraction, mainly to analyze the comparison between the divided data stream fragment and the standard application fingerprint. The experiment needs to be manually marked first because the amount of data is huge, and the marking cannot be completely accurate. Therefore, through the observation and analysis of the above dataset many times, typical application fingerprints of various application categories are analyzed. Table 12 lists the partial fingerprints of various applications with the highest frequency. Table 12 shows some application fingerprints obtained by manual analysis. For example, the application fingerprints shown in lines 13-18 show the data packets captured by users using video applications. The network device has a limit on the size of data packets when sending data, and if the size exceeds the limit, they are divided into several packets and sent in sequence. A video will be divided into different types of packets, TCP or UDP, by network devices. This puts forward two evaluation criteria to analyze the address below division after obtaining the effect of the segments of data flow. The first standard is a fragment in the application fingerprints in artificial analysis to obtain the proportion. In the evaluation criteria as shown in (25), the analysis of the evaluation standard is divided into good segments of data flow and the relationship between the real application fingerprinting.
DS refers to the number of data flow fragments after the repetition is removed, and FP refers to the number of application fingerprints obtained by manual analysis. Figure 15 shows the division of data flows in each dataset.  Table 12 shows some application fingerprints obtained by manual analysis. For example, the application fingerprints shown in lines 13-18 show the data packets captured by users using video applications. The network device has a limit on the size of data packets when sending data, and if the size exceeds the limit, they are divided into several packets and sent in sequence. A video will be divided into different types of packets, TCP or UDP, by network devices. This puts forward two evaluation criteria to analyze the address below division after obtaining the effect of the segments of data flow. The first standard is a fragment in the application fingerprints in artificial analysis to obtain the proportion. In the evaluation criteria as shown in (25), the analysis of the evaluation standard is divided into good segments of data flow and the relationship between the real application fingerprinting. DS refers to the number of data flow fragments after the repetition is removed, and FP refers to the number of application fingerprints obtained by manual analysis. Figure  15 shows the division of data flows in each dataset. It can be seen from the analysis in Figure 15 that when the number of packets is small or the application type is single, the segmented data flow fragment type is small. When It can be seen from the analysis in Figure 15 that when the number of packets is small or the application type is single, the segmented data flow fragment type is small. When the dataset is mixed data, the more traffic that is captured, the closer the fingerprint obtained by manual analysis is. The next step is to analyze the part of the grab fragment that does not overlap with the application fingerprint, and (26) is used as the evaluation criterion: ∑ i FP i is the proportion of various application fingerprints in the divided data segments.
This evaluation index analyzes the situation of data flow segments that are different from application fingerprints in the traffic. This paper defines such data flow segments that are not application fingerprints as "fragments" of the data stream, aiming to find out the reasons for sending these irregular packets. The analysis of these packet fragments that do not belong to the fingerprint can be of great help to the preprocessing of the data stream and can remove a lot of interference and noise. Figure 16 shows the "fragments" of the data stream after the traffic division of each data set.  is the proportion of various application fingerprints in the divided data segments. This evaluation index analyzes the situation of data flow segments that are different from application fingerprints in the traffic. This paper defines such data flow segments that are not application fingerprints as "fragments" of the data stream, aiming to find out the reasons for sending these irregular packets. The analysis of these packet fragments that do not belong to the fingerprint can be of great help to the preprocessing of the data stream and can remove a lot of interference and noise. Figure 16 shows the "fragments" of the data stream after the traffic division of each data set. As shown in Figure 16, a relatively high FP_DS scenario is using a single category of network traffic. In this case, network devices send data packets of the same application in a centralized manner, and data packets are retransmitted and rarely out of order. The reason why TCP data packets are retransmitted is that TCP data packets are retransmitted. The cause of Dataset6, Dataset7, and Dataset8 fragments is that multiple applications send a large number of data packets at the same time, and network devices send data packets to different IP addresses. As a result, there are many fragments with only one or two data packets. Another reason for fragments is the jump of IP addresses. Due to the problems of the server and network structure, different resources will be distributed on different servers, so even a communication session within a period will be sent by different IP addresses, which also causes the data stream to be divided into short fragments.

High-Utility Fingerprinting Extraction Experiment
This experiment mainly analyzes the application of fingerprinting extraction based on utility. Firstly, the utility of different message types in different datasets is analyzed, As shown in Figure 16, a relatively high FP_DS scenario is using a single category of network traffic. In this case, network devices send data packets of the same application in a centralized manner, and data packets are retransmitted and rarely out of order. The reason why TCP data packets are retransmitted is that TCP data packets are retransmitted. The cause of Dataset6, Dataset7, and Dataset8 fragments is that multiple applications send a large number of data packets at the same time, and network devices send data packets to different IP addresses. As a result, there are many fragments with only one or two data packets. Another reason for fragments is the jump of IP addresses. Due to the problems of the server and network structure, different resources will be distributed on different servers, so even a communication session within a period will be sent by different IP addresses, which also causes the data stream to be divided into short fragments.

High-Utility Fingerprinting Extraction Experiment
This experiment mainly analyzes the application of fingerprinting extraction based on utility. Firstly, the utility of different message types in different datasets is analyzed, and the influence of data stream size and class on the evaluation of message utility is analyzed. Then, the high-utility application fingerprint extracted by the Huf algorithm is be analyzed. In this paper, the tf-idf algorithm is used to evaluate the utility of the application. Figure 17 shows tf-idf values of the main types of packets under different datasets. Figure 17 shows that the category and number of packets have a great influence on the utility of packets. When the number of packets obtained is large enough, the packets tend to be stable. This is similar to natural language processing, which requires a large amount of text to form a huge corpus to ensure the accuracy of the calculation.
The following is the analysis of the Huf algorithm to automatically obtain the effect of application fingerprinting. The Huf algorithm aims can take the place of manual analysis based on the application fingerprints of the session from the network traffic. First of all, obtaining the fingerprint must be accurate, and the requirement is to obtain as much as possible the application fingerprints that contain artificial analysis results. The second step is to reserve a small number of application fingerprints. Compared with TCP and UDP, encrypted messages account for only a small number of network data flows, because they are the part where the client establishes a connection with the server, and the server will maintain the connection for a long time in the future. Therefore, another criterion for evaluating the Huf algorithm is the retention of these important few application fingerprints. Finally, the last step is to obtain the application of the fingerprint in the manual analysis that does not include the sample. The error extract fingerprint should be as far away as possible; otherwise, it will cause a certain influence on future traffic classification. This paper used the three standard evaluations of the above three kinds of situations. One first needs to access the application of fingerprint artificial analysis, which can be divided into two parts: FP e that contains the encrypted message and FP u that does not contain the encrypted message. The evaluation index is shown in (27).  (27) FP Hu f , FP c Hu f , and FP e Hu f are the total number of application fingerprints, the number of correct classifications, and the number of encrypted messages obtained according to the Huf algorithm, respectively. The higher the value of Metric Huf is, the better the effect of the Huf algorithm is. The experiment is conducted on the above dataset, and the results are shown in Figure 18.

Metric
Appl. Sci. 2022, 12, x FOR PEER REVIEW 29 of 32 and the influence of data stream size and class on the evaluation of message utility is analyzed. Then, the high-utility application fingerprint extracted by the Huf algorithm is be analyzed. In this paper, the tf-idf algorithm is used to evaluate the utility of the application. Figure 17 shows tf-idf values of the main types of packets under different datasets.  Figure 17 shows that the category and number of packets have a great influence on the utility of packets. When the number of packets obtained is large enough, the packets tend to be stable. This is similar to natural language processing, which requires a large amount of text to form a huge corpus to ensure the accuracy of the calculation.
The following is the analysis of the Huf algorithm to automatically obtain the effect of application fingerprinting. The Huf algorithm aims can take the place of manual analysis based on the application fingerprints of the session from the network traffic. First of all, obtaining the fingerprint must be accurate, and the requirement is to obtain as much as possible the application fingerprints that contain artificial analysis results. The second step is to reserve a small number of application fingerprints. Compared with TCP and UDP, encrypted messages account for only a small number of network data flows, because they are the part where the client establishes a connection with the server, and the server will maintain the connection for a long time in the future. Therefore, another criterion for evaluating the Huf algorithm is the retention of these important few application fingerprints. Finally, the last step is to obtain the application of the fingerprint in the manual analysis that does not include the sample. The error extract fingerprint should be as far away as possible; otherwise, it will cause a certain influence on future traffic classification. This paper used the three standard evaluations of the above three kinds of situations. One first needs to access the application of fingerprint artificial analysis, which can be divided into two parts: e FP that contains the encrypted message and u FP that does not contain the encrypted message. The evaluation index is shown in (27). Huf algorithm is. The experiment is conducted on the above dataset, and the results are shown in Figure 18. It can be seen from Figure 18 that when the Dataset1-Dataset5 obtained in the experiment have few data packets or the application is a single category, MetricHuf is unstable and irregular. When the data packets collected are large and the time is long, MetricHuf will be stable within a certain range. It can be seen from Figure 18 that when the Dataset1-Dataset5 obtained in the experiment have few data packets or the application is a single category, Metric Huf is unstable and irregular. When the data packets collected are large and the time is long, Metric Huf will be stable within a certain range.

Conclusions
Extracting network traffic features is always the focus of network behavior analysis, and modeling a large number of independent data packets is the premise of traffic processing. The Huf algorithm proposed in this paper can realize dynamic application fingerprinting extraction, which solves the two problems that anomaly detection cannot process original network datasets and relies on manual extraction of features, and realizes dynamic data flow processing. The application fingerprint extracted by the Huf algorithm has two functions in anomaly detection. The first function is to provide features for traffic classification in the next stage, and the second function is to establish a profile model in anomaly detection together with the fingerprint and classified network traffic. In this paper, the original network data are taken as the analysis object to achieve fine-grained anomaly detection of mobile terminals. The application fingerprint extracted in this paper is the basis of anomaly detection modeling.
In this paper, we use the association analysis algorithm in data mining to mine the relationship between independent data packets. We built a Huf-Tree to realize dynamic application fingerprint mining and calculated the utility of fingerprints to obtain more valuable fingerprints. In this paper, we still have some problems. The IAT in this paper is only divided based on IP addresses, and there is a lot of room for improvement. If the data flow is divided accurately, the subsequent algorithm will obtain better results. For finegrained anomaly detection methods, the analysis of outlier types needs further research.