IoT Application-Layer Protocol Vulnerability Detection using Reverse Engineering

: Fuzzing is regarded as the most promising method for protocol vulnerabilities discovering in network security of Internet of Things (IoT). However, one fatal drawback of existing fuzzing methods is that a huge number of test ﬁles are required to maintain a high test coverage. In this paper, a novel method based on protocol reverse engineering is proposed to reduce the amount of test ﬁles for fuzzing. The proposed method uses techniques in the ﬁeld of protocol reverse engineering to identify message formats of IoT application-layer protocol and create test ﬁles by generating messages with error ﬁelds according to message formats. The protocol message treated as a sequence of bytes is assumed to obey a statistic process with change-points indicating the boundaries of message ﬁelds. Then, a multi-change-point detection procedure is introduced to identify change-points of byte sequences according to their statistic properties and divide them into segments according to their change-points. The message segments are further processed via a position-based occurrence probability test analysis to identify keyword ﬁelds, data ﬁelds and uncertain ﬁelds. Finally, a message generation procedure with mutation operation on message ﬁelds is applied to construct test ﬁles for fuzzing test. The results show that the proposed method can effectively ﬁnd out the message ﬁelds and signiﬁcantly reduce the amount of test ﬁles for fuzzing test.


Introduction
Fuzzing is a widely used security technique for discovering vulnerability in network protocol by sending a series of test files with random or fault data to software system implementing specific protocol and observing software exceptions to detect vulnerabilities within the protocol.
Currently, there exist mainly two kinds of fuzzing techniques, i.e., mutation-based and generation-based fuzzing [1].The former generates test files by injecting random or fault data into sample messages (message is the basic data unit exchanged between processes of application-layer protocol ), while the latter constructs fault-injected messages as test files based on specific protocol specification.The mutation-based fuzzing emits a fatal problem that too many fault-injected messages are required to maintain a high test coverage, such as FileFuzz (http://www.securiteam.com/tools/5PP051FGUE.html) and SPIKEfile (https://www.ee.oulu.fi/research/ouspg/SPIKEfile).However, the amount of fault-injected files is 256 L , where L is the power of sample message's length, and it would take tremendously long time to handle so great amount of fault-injected test files especially when L is large.Actually, a protocol's software system parses inputs by considering their formats and treats any files which does not obey the rule of its format as invalid input, in which case a software system will throw an error and quit before it reaches the fault segment(s).Therefore, many of fault-injected test files are not necessary for successful fuzzing test.The generation-based fuzzing generates test files by considering the format of input messages, such as PROTOS (https://www.ee.oulu.fi/roles/ouspg/Protos).One advantage of such fuzzing tools is that it reduces the number of test files greatly and introduces nearly no sacrifice on test coverage [2].However, one has to figure out the message formats and configure generation-based fuzzer accordingly.Currently, the message formats are mainly collected or analyzed in a manual way, which is a time-consuming and error-prone process.To address these issues, protocol reverse engineering [3] is introduced to obtain protocol specification automatically.The protocol specification including message format a set of rules that describe or model a network protocol.Then a field-based fault-injected message generation procedure conducted by the message format is applied to create fuzzing test files.
Protocol message is treated as a byte sequence which could be divided into a sequence of fields.A keyword field usually holds a command, operator or state code of protocol, while a data field is variable subsequence whose content is always changeable, such as the value of some parameters of communication.Generally, message format is recovered by identifying all fields in byte sequences.However, it is hard to locate the boundary of fields and a great challenge to identify fields in message, since a priori information about them is usually not available.The byte sequence of protocol message is supposed to obey an underlying stochastic process in which different fields have their own distribution of symbols and change-points are the boundaries of fields.Apparently, each change-point implies an end point of one field and a start point of another field.With these assumption, our goal of field boundary detection is essentially the problem of multi-change-point detection.This problem can be addressed using change-point detection [4] widely used in time series analysis.When change-points are localized successfully, messages are divided into field sequences.However, the type of fields are still uncertain.Thus, a further inference procedure, named position-based occurrence probability test analysis, is proposed to determine field type( keyword fields and data fields).Firstly, fields with approximate zero-probability distribution are classified as data fields.Then, the rest ones are further processed in a position-based statistic test.Specifically, a reference position would be selected for every field, and each field are tested by binomial test to make sure whether their positions are equal to the reference position with probability 1 given a significance level α.The fields passing these tests are chosen as keyword fields, while the rest ones are considered as uncertain fields.

Related Work
Recently, the security and privacy issues for Internet of Things have attracted a lot of research interests [5][6][7][8][9][10].In particular, the analysis of applications and protocols in real-time network traffic monitoring is a fundamental and critical building block in network management and security systems for IoT infrastructures [11][12][13][14][15].In this part, we review the recent works in application-layer network protocol vulnerability analysis and detection.
Fuzzing helps protocol vulnerabilities detection to gain higher benefit-to-cost ratio with no or less increasing in computing complexity.It aims to reveal bugs in protocols which would be exploited by adversary to launch attack or activate their malicious code.Currently, research on network protocol fuzzing test is a heat topic in network security.AutoFuzz [16] identified the variable parts of sample messages and fuzzes protocol implementation by sending messages with invalid symbols or messages.AspFuzz [17] leveraged the accessible protocol specifications on RFCs (Request for Comments) to generate fault-injected messages for test files.Then, AspFuzz sent both anomalous and reordered messages to discover vulnerabilities.SecFuzz [18] focused on fuzzing security protocol implementation, but it did not consider the specification of target protocol as well.Zhao et al. [19] used regression finite state machine to infer a state transition diagram of protocol so as to reveal potential vulnerabilities in wireless protocols.
In recent years, a range of works about protocol reverse engineering [20] have been published.Early in 2005, Marshall A. Beddoe held the protocol informatics project [21] and applied bio-informatics algorithms to identify the fields in packets based on alignment algorithms.Cui et al. went further than Beddoe and presented Discoverer [22] to recover protocol message format using both sequence aligning and recursively clustering algorithm.However, Discoverer need some a priori information about the delimiters used by protocol, such as space and comma, which is used to help tokenization, i.e., breaking message into token sequence.Recently, Tao et al. [23] combines hierarchical clustering algorithm, multi-sequence alignment and Bayesian decision model to determine the field boundary of binary protocol in bit granularity.Chen et al. [24] introduce deep learning algorithm to analyze mobile applications.Xiao et al. [25] propose a method based on heuristic rule to reverse analysis of the incomplete flow.In our approach, we make no assumption about the delimiters.We treat the byte sequence of message as a stochastic process and detect field boundaries according to their statistical properties.
As a paralleled method to understand the unknown protocols, binary analysis-based techniques, such as Polyglot [26], Tupni [27], AutoFormat [28], Prospex [29] Dispatcher [30] and so on, also draw much research attention.They are practical in some special scenarios where binary codes are available and executable in a specific sandbox-like environment.Moreover, binary analysis method would fail if programs make use of some confusion techniques like obfuscation to keep themselves away from being reverse-engineered.
As in many other security application domains [31][32][33][34][35][36], data mining and machine learning techniques have been widely adopted in the domain of IoT security and IoT traffic analysis.One of the key challenges is the data privacy problem, especially in collaborative and cloud-based learning scenarios.Several recent studies have proposed novel data privacy preserving approaches for addressing the problem [37][38][39][40][41][42].

Problem Formulation
Suppose that the alphabet used by protocol messages is defined as Σ = 0x00, 0x01, 0x02, ..., 0xFF .A string ω is defined as a finite set of ordered letters in Σ.That is ω = a 1 a 2 ...a n ( a 1 , a 2 , ..., a n ∈ Σ).All strings over alphabet Σ forms a super set Σ * .As a basic data unit used by IoT protocol, protocol message m is essentially strings made up of a sequence of message fields.Thus, we mark message field as ∈ Σ * .
In this paper, a protocol message is assumed to be a byte sequence undergoing hidden statistical process, denoted as Θ, whose statistical feature would shift on and on when the byte sequence goes from one message field to another.As Θ passes from one field ( i ) to another ( j ), the statistical characteristic would change significantly.Thus, a change-point would occur just in the boundary of two different message fields.Inspired by this observation, the problem of message field identification can be transformed to be a change-point detection issue in the statistical process undergone by protocol message.
Given a string ω o = x 1 ...x n , a q-length prefix of the last letter (i.e., x n ) in ω o is marked as T(ω o , q), while the set of such prefixes whose lengths are no longer than The prefix conditional probability of x 1 ...x n is defined as Let m = x 1 x 2 x 3 ... to be a Q-order Markov process.Then, the likelihood of x n given x 1 , ..., x n−1 is where Q ∈ R and n > Q.
Suppose that the byte sequence of protocol message obeys Q-order Markov process, then Equation (1) would be rewritten as follows.
where ω q is the weight of P(x n |T(x 1 ...x n , q)).Essentially, ω q can be regarded as the importance of T(x 1 ...x n , q) for predicting the context of x n .The larger q is, the more important it is for T(x 1 ...x n , q) in predicting the context of x n .For instance, it is much more important for P("e"|"xampl") than P("e"|"pl") to foresee that the context of "e" is "example" instead of "multiple".As a result, the weight of ω q in this paper is defined as Additionally, P(x n |T(x 1 ...x n , q)) is calculated by where ν(ω) is the frequency of ω in training dataset D.
As shown in Figure 1, the prefix conditional probability of x 1 ...x n would be very high when x n and T(x n , q) locate in the same field, otherwise it would be low.

Minmax Formulation for Field Detection
There exist mainly two formulations of change-point detecting problem: Bayesian formulation and minmax formulation.The Bayesian formulation [43] assumes that the change-point γ obeys a prior distribution which is known in prior, while the minmax formulation [44] supposes that the change-point as well as its statistical distribution are unknown to us.
In this paper, the statistical distribution of change-points in protocol message is unknown.As a result, the change-point detection problem should be represented in minmax formulation.Page [45] proposed a cumulative sum (CUSUM) algorithm to implement an optimal solution to minmax formulated problems.Accordingly, a CUSUM-LIKE algorithm is proposed to search fro multiple change-points in this paper.Since the statistic feature of message fields is unknown in prior, the likelihood ratio from post-change probability to pre-change probability, denoted as L(X n ) cannot be calculated directly by L(X n ) = f 1 (X n )/ f 0 (X n ).Thus, L(X n ) is replaced with a new metric in this paper as Suppose γ is a change-point in a message and x n is the n-th letter in the message.We assume that the post-change distribution of x n is f 1 (X n ), while the pre-change distribution of x n is f 0 (X n ), then prefix conditional probability of x n , i.e., p n , would be much less than p n−1 , which results in a high and positive value of C n .When n < γ, if x n and x n−1 locate in the same field, that is they obey the same distribution, so that |1 − p n /p n−1 | ≤ , where is a small and positive value, given as a threshold.if x n and x n−1 locate in different fields, that means n − 1 is also a change-point which should be detected before γ.On the other hand, the value of C n is likely to be bigger than the given threshold when n > γ.As a result, a detection indicator metric which could be regarded recursively for multi-change-point detection should be defined as: The stopping condition can be set as where υ is a threshold of detection indicator.

Multi-Change-Point Detection
Since the problem of message field identification in this paper is actually a multi-change-point detection problem, the detection procedure has to be extended to a multi-round procedure presented in Section 3.1 and called MultiCUSUM.
A variable χ n indicating the underlying state of x n is defined as Accordingly, the detection statistic is where u 0 is the initial condition in a new round of detection procedure started once the previous change-point has been found.The stopping time in the k-th iteration, denoted as τ * k , is defined as where with µ k as the mean of {C τ * k−1 +1 , ..., C n−1 } and ρ as the coefficient of µ k .

Message Segmenting Algorithm
A message segmenting algorithm, as shown in Algorithm 1, is proposed to segment protocol message m into a set of message fields.In Algorithm 1, the message m consists of a set of All messages associated with a specific protocol in D are concatenated one by one to form a new message m according to their appearance time.Then, a Q-depth suffix trie T is built to store sub-strings of m with max length of Q + 1 (line 1).The prefix conditional probability p n is calculated according to Equation (3) (line 2) to enable the multi-change-point detection procedures (MultiCUSUM()).The identified change-points are put into P 1 (line 3).

Algorithm 1 Message Segmenting Algorithm
Input: Message m = x 1 ...x N Output: Segment set Ω 1: T ← QSufTrie(m); # Creating Q-depth suffix trie 2: P ← condProb(m,T); # Compute the conditional probabilities: P = {p n : n = 1, ..., N} 3: P 1 ← MultiCUSUM(m,P); # Change-point detection, P 1 is the change-point set 4: m R ← x N x N−1 ...x 1 ; # Reverse the message stream 5: T R ← QSufTrie(m R ); 6: P R ← condProb(m R ,T R ); 7: P 2 ← MultiCUSUM(m R ,P R ); 8: P ← P 1 ∪ P 2 9: Ω ← MsgSeg(m,P); Actually, not all change-points are not so sensitive to the prefix conditional probability of P(x n |x 1 ...x n−1 ) to be detected by the aforementioned procedure, instead they are more sensitive to the postfix conditional probability of P(x n |x n+1 ...x N ) which is essentially the prefix conditional probability x − n in a special string that is the reverse-order of original message.Therefore, we reverse the letter order of m (i.e., m R = x N x N−1 ...x 1 ) and perform the same detection procedure again on m R to search for such type of change-points (line 4∼7) and put the results in P 2 .
Finally, the two sets of change-points are merged by P = P 1 ∪ P 2 and the message m is segmented into segments based on the change-points in P(line 8∼9).

Occurrence Probability Analysis
To relief the burden of position-based statistic test analysis, a pre-processing called occurrence probability analysis is applied to filter out the obvious part of data fields whose occurrence probability is very low.Given a dataset D and its size of M, and the occurrence probability of a string ω ∈ Ω in D, denoted as p D (ω), is defined as the ratio between the amount of messages containing ω, denoted as ν m (ω), and the size of dataset.
The data field is variational and their occurrence probabilities of each value in a data field are always very small, which nearly approaches zero.Therefore, the data field can be found by searching for those string segments whose occurrence probabilities are statistically zero.In this paper, the occurrence probabilities of message segments is assumed to obey binomial distribution and the binomial test in the statistics field is considered to test whether the occurrence probability of each message segment is zero.
Let the hypothesis be where α is a significance level.The strings in F could be chosen as data fields according to

Position-Based Statistic Test Analysis
Apparently, a keyword field would frequently appear in many messages with similar function and its positions are also relatively stable.That means both frequency and position are important features for us to infer keyword fields from segment set F .As a result, a position-based statistic test is introduced to select keyword fields from {ω : ω ∈ (F − F d )} by testing the position of segment is fixed or quasi-fixed in messages.
Specifically, four kinds of positions of ω are considered in our scheme.That is : the distance between the message head and the position of ω in the message.

•
P ω,2 : the distance between the message tail and the position of ω in the message.

•
P ω,3 : the distance between the head of a line which containing ω and the position of ω in that line. ,4: the distance between the tail of a line which containing ω and the position of ω in that line.
Let P ω,r = (p ω,r 1 , ..., p ω,r n ), r ∈ {1, 2, 3, 4} and define the support rate of p ω,r i , marked as N(p ω,r i ), as the number of p ω,r i in D. Based on binomial test (see Section 4.1), the keyword fields are chosen by given α as the significance level.Equation ( 15) infers keywords whose positions are fixed by searching for segments satisfying max i,r {N(p ω,r i )}.It has good performance on those ω which have one dominated position.For instance, "GET" in HTTP messages has one dominated position, i.e., in the head of a request message.However, some other keywords have more than one dominated position, and there are multiple peaks in N(p ω,r i ).Aiming to address multi-peak issue, an algorithm (called MDL-PTA) based on the minimal description length (MDL) [46] criteria is introduced to enable the position-based statistic test analysis, as shown in Algorithm 2.
k reference positions, B k = {b 1 , ..., b k }, whose support rates are the first k top values in {N(p ω,r j ) : p ω,r j ∈ P ω,r } (line 5) are selected for each ω, and P ω,r is divided into k clusters, C k = {c(b 1 ), ..., c(b k )}, according to the distance between p ω,r j and reference position b m , m = 1, 2, ..., k (line 6).The entropy of C k is calculated through following equation: The model complexity of C k is (log k)/2 and the sum of description length of C k is calculated in line 7, that is The k-th model in the model set Ψ is represented as {B k , C k , L k }.The optimal model with minimal description length would be selected from Ψ (line 11).Apparently, the computation complexity would be very high if all models in Ψ are considered.Meanwhile, a keyword should not have lots of reference positions.As a result, only the top K models in Ψ are considered in Algorithm 2 (line 4∼10).

Algorithm 2 MDL-PTA Algorithm
Input: K, D and ω ∈ (F − F d ) Output: true if ω is a keyword field, or false otherwise.
The inferred message fields would be further refined and some semantic information of message fields would be determined.Specifically, continuous segments of data (or uncertain) fields would be merged into a single segment which is data (or uncertain) field.Regular expressions representing some specific semantic information, such as IP address, File names, URLs, Timestamp and so on, are applied to match the message fields so that some semantic of message fields would be inferred.

Evaluation
In this section, experiments are performed to evaluate the effectiveness of the proposed method.The experiments comprise of two parts: message segmentation evaluation and fuzzing test.The proposed message segmentation approach is implemented on a system called QCD-PInfer whose system architecture is shown in Figure 2.
There are totally six typical protocols (HTTP, FTP, SMTP, POP, DNS and QQ) which are widely used in the application-layer are selected to test the effectiveness and efficiency of message segmentation.The recall and precision of keyword inference are shown in Tables 1 and 2. Please note that, the ground truth of keywords are those keywords which are occurred in the test set.Both DNS and QQ are not taken into account for evaluating the quality of keyword set, since the two are binary protocols and there is no concept of keyword defined in binary protocol.By comparison, QCD-PInfer has a higher recall rate than Discoverer and PI.In particular, PI's recall rate is much low: the recall rates for HTTP, FTP, SMTP and POP are less than 10%.Discoverer is prone to infer too many segments as keywords, so that its precision is much lower than that of the proposed system.Although PI's recall rate is very low, its precision for HTTP and FTP is extremely high.However, PI's precision for other protocols are still very low.It is worth mentioning that PI infers too few keywords, always less than 5 for all protocols being considered.The F-scores of the experiment results are shown in Figure 3.The proposed system has the highest F-score for all the six protocols, which means our method performs well in keyword inference.In fuzzing test, QCD-PInfer is extended with fuzzing function to implement an automatic fuzzing tool (APREFuzz).APREFuzz can identify vulnerability in a system being tested which is designed to introduce information-centric network into IoT devices to enable their caching capability.The protocol used by target system under testing comprises of 5 type of messages responsible for sending interesting, distributing data, pushing data, responding with target data and responding with no answer, respectively.
Firstly, message fields are identified using QCD-PInfer system, and message format are reconstructed.Secondly, test files are generated by inserting fault data into one field according to the message format.Please note that, for a real fuzz test, fault data may inserted into more than one field.However, as a proof-of-concept system, APREFuzz considers the scenarios with only one field being fault-injected currently.Actually, it is not difficult to extend the system to consider fault-injected in multiple fields.When inserting the fault data, keyword fields are only replacing with inferred keywords according to message formats, data fields would be replaced by random data, while uncertain fields would be replaced with either inferred keywords or random data.In our experiments, the uncertain fields are treated the same as data fields.
Finally, the target system are treated as a black box and supposed to be unknown to us.APREFuzz sends test files to target system file-by-file and monitors the reactive of target system via analyzing the response.
In our experiments, APREFuzz extracted 7 keyword fields and infers 7 data fields in the sample message.One data field is found that it contains only figures.The amount of inferred keywords is 12.We take 11 abnormal strings into account for inserting fault data into the data fields except the one containing only figures.For the special field that containing only figures, 21 boundary figures are used to be injected.As a result, the amount of fault-injected files generated by APREFuzz is 248 (=(12 + 11) × 7 + 21 × 1 + 11 × 6).On the other hand, FileFuzz generates 393, 216 (=1.5 × 1024 × 2 8 ) fault-injected files by replacing each byte with values from 0x00 to 0xFF.When test files are sent to target system, APREFuzz monitored one exception that the system fails to respond, while FileFuzz monitor none.The exceptions maybe indicates a vulnerability which would be leveraged to launch a DoS attack, or some attacks that would ruin the system's availability.Actually, other tools are needed to analyze the exception deeply and figure out its type and impact.However, that work has surpassed the discussion scope concerned in this paper so that it will not be presented here.

Conclusions
The proposed method applies protocol reverse engineering approach to improve IoT protocol fuzzing performance by creating valid and effective test files based on protocol message format and reducing greatly the size of test files.It considers the statistical attributes of message fields to locate their boundaries by searching for change-points in the messages and reconstruct the message format.A CUSUM-LIKE algorithm is presented to address the problem of multi-change-point detection.Additional procedures including occurrence probability test and position test are further employed to classify the message segments into keyword fields , data fields and uncertain fields.The results show that the extracted message formats are useful for generating test files for network protocol fuzzing.
In the future, the proposed APREFuzz with enough improvement based on current version would be a practical and powerful tool to generate test files automatically for fuzzing test carried on IoT protocols or devices to reveal their hidden vulnerabilities.It also would contribute to strengthening the IoT security in effective and efficient way, and even to be a security tool for improving protocol fuzzing in many other types of network.

Figure 3 .
Figure 3.The F-score value of keywords.

Table 1 .
The recall of keyword inference.

Table 2 .
The precision of keyword inference.