We further count the combination identification probability of different protocol banners in
Figure 5. We can find different combinations have different protocol complementarity for identifying device attributes. For example, using two protocol banners of Http_80 and RTSP_554, we can obtain the device model probability of 66.2% and 58.7% respectively, and combine the two protocols to obtain device model probability of 71%. Although the Onvif_3702 protocol banner can only obtain the device model probability of 34.2%, which is far less than the device model probability of RTSP_554 protocol banner, it can achieve 83% of the devices model identification probability after combining with Http_80 protocol banner. In other words, the complementarity between Http_80 and Onvif_3702 is much higher than that between Http_80 and RTSP_554. The reason for this difference is that the device model identification results from the combination of Http_80 and RTSP_554 protocol banners have many duplicate devices. Therefore, by optimizing the combination of multiple protocols and forming the optimal multi-protocol probes scheduling sequence, the communication overhead and identification time can be significantly reduced.
In this section, we mainly discuss how to generate optimal multi-protocol probe sequence for type-known IoT devices. We first analyze the whole banner-based device identification process and transform the generation process of the multi-protocol probe sequence of IoT devices into Markov decision process based on reinforcement learning concept. Finally, we optimized the strategy generation method in the value iteration algorithm and used it to generate an optimal multi-protocol probe sequence for type-known IoT device.
5.1. Scheduling Model of Multi-Protocol Probe Sequence
It is a scheduling problem to construct the optimal multi-protocol probe sequence of IoT devices. In the process of identifying device brand and model information based on banners, according to the findings in
Section 4.3, we need to send
n different types of protocol probe packets one by one to obtain the corresponding protocol banners. The optimal protocol probe sequence of device is to get the protocol banners which contain more attribute information so as to identify the brand and model of device faster and reduce the time cost and communication overhead. Each protocol banner has different benefits for device identification, and the different orders of sending protocol probe packets will lead to dynamic changes in identification benefits.
We use reinforcement learning method to construct the model of device identification process. Reinforcement learning is a kind of decision-based learning method developed from adaptive control theory. In decision process, agent is used to count the learning results in the current interactive context, and these results are taken as the feedback to keep continuous learning in order to obtain the optimal decision process. The goal of reinforcement learning is to give a Markov decision process (MDP) and find the optimal strategy. In this paper, the corresponding action of MDP model is to send protocol probe packet to the device. Firstly, based on the Markov decision process, the formal analysis of device identification process is carried out. In the Markov decision process, the agent can perceive the different state set
S, and has its executable action set
A, which refers to all protocol probe packets sets. At each discrete time
t, the agent perceives the current state
, chooses to perform the current action
, and obtains the return
, and then generates a successor state
. The generation of the successor state is only related to the current state. In this paper, the symbol
S represents the set of identification states corresponding to all times. There are only 4 types of identification states
, where
indicates that the device has no attribute information identified,
indicates that only the device brand is identified, and
indicates that the device model is identified,
means that the brand and model of the device have been identified and the identification process is stopped. The state transition process is shown in
Figure 6.
Symbol
a represents the currently sent protocol probe packet. Symbol
r means the immediate reward value provided by the change of the device identification status after sending the protocol probe packet. If the device identification status does not change,
r = −1. If only one brand or model attribute information of the device is added,
r = 1. If both brand and model attributes of the device are identified,
r = 2.
represents the state transition function. The task of the agent is to learn a strategy:
, which selects the next action
based on the currently observed state
, that is,
. Meanwhile, for any state
and action
, there is a probability
when the action
is transferred to the state
under the state
, which is calculated by the acquisition rate of each protocol banner and its corresponding device attribute proportion rate. When device identification is based on each protocol banner, the calculation formula of its corresponding state transition probability matrix
is shown in
Table 2. In the table, to simplify the symbol,
means protocol banner acquisition rate,
means the probability that there is no device attribute in protocol banner,
and
mean the probability that protocol banner contains only device brand or model, and
means the probability that both device brand and model are included in the protocol banner.
Therefore, the state transition probability matrix is formed by device identification based on n protocol banners.
5.2. Value Iteration Algorithm
Due to both the transition probability
P of the identification states and the immediate reward value
R of the action are known in our model, it is a model-based reinforcement learning. Therefore, for any strategy
, the expected cumulative reward brought by the strategy can be estimated. Let the value function
represents starting from state
and using the cumulative reward brought by strategy
. Because MDP has Markov properties, that is, the next state of the system is determined only by the current state, so there is a recursive expression as shown in Equation (2):
where 0 ≤ γ < 1 is a constant, called the discount factor, which determines the relative proportion of delayed reward and immediate reward. When the value of γ is 0, it means that only the immediate reward is considered. The current optimal multi-protocol probe strategy
can be selected by calculating the maximum discounted cumulative reward value, i.e. maximizing
. The optimal strategy
can be expressed by Equation (3):
The value function corresponding to the optimal strategy is , abbreviated as , which means the maximum cumulative conversion reward obtained starting from the state , that is, the cumulative conversion reward is obtained by the optimal multi-protocol probe strategy starting from the state .
Starting from state
, after taking an action
and executing the optimal strategy
, we use the state-action function
to express the cumulative reward. Its recursive form is expressed in Equation (4). For the state
s, executing different actions will obtain different
Q values, then the optimal action
a starting from state
s can be calculated by Equation (5):
We can find that it is actually a dynamic programming algorithm to use the recursive equations of Equations (2) and (4) to calculate the value functions
and
respectively. For
, since
tends to 0 when
t larger, it is finite to continue the recursion until the starting point. In addition, in order to obtain
, it may be iterated many times, so we set a threshold
to limit it. If the value function changes less than
after one iteration, the iteration is stopped:
In the process of generating the optimal protocol probe sequence, the
of each state is calculated iteratively to maximize the cumulative reward value of the optimal value function. In this way, the sum of actions in Equation (2) can be transformed to the maximum in Equation (7):
The traditional value iteration algorithm only focuses on the optimal actions in each state. If it is directly applied to the process of generating the optimal protocol probe sequence, it is possible to get the same optimal action for each device identification state and result in only one protocol in the final protocol probe sequence. The reason is that if different identification states do not achieve the termination state, it is necessary to select an optimal action from the action set to continue probe. Due to the corresponding action set of each state is the same, it is easy to cause the selected optimal action to be the same or the action has been executed before the selection, so as to get a dead cycle.
In order to avoid this situation, we optimized the strategy generation method in the value iteration algorithm. After calculating the
Q (
s, a) of each action corresponding to each state, the actions in the action set are sorted according to the
Q value from large to small, and the optimal protocol probe sequence
seq corresponding to each identification state is obtained. Finally, according to the sequence of identification states
and
seq, the corresponding actions in the sequence
seq of each state are selected in turn and repeatedly, and the optimal multi-protocol probe sequence of a type-known IoT device is obtained. According to the related works [
25,
26], it can be proved that the algorithm has convergence, that is, the optimal protocol probe sequence obtained is unique. Due to space limitation, the proof process is not given here. The specific algorithm description is shown in Algorithm 1.
Algorithm 1 Value iteration algorithm |
Input: E(S, A, P, R); S’; ; |
//E(S, A, P, R): MDP quad |
//S’: Special Identifying status set |
//: Discount factor |
//: Convergence threshold |
Output: Optimal_Sequence |
1 ∀s∊S:V(s)=0; |
2 for t=1,2,… do |
3 ; |
4 if max s∊S |V(s)-V’(s)|<θ then |
5 break; |
6 else |
7 V=V’; |
8 end if |
9 end for |
10 for i∊S’ do |
11 for j∊A do |
12 ; |
13 end for |
14 //Sort a by Q |
15 |
16 end for |
17 Sequence=combine; |
18 return Sequence; |
Besides, in order to achieve a balance between identification time and identification fineness, we set a accuracy gains threshold to truncate the sequence. In practice, we believe that if the accuracy gains on brand or model by increasing a protocol probe is less than the threshold of 1%, the subsequent protocol probe packets will be discarded. Finally, we will generate a segment of optimal multi-protocol probe sequence. For other IoT device types, such as router and printer, we can also generate the optimal protocol probe sequence segment.