1. Introduction
The Internet of Vehicles (IOV), a new Ad Hoc Network composed of the basic communication units of mobile vehicles and their surroundings, with the ability of perception, computing, storage and wireless communication running on the road, is an effective measure to improve safe driving. It achieves the communication between vehicle and vehicle, vehicle and road, as well as vehicle and Internet, which is a typical application of Internet of Things (IoT) technology in the field of transportation systems [
1]. Multi-vehicle collaboration is a critical technology to solve the problem of unmanned driving, assisted driving and to improve the safety of driving in the IOV. It is the most widely used in IOV application. When dismounted networking is the most widely used field, the most important thing in the scene of assisted driving is to ensure the coordination between vehicles. Data for IOV applications are mainly divided into three categories: critical safety, traffic efficiency, and non-safety. The data of safety driving (e.g., vehicle distance, speed, vehicle control commands), as the basic data of critical safety applications is the foundation of IOV applications. How to guarantee the validity of data in the aspect of safety driving is one of the challenges facing the IOV applications.
Different from the traditional data security prevention methods in the Internet, the data security problems facing the IOV mainly exist in both internal and external causes. On the one hand, the internal safety prevention defects of vehicles are mainly reflected in the existing Internet communication protocols [
2], that is, the vehicle lacks an effective verification mechanism for the data transmitted on the Internet, such as Controller Area Network (CAN) protocol. On the other hand, data attacks are diversified due to the open architecture and application of the IOV [
3]. For example, hackers remotely controlled the BMW’s onboard system, the Connected Drive, using the security flaw in January 2015. In 2016, Toyota chairman Takeshi Uchiyamada, who has been Chairman of Toyota, claimed that two hackers had also used a computer to access the Prius’ control system before and then removed the car from the driver’s control completely. In June 2018, GCN published a report claiming that connected cars can lie [
4], posing a new threat to smart cities.
The research on data anomaly detection method is the key technology for validity and authenticity of interactive data of current multi-vehicle collaboration process in combination with the characteristics of IOV data and driving style analysis. At present, there have been a number of studies on data anomaly detection, such as anomaly detection of traffic flow data based on Turkey smoothing algorithm proposed by Xu et al. [
5], and the data cleaning algorithm for redundant data proposed by Wang et al. [
6]. However, due to the particularity of IOV application scenario, human behavior is an important parameter that cannot be ignored in the IOV, and driving style has a great influence on the data of IOV. Klauer et al. collected the driving data of 100 incidents; after a deep analysis of the driver’s behavior, they found there was a direct relationship between driving style and vehicle data [
7]. Furthermore, different driving styles reflect different characteristics in the data, which is very helpful for us to detect data anomalies. For example, conservative drivers generally have stable data, and frequent data changes are abnormal. For aggressive drivers, this frequent change is a normal behavior. Therefore, this paper combines driving style with anomaly detection. Therefore, this paper aims to propose an anomaly detection algorithm combined with driving style.
Based on the WAVE protocol standard, which defines three types of data for IOV [
8]: critical safety, traffic efficiency, and non-safety, this paper aims to analyze security key data, considering that the real-time anomaly detection of all data in the data packet will cause a large computing cost. Thus, in this paper, the traffic cellular automata model is constructed to preprocess the data in order to achieve a desired anomaly detection effect with limited computing resources. On the basis of this model, the Anomaly Detection based on Driving style (ADD) algorithm is proposed based on driving style. In order to quantify driving style better, a new driving style quantization model is proposed, which can more comprehensively quantify driving style via data. The driving style parameter (
e) obtained from aforementioned model, velocity (
v), acceleration (
a), and distance (
d) are detected via the Gaussian mixture model (GMM). The experiments show that the ADD algorithm proposed in this paper have good performance in data anomaly detection (This work has been published on GitHub. The URL is
https://github.com/IoTLabDLUT/Data-Anomaly-Detection-for-Vehicular-Ad Hoc-Networks).
4. Add Algorithm: Anomaly Detection Based on Driving Style
Considering that part of the attacking data in the actual situation will not affect the safety of driving, regarding it as normal data can effectively reduce the calculation cost without affecting the accuracy, thus this paper adopts the following methods to preprocess all the data:
C is the preliminary screening index for anomaly, and
C is 1 for possible anomaly, while
C is 0 for normal data.
F is the rule set of cellular automata traffic flow model,
f is one of these rules,
a is the decision of the vehicle, and
d is the data of the vehicle. Equation (
11) shows that, if the vehicle data can keep the TCA system steady, we believe that the data are normal data; otherwise, it is considered suspected abnormal data and proceeds to the next step for further judgment.
The formula for the driving style recognition coefficient
is given in
Section 3.3.
After obtaining
, this paper converted it into the corresponding driving style coefficient
e, in order to conduct anomaly detection via GMM later. The corresponding relation of
and
e is shown as
Table 2:
Where Norm is normal threshold, Agg is aggressive threshold (according to the experiment results, it is suggested that two thresholds are valued 0.5 and 1.5 respectively).
4.1. Gaussian Mixed Model (GMM)
After obtaining driving style parameter e, data anomaly detection is conducted. For the sake of research, this paper extracts the velocity v, acceleration a, distance d, and quantized driving style parameter e from the data packet for anomaly detection.
Since Gaussian mixture model (GMM) is applicable to continuous variables and can reflect the correlation between dimensions, this paper uses the GMM to carry out this work, as shown in
Figure 3:
GMM is often used in clustering. Taking a point in GMM distribution randomly can be divided into two steps: first, choose one component from K components randomly; the probability of selecting each component is actually its coefficient . After selecting components, separately consider selecting one point from this component’s distribution; here, it has returned to normal Gaussian distribution and has converted to known issues. Thus, GMM is used for clustering, and we only need to deduce the probability distribution of GMM according to the data.
The random variable
X is set; then, the Gaussian mixture model can be expressed as follows:
where
is the kth component of the mixture model. Given two clusters, which can be represented by two two-dimensional Gaussian distributions, then the component
K = 2.
is the mixture coefficient that satisfies:
where
is equivalent to the weight of each component
; then, the form of two clustering GMM is shown in the formula:
The problem of GMM parameter estimation is how to determine the value of and automatically based on data. To solve this problem, we can use the Expect–Maximization (EM) algorithm. With the EM algorithm, we can iteratively calculate (, , ) in GMM.
The EM algorithm has two steps. The first step is to obtain the rough value of the estimating parameter; the second step is to use the value of the first step to maximize likelihood function. Thus, the likelihood function of GMM should be obtained first; there are three parameters in GMM model to estimate:
π, μ and
. Rewrite Equation (
12):
In order to estimate the three parameters, the maximum likelihood function of the three parameters needs to be solved respectively. First, solve the maximum likelihood function of
. Take the logarithm of Formula (15) and then take the derivative of
and set the derivative to
0, the maximum likelihood function can be obtained:
Both sides multiply
, rearranging to get:
where:
As shown in Formula (19), is defined to represent the posterior probability of the component. In Formulas (17) and (18), N is the number of points, then can represent the posterior probability of n () belonging to cluster k, is the weighted average of all points, and is the weight of each point, which is related to the cluster.
Calculate the maximum likelihood function of
in the same way; we can obtain:
Finally, the maximum likelihood function of
remains; it can be regarded as the prior probability of
. Note that there are constraints
of
, so we need to add the Lagrangian operator:
To calculate the maximum likelihood function of
above in the same way, we can obtain:
Both sides multiply
, and we can obtain
, which leads to a more concise expression of
:
Using the EM algorithm to estimate GMM parameters is to maximize Formulas (17), (20) and (23), Formulas (17), (19), (20) and (23) are required. First, assign the initial value of π, μ and , which is substituted into Formula (19) to obtain . Then, substitute into Formulas (18), (20) and (23) to obtain , and . Subsequently, substitute , and into Formula (19) to obtain new , then substitute new into Formulas (17), (20) and (23). Repeat the former steps until the algorithm converges.
4.2. Add Algorithm
Anomaly detection algorithm of IOV data, ADD, is proposed based on driving style. The state transition diagram of ADD algorithm is shown in
Figure 4.
The main process is as follows:
S0: Pre-detection by traffic flow model. If the vehicle data conforms to the TCA model (C = 0), determine it as normal data and continue S0; if the vehicle data does not conform to the TCA model (C = 1), anomalies may exist and further comprehensive determination is needed, and proceed to S1.
S1: Data preprocessing. Speed v, acceleration a and distance d are extracted from the data packet. The distance d and the minimum safe distance s are compared. If , proceed to S2; otherwise, proceed to S3.
S2: The driving style recognition coefficient is calculated according to the Formula (10).
S3: The driving style coefficient e is obtained according to the comparison table of driving style parameters.
S4: The obtained driving style coefficient
e, velocity
v, acceleration
a, and distance
d are used to obtain the anomalies list via the Gaussian mixture model (GMM).For a given m-dimensional data set
, using a Gaussian mixture model to calculate mathematical expectation
and build the covariance matrix
of all the characteristics, as shown in Formulas (24) and (25):
The probability density function is as shown in Formula (26):
The probability density function calculated in Formula (26) is used to judge the new data, and can be compared with the adaptive threshold to detect anomaly data; finally, the anomalies list can be obtained through output.
5. Experiment and Analysis
In this paper, the ADD algorithm is proposed to analyze the validity of IOV data. Two data sets are used in the simulation experiment.
Data set 1: NGSIM data set for experimental simulation [
33]. Researchers collected detailed vehicle trajectory data on southbound US 101 and Lankershim Boulevard in Los Angeles, CA, USA eastbound I-80 in Emeryville, CA, USA and Peachtree Street in Atlanta, GA, USA. Data were collected through a network of synchronized digital video cameras. NGVIDEO, a customized software application developed for the NGSIM program (Dataset 1.1; USDOT; Los Angeles, California; Emeryville, California; Atlanta, Georgia; America), transcribed the vehicle trajectory data from the video. This vehicle trajectory data provided the precise location of each vehicle within the study area every 0.1 s, resulting in detailed lane positions and locations relative to other vehicles.
Data set 2: The self-made data set simulated by Simulation of Urban MObility (SUMO), using a “five-car model” and custom vehicle to generate autonomous vehicles required by the experiment. In order to distinguish the two different types of vehicles, we use red vehicles to represent autonomous vehicles and yellow vehicles to represent environmental vehicles, as shown in
Figure 5. Then, collect the simulated data and analyze it as the interactive data between vehicles.
This paper only changes the speed to simulate anomaly data. According to the actual situation and the investigation of Chen et al. [
34], four types of anomaly data are defined, as
Figure 6:
1. , the acceleration is 0, but the speed changes;
2. , the acceleration is not 0, but the speed remains unchanged;
3. The distance(d) is too small when the speed or acceleration is large;
4. , a step occurs in speed or acceleration.
In order to prove the performance of the algorithm proposed in this paper, two algorithms are adopted for comparison: (1) HTM algorithm: Hole et al. [
35] carried out anomaly detection on the time series data, whose data set was derived from the collected time series data of various industries, including the vehicle speed in the IOV. In this paper, the HTM algorithm is only used for anomaly detection of one-dimensional data, velocity v, for comparison. (2) LSTM algorithm: Filonov et al. [
36] adopted a method based on LSTM neural network to monitor and detect anomalies in multivariable time series data. In this paper, the LSTM algorithm is used to detect the anomalies of the three-dimensional IOV data (speed, acceleration, distance) without driving style for comparison.
In this paper, ten cars in the NGSIM data set are randomly selected as the GMM training data set.
Figure 7 shows the speed trend of the training set, where the velocity step between two vehicles is marked as anomaly 4.
There are 11 anomaly intervals in the graph above, the specific situation is as follows:
Anomaly intervals (5) and (8): when the acceleration is 0, the velocity changes (anomaly 1);
Anomaly intervals (6) and (9): when the acceleration is not 0, the velocity remains unchanged (anomaly 2);
Anomaly interval (1) and (11): when the vehicle distance d is small, the speed or acceleration is large (anomaly 3);
Anomaly intervals (2), (3), (4), (7), and (10): when velocity or acceleration step is generated, which is not normalcy (anomaly 4).
GMM test data set is composed of two parts. Test set 1 is the vehicles randomly selected from the NGSIM data set, and the following three test sets are obtained:
State 1 selects the vehicles as 14, 233, 999, and 2333. The anomaly data include all four anomaly conditions. The schematic diagram is shown in
Figure 8a.
Figure 8b is the preliminary screening index of anomalies obtained via the cellular automata traffic flow model.
State 2 selects the vehicles as 28 and 78, and the anomaly data includes anomaly 1, 2, and 3. All these three anomalies are considered to be related to each other; the schematic diagram is shown in
Figure 9a.
Figure 9b is the preliminary screening index of anomalies obtained via the cellular automata traffic flow model.
State 3 selects the vehicles as 59 and 1202, and there is only anomaly 4. In this case, there was only step anomaly of speed, as shown in
Figure 10a.
Figure 10b is the preliminary screening index of anomalies obtained via the cellular automata traffic flow model.
Test set 2 is obtained by adding the anomaly data (including all four anomalies, in the first situation) into the SUMO simulation data set, as shown in
Figure 11.
5.1. Experimental Results and Analysis
According to the actual situation and the study of Murphey et al. [
26], the changes of driving style are not a mutation process reflected in the data; thus, this paper proposes that the driving style
e is a transient data, which is quantified by a sliding time window (it is suggested that the time window should be consistent with
). The driving style quantification result of a vehicle (Id: 562) is shown in
Figure 12.
When comparing the experimental results of the test set, the three parameters–Precision, Recall and
score, which are commonly used in the field of data anomaly identification, are selected as the evaluation criteria. The common method for calculating Precision, Recall, and
scores is shown in Formulas (27) to (29):
where
represents the number of correctly detected,
represents the number of false positives, and
represents the number of false negatives. As shown in Formulas (27)–(29), the number of
,
and
will determine the Precision and Recall. The precision is used for judging how sensitive the anomaly detection algorithm is to anomalies. The recall rate reflects the ability of the algorithm to detect anomalies. The accuracy and recall values together affect the final
score, which represents the overall performance of the anomaly detection algorithm.
5.1.1. Experimental Results Analysis of the First Situation
In the first situation, data anomaly detection results are shown in
Table 3 and
Table 4, where
Table 3 is the result of test set 1 (NGSIM data set) and
Table 4 is the result of test set 2 (SUMO simulation data set).
It can be seen from the table that the HTM algorithm only performs anomaly detection on one-dimensional speed data, and its Precision is generally high, but Recall is very low because the HTM algorithm can detect step anomalies well and has low false positive (). The correlation between multidimensional data cannot be taken into account in the HTM algorithm, which will result in high false negative (). Therefore, the Recall of each group is very low, and the final calculated score is also very low.
The Precision of the LSTM algorithm is not much different from that of the ADD algorithm; however, the Recall of the LSTM algorithm is significantly higher than that of the former. Because the ADD algorithm takes into account the driving style, the detection is more comprehensive, the false negative () detection result is very low, and the final calculated score is also significantly higher than the other two algorithms.
5.1.2. Experimental Results Analysis of the Second Situation
In the second situation, the results of anomaly data detection are shown in
Table 5. The HTM algorithm used for detection has low Precision and Recall, and the final calculated
score will be far lower than the latter two. The results of LSTM algorithm and ADD algorithm are not much different from the first situation. The Precision of the ADD algorithm is similar to the former, but the Recall is higher than the former, and the final
score is also higher than the former.
5.1.3. Experimental Results Analysis of the Third Situation
In the third situation, the results of anomaly data detection are shown in
Table 6. There is only velocity step anomaly, and any algorithm used for detection will have high Precision and Recall. The final calculated
score will be close to each other. The detection effects of the three algorithms will not be much different.
We can clearly determine that the performance used multidimensional data to carry out validity analysis is much better than using only one-dimensional speed data via the comparison among three situations because it will not consider the correlation between multidimensional data, which will lead to high false negative results and a very low final calculated score. In the same situations above, the method of adding the driving style parameter e is better than the method irrespective of that. Although it will create unnecessary mistakes and increase false positive (), more comprehensive calculation can detect more anomalies, decrease false negative (), and improve Recall; finally, the score calculated by combining Precision and Recall will be significantly higher than the method irrespective of driving style.