An Optimized Algorithm for Dangerous Driving Behavior Identiﬁcation Based on Unbalanced Data

: It is of great signiﬁcance to identify dangerous driving behavior by extracting vehicle trajectory through video monitoring to ensure highway trafﬁc safety. At present, there is no suitable method to identify dangerous driving vehicles accurately based on trajectory data. This paper aims to develop a detection algorithm for identifying dangerous driving behavior based on the road scene, which is mainly composed of imbalanced dangerous driver detection and labeling, extraction of driving behavior characteristics and the establishment of a recognition model about dangerous driving behavior. Firstly, this paper deﬁnes the risk index of the vehicle related to ﬁve types of dangerous driving behavior: dangerous following, lateral deviation, frequent acceleration and deceleration, frequent lane change, and forced insertion. Then, a variety of methods, including K-means clustering, local factor anomaly algorithm, isolation forest and OneClassSVM, are used to carry out anomaly detection on the risk indicators of drivers, and the optimal method is proposed to identify dangerous drivers. Then, the speed and acceleration of each vehicle are Fourier transformed to obtain the characteristics of the driver’s driving behavior. Finally, considering the imbalanced characteristic of the analyzed dataset with a very small proportion of dangerous drivers, this paper compares a variety of imbalanced classiﬁcation algorithms to optimize the recognition performance of dangerous driving behavior. The results show that the OneClassSVM detection algorithm can be effectively applied to the identiﬁcation of dangerous driving behavior. The improved Xgboost algorithm performs best for the extremely imbalanced data of dangerous drivers.


Introduction
The problem of road traffic safety has always been a serious social safety problem. Road traffic accidents bring huge property losses and serious personal safety threats to people. Some of the causes of road accidents include congestion-related collisions, distracted driving, improper taking of bends, unsafe changing of lanes, environmental disasters, and pedestrian or animal crossing [1]. Studies have shown that human factors are the most important factor leading to road traffic accidents [2]. Among them, the driver's dangerous driving behavior is an important reason leading to a traffic accident. Dangerous driving behavior refers to the driving behavior that is far lower than the standard of qualified drivers and obviously has serious personal injury or property damage potential.
With the development of autonomous vehicles and advanced driver assistance systems (ADAS), more and more researchers have used smart sensors and artificial intelligence algorithms to identify driving risks and prevent traffic accidents. In recent years, many scholars have studied driver behavior through natural driving experiment methods. For example, some scholars obtain vehicle condition information (such as vehicle speed, engine speed, etc.) through the vehicle On board Unit (OBD) interface to identify the driver's dangerous driving behavior [3]. Wu et al. used specific sensors (such as CCD cameras and gyroscopes) to obtain data, such as vehicle speed and acceleration, and identify driving behavior by extracting relevant features [4]. Many scholars also use driving simulators to carry out research on recognition modeling of dangerous driving behaviors [5,6]. Traffic accident data are often difficult to observe in the actual traffic flow operation process. Some researchers use highway surveillance cameras to monitor traffic flow at high altitude and extract vehicle trajectory data, as well as yaw rate, to classify dangerous driving behavior [7][8][9]. Some scholars use vehicle trajectory data to classify driving behavior through statistical methods or other data mining methods [10][11][12]. Many scholars used clustering methods to classify dangerous driving behaviors [13][14][15].
In the fields of outlier detection and abnormal data detection, unsupervised algorithms, such as local anomaly factor algorithm, OneClassSVM and independent forest, are widely used. However, there are relatively few applications of such anomaly detection algorithms in the research of identifying driving behavior. Ramyar et al. used OneClassSVM to distinguish between normal lane change and abnormal lane change only for lane change scenarios [16]. Matousek et al. used anomaly detection methods to identify aggressive driving behaviors, but the result about the effectiveness of recognition is not significant [17]. In addition, the methodology is only based on the data collected by driving simulator experiments.
It is noted that dangerous driver data obtained through clustering or anomaly detection often only occupies a small proportion. Imbalanced datasets will produce imbalanced recognition and classification effects on created prediction models, resulting in poorer prediction accuracy for categories with fewer data. This research focuses on identifying dangerous driving behavior, which is also a category with a small amount of data. Therefore, it is necessary to improve the classification effect of dangerous driving behavior on the prediction model through optimized methods. For the processing of imbalanced datasets, the pre-sampling method is usually used for data preprocessing to reduce the degree of data imbalance and thereby, improve the recognition performance of a small number of samples. Commonly used pre-sampling methods mainly include: random oversampling, random undersampling (rus, random undersampling), SMOTE (Synthetic Minority Oversampling Technique) and other methods. SMOTE was proposed by Chawla et al. [18]. By synthesizing more instances of the minority class in the "feature space", he expanded the decision-making area of the minority class and balanced the proportion of classes. Some researchers have also proposed imbalanced lifting algorithms to process imbalanced data, such as Smoothboost, Rusboost, etc., but the recent imbalanced lifting algorithms are rarely applied in the transportation field [19,20]. Adaboost (Adaptive Boosting) is a commonly used boosting algorithm, proposed by Yoav Freund and Robert Schapire in 1995. In recent years, the Xgboost algorithm has been proposed and will be widely used in various scenarios and has shown good recognition performance [21]. For imbalanced data, scholars in other fields have proposed an imbalanced improved Xgboost [22]. The improved algorithm's recognition performance for dangerous driving behaviors needs further research.
Based on the trajectory data of high-altitude vehicle monitoring on highways, from the perspective of dangerous driving behavior on highways, this paper defines five risk evaluation indicators for dangerous driving behaviors, including dangerous car following, lateral deviation, frequent acceleration and deceleration, frequent lane changes, and forced insertion. This evaluation index is used to identify dangerous driving behavior. This paper analyzes and compares four unsupervised anomaly detection methods for the anomaly detection and calibration of dangerous driving behavior, and proposes an evaluation method to verify the effectiveness and practicability of the anomaly detection algorithm. After that, this paper performs frequency domain feature extraction of vehicle trajectory parameters on the calibrated vehicle trajectory data and compares and analyzes multiple recognition and classification models to obtain the model with the best prediction performance. The innovation of this paper is that in the research of driving behavior recognition based on Electronics 2022, 11, 1557 3 of 15 video surveillance, aiming at the unsupervised anomaly detection algorithm, a dangerous driving behavior detection method based on risk index is proposed to ensure the effectiveness of dangerous vehicle trajectory detection. In addition, combined with the imbalance processing means, this paper identifies and analyzes the extremely unbalanced dangerous driving behavior data, which can effectively improve the accuracy of dangerous driving behavior recognition.

Methodology
The modeling algorithm of this article can be divided into four steps. The first part mainly elaborates the calculation method of index that defines dangerous driving behavior. The second part proposes how to establish an anomaly detection model to calibrate dangerous drivers. The third part uses Discrete Fourier Transform (DFT) method to convert the given time series into signal amplitude in the frequency domain, thereby revealing the driving characteristics hidden in the vehicle trajectory data. The fourth part compares and analyzes a variety of imbalance recognition and classification algorithms and their performance indicators. The specific technical route is shown in Figure 1.
trajectory parameters on the calibrated vehicle trajectory data and compares and multiple recognition and classification models to obtain the model with the bes tion performance. The innovation of this paper is that in the research of driving recognition based on video surveillance, aiming at the unsupervised anomaly d algorithm, a dangerous driving behavior detection method based on risk inde posed to ensure the effectiveness of dangerous vehicle trajectory detection. In combined with the imbalance processing means, this paper identifies and ana extremely unbalanced dangerous driving behavior data, which can effectively the accuracy of dangerous driving behavior recognition.

Methodology
The modeling algorithm of this article can be divided into four steps. The mainly elaborates the calculation method of index that defines dangerous drivin ior. The second part proposes how to establish an anomaly detection model to dangerous drivers. The third part uses Discrete Fourier Transform (DFT) metho vert the given time series into signal amplitude in the frequency domain, thereb ing the driving characteristics hidden in the vehicle trajectory data. The fourth p pares and analyzes a variety of imbalance recognition and classification algorit their performance indicators. The specific technical route is shown in Figure 1.

Dangerous Driving Behavior Indicators
This article defines the following five dangerous driving behaviors based on driving scenarios on the highway: dangerous car following, lateral deviation, acceleration and deceleration, frequent lane changes, and forced insertion. As s Table 1, the measure of driving risk (MOR) of the five dangerous driving beh quantified and expressed by the values of MOR1~MOR5. Among them, accordi formula definition in the table, the greater the MOR value, the greater the risk o hicle. Due to the large difference in the range of the MOR indicators for various da driving behaviors, this article standardizes the dispersion of the various MOR in and converts the characteristic value data into dimensionless data. The five MO range from 0 to 1.

Dangerous Driving Behavior Indicators
This article defines the following five dangerous driving behaviors based on ordinary driving scenarios on the highway: dangerous car following, lateral deviation, frequent acceleration and deceleration, frequent lane changes, and forced insertion. As shown in Table 1, the measure of driving risk (MOR) of the five dangerous driving behaviors is quantified and expressed by the values of MOR1~MOR5. Among them, according to the formula definition in the table, the greater the MOR value, the greater the risk of the vehicle. Due to the large difference in the range of the MOR indicators for various dangerous driving behaviors, this article standardizes the dispersion of the various MOR indicators and converts the characteristic value data into dimensionless data. The five MOR values range from 0 to 1.

Anomaly Detection Method
Previous studies have shown that the commonly used methods about the determination of threshold values of dangerous driving behavior indicators include statistical methods based on data distribution, clustering methods and abnormal detection methods.
In this paper, four methods including K-means clustering, local factor anomaly algorithm, Isolation Forest and OneClassSVM are employed to analyze dangerous driving behavior. Based on the vehicle's five dangerous driving behavior characteristics indicators, clustering to find samples with similar data characteristics or outlier abnormal samples by anomaly detection method can distinguish between normal drivers and dangerous drivers.
K-means clustering is a commonly used unsupervised learning method and can be used to identify and detect abnormal points. Given a set of observations (x 1 , x 2 , . . ., x n ), where xi is a d-dimensional real vector. The purpose of K-means clustering is to divide n observations into k clusters {C 1 , C 2 , . . ., C k } so as to minimize the sum of variance within the cluster. The objective function of K-means clustering is as follows: x represents the vector mean of the C i cluster.
Isolation Forest is an efficient anomaly detection algorithm, which has excellent applications in a variety of anomaly detection scenarios. The basic principle is to construct multiple random binary tree subtrees and find outliers by calculating the average path length of the sample to the leaf nodes in all trees. Among them, the abnormal score formula in the independent forest sample is as follows [22]: E(h(X)) represents the average path length of sample x to the leaf node in multiple subtrees, ϕ represents the number of training samples of a single binary tree, and c(ϕ) represents the average path length of training samples. The larger the S value, the more abnormal the data.
The local anomaly factor (LOF) algorithm [23] is also an unsupervised anomaly detection method, which judges abnormal data points by identifying the data points that cross the local density with respect to its neighbors [24]. OneClassSVM is an unsupervised exception detection method [25]. The main implementation principle is to obtain the hyper- sphere boundary around the data in the feature space by constructing a hypersphere, and to minimize the hypersphere volume as the objective function, so as to identify the outliers.
Suppose the sample training set X = {x 1 , x 2 , . . ., x n }, ω is a normal vector for the classification hyperplane, v, δ i is the variable, L is the number of samples, ρ is the constant of the origin relative to the classification surface, ϕ(x i ) is the mapping function, and the target function is as follows:

. Evaluation of Classification Method
For the evaluation of classification methods, this paper applied contour coefficient, correlation coefficient and category feature analysis methods.

Contour factor
The contour coefficient is usually used in clustering situations where the actual category information is unknown. For a single sample, suppose a is the average distance from other samples in the same category, and b is the average distance from samples in different categories. The contour coefficient is: For a sample set, its contour coefficient is the average of the contour coefficients of all samples. The larger the coefficient value, the closer the distance between samples of the same type and the farther the distance between samples of different types.

•
Correlation coefficient This article uses Spearman's rank correlation coefficient (spearman) to measure the correlation between driving risk index characteristics and driving behavior categories. Assuming that two data vectors in the sample are x (x 1 , x 2 , . . ., x n ), y (y 1 , y 2 , . . ., y n ), the correlation coefficient calculation formula is as follows: Among them, when ρ value < 0.05, it is considered that there is a significant correlation between the two vectors. The larger the value of the correlation coefficient ρ xy , the stronger the correlation, and the positive value of ρ xy represents the positive correlation between the two vectors.

•
Category feature analysis Unsupervised algorithms usually divide the data into several parts through clustering or anomaly detection, and the evaluation of the algorithm often requires observation of the data characteristics of these parts. Based on the dangerous driving behavior detection scenario, the detection algorithm (including clustering method and abnormal detection method) used in this paper divides the data into normal driving behavior and dangerous driving behavior, and proposes the detection effect index of dangerous driver. Suppose that based on an unsupervised detection algorithm, two driver sample categories are obtained. The normal driver sample is X (x 1 , x 2 , . . ., x n ) and the dangerous driver sample is Y (y 1 , y 2 , . . ., y m ). As shown in Formula (6), the detection effect index of the algorithm consists of the detection effect index of the data in MOR1 to MOR5, and F ie represents the detection effect index of a certain detection algorithm result in the i-th MOR. As shown in Formula (7), the performance of detection result in the i-th MOR (i = 1, 2, . . ., 5) index mainly considers three factors including the normal driver's maximum MOR feature MOR ix , the dangerous driver, the distribution shape feature dix and the normal driver distribution shape feature d iy .
Among them, K 1 , K 2 , and K 3 are the weight of the maximum MOR characteristics of normal drivers, the distribution characteristics of dangerous drivers, and the distribution characteristics of normal drivers. This research considers the importance of these three to be the same, so this article takes the value k 1 = 1, k 2 = 1, k 3 = 1.
The normal driver's maximum MOR eigenvalue (MOR_ix) represents the maximum value of the i-th MOR values of the normal driver sample obtained by the detection algorithm. The d_ix represents distribution of normal driver's MOR value obtained by the detection algorithm in the i-th MOR value range. The specific definition is shown in Formula (10). The d_iy indicates the distribution of the dangerous driver's MOR value obtained by the detection algorithm in the i-th MOR value range. The specific definition is shown in Formula (11).
Among them, N iy ≥ 0.5 represents the number of drivers whose i-th MOR eigenvalue of the dangerous driver sample is greater than 0.5; N iy < 0.5 represents the drivers whose i-th MOR eigenvalue of the dangerous driver sample is less than 0.5. Assuming xi ∈ X(x 1 , x 2 , . . ., x n ), N(i) k is defined as shown in Formula (12).

Extraction of Characteristics of Driving Behavior
It is all about time series data for driving trajectory data, such as speed, acceleration and so on. The purpose of this research is to identify dangerous driving behaviors by extracting the characteristics of driving trajectory data. In the study of behavior recognition, the methods to obtain the characteristics of time series data mainly include extraction of characteristics in time domain and frequency domain. Because the number of observed frames of each vehicle in the experiment data is inconsistent, the time series of the speed and acceleration of each vehicle cannot be directly used as the input of the recognition model for identifying dangerous driving behavior.
In this paper, the time series of driving characteristics is converted into signal amplitude in the frequency domain, and the first 15 frequency domain components obtained from this characteristic are used as the new characteristic input model.
The DFT of a given time series (x 1 , x 2 , . . ., x n ) is defined as N complex numbers (DFT 0 , DFT 1 , . . ., DFT N−1 ): Among them, i is the imaginary unit and e is the base of the natural logarithm. Based on the above modeling method, the driver's driving behavior was labeled and the driving behavior characteristics of each driver were obtained. This section analyzes and compares several recognition models to improve the prediction accuracy. After analyzing the proportion of dangerous driving behaviors, it is found that the driving behavior recognition dataset studied in this paper is extremely imbalanced, so this paper compares multiple imbalanced recognition models for creating optimal algorithm.
All models of dangerous driving behavior recognition employed in this paper are shown in Table 2  Among them, Adaboost and Xgboost are commonly used machine learning algorithms and the driving behavior data are directly recognized without imbalanced processing. "Smote+" means that the data are subjected to the imbalanced processing of the smooth method before training the driving behavior data. "Rus+" means that before using the machine learning model to train the driving behavior data, the data are subjected to the imbalanced processing of the random undersampling method. For example, "Smote + Adaboost" means that during the training of the Adaboost recognition model, the training dataset is processed by the smooth method. The amount of data in each category in the training set is consistent. The Smoteboost and Rusboost are imbalanced boosting algorithms. The standard Adaboost boosting algorithm assigns the same weight to the misclassified samples in each iteration, while the smooth sampling algorithm is used in each iteration in the Smoothboost algorithm to gradually improve the imbalanced ratio of minority samples. In the same way, Rusboost uses a random under-sampling method in each iteration.
To the best of the authors' knowledge, the imbalance-based improvement of Xgboost in the field of dangerous driving behavior recognition has not been tried (15). The imbalance factor is introduced into the loss function of Xgboost, and the classification effect is adjusted by optimizing the value of the imbalance factor. The entire model can better improve the recognition performance of the minority category that is dangerous driving behavior.

Evaluation Index of Identification Model
For the evaluation of performance of the dangerous driving behavior recognition model, this paper employed four important performance indicators: correct rate, recall rate, F1 value and Auprc value.
The correct rate is defined as follows: Electronics 2022, 11, 1557 8 of 15 Among them, TP is the number of dangerous drivers who are correctly identified, and FP is the number of normal drivers who are mistakenly identified as dangerous drivers.
The recall rate is defined as follows: Among them, FN is the number of dangerous drivers mistakenly identified as normal drivers.
The F1 score is the harmonic average of the correctness rate and the recall rate, where the F1 score reaches the best value at 1 (perfect correctness rate and recall rate) and the worst value at 0. The F1 formula is defined as follows: In general, when the observed values of each category are approximately equal, Receiver Operating Characteristic Curve (ROC curve) should be used. When there is imbalanced class, the correct recall rate curve should be used. The ROC curve of an imbalanced dataset may be deceptive and lead to misinterpretation of model performance [29]. Therefore, this research uses the area under the accuracy-recall curve (Auprc) to compare the performance of the algorithm, and calculates the area under the entire accuracy-recall curve as the evaluation index of the performance.

Data Description
The dataset is the natural driving vehicle trajectory data token by drones at high altitude on the expressway, and the positioning error is less than 10 cm. The road scene is a two-way four-lane highway, which is a straight road, and in the process of data preprocessing, the paper processes the vehicle trajectory parameters in two directions of the road, respectively. In the video detection process, the length of time that each vehicle is continuously detected and identified is different; that is, the number of frames of each vehicle is not consistent. Therefore, this article first conducts a statistical analysis on the number of frames that all vehicles appear in the video data. In all the collected data, most of the vehicles were identified and observed for more than 10 s (1 s = 10 frames) and the observation frames of the vehicles are basically between 250 and 550. In order to carry out the research on driver's behavior recognition better, this study mainly focused on the vehicles with the number of observed frames greater than 300 frames. The research objects have a total of 8917 vehicles, including 836 large cars and 8081 small cars.

Results
This section may be divided by subheadings. It should provide a concise and precise description of the experimental results, their interpretation, as well as the experimental conclusions that can be drawn.

Test Results of Dangerous Driving Behavior
The clustering method usually does not impose a mandatory requirement on the number of clustered samples, but the use of anomaly detection algorithm to detect dangerous drivers needs to pre-determine the proportion of dangerous drivers. Based on the abovementioned modeling methods, this paper calculates the performance of an evaluation index of various dangerous driving behavior detection methods to determine the proportion of dangerous driving behavior in the detection algorithm. As shown in the subgraphs a, b, and c of Figure 2, the curves of the detection effects using three algorithms, including OneClassSVM, independent forest and local anomaly factor, under different proportions of dangerous drivers, are obtained. The results show that the classification effect of OneClassSVM is the best when the proportion of dangerous drivers is 0.02. The optimal proportion of dangerous drivers using independent forest is 0.053, and the pro-Electronics 2022, 11, 1557 9 of 15 portion of optimal drivers using local anomaly factor is 0.024. The Figure 2d shows the optimal detection effect obtained by various detection methods, and the results show that the performance of OneClassSVM for dangerous driving behavior detection is the best. abovementioned modeling methods, this paper calculates the performance of an evaluation index of various dangerous driving behavior detection methods to determine the proportion of dangerous driving behavior in the detection algorithm. As shown in the subgraphs a, b, and c of Figure 2, the curves of the detection effects using three algorithms, including OneClassSVM, independent forest and local anomaly factor, under different proportions of dangerous drivers, are obtained. The results show that the classification effect of OneClassSVM is the best when the proportion of dangerous drivers is 0.02. The optimal proportion of dangerous drivers using independent forest is 0.053, and the proportion of optimal drivers using local anomaly factor is 0.024. The sub-Figure d in Figure  2 shows the optimal detection effect obtained by various detection methods, and the results show that the performance of OneClassSVM for dangerous driving behavior detection is the best. Various methods for detecting dangerous driving behavior obtain different proportions of dangerous driving behaviors. As shown in Figure 3, the number of dangerous drivers is judged to be the largest based on K-means clustering methods. A total of 776 vehicles are judged as participating in dangerous driving behaviors. The OneClassSVM algorithm shows the least number of dangerous drivers, with a total of 178; the driver's Various methods for detecting dangerous driving behavior obtain different proportions of dangerous driving behaviors. As shown in Figure 3, the number of dangerous drivers is judged to be the largest based on K-means clustering methods. A total of 776 vehicles are judged as participating in dangerous driving behaviors. The OneClassSVM algorithm shows the least number of dangerous drivers, with a total of 178; the driver's behavior was judged to be dangerous driving behavior.
(d) Comparison of optimal detection effect Various methods for detecting dangerous driving behavior obtain different proportions of dangerous driving behaviors. As shown in Figure 3, the number of dangerous drivers is judged to be the largest based on K-means clustering methods. A total of 776 vehicles are judged as participating in dangerous driving behaviors. The OneClassSVM algorithm shows the least number of dangerous drivers, with a total of 178; the driver's behavior was judged to be dangerous driving behavior. Next, this paper uses contour coefficients commonly used in clustering methods to compare various dangerous driver detection methods. As shown in Figure 4, the K-means clustering method shows the largest contour coefficient, followed by independent forest Next, this paper uses contour coefficients commonly used in clustering methods to compare various dangerous driver detection methods. As shown in Figure 4, the K-means clustering method shows the largest contour coefficient, followed by independent forest and OneClassSVM. The results show that the contour coefficient obtained by the local anomaly factor is lower than the other three methods, based on the data. This shows that, compared with the local anomaly factor, the distances between samples of dangerous drivers obtained by the other three methods are closer, the distance between samples of dangerous drivers and normal drivers is longer, and the classification performance is relatively better. In addition, the samples of dangerous drivers do not necessarily have similarities in the MOR values, and further correlation analysis of the results and analysis of the MOR distribution of dangerous drivers are needed. and OneClassSVM. The results show that the contour coefficient obtained by the local anomaly factor is lower than the other three methods, based on the data. This shows that, compared with the local anomaly factor, the distances between samples of dangerous drivers obtained by the other three methods are closer, the distance between samples of dangerous drivers and normal drivers is longer, and the classification performance is relatively better. In addition, the samples of dangerous drivers do not necessarily have similarities in the MOR values, and further correlation analysis of the results and analysis of the MOR distribution of dangerous drivers are needed. In order to verify the effectiveness of various detection algorithms, this paper uses Spearman's correlation coefficient to analyze the correlation between category labels and feature variables. Figure 5 shows driving risk indicators (MOR1~MOR5) and the heat map of the correlation coefficient between the labeled data obtained by various detection methods. The legend on the right side of the heatmap shows the color depth, corresponding to different correlation coefficients. The Isolation_label obtained through the independent forest method, kmean_label obtained through the k-means method, LOF_label obtained through the local abnormal factor, and OneClassSVM_label obtained through OneClassSVM. In order to verify the effectiveness of various detection algorithms, this paper uses Spearman's correlation coefficient to analyze the correlation between category labels and feature variables. Figure 5 shows driving risk indicators (MOR1~MOR5) and the heat map of the correlation coefficient between the labeled data obtained by various detection methods. The legend on the right side of the heatmap shows the color depth, corresponding to different correlation coefficients. The Isolation_label obtained through the independent forest method, kmean_label obtained through the k-means method, LOF_label obtained through the local abnormal factor, and OneClassSVM_label obtained through OneClassSVM.
Spearman's correlation coefficient to analyze the correlation between category labels and feature variables. Figure 5 shows driving risk indicators (MOR1~MOR5) and the heat map of the correlation coefficient between the labeled data obtained by various detection methods. The legend on the right side of the heatmap shows the color depth, corresponding to different correlation coefficients. The Isolation_label obtained through the independent forest method, kmean_label obtained through the k-means method, LOF_label obtained through the local abnormal factor, and OneClassSVM_label obtained through OneClassSVM.  Among them, the data label is defined as a two-class label, where value 1 represents dangerous driving and value 0 represents normal driving. From the color of the heatmap, the closer it is to white, the weaker the correlation is, and the closer it is to blue, the stronger the correlation is. It can be seen from the figure that the correlation coefficients between the labels obtained by each detection method and the MOR value of the driving risk indicator are all positive, indicating that the greater the degree of driving risk, the more dangerous the driving tends to be, and the lowest correlation is 0.014. It can be seen that all these types of detection methods are effective. From the perspective of the color distribution on the heat map, the correlation between most labels and the mor value of driving risk index is not strong, for example, between MOR3 and the labels by all four algorithms. The data labels obtained by independent forest and k-means clustering have a strong correlation with MOR4 and MOR5, and the data labels obtained by the local anomaly factor and OneClassSVM have a strong correlation with MOR1, but the data labels obtained by local abnormal factor have very weak correlation with MOR1~MOR5. The p-values of all correlation coefficients are less than 0.01.
Finally, this paper compares the data characteristics from the results of two types of samples (normal drivers and dangerous drivers), obtained by various detection methods. Figure 6 shows the data feature distribution map obtained by the four detection methods. As shown in Figure 6a, the blue dots represent samples of normal drivers, and the orange dots represent samples of dangerous drivers. The abscissa represents five driving risk indicators (MOR1~MOR5) and the ordinate represents the value of MOR. Each driving risk indicator on the abscissa corresponds to two columns of distribution maps (normal driving samples and dangerous driving samples). From the abovementioned definition of dangerous driving behavior index, it can be seen that the characteristic of dangerous driving behavior should be that some or all the dangerous driving risk index are relatively large, and the MOR value of the normal driver data sample should be relatively small. Therefore, this research suggests that the range of eigenvalue for dangerous driving behavior is wider than that for normal driving behavior samples. The range of normal driving behavior needs to occupy a certain distribution and cannot be too large.   It can be seen from Figure 6c that, based on the k-means clustering to obtain the classification results, the value ranges of the feature values MOR1 and MOR3 for the normal driver sample points are almost evenly distributed across the entire value range. For drivers, the greater the MOR value, the greater the risk of dangerous driving, and the value of MOR in the normal driver sample obtained by this method is too high. Therefore, this research suggests that this classification method is not practical for identifying dangerous driving behavior. It can be seen from Figure 6b that in the data category of normal drivers obtained by the independent forest algorithm, the feature value of MOR5 is basically 0. Obviously, the classification method mainly refers to whether the lane change insertion feature behavior occurs. The MOR5 feature value is greater than 0, which means that the vehicle has been inserted into lanes. Normal drivers may also have lane-changing insertions. Dangerous driving behaviors should not be classified based on a certain feature. Therefore, the results obtained by this method cannot truly satisfy dangerous driving behavior recognition needs. In Figure 6, the eigenvalue distributions of the algorithm results of sub-Graphs a and d are relatively similar. However, the maximum value of the MOR3 characteristic value of the normal driver in sub-Graph a is higher than that of sub-Graph d. In addition, it can be clearly seen from the sub-Graphs that the eigenvalue range distribution of dangerous driving behavior is wider than that of normal driving behavior, and the MOR eigenvalues of normal driving behavior samples are mainly distributed below 0.6.
Combining the above evaluation methods, this article recommends using OneClassSVM method to calibrate dangerous drivers. This conclusion is consistent with the conclusion obtained from the analysis of the index about detection performance proposed in this paper.

Results of Dangerous Driving Behavior Recognition
Based on the abovementioned modeling methods, the OneClassSVM abnormality detection is performed on all the vehicle trajectory data to obtain the label of dangerous driver data. After that, the speed and acceleration of each vehicle are subjected to the Fourier transform feature extraction to obtain the parameters of driver behavior characteristics. The results of the comparative analysis of the recognition and classification models are shown in Table 3. The recognition process of each recognition model adopts a 5-fold hierarchical cross-validation method. The dataset is divided into five equally. One is selected, in turn, as the test set data, and the remaining four datasets are used as training data. After repeating cross validation five times, the evaluation results based on correctness rate, recall rate, F1 value and Auprc value of the five test set data are averaged to represent the comprehensive recognition performance of the entire algorithm model. As shown in Table 3, the prediction accuracy of the imbalanced improved Xgboost algorithm is 83.5%, which is only lower than the 84% of Xgboost, indicating that only 16% of the dangerous drivers identified by the model are misjudged. Rusboost has the highest recall rate, which means that 94.6% of dangerous drivers are recognized. The imbalanced improved Xgboost recall rate is higher than that of Xgboost and Adaboost. The data results show that the F1 value and Auprc value obtained by the imbalanced modified Xgboost are the highest. In addition, considering the characteristics of imbalanced processing data, simply preprocessing the data structure through the sampling method cannot significantly improve the recognition performance of the algorithm for identifying dangerous driving behaviors. The other imbalanced promotion algorithm or the methods, including changing the loss function, will be helpful to improve the recognition performance, because the imbalanced promotion algorithm and loss function can change the weight of the model through iterative training and strengthen the learning of the characteristics of minority samples, which is better than the pre-sampling processing method.

Conclusions
Based on the real vehicle trajectory data collected based on different highway scenes, this paper defined crash risk indicators of dangerous driving behaviors for five types of driving behaviors, including dangerous car following, lateral deviation, frequent acceleration and deceleration, frequent lane changes, and forced insertion. Through these dangerous driving risk indicators, a variety of methods were employed and compared to detect abnormalities in vehicle trajectory data. An evaluation method for abnormal detection results is proposed to analyze and evaluate the classification results of dangerous drivers. The results show that the data spatial structure of the dangerous driver category obtained by OneClassSVM is more accurate, and the distribution of the MOR range is wider than that of the normal driver. The MOR range of the normal driver category is basically distributed in the low range of 0 to 0.6. Finally, OneClassSVM is used to detect and calibrate dangerous drivers. Aiming at solving the extremely imbalanced characteristics of the dangerous driver dataset, this paper uses a variety of processing methods for comparative analysis. The results show that the improved Xgboost algorithm has the best performance in identifying dangerous drivers, followed by the Xgboost algorithm, and the RusBoost imbalanced lifting algorithm. In summary, this paper proposes an algorithm for detecting dangerous driving behaviors based on vehicle trajectories, which can effectively identify dangerous driving behaviors in advance.
However, in real life, driving behaviors are complex and diverse. In this paper, only mor value is used to identify different driving behaviors. Next, more indicators can be selected for research. In addition, this paper only carries out the identification and analysis of dangerous driving behavior based on video surveillance. In the follow-up research, it can also carry out the research based on on on-board equipment, and conduct in-depth discussion in combination with the two scenes.