Complementary Deep and Shallow Learning with Boosting for Public Transportation Safety

To monitor road safety, billions of records can be generated by Controller Area Network bus each day on public transportation. Automation to determine whether certain driving behaviour of drivers on public transportation can be considered safe on the road using artificial intelligence or machine learning techniques for big data analytics has become a possibility recently. Due to the high false classification rates of the current methods, our goal is to build a practical and accurate method for road safety predictions that automatically determine if the driving behaviour is safe on public transportation. In this paper, our main contributions include (1) a novel feature extraction method because of the lack of informative features in raw CAN bus data, (2) a novel boosting method for driving behaviour classification (safe or unsafe) to combine advantages of deep learning and shallow learning methods with much improved performance, and (3) an evaluation of our method using a real-world data to provide accurate labels from domain experts in the public transportation industry for the first time. The experiments show that the proposed boosting method with our proposed features outperforms seven other popular methods on the real-world dataset by 5.9% and 5.5%.


Motivations
Decreasing the current number of global deaths and injuries from road traffic accidents by half is one of the important Sustainable Development Goals as part of the 2030 Agenda for Sustainable Development adopted by the United Nations General Assembly. Traffic accidents not only bring huge financial losses to society but also cause great physical and mental damages to everyone [1,2]. Millions of people died from traffic accidents in 2018 [3], and most traffic accidents are caused by human mishandling [4]. Analyzing the behavior of drivers, especially public transportation drivers, is important to protect road safety [5][6][7]. To ensure safety for public transportation, public transportation operators can be requested to get an evaluation of drivers and to identify dangerous drivers for retraining. For public transportation fleet management and monitoring, massive data is collected from vehicles using state-of-the-art technologies of sensors for example MobilEye from Israel. In the control center, thousands of real-time events and alarms arrive from the sensors of the vehicles through wireless networks in real time every day. Although it is virtually impossible to handle such a huge amount of data manually, accurate predictions with machine learning to analyze behavior from the vehicles has recently become feasible. Machine learning techniques have been applied to analyzing behavior in different tasks with various kinds of data collected using sensors in moving vehicles [8][9][10][11]. We investigate efficient and accurate machine learning methods to classify whether the driving behavior of drivers is safe as it is important that drivers with unsafe behavior should be warned and retraining can be provided to the drivers.

Challenges
It is challenging to classify a driver's driving behavior (safe/unsafe). First, the industrial need for a high classification performance cannot be satisfied using solely existing methods as the misclassification rates can be high. The high misclassification rates in previous methods have been pointed out in previous literature [12,13] as it has been suggested that it is difficult to model the driver's behavior. For example, the performance of K Nearest Neighbour (KNN) models is greatly affected by unbalanced training data in traffic data for safety with many fewer labels for accidents. Second, the lack of features in the data collected with the Controller Area Network (CAN) bus does not provide lots of information for driving behavior analysis to train an accurate machine learning classifier. There is no existing method for road safety predictions with CAN bus data to extract extra useful information from features. Last, because of the high cost of labeling and privacy issues with public transportation data, to the best of our knowledge, there is no publicly available dataset with labels for the predictions of safe driving behavior on public transportation. The lack of urgently needed labels in datasets for road safety makes it hard to evaluate and build machine learning models.

Contributions
In this paper, our contributions include (1) a boosting method to make deep learning and statistical learning complement each other, (2) a novel method to compute extra time-series features to extract richer information, (3) extensive evaluation on a new real-world dataset with labels from experts in the public transportation domain.
The motivation for the usage of CAN-bus data is that the CAN standard is one of the most important bus standards for vehicles [14][15][16][17]. The infrastructure already built-in in most vehicles bought by transportation companies records data efficiently from devices in the vehicles. The CAN standard is a serial data bus standard designed to enable electronic devices to communicate with each other. Therefore, CAN data can be easily obtained by transportation companies. There are a small number of available features in the data collected using the standard CAN bus system. With the lack of sufficient useful features, it is hard to find patterns in the data to determine driving behavior from the driver.
The novel boosting method is proposed to combine the advantages of statistical learning and deep learning. It is shown in our experiments that the ensemble with our proposed features outperforms any single state-of-the-art method we considered, and our boosting method combines seven state-of-the-art machine learning methods including support vector machine (SVM), random forest (RF), k-nearest neighbour (KNN), discriminant analysis, naive Bayes classifier, adaptive boosting (AdaBoost) and a deep learning neural network called Long Short-Term Memory (LSTM). In addition, the proposed boosting method outperforms the seven methods also for the case without the proposed feature extraction method.
Extra features are computed using our method. Feature engineering is an important tool to extract useful information in time-series for machine learning methods for better performance when the number of features is small. We show in our experiments that this significantly improves the performance of the classifiers. We, thus, propose a method to compute extra time-series features from the raw data of the CAN bus system to extract extra information.
To completely evaluate methods in the real world, the experiments are conducted using a real-world dataset collected using the CAN bus system. Because of the high cost of sample labeling, there is no published real-world dataset with labels for analyzing the driving behavior of drivers on public transportation. The samples in the dataset are labeled by the experts of Transportes Urbanos de Macau (TransMac), which is one of the largest bus companies in Macao.
We divide the rest of this paper into five sections. In Section 2, we describe related work. Our dataset and the proposed method are described in detail in Sections 3 and 4. In Section 5, the proposed method is evaluated on two datasets with a comparison to various state-of-the-art machine learning methods. Section 6 concludes this work.

Related Work
In this work, we focus on safety classification with data from the Controller Area Network (CAN) bus. The development of the CAN bus started in 1983 and was released in 1986 [18]. This standard has recently become available in most embedded systems for vehicles. The CAN bus is one of five protocols used in the mandatory onboard diagnostic (OBD-II) standard. The OBD-II standard has been mandatory for all cars and light trucks sold in the United States since 1996, and the EOBD standard has been mandatory for all petrol vehicles sold in the European Union since 2001 and all diesel vehicles since 2004 [19,20]. A lot of recent research focuses on the analysis of sensor data from the CAN bus system [5,[21][22][23][24][25][26][27][28].
Machine learning plays the important role in building data analytics models to handle massive data [29,30]. Statistical learning and deep learning are important areas in machine learning. Methods based on statistical learning have been successfully applied to solving many related behavior analytics problems. In [11,31], Bayesian methods are used to predict braking behavior and model vehicle speeds. k-nearest neighbors (KNN) is employed to classify driving styles in [10]. Support vector machines (SVM), deep learning (DL), and decision trees (DT) have been applied to predicting driving behavior and accident risk predictions [8,9]. Deep learning techniques are popular in recent years for many tasks [32,33]. The Long Short-Term Memory (LSTM) technique is one of the most popular deep learning techniques for time-series problems [34]. LSTM networks are a special kind of recurrent neural networks (RNN), and LSTM networks aim at learning long-term dependencies [35]. In contrast to the standard RNN, the repeating modules of LSTM networks contain four interacting layers to enable the ability to change information of the cell state. LSTM is used to detect driver distraction [36]. The ensemble of machine learning methods are shown to outperform, most of the time, one particular technique [37][38][39], and we propose a heterogeneous boosting method to obtain better performance. In our experiments, our boosting method is compared with seven state-of-the-art methods.

Evaluation of Road Safety Predictions
For any application domain of machine learning, one of the most objective evaluation methods is to see how prediction models perform in real-world datasets. However, to the best of our knowledge, there is no published real-world dataset with labels provided for behavior analysis of public drivers. In this work, we build a new dataset collected using the CAN bus system from one of two public bus companies in Macao called TransMac.
One record is produced every three seconds from the reading of the sensors in a moving public vehicle, and there are totally 6451 records with 24 features in the new dataset. All 6451 records are labeled by domain experts. There are 507 unsafe cases and 5944 safe cases in the labels. In total, the recording time for our CAN bus data is 6451 × 3 = 19,353 s which is 5.38 h of driving by professional bus drivers in the company. Although each sample contains 24 features, some features are totally irrelevant for the training of the machine learning model (see Table 1). As shown in Table 1, features like vehicle identification are meaningless for the machine learning model. Features containing too much missing data cannot be useful either. For example, most entries of the "CANALARMSTATE" feature and the "CANALARMSTATE" are N/As (Not Available). Features used are listed in Table 1 to provide a reference for the training of our method. In addition, descriptive statistics on the feature set used for training are shown in Table 2.

The Proposed Method
We propose a feature extraction method (see Algorithm 1) for extracting richer information from the change of feature vectors against time and propose a boosting method (see Algorithm 2) to classify whether driving behavior of drivers on public transportation can be considered safe on the road. The feature extraction method is a general method. It can be used with any other machine learning classification method. In the experiments, it is shown that our feature extraction method can be used to improve the performance of any classification method. In addition, it is also shown that our boosting method outperforms other seven machine learning methods whether or not our feature extraction method is used.

Our Method for Richer Information With Feature Extraction
Missing data is common in industrial data collected from CAN data systems. Features with too many N/A (Not Available) entries cannot be used to train the machine learning methods. In addition, features irrelevant for driving behavior analysis like the identifier of the vehicles are excluded. Therefore, there are only a few useful features left for classification without irrelevant features (see Table 1). The low dimensionality of the feature space of training data severely limits the descriptive power of the samples. The lack of descriptive power makes it is very difficult to obtain accurate machine learning models. We argue that richer information can be extracted from the change of feature values against time and we, hence, propose a feature extraction method to provide extra useful time-series features to deal with the lack of information in the original features. For example, the acceleration of the car is important for driver behavior analysis, but this information is not recorded in the original data. The acceleration of the bus can be obtained by calculating the gradient of the velocity of the bus. The proposed feature extraction method is shown in Algorithm 1.

Algorithm 1 Our Method for Richer Information With Feature Extraction
Divide n samples into p periods, P 1 , ..., P p , by the recording time.
The input data of Algorithm 1 contains n samples, Especially, the features irrelevant for training are excluded in these m features. For example, for our proposed dataset, the m features are the twelve features used to train the classification model (see Table 1). It is common in time-series analysis to use moving averages. The moving average is used to filter out noise. It is a common signal preprocessing step for time-series data if there is noise in the data [40][41][42]. We noticed the noise in the CAN-bus data so signal processing filtering techniques are employed in our work with the aim to achieve better training and classification. The moving average is used to filter out noise. It is a common signal preprocessing step for time-series data if there is noise in the data [40][41][42]. We noticed noise in the CAN-bus data so signal processing filtering techniques are employed in our work with the aim to achieve better training and classification.
For example, the average driving speed of a driver in two minutes (a period) is useful information for analyzing his driving behavior. Motivated by this, some time-series features are calculated for this particular reason (see Algorithm 1). The n samples are divided into p periods by the recording time. The value of p is a tunable parameter, and it depends on the time interval between two samples.
In our boosting method, one period covers two minutes. For sample s i , m time-series features, {t i,1 , ..., t i,m }, are extracted from raw features. The latitude and longitude features, f latitude and f longitude , are obtained from GPS information. The difference in the latitude/longitude values of the adjacent samples can be used to measure the velocity of the bus. In Algorithm 1, t i,m+1 and t i,m+2 of sample s i are the differences in the values of sample s i and sample s i−1 , for latitude and longitude features respectively. The gradient of a feature is used to describe how fast the feature values change. The velocity, the mileage, the tire pressure, the engine speed, and the engine temperature are important for accurate classification. These features can reflect the different behavior of drivers, and {t m+3 , ..., t m+7 } are calculated to find the rates of change.
The features are irrelevant features if the changes of the features due to reasons other than driving behavior. In most cases, data used to train machine learning models includes irrelevant features or redundant features. Certain machine learning models automatically pick useful features during training. We use techniques that specifically map useful features to the labels (driving behavior in our case) effectively like random forests (RF) with the importance scores of features generated to ignore irrelevant features. The RF model is then trained using a subset of features with high importance scores [43][44][45]. When the dimensionality of feature space is large, performance could suffer due to the curse of dimensionality [46,47]. However, in our case, as the number of features is not exceedingly high, our experiments show that the techniques perform well without suffering from the curse of dimensionality.

Weak Learners
In our heterogeneous boosting method, all the seven methods in Table 3 are used. Different machine learning methods have their own characteristics and are suitable for different tasks. SVM, traditionally a statistical machine learning method, is one of the most popular methods for two-class classification. In SVM, a hyperplane is constructed for classification after the data in the low-dimensional space is mapped to the high-dimensional space using a kernel function. In our boosting methods, Radial Basis Function (RBF) kernel is used as the kernel function of the SVM, and the RBF kernel can be obtained as where x i , x j are the feature vectors of sample i and sample j. k-nearest-neighbors (KNN) is a nonparametric classification method, which is simple but effective in machine learning.
The assumption underlying KNN is that the information (e.g., the class in classification problems) of an output sample is similar to input samples containing similar characteristics to this output sample.
In the proposed boosting method, KNN using the chi-square distance is used to measure the distance of samples. The chi-square distance between sample i and sample j is obtained as where x i , x j are the feature vectors, and F is the dimentionality of samples. RF, a method based on bagging, contains a certain number of decision trees to classify. In the RF method, Q decision trees are trained to determine classification results of RF. Given a sample x, its result y produced by RF is where f q (x) is the result given by a decision tree. Discriminant Analysis is also known as Fisher Discriminant Analysis. In Discriminant Analysis, a linear combination of the features is obtained to classify the samples. AdaBoost used as one of the component learners in our boosting method is a homogeneous boosting method. The base learners of this homogeneous boosting method are 1-depth decision trees. In Naive Bayes, Bayes' theorem with naive independence assumptions between the features is applied. The LSTM network used in the proposed boosting method is shown in Figure 1. There are five layers in the LSTM network: the input layer, the LSTM layer, the fully connected layer, the softmax layer, and the classification layer. The LSTM network starts with putting the CAN bus data into the input layer. Using the input data as the training set, the LSTM layer learns long-term dependencies of samples. To handle time-series data, the LSTM layer contains many LSTM blocks. One LSTM block uses an input sample and the output of last LSTM block as its input and the block output a cell state and a hidden state. Finally, the classification results are generated by the fully connected layer, the softmax layer, and the classification layer through analyzing the long-term dependencies. In the LSTM network, the vanishing gradient problem is avoided by adding four components to the RNN network (see Table 4 and Figure 1): the input gate, the output gate, the forget gate, and the cell candidate. The interaction of the four components is shown in Table 4 and Figure 1. c t and h t denote the cell state and the hidden state produced by the t-th LSTM block. i t , o t , f t , and c t denote the outputs of the input gate, the output gate, the forget gate, and the cell candidate. As shown in Figure 1, the output of the t-th LSTM block are the cell state c t and the hidden state h t . The cell state c t is obtained by where denotes the Hadamard product. The hidden state h t is obtained by where σ denotes the state activation function of the LSTM block. It is popular to use two well-known techniques, principal components analysis and t-sne, with deep learning. We apply these methods to process raw CAN bus data to train the LSTM network. Table 3. Seven state-of-the-art machine learning methods used in the proposed boosting method.

Our Method with Boosting
Ensemble learning is a machine learning technique, which is used to combine multiple methods and to get better performance than that of a particular method. The proposed boosting method combines seven state-of-the-art machine learning methods. The seven machine learning methods are support vector machine (SVM) [48,49], k nearest neighbour (KNN) [50], random forest (RF) [43,51], naive Bayes [52], discriminant analysis [53], adaptive boosting (AdaBoost) [54], and Long Short-Term Memory (LSTM) [34]. The proposed boosting method is in Algorithm 2.

Output: Final strong classifier H(x).
1: Initialize weights w 1,i = 1 2c , 1 2(n−c) for safe samples and unsafe samples, respectively, where c is the number of safe samples. 2: for t = 1 to U do 3: Normalize the weights, w t,i = w t,i ∑ n j=1 w t,j .

4:
Train all g weak classifiers, l 1 (x) , ..., l g (x) , using the training data D with our time-series features.

5:
Prediction using g classifiers, and h t (x) is the classifier l u (x) with the highest correctly rate a.
As shown in Algorithm 2, there are n samples in the training data, and the dimensionality of them is equal to m. There are g weak learners used in the algorithm. In the proposed boosting method, g is equal to seven (see Table 3). U is the number of the weak classifiers which are chosen to form final strong classifier H(x). The value of U is a tunable parameter, and it is equal to five in our method. Each of the g classifiers is trained based on one particular machine learning method.

Experiments
Our experiments are conducted using two datasets: the public Warrigal dataset [55] and our own dataset provided by TransMac. The Warrigal dataset can be downloaded in http://its.acfr.usyd. edu.au/datasets/warrigal/. In the experiments, the proposed boosting method is compared with other seven popular machine learning methods: SVM, KNN, RF, Simple Bayes, Discriminant Analysis, AdaBoost, and LSTM.

Evaluation Metrics
The classification accuracy, sensitivity, specificity, and the Area Under Curve (AUC) are four of the most popular evaluation methods for a binary classifier. The sensitivity is the probability of detection while the specificity gives the probability of false alarm. The AUC value is equal to the probability that a randomly chosen positive example is ranked higher than a randomly chosen negative example. The sensitivity and specificity are also referred to as the True Positive Rate (TPR) and the True Negative Rate (TNR), respectively. Sensitivity and Specificity are computed using the numbers of true negatives (TN), false negatives (FN), true positives (TP), and false positives (FP): The average accuracy is taken over ten repeated experiments to better evaluate the methods, and it is obtained by using Accuracy = (TP + TN) / (TP + TN + NP + NF) . (8)

Experimental Setup
We use MATLAB to implement the seven methods and the proposed methods. Grid search is used to determine all the hyperparameters of our classifiers using a validation set. The predicted dependent variable is safety from 0 (safe) to 1 (unsafe). For our dataset, the training and test labels were determined by domain experts. For the other dataset, Label 1 indicates ongoing communications by the driver which can violate safety guidelines and 0 indicates the time without communications by the driver. In our dataset, there are 507 unsafe cases and 5944 safety cases. In the Warrigal dataset, there are 17,716 unsafe cases and 205,722 safety cases.
To clearly show the comparison of all methods, the experiments on each dataset are divided into two parts. In the first part, in order to demonstrate that the proposed boosting method can outperform other machine learning methods (see Table 3), all methods are trained using the raw data without our feature extraction method (see Algorithm 1). In the second part, to determine whether our feature extraction method can be used to improve the performance of machine learning methods, our feature extraction method is used to compute time-series features, and data with these time-series features are used to train the methods. By comparing the accuracies with the two sets of experiments, it is shown that the performance of methods is improved using our feature extraction method. To avoid the overfitting issue of machine learning, in both sets of the experiments, there are two scales, 70% of the dataset and 90% of the dataset, of the training set. The samples in the training set are randomly extracted from the whole dataset.

Safety Classification on the Warrigal Dataset
The Warrigal dataset [55] is a large dataset collected with the interactions of large trucks and smaller vehicles, i.e., thirteen vehicles in a large quarry-type environment. The data contains vehicle state information (like positions, speeds, and heading) and information on vehicle-to-vehicle communications. Due to the large size of the Warrigal dataset, when the dataset is published, the dataset is divided into many subsets with one subset for data recorded in a day. Because of the large size of the dataset and limited computational resources, only one subset (data recorded on 1st February 2009 which is just the first day in the dataset) picked from the data is used for our experiments. There are 223,438 samples with twelve features in the subset. In the dataset, there is no label for safety predictions. We consider constant communication through wireless devices as distractions which could potentially lead to unsafe driving behavior. Samples recorded during constant verbal communication are labeled as potentially unsafe (or inattentive) while the other samples are labeled safe (or attentive). More specifically, Label 1 indicates ongoing communications by the driver which can violate safety guidelines and 0 indicates driving without wireless communication engaged. 17,716 samples are labeled potentially unsafe and 205,722 are labeled safe.
The comparison among prediction models trained with the raw features of the Warrigal dataset is shown in Table 5. In terms of classification accuracy, the proposed method outperforms the other seven methods by 1.5% and 1.1%, with 70% and 90% of the dataset for training respectively. The novel method also gets the highest value of AUC. We further demonstrate the effectiveness of our feature extraction method. The performance of the models trained with our features is shown in Table 6. As observed in Tables 5 and 6, the proposed boosting method outperforms the other seven state-of-the-art approaches whether or not our feature extraction method is used. In addition, comparison with the accuracies in Tables 5 and 6 indicates that the performance of all machine learning methods is improved using our feature extraction method. Our method with feature extraction outperforms other methods using raw features by 2.7% and 3.2%, with 70% and 90% of all samples randomly selected for training respectively.

Safety Classification with Our Dataset
The results from the previous experiments show that our method improves the performance of classification. We further apply our methods to a real-world problem in the industry with a public bus company in Macao called TransMac. Given the fact that there is no real-world public dataset with labels for safety classification, the experiments are conducted using a dataset built with data collected from TransMac.
In this subsection, the experimental setup follows that of the previous subsection using a different dataset with the first set of experiments with feature extraction and the second set without. The performance comparison of the methods can be found in Tables 7 and 8. As shown in the two tables, our boosting method outperforms the other methods in all cases. It is shown that the performance of all methods is improved using our features extraction method. Our boosting method with our feature extraction method can outperform other methods using raw features by 5.9% and 5.5%, with 70% and 90% of the whole dataset randomly selected for training respectively.  Table 7, it shows that the performance of any method is improved using our feature extraction method.

Further Evaluation
The CAN-bus data is time-series data. In order to see if predictions can still be accurate when training is done at a very different time period, we conduct another set of experiments to further evaluate the performance of our method. In this further experiment, the samples of the training set and the test set are collected at different times. As our real-world dataset was collected in four different periods, samples are therefore grouped into four subsets corresponding to the respective time periods. A four-fold cross-validation is conducted with these four subsets and the average classification accuracies of different methods are compared as shown in Table 9. Like Tables 7 and 8, it is also found that the proposed method achieves the highest accuracy shown in the above table with or without our features. The boosting methods tend to be more expensive during training. We summarize the training time required for the methods in Table 10. It is observed in our experiments that the performance does not improve when the number of weak learners in the final strong classifier, U, is larger than 5 although the total number of base learners built with state-of-the-art classification techniques is seven in our heterogeneous ensemble method. The empirical result verifying this is shown in Table 11.

Conclusions
Automation to determine whether certain driving behavior of drivers on public transportation can be considered safe on the road using A. I. or machine learning techniques has become a possibility recently. However, the industrial need for a high classification performance cannot be satisfied using existing methods using computer vision as the misclassification rates are too high with existing methods. Due to the high misclassification rates, it makes it hard to compare and to evaluate the performance of drivers on public transportation.
Our goal is to build a practical and accurate method for road safety predictions that automatically determine if the driving behavior is safe on public transportation. In this paper, our main contributions include (1) a novel feature extraction method because of the lack of informative features in the data, (2) a novel boosting method for driving behavior classification (safety or not) to combine advantages of deep learning and traditional statistical learning methods with much improved performance, and (3) evaluating methods using real-world data to provide accurate evaluations from labels from experts in the public transportation industry for the first time. The experiments show that the proposed boosting method with the proposed features outperforms seven other popular methods on the real-world dataset by 5.9% and 5.5%.