Transportation Modes Classification Using Sensors on Smartphones

This paper investigates the transportation and vehicular modes classification by using big data from smartphone sensors. The three types of sensors used in this paper include the accelerometer, magnetometer, and gyroscope. This study proposes improved features and uses three machine learning algorithms including decision trees, K-nearest neighbor, and support vector machine to classify the user’s transportation and vehicular modes. In the experiments, we discussed and compared the performance from different perspectives including the accuracy for both modes, the executive time, and the model size. Results show that the proposed features enhance the accuracy, in which the support vector machine provides the best performance in classification accuracy whereas it consumes the largest prediction time. This paper also investigates the vehicle classification mode and compares the results with that of the transportation modes.


Introduction
In recent years, smartphones are becoming more and more popular. Each phone typically contains a variety of sensors, such as a GPS (Global Positioning System) sensor, a magnetometer, and a gyroscope sensor, etc. Therefore, it is easy to get a large amount of sensor data from smartphones. This paper utilizes the information from such sensors to detect different types of transportation modes. Classifying a person's transportation mode plays a crucial rule in performing context-aware applications. Using sensors embedded in smartphones has been recognized as a good approach.
Much literature has studied this issue. For example, Elhoushi et al. [1] proposed an algorithm for indoor motion detection such as walking, sitting, standing, etc. They used the accelerometer triad, the gyroscope triad, the magnetometer triad, and the barometer information as the input sensors. Hemminki et al. [2] proposed an algorithm to use smartphones to detect five transportation modes, including bus, train, metro, tram and car. They used kinematic motion classifiers to distinguish whether users were walking or not. Once the motorized transportation was detected, the motorized classifier could classify the current transportation activity. Sasank et al. [3] used GPS and accelerometer data as the input data. After filtering out the noise, they built an instance-based decision tree as the classifier and used a discrete hidden Markov model to make the final decision. Ben et al. [4] collected the accelerometer data. They used the magnitudes of the 250 FFT (Fast Fourier Transform) components and the statistics of the signal as features and used genetic data analysis and SVM (Support Vector The transportation state includes 10 modes, still, walk, run, bike, motorcycle, car, bus, metro, train and high speed rail (HSR). Compared to other similar studies which use small-scale data (several or dozens hours) [1,19], such big data makes the results of this paper more convincing and general.
The database for five transportation modes is indicated by Table 1. This paper classifies the vehicular modes (i.e., motorcycle, car, bus, metro, train, and HSR) as a single mode: on a vehicle. Then, these data would be separated into training and testing data for the performance evaluation. In this paper, we attempt to visualize the big data and the corresponding features from different perspectives. Figures 1 and 2 show the distribution of the raw data and averaged data, respectively, from three x-axis of sensors in the transportation mode. The raw data is randomly selected 10 s, while the averaged data is obtained by computing the absolute value of the 1000 min average from the large dataset. These figures show that the long-term statistic is different from that of the raw data, verifying the importance of the temporal processing in the features.
Similarly, Figures 3 and 4 show the vehicular cases. These figures show that the measurements are not discriminate as that in Figures 1 and 2. This demonstrates the difficulty of vehicle mode detection. (several or dozens hours) [1,19], such big data makes the results of this paper more convincing and general.
The database for five transportation modes is indicated by Table 1. This paper classifies the vehicular modes (i.e., motorcycle, car, bus, metro, train, and HSR) as a single mode: on a vehicle. Then, these data would be separated into training and testing data for the performance evaluation. In this paper, we attempt to visualize the big data and the corresponding features from different perspectives. Figures 1 and 2 show the distribution of the raw data and averaged data, respectively, from three x-axis of sensors in the transportation mode. The raw data is randomly selected 10 s, while the averaged data is obtained by computing the absolute value of the 1000 min average from the large dataset. These figures show that the long-term statistic is different from that of the raw data, verifying the importance of the temporal processing in the features.
Similarly, Figures 3 and 4 show the vehicular cases. These figures show that the measurements are not discriminate as that in Figures 1 and 2. This demonstrates the difficulty of vehicle mode detection.

Feature Extraction
With database from Section 2.1, this paper can integrate these data into diversified features. In this paper, 512 samples are integrated into a frame, and a moving window with 75% overlap is used to generate the next frame. In this setup, the monitoring period of each frame is 17.06 s. The 75% overlap means that we reused the 17.06 × 75% = 12.8 s data as the next frame to smooth the data continuity and to reduce the system delay. Then, these frames would be transformed into various features. Because we select Reference [5] as a baseline, the seven features used in Reference [5] are listed below: (1) Average of the accelerometer's magnitude.
(2) Standard deviation of the accelerometer's magnitude.  Figure 5 shows that the fourth feature outperforms the fifth one in transportation mode classification. Figures 5 and 6 again show that the vehicular mode classification is more difficult than that in transportation mode. It motivates us to use more features. To improve accuracy for both transportation and vehicular modes, we select and combine useful features from existing works. However, due to the constrained power and resources, the modification and dimension should be minor.
The aim of this study is not to propose a new statistic feature, which has been well investigated. Instead, we try to select and combine useful features from existing works under the power and dimension constrain for both transportation and vehicular modes classification tasks. In fact, we have tried thousands subset combinations heuristically in the experiments of this study, and reported the best one as the proposed feature. Next, this paper fetched six notable features based on the above-mentioned features and Liu et al. [20], and then figured out other eight features. This paper combines these 14 features to classify training and testing data to evaluate the accuracy. Note that among the proposed features, the first four features were proposed by [5], and the fifth and sixth were derived from [20]. The proposed features are described as followed: (1) Average of the accelerometer's magnitude.

Machine Learning Algorithms
With the training data from Section 2.2, this paper then used three machine learning algorithms, including decision tree, K-nearest neighbor, and support vector machine, to train classifier. The following is the introduction of each algorithm.

Decision Tree (DT)
The DT algorithm exemplifies every possible outcome of a decision through means of categorizing the data in each step for regression and classification. In the DT algorithm, a tree is created by a specific algorithm, which is a supportive tool used to simplify a given set of complex data. The decision tree consists of nodes and branches based on a rule. The nodes illustrate that a decision has been made whilst the branches that spread to the left or right from the nodes show that the data is further being categorized. On each occasion when a decision has been made, a new counter node is formed. This in effect forms the 'tree-like' graph to help individuals visually analyze the data so that an accurate and meaningful decision can be derived.
A tree searches through variable to find a value of a variable which splits the data into two or more groups. The best split minimizes the error (impurity) in the resulting subsets. To find the best

Machine Learning Algorithms
With the training data from Section 2.2, this paper then used three machine learning algorithms, including decision tree, K-nearest neighbor, and support vector machine, to train classifier. The following is the introduction of each algorithm.

Decision Tree (DT)
The DT algorithm exemplifies every possible outcome of a decision through means of categorizing the data in each step for regression and classification. In the DT algorithm, a tree is created by a specific algorithm, which is a supportive tool used to simplify a given set of complex data. The decision tree consists of nodes and branches based on a rule. The nodes illustrate that a decision has been made whilst the branches that spread to the left or right from the nodes show that the data is further being categorized. On each occasion when a decision has been made, a new counter node is formed. This in effect forms the 'tree-like' graph to help individuals visually analyze the data so that an accurate and meaningful decision can be derived.
A tree searches through variable to find a value of a variable which splits the data into two or more groups. The best split minimizes the error (impurity) in the resulting subsets. To find the best

Machine Learning Algorithms
With the training data from Section 2.2, this paper then used three machine learning algorithms, including decision tree, K-nearest neighbor, and support vector machine, to train classifier. The following is the introduction of each algorithm.

Decision Tree (DT)
The DT algorithm exemplifies every possible outcome of a decision through means of categorizing the data in each step for regression and classification. In the DT algorithm, a tree is created by a specific algorithm, which is a supportive tool used to simplify a given set of complex data. The decision tree consists of nodes and branches based on a rule. The nodes illustrate that a decision has been made whilst the branches that spread to the left or right from the nodes show that the data is further being categorized. On each occasion when a decision has been made, a new counter node is formed. This in effect forms the 'tree-like' graph to help individuals visually analyze the data so that an accurate and meaningful decision can be derived. A tree searches through variable to find a value of a variable which splits the data into two or more groups. The best split minimizes the error (impurity) in the resulting subsets. To find the best split, we have to measure the degree of impurity of the child nodes [21]. The higher the impurity, the less skewed the class distribution will be. There are several ways to measure the impurity of the best split. Some of the impurity measures are: • Entropy: • Gini Impurity: impurity-based metrics which is used to measure how often an element from a set can be labeled incorrectly. It can be measured as: In Equations (1)-(3), p i is the probability mass function of the i-th sample. Compared to other classification algorithms, decision trees are simple to understand, easy to interpret and robust against skewed distributions but a small change can alter the results drastically. One more problem with the decision tree is that they can overfit easily [22].

K-Nearest Neighbor (KNN)
The KNN algorithm is a non-parametric method used for classification or regression, and the output solely depends on which of the two are being used. For the output in classification an object is usually classified by the majority of the votes received by its neighbors and for the output in regression the object is based on the property value. KNN is a lazy learning algorithm which does not use training data and classifies the new instances based on similarity measure (i.e., distance measure). It classifies the unlabeled instance to the most common node amongst its nearest neighbors based on the distance. Since there is no prior knowledge available in KNN, the decision rule of KNN is dependent on the distance metrics. A simple case of KNN is shown in Figure 7, where a new instance is classified based on the value of K.
The performance is totally dependent upon the way the distances are computed. The distance can be computed using one of the following methods: where D(x, y) is the shortest distance between any two samples. The most commonly used distance metric is the Euclidean distance. It should also be noted down that the above mentioned three distance metrics are only used for continuous variable. In discrete or categorical case, the Hamming distance is used. Despite it being robust and effective for coping with large training data, the weakness lies in the run time performance with it being considered poor for a large training set and high computational cost. regression the object is based on the property value. KNN is a lazy learning algorithm which does not use training data and classifies the new instances based on similarity measure (i.e. distance measure). It classifies the unlabeled instance to the most common node amongst its nearest neighbors based on the distance. Since there is no prior knowledge available in KNN, the decision rule of KNN is dependent on the distance metrics. A simple case of KNN is shown in Figure 7, where a new instance is classified based on the value of K. The performance is totally dependent upon the way the distances are computed. The distance can be computed using one of the following methods: Figure 7. A K-nearest neighbor model.

Support Vector Machine (SVM)
SVM is a very popular method, capable of performing classification and regression. It offers very promising results and can capture complex relationships without going into the difficult transformations. SVM constructs a set of hyperplanes in high-dimensional space to separate categories of examples. With these separated categories, people can find obvious differences of each category and classify unknown examples into specific group more accurately. A good separation can be achieved by hyperplanes that has largest functional margin which in return lowers the generalization error. In SVM, a decision surface is to be find which is far from any data point. A simple scenario for support vectors and margin is shown in Figure 8, where the support vectors are the points fall within the margin.
where ( , ) is the shortest distance between any two samples. The most commonly used distance metric is the Euclidean distance. It should also be noted down that the above mentioned three distance metrics are only used for continuous variable. In discrete or categorical case, the Hamming distance is used. Despite it being robust and effective for coping with large training data, the weakness lies in the run time performance with it being considered poor for a large training set and high computational cost.

Support Vector Machine (SVM)
SVM is a very popular method, capable of performing classification and regression. It offers very promising results and can capture complex relationships without going into the difficult transformations. SVM constructs a set of hyperplanes in high-dimensional space to separate categories of examples. With these separated categories, people can find obvious differences of each category and classify unknown examples into specific group more accurately. A good separation can be achieved by hyperplanes that has largest functional margin which in return lowers the generalization error. In SVM, a decision surface is to be find which is far from any data point. A simple scenario for support vectors and margin is shown in Figure 8, where the support vectors are the points fall within the margin. To maximize the margin for a given set of training data, the following optimization problem need to be solved: ϵ ≥ 0 where y is either 1 or −1, indicating the class to which the point x belongs. The parameter w is the (not necessarily normalized) normal vector to the hyperplane. The parameter C is the regularization To maximize the margin for a given set of training data, the following optimization problem need to be solved: where y i is either 1 or −1, indicating the class to which the point x i belongs. The parameter w is the (not necessarily normalized) normal vector to the hyperplane. The parameter C is the regularization parameter used to prevent overfitting. The parameter b determines the offset of the hyperplane from the origin along the normal vector w.

Transportation Mode Classification
This paper extracts 90,000 feature vectors of each mode and uses three machine learning algorithms to train classification models. This paper compares the results between two sets of features: one is the seven features based on Reference [5], and the other one is the proposed features. First, this paper created two tables to show the performance of each algorithm. Table 2 shows the general accuracy, prediction time, and model size of each algorithm with seven features based on Reference [5] while Table 3 shows that with the proposed 14 features.  In these tables, the general accuracy means the ratio of the correct results to the total testing numbers, the prediction time means how long it would it take for each prediction with the unit of microseconds (i.e., 10 −6 s), and the model size means the size of each model with the unit of megabits (MB). The results show that DT reports the lowest prediction time and the smallest model size. On the other hand, SVM provides the best performance in accuracy whereas it incurs the largest prediction time. More importantly, the table shows that the proposed features significantly enhance the accuracy in the three machine learning algorithms. Specifically, DT improves from 74.65% to 79.59%, KNN improves from 77.33% to 86.86%, and SVM improves from 81.60% to 86.94%. While using the proposed features, KNN shows a comparable performance to SVM and a slightly larger model size.
Next, Figure 3 more clearly compares the two different feature sets on accuracy, showing that when the number of features changes from seven to 14, the accuracy obviously improves. The improvement is the most significant with the KNN method. Based on Figure 9, we can see that SVM has the best performance in general accuracy. Nevertheless, from the other operating points of view, KNN would be also a good choice because of its comparable accuracy and lower prediction time.
For analyzing results more intuitively, this paper constructs the confusion matrices of each algorithm with the proposed 14 features. In these confusion matrices (i.e., Tables 4-6), the header columns are the actual label, and the header rows are the prediction label. For instance, if a prediction result is the still mode, and its actual label also is the still label, then this prediction result is correct. If a prediction result is the biking mode, but its actual label should be walking, then people can know that the walking data was misattributed to the bicycle instead. In these confusion matrices, we can find that in many cases, the "in vehicle" data were usually misjudged as "on bike" and "still". Besides, the running mode always produces the most accurate result. This is because running makes the smartphone shaken severely, making the classification easy.

Vehicle Mode Classification
The vehicle mode includes HSR, metro, bus, car, and train. Tables 7 and 8 compare the results between the two feature sets; one is the seven features based on Reference [5] (Table 7), and the other Figure 9.
Comparison between two different feature sets on accuracy of three machine learning algorithms.

Vehicle Mode Classification
The vehicle mode includes HSR, metro, bus, car, and train. Tables 7 and 8 compare the results between the two feature sets; one is the seven features based on Reference [5] (Table 7), and the other one is the proposed features (Table 8). The tables show that the proposed features significantly enhance the accuracy in the three machine learning algorithms. These results from the vehicular mode classification are consistent with that of transportation mode detection. Again, DT reports the lowest prediction time and the smallest model size. The only difference is that KNN provides the best performance in accuracy whereas it also incurs the largest model size. Figure 10 more clearly compares the two different feature sets on accuracy, verifying that when using the proposed features, the accuracy still clearly improves.   Similar to the previous transportation mode detection results, Tables 9-11 provide the confusion matrices of each algorithm with the proposed features. The results show that among the five vehicular modes, detecting the car mode reports the highest accuracy (89.21% using KNN). On the other hand, the most significant errors occur while classifying the car and train (11.27% using KNN). Figure 11 compares the performance between the transportation and vehicle mode classification. This figure shows that classifying the vehicle mode is more difficult than the transportation mode. The general accuracy reduces from 86.94% to 78.59% and from 86.86% to 83.57%, respectively, based on SVM and KNN. This is because the behaviors of the car-bus and the train-metro are very similar, making the mode classification difficult.   Figure 13c,d show the mean of the sixth (horizontal section (X-Z plane) of the accelerometer's magnitude) and the 14th added features (average of magnetic instant change), respectively. This figure shows that the added feature can provide assistance to the task due to the different properties. More importantly, the added feature can improve the performance of the vehicular mode detection task, as indicated in Figure 8. From this figure, we can Figure 11. Performance comparison between transportation and vehicle mode classification using three machine learning algorithms.     Figure 13c,d show the mean of the sixth (horizontal section (X-Z plane) of the accelerometer's magnitude) and the 14th added features (average of magnetic instant change), respectively. This figure shows that the added feature can provide assistance to the task due to the different properties. More importantly, the added feature can improve the performance of the vehicular mode detection task, as indicated in Figure 8. From this figure, we can see that the first original features are almost the same in the five modes whereas the sixth added feature can separate the data into two categories. These figures again verify the ability of the added features in enhancing the accuracy in both tasks.

Conclusions
This paper studies the transportation mode using big data from three smartphone sensors based on three machine learning algorithms and two different feature vectors. From the feature perspective, the results show that the proposed features significantly enhance the accuracy in the three machine learning algorithms, as compared to traditional features. From the classifier perspective, SVM has the best performance in the transportation modes' prediction accuracy whereas it incurs the largest prediction time. While using the proposed features to predict the transportation modes, KNN shows a comparable performance to SVM and a slightly larger model size. This paper also investigates the vehicle mode classification and compares the results with those of the transportation modes. In the vehicle mode detection tasks, KNN outperforms SVM with a shorter prediction time, but contains largest model size. The future work is to study different features and models to overcome the problem of the misattributed results.