Analyzing the Importance of Sensors for Mode of Transportation Classification †

The broad availability of smartphones and Inertial Measurement Units in particular brings them into the focus of recent research. Inertial Measurement Unit data is used for a variety of tasks. One important task is the classification of the mode of transportation. In the first step, we present a deep-learning-based algorithm that combines long-short-term-memory (LSTM) layer and convolutional layer to classify eight different modes of transportation on the Sussex–Huawei Locomotion-Transportation (SHL) dataset. The inputs of our model are the accelerometer, gyroscope, linear acceleration, magnetometer, gravity and pressure values as well as the orientation information. In the second step, we analyze the contribution of each sensor modality to the classification score and to the different modes of transportation. For this analysis, we subtract the baseline confusion matrix from a confusion matrix of a network trained with a left-out sensor modality (difference confusion matrix) and we visualize the low-level features from the LSTM layers. This approach provides useful insights into the properties of the deep-learning algorithm and indicates the presence of redundant sensor modalities.


Introduction
The broad acceptance of smartphones holds the potential for large-scale humancentered sensing and research. Smartphones contain a variety of different sensors for global localization and a body's force. The data derived from smartphones enhances the research focused on the challenges arising with the growing number of people in urban and major metropolitan areas. One important challenge is traffic management in urban areas, since traffic congestion occurs naturally during rush hours. Information on people's transport behavior can result in better routing and less congestion. Most smartphones can position themselves in a global frame of reference, e.g., GPS, but the accuracy depends on the signal quality and line of sight between the sensor and the satellites. The accuracy decreases significantly indoors or underground, as well as the features derived from the measurements. Inertial Measurement Units (IMU) are not reliant on external infrastructure. On the one hand, the data quality of the IMU does not depend on whether the sensor is underground or not, and on the other hand, the IMU data depend on the kinematic chain between the sensor and the source of the force applied to the sensor.
As part of the SHL recognition challenge 2020 [1], we proposed a deep-learning-based algorithm that combines augmentation and LSTM layers as well as several convolutional, and fully connected layers to perform transportation mode classification using IMU data from smartphone sensors [2]. However, decisions made by deep neural networks are difficult to understand and interpret due to their black box character. Explanatory Artificial Intelligence (XAI) tackles this problem and allows more transparent decisions that can be explained in a certain level of detail. This is essential since explanations can be used to ensure algorithmic fairness, identify potential bias, and problems in the training data, and to validate that the algorithms work as expected [3]. Compared to deep neural networks for image classification, where learned features can be visualized more intuitively, and thus be interpreted more easily by humans, visualizations in the time series domain are challenging. This is since the input as well as the features are more abstract and include a time dimension. In this context, this paper targets the basic understanding of what a deep neural network learns and which inputs have the greatest influence on accuracy. For this purpose, we trained our network for the SHL Challenge 2020 with a leave-one-sensor-out strategy and computed the difference confusion matrices of the network and the baseline trained with all sensors. Moreover, we used autoencoders to visualize the low-level features learned by the LSTM layers of each sensor. By combining the two results, we were able to identify the contribution of individual sensors to classification accuracy and detect redundancies.
The paper is structured as follows: Section 2 gives a brief overview of the state of the art. Afterwards, in Section 3 we provide details regarding the used dataset, the preprocessing pipeline and the used algorithm. Section 4 summarizes the results of our analysis and we discuss the obtained results in the final Section 5.

State of the Art
For several years extensive work on understanding and sensing the mobility behavior of people has been carried out. This section first introduces the state of the art regarding mode of transportation classification using machine learning approaches based on smartphone sensor data. Subsequently, we examine explanatory visualization techniques that provide a better understanding of deep neural networks and their decisions, as well as methods that can be used to shed light on the influence of inputs.
All approaches that include contextual information are not considered, since this research focuses on the use of information derived from smartphone sensors. A common approach is to understand the detection of the mode of transportation as a classification problem. We have assigned related works to the following two categories: 1. traditional machine learning-based classification and 2. deep-learning-based classification. Antar et al. [4] and Liono et al. [5] proposed random forest (RF) classifiers that achieved an accuracy of 92% and 91% on the SHL dataset and Crowdsignals dataset. Yu et al. [6] extracted features from three sensors (accelerometer, magnetometer and gyroscope) and proposed support vector machines (SVM) as the best classifier for detecting a person's mode of transportation (i.e., standing still, walking, running, cycling, and in the vehicle). Similar findings were also made by Fang et al. [7]. Another traditional approach only based on acceleration data was proposed by Hemminski et al. [8] to detect five different modes of transportation (i.e., bus, train, underground, tram and car). Recently, largescale datasets have been made available which enable the application of deep-learning techniques. The deep-learning algorithms are outperforming the traditional approaches which are using handcrafted features. Jeyakumar et al. [9] proposed a deep convolutional bidirectional-LSTM ensemble trained directly on raw sensor data on the SHL dataset. Using this approach, an F 1 -score on 96% was achieved for transportation mode classification. Qin et al. [10] introduced a deep-learning-based algorithm that combines a CNN and LSTM network. By using CNN-extracted and handcrafted features (i.e., segment and peak features), the algorithm can distinguish the transportation modes with an accuracy of 98.1% on the SHL dataset. Vu et al. [11] proposed a gate-based recurrent neural network to detect the transportation mode on the HTC dataset. This accelerometer-based approach achieved an accuracy of 94.72%. Tambi et al. [12] presented a CNN that distinguishes four transportation modes (bus, car, subway, train) by using mobile sensor data derived from an accelerometer and a gyroscope in the spectral domain. An accuracy of 91% was achieved.
Although there is a lot of work done on the development and modification of LSTM architectures, the decisions made by deep neural networks are still difficult to understand and interpret, due to their black box character. To provide a better understanding, different explanatory techniques have been proposed.
One technique that visualizes decision-making in CNNs is the Class Activation Map (CAM) [13]. It indicates the discriminatory image region used to identify a specific class. Grad-CAM is a more versatile version of CAM. Using gradients applied to the last convolution layer of a CNN, Grad-CAM tries to find salient regions in the input space [14]. However, CAM is not only applied for a deeper understanding of the decision-making process for image classification, but also for the classification of time series. In this context, Wang et al. [15] introduced a one-dimensional CAM that highlights class-specific regions that have contributed most to a particular classification of time series. This method gives insights into the properties of the deep-learning algorithm or its decision-making process, but does not provide the possibility to identify redundant features.
Several works have focused on interpreting the hidden states of LSTMs or hidden layers of CNNs. Karpathy et al. [16] showed the existence of interpretable cells in LSTMs that kept track of dependencies, such as line length, quotes and brackets in character-level language models. Moreover, the hidden state of LSTMs on different inputs can be explored interactively by the visual analysis tool LSTMVis [17] for recurrent neural networks. To intercept the hidden layers of deep neural networks, Moreira et al. recently employed autoencoders to provide information for the interpretation of classifiers, and to enable the investigation of misclassifications in the dataset from emerging clusters [18]. These methods offer the possibility to increase the explainability of the functionality of models, whereas in this paper we mainly focused on the explainability of the input and on the low-level features. Another area of research is feature selection, which not only aims to gain a better understanding of the features, but also to improve the prediction accuracy and speed of classification. With the intention of improving prediction accuracy, Liu et al. [19] proposed a leave-one-feature-out wrapper method. The leave-one-covariate-out method [20] aims at estimating the importance of local features. Furthermore, Azarbayejani et al. [21] introduced an approach for the evaluation of the redundancy of sensor networks, which is based on a leave-one-sensor-out analysis. These methods, which can be applied in a straightforward manner, improve the explainability of the features used and allow identification of redundant features or sensor modalities. At this point we see the potential to extend these methods and to introduce them into the area of explainability.

Materials and Methods
The provided part of the Sussex-Huawei Locomotion-Transportation (SHL) dataset [22,23] contains data from smartphones carried on the body in various positions. The dataset was collected with three participants over 31.6 d, each of them carrying four phones positioned at the four different locations hand, bag, hips, and torso. The values of the hardware sensors accelerometer, gyroscope, magnetometer, and pressure, as well as the software sensor values of linear acceleration, gravity and orientation. A virtual, i.e., software sensor is constructed by using the values of one or more hardware sensors to compute the value of the software sensor. The measurement frequency was 100 Hz. Each individual sensor value was labelled, i.e., 100 labels are available for 1 s.
The dataset includes eight different modes of transportation still, walk, run, bike, car, bus, train, and subway.

Pre-Processing
Before pre-processing, we performed some data integrity checks. We found that the labels for some samples are not uniform, i.e., the samples contain transitions of modes of transportation. Since the number of samples containing a transition was less than 1%, we assigned the label by majority decision. Thus, our dataset has only one label instead of 500 for each sample. Then, the training set has been merged with the validation set. To overcome the class imbalance, we followed a simple approach and oversampled by copying random samples and undersampled by deleting random samples. We used 30,000 samples, because the number of classes in which samples had to be deleted equals the number of classes in which samples had to be copied. Before balancing, the full dataset was split into new training, validation, and test sets in a stratified way. Since the samples in the challenge test set are not in a consecutive order, the samples were chosen at random. 75% of the full dataset was assigned to the training set, 15% to the validation set, and the remaining 10% to our private test set. Finally, the data from all phone locations were merged. The training set contains 720,000 samples, the validation set 144,000 samples, and the test set 96,000 samples.
Two pre-processing steps were applied on the balanced dataset. The first one was to apply a low-pass filter on all data. We used a second order filter with a cut-off frequency of 25 Hz. The second step was standard scaling by subtracting the mean and dividing by the variance. Standard scaling was applied to each feature in each dimension separately. Augmentation was applied with a probability of 50% during runtime. Figure 3 shows the difference in the acceleration between a raw sample and the augmented sample of the class still. The activations of the LSTM layers were not preprocessed. The activation function of the LSTM layers were tanh, and therefore the output was already scaled to [−1, 1]. The output dimension were (batch size, 500, 64), and the last 64 values of the output sequences were used. For visualization we transformed each encoded value by where sign denotes the signum function, ln the natural logarithm and x the input variable. The natural logarithm is not defined for 0, but if the activation is −1 the input variable to the natural logarithm is 0. Our implementation returns the input value if the input value is so small that the natural logarithm function cannot compute the result. This happens for 0 and for very small values close to 0, because computers have only a limited number of bytes for storing values (floating point arithmetic).

Algorithm
For finding the best architecture, we started off with a very small neural network and followed a Greedy approach. We subsequently added layers and adjusted parameters. If the result improved, the adjustments were kept, if the result was worse, the adjustments were reverted. The architecture, we propose ( Figure 4) combines an augmentation and an LSTM layer, as well as several convolutional and fully connected layers to perform transportation mode classification. The input data is split into seven streams, one stream per sensor. To artificially increase the number of training samples, an augmentation layer is implemented, which augments four windows of size 50 of each sample with a factor of 2. This is followed by an LSTM layer that can store information about time to find temporal correlations of the input sequences. The LSTM layer comprises 64 neurons, sigmoid recurrent activation and tanh activation. It is followed by a dropout layer, with a dropout rate of 0.25, that is used to avoid overfitting, a convolution layer, and at the end of each stream a maximum pooling layer. The convolutional layer consists of 128 filters, a kernel size of 8, stride length of 2 and a Leaky ReLU activation function with α = 0.001. Maximum pooling was performed with stride length of 2. Then, the seven streams are merged via a concatenation layer, which allows us to combine all features to extract meaningful information. Afterwards, a convolutional layer and a maximum pooling layer are used 4 times in a row, whereupon a flatten layer completes the second block (see Figure 5). In all type 2 blocks, maximum pooling, the convolutional stride and the Leaky ReLU activation with α = 0.001 were the same. The number of filters and the filter size were arranged in ascending order 64, 64, 128, 128, and 16, 32, 64, 64. The subsequent fully connected layers, each followed by a dropout layer, recombine the representations learned from the convolution layer and reduce the dimension. Both blocks of type 3 used the same parameters. The dense layer had 256 neurons, the dropout rate was 0.25 and Leaky ReLU was used as activation function, as before. In the last step, the classification layer uses the SoftMax activation function for the mode of transport classification. We used categorical cross-entropy loss and the F 1 score as metric. The Adam optimization algorithm was used for gradient optimization and we used a learning rate schedule with exponential decay after the first 10 epochs with an initial learning rate of 0.001.  For dimensionality reduction and visualization we used a common autoencoder. The basic idea of an autoencoder is to find the best representation of high-dimensional data in a low-dimensional latent space. The best latent space representation is the representation, where the input can be reconstructed with a minimal error. The autoencoder was trained on all last activations of the LSTM layer of all samples. For each sensor modality a separate autoencoder was trained. The autoencoders were comprised of five layers with 600, 150, 2, 150, 600 neurons. The architecture is shown in Figure 6. The autoencoder is reducing the dimensionality from 600 dimensions to two. The upper part of the network is the encoder and the lower part the decoder. The latent layer with two neurons and the output layer were activated by a linear activation function and all other layers by the ReLU function. The used optimizer was Adam with a learning rate of 0.001, and the mean squared error as loss function. The number of training epochs was not uniform, because we used the early stopping criteria with a minimum delta of 0 for the validation loss, and a patience of 5 epochs, i.e., the training was stopped after 5 epochs without any improvements. Figure 6. The autoencoder architecture used for dimensionality reduction. The first neuron of the latent space is the x-value and the second neuron of the latent space is the y-value.

Difference Confusion Matrix
The difference confusion matrices are computed by subtracting the confusion matrix of the network trained with all sensor modalities from the confusion matrix of the network trained without a certain sensor modality. A positive value in a cell means that the value is larger for the network without one sensor modality. A negative value in a cell means that the value is smaller for the network without one sensor modality and 0 means the values are equal. A positive value on the diagonal means that the network without one sensor modality is better classifying the corresponding class and a negative value means that the network without one sensor modality is worse in classifying the corresponding class. A positive value in a non-diagonal cell means the network without one sensor modality is worse in distinguishing the corresponding classes and a negative value means it is better in distinguishing. A value of 0 means equal classification performance.

Results
After some preliminary experiments, we found that the model has difficulties distinguishing between the classes train and subway. Therefore, we put a higher weight (3x) on the gradient update for the class train. The Figures 7 and 8 show the graphs of the F 1 score and the loss of the final training. In the beginning the score and the loss have a high slope and later the slope is asymptotically approaching the limit 0. During the first 10 epochs the validation score is slightly better than the training score and the validation loss is slightly smaller than the training loss. The confusion matrix shows that the model performs best on the classes walk and run and worst on the classes still and subway. An overview of the best epochs, the score on our private test set, and on the challenge test set is given in Tables 1 and 2. The best epoch was epoch 77 with a validation score of 98.93% and a score of 98.96% on our private test set. The largest score difference for our private test set (0.77%) is obtained by subtracting the score of the network trained without pressure from the score of the network trained with all sensor modalities. Furthermore, the largest score difference for the challenge test set (8.33%) is found by subtracting the score of the network trained without orientation from the score of the network trained without linear acceleration. Only two networks with one left-out sensor, gyroscope and orientation, have a worse score on the challenge test set than the network trained with all sensor modalities.   The difference confusion matrices are shown in Tables 3-10 and the plots of the encoded last activations from the LSTM layers are shown in Figures 9-15. The color codes are blue for the class still, orange for the class walk, green for the class run, red for the class bike, violet for the class car, brown for the class bus, pink for the class train, and grey for the class subway.
The diagonal values of the difference confusion matrix without acceleration are negative, except for class bus (min. −98, max. 12). The maximum difference is found in the cell (train, train) and the minimum difference in the cell (run, run). Eleven of the remaining 56 non-diagonal cells contain a negative value (min. −18, max. −1), nine cells contain 0 and 36 cells contain positive values (min. 1, max. 55). The plot of the last activations of the acceleration LSTM layer shows that the activations of the classes bike (red) and run (green) overlap the least. The activations of the class walk (orange) are not overlapping in the area centered at (0, 1). Acceleration is an important feature for classifying run and bike, but that the largest drop in accuracy is found for class train. In contrast, more bus samples (12) were classified correctly without the accuracy features. Considering the scores in Table 1, we see that the loss in the score on our private test set is the second smallest loss (0.19%) and there is a very small increase of 0.03% in the score on the challenge test set.
The diagonal values of the difference confusion matrix without gravity are negative except for the class bike (min. −86, max. 1). The maximum difference is found in the cell (train, train) and the minimum difference in the cell (bike, bike).  (Table 5) is the only one that has a positive value on the diagonal. The score on the challenge test set increases by 3.69% and the score on the private test set decreases by 0.31%.
The diagonal values of the difference confusion matrix without gyroscope are all negative (min. −104, max. −11). The maximum difference is found in the cell (train, train) and the minimum difference in the cell (run, run).  1). Thus, the gyroscope is one sensor that is redundant for the classes run and bike. The difference confusion matrix in Table 6 and the largest drop in performance is found in the classes train and subway and not in run and bike. However, the gyroscope is important for the other classes, because the test score has the second largest decrease of 0.74% for our private test set. The same holds for the challenge test set (0.29%).
The diagonal values of the difference confusion matrix without linear acceleration are negative, except for the class bike (0). The maximum difference is found in the cell (train, train) and the minimum difference in the cell (run, run). Sixteen of the remaining 56 non-diagonal cells contain negative values, eight cells contain 0 and 42 cells contain positive values (min. 1, max. 59). The effect of leaving out the linear acceleration is similar to leaving out the acceleration. The plots in Figures 9 and 12 look similar and the distribution of the loss in accuracy in Tables 4 and 7 is similar as well. However, the score on the challenge test set is the highest score for all left-out sensors and increases by 7.32% and the score on the private test set decreases by 0.28%.
The diagonal values of the difference confusion matrix without magnetometer are all negative. The maximum difference is found in the cell (still, still) and the minimum difference in the cell (run, run). The diagonal values of the difference confusion matrix without orientation are all negative. The maximum difference is found in the cell (train, train) and the minimum difference in the cell (run, run). Ten of the remaining 56 non-diagonal cells contain negative values (min. −15, max. −1), seven cells contain 0 and 39 cells contain positive values (min. 1, max. 108). The plot of the activations shows overlapping activations for the orientation LSTM layer for all classes and does not show any substantial contribution of the orientation to any class, but the difference confusion matrix shows a large loss in accuracy for the classes train and subway. Comparing the losses in performance in Table 1 shows that the overall losses are moderate with 0.51% and 1.01% for the private test set and the challenge test set, respectively.
The diagonal values of the difference confusion matrix without pressure are all negative. The maximum difference is found in the cell (subway, subway) and the minimum difference in the cell (run, run). Five of the remaining 56 non-diagonal cells contain negative values (min. −6, max. −1), nine cells contain 0 and 44 cells contain positive values (min. 1, max. 167). The plot of the activations of the pressure sensor ( Figure 15) is remarkable because the structure is totally different to the other sensor activation plots. All other activations are starting around (0, 0) and then evolve in all directions, whereas the pressure activations look like a line with bulges. The lower right part of the plot shows that the pressure features are useful to distinguish bike, car, and train. The difference confusion matrix Table 10 is contradicting the plot. The false classifications of these three classes differ slightly compared to using the sensor. The largest loss in performance is found in the classes train and subway. Leaving out the pressure sensor results in a loss in accuracy by 0.85% on the private test set and an increase in accuracy by 4.88% on the challenge test set.

Discussion and Conclusions
Considering all results, the classes train and subway are most affected by removing one sensor modality. In six out of seven cases, these two classes have the highest loss in performance. The sensors acceleration, gyroscope, and linear acceleration are redundant for the two classes run and bike and the least important sensor seems to be the gravity sensor. Furthermore, the pressure sensor seems to be the most important sensor, according to Table 2 and the shape of the activations in Figure 15. The results also showed that the software sensors linear acceleration and orientation do not give substantial contribution to the performance. The network can internally learn the important information from the hardware sensors. Moreover, the difference confusion matrices and the activation plots helped to identify redundancies regarding the sensors and certain classes. Even though the plots are supporting the findings in the confusion matrices, the use of the plots is limited. The plots are useful to visualize the magnitude of activation of the different classes and the general structure of the plots can be used to identify sensors that should be investigated further. However, we showed that the difference confusion matrices are applicable in cases where visualization methods are only partially useful.
Our contribution to explainable machine learning is the introduced difference confusion matrices as a tool for analyzing deep neural networks. We showed that the insights match the visualization and that the difference confusion matrices can be used when visualization is limited. We also identified sensor redundancies and revealed that the network internally learns most of the information provided by the software sensors. Funding: This research received no external funding Data Availability Statement: Publicly available datasets were analyzed in this study. This data can be found here: http://www.shl-dataset.org/download/.