Deep Transfer Learning for Vulnerable Road Users Detection using Smartphone Sensors Data

As the Autonomous Vehicle (AV) industry is rapidly advancing, the classification of non-motorized (vulnerable) road users (VRUs) becomes essential to ensure their safety and to smooth operation of road applications. The typical practice of non-motorized road users’ classification usually takes significant training time and ignores the temporal evolution and behavior of the signal. In this research effort, we attempt to detect VRUs with high accuracy be proposing a novel framework that includes using Deep Transfer Learning, which saves training time and cost, to classify images constructed from Recurrence Quantification Analysis (RQA) that reflect the temporal dynamics and behavior of the signal. Recurrence Plots (RPs) were constructed from low-power smartphone sensors without using GPS data. The resulted RPs were used as inputs for different pre-trained Convolutional Neural Network (CNN) classifiers including constructing 227 × 227 images to be used for AlexNet and SqueezeNet; and constructing 224 × 224 images to be used for VGG16 and VGG19. Results show that the classification accuracy of Convolutional Neural Network Transfer Learning (CNN-TL) reaches 98.70%, 98.62%, 98.71%, and 98.71% for AlexNet, SqueezeNet, VGG16, and VGG19, respectively. Moreover, we trained resnet101 and shufflenet for a very short time using one epoch of data and then used them as weak learners, which yielded 98.49% classification accuracy. The results of the proposed framework outperform other results in the literature (to the best of our knowledge) and show that using CNN-TL is promising for VRUs classification. Because of its relative straightforwardness, ability to be generalized and transferred, and potential high accuracy, we anticipate that this framework might be able to solve various problems related to signal classification.


Introduction
Through the ongoing growth of the automated vehicles (AV) trade, the classification of non-motorized road users such as pedestrians is becoming crucial in developing safety applications for the Cooperative Intelligent Transportation System (C-ITS) to enhance the safety of non-motorized road users [1,2]. The C-ITS has been widely investigate and used due to its ability to utilize the data and better manage the transportation networks. The C-ITS attempts to advance health, performance, and comfort through various connectivity technologies, such as vehicle-to-vehicle (V2V). C-ITS shares various forms of information, including knowledge about non-motorized road users, traffic congestion, accidents, and road threats [3]. This helps C-ITS to create an integrated person, route, infrastructure, and vehicle network by implementing communications and other transportation technologies. Taking advantage of the new technology and available big datasets can create a fully functional, instant-time, precise, and secure transport facilities [2,4]. While C-ITS is now receiving attention globally, the academic community focused primarily on motorized road users such as motor vehicles, expressing less interest in non-motorized road users [5]. Major planning challenges faced by deploying AV is intermodal traffic control, where AV regulations and programming should be structured to value human life by reducing the likelihood of crashes and protecting non-motorized road users [6,7].
The non-motorized road users are known to be one example of people intervening with AVs "humans that do not explicitly interact with the automated vehicle but still affect how the vehicle accomplishes its task by observing or interfering with the actions of the vehicle" [1]. The non-motorized road users, which typically lack protection, can be described as 'vulnerable.' Those were identified in this research effort by the quantity of traffic safety they lack. The lethality of non-motorized road users, in particular pedestrians and cyclists, are greater than the norm. This is because of the discrimination factor for non-motorized road users' collisions is small compared to motorized road users [8].
Conversely, the use of smartphones in data analysis has also lately gained attention of academics and policymakers. Smartphone applications (apps) have been designed and used successfully in several fields to gather data from smartphones. Researchers can use smartphones in the transport industry to track and gather motion information, such as velocity and motion vector from the integrated Global Positioning System (GPS). This information has the potential to identify the travel mode of the individual, which can be used in a variety of different ways and may decrease the amount of time and expense of traditional travel surveys substantially.
An established practice in non-motorized road users' classification and mode transportation recognition tasks is using the state-of-the-art algorithms for classification by integrating frame-level features over some period as an input. The common approach is to use the traditional statistic techniques, such as mean and standard deviation, resulting in a less resolution dataset and losing very valuable information such as the historical evolution and needs significant training time and expense. In this study, we explore the possibility of using Transfer Learning with Convolutional Neural Network to classify non-motorized road users with high precision (CNN-TL), which saves training time and cost, to classify images constructed from Recurrence Quantification Analysis (RQA). This approach has the potential to be popular in the transportation mode recognition field due to the potential high accuracy and ease of implementation.

Related Work
Scientists have established many methods to effectively differentiate between the modes of transport. Machine learning and artificial intelligence algorithms have shown outstanding performance in creating classification models with high precision, in particular with transportation mode classification. Throughout different experiments, supervised learning models, such as Support Vector Machines (SVMs) [9][10][11], Random Forests (RFs) [11,12], and Decision Trees [11,[13][14][15][16][17][18], have been utilized in different research efforts.
These research efforts have obtained various levels of accuracy in the classification. There are many variables that influence the precision of detecting modes of transport, as for example the monitoring time, the source of the data, number of modes, among others [12,19]. A major factor influencing the precision of transportation mode recognition approach, however, is the classifier used in the approach. In most of the research conducted in the past, researchers used only one classification algorithm layer [12,14,15]. This is called a conventional framework. On the other hand, a few researchers have used more than one classification algorithm layer, which is called a hierarchical framework [11]. Rasti et al. have provided an overview over the deep feature extraction approaches with its applications to hyperspectral image classification, covering a wide range of techniques with different classification layers.
In addition to the number of classification layers, the domain of the extracted features is another important factor that needs to be considered in the transportation mode recognition approach. The domain of the features can be classified into two categories: time and frequency. Several research efforts were conducted using both as in [9][10][11][12]14,15,20] [9,11,15]. Both accomplished significant and high accuracy.
Zadeh et al. [21] proposed a geometric approach to detect risky circumstances, so that their built-in alert system on smartphones can secure non-motorized road users. This approach can estimate the probability of a crash with the use of a fuzzy inference. In addition, a three-dimensional (3D) photonic mixer camera was developed to provide pedestrian identification using a sensor device to meet unique criteria for pedestrian safety in [22]. Anaya et al. [5] have used V2V communication to develop a novel Advanced Driver Assistance Program to prevent collisions between motorcyclists and cyclists. A multi-sensor approach was developed to non-motorized road users' security as part of the PROTECTOR project by detecting and identifying non-motorized road users from vehicles in motion in [23]. They explored the impact of using CNN-TL on the precision of the non-motorized road user's classification, which was the first effort to the best of our knowledge in this respect. They aim at precisely detecting non-motorized road users through data obtained from sensors on smartphones with low power. High level C-ITS protection relies on a specific classification of the non-motorized road users. A binary classifier was introduced to discriminate non-motorized road user's modes (i.e., cycling, running, and walking) from motorized road user's modes (passenger car and taking a bus). A binary classifier is useful in situations where there are higher threats to non-motorized road users. For instance, at an intersection, all subjects' smartphones detect non-motorized road users, and reports them to the C-ITS roadside unit. The C-ITS roadside also receives messages from vehicles, if any, and then transmits this message onto a warning sign if it detects a potential conflict between non-motorized road users and the vehicle.
We should emphasize that most of the methods proposed in the latest research efforts did not consider the shortcomings of GPS data such as signal failure or data loss, resulting in unreliable location information. In addition, turning on the GPS service in smartphones might quickly drain the battery; thus, this effort attempted to use collected data of various sensors in a smartphone without GPS information.

Data Collection
The dataset was obtained in Blacksburg, VA using a smartphone app by Jahangiri and Rakha [12]. Ten travelers were provided with the app to track their movement in five different modes of transportation, namely: car, bicycle, bus, running, and walking. Data were gathered from four different sensors in the smartphone: The Global Positioning System (GPS), accelerometer, gyroscope, and rotation-vector. Data were warehoused at the maximum viable frequency. Data gathering took place on working days (Mondays to Fridays) and during working hours (from 8:00 AM to 6:00 PM). Several variables have been considered to gather meaningful data that represent natural behaviors. To ensure the sensor positioning has no impact on the data collected, travelers (i.e., participants) were asked to consider holding the smartphone in various positions with no limitations. The data were gathered on various road types, and some periods that indicate congestion conditions that occur in real-life circumstances. The gathering of 30-min of data per person during the study period was considered appropriate for each mode.
To equate the results of the analysis with results of previous research efforts [11,12], the selected features extracted from the signals were assumed to have a significant association with the modes of travel for the study. In addition, features that could be derived from the rotation-vector values were omitted for the same purpose. Furthermore, GPS features were ignored in this study, allowing this system to be applied in circumstances in which GPS data were unavailable and to resolve the issue of battery depletion when the GPS service is turned on.

RQA Features
Extracting features from the signal is the standard approach in solving mode classification problems in the literature, which can be then used as inputs into the various classification algorithms. The traditional method creates features mainly by using statistics such as the mean, median, and standard deviation values. This process might result in losing the temporal evolution and behavior of the signal, which is valuable information. Extracting features that represent this behavior maximizes the classification precision and accuracy, but it is not yet deeply investigated in the literature. In [24], we proposed extracting features using Recurrence Quantification Analysis (RQA), which we proved provided extensive temporal behavior of the obtained signal. RQA is a nonlinear method for analyzing complex dynamic systems by quantifying the recurrence properties of the signal. Eckmann et al. [25] implemented this as a visual tool for finding hidden recurring patterns and non-stationary and systemic shifts. RQA has proven to be a robust method for analyzing dynamic systems and is capable of quantitatively characterizing the magnitude and complexity of nonlinear, non-stationary, and small signals [26][27][28][29][30][31][32]. It seems that RQA may result in more subtle kind of features to the variations in the signal and more robust against the noise in the signal data [30,31].
In this study we used the extracted features using RQA to create images (we called them RQA images) that could be then used as inputs in a classification algorithm instead of using many numerical features. This has many benefits, including the ability of using pretrained deep learning algorithms and representing the various features in one single image, which will save a significant amount of time in computing, reducing the complexity of the system. However, before we introduce the proposed framework, the following is a brief description of how we extracted the features using the quantification of patterns that occur in Recurrence Plots (RPs); more information and details can be found here [24]. Extraction of RQA features involves the setting of three essential parameters: delay (τ) (i.e., lag), phase space dimension (D), and threshold parameter (T). Delay is chosen as the minimum value for the Average Mutual Knowledge (AMI) function. We averaged the collective average information function over all participants and modes in order to calibrate the delay parameter for each channel, as can be seen in Figure 1a. The phase space dimension is calculated using the False Nearest Neighbor (FNN) test, as seen in Figure 1b. To calculate the value of T, the space dimension and the delay were used to create the RP and to extract RQA features at various T values. We use the resulted RQA features of each stream from applying RF algorithm as inputs. Consequently, T was calculated for each wave based on the precision of the classification. More information and details on how we found these parameters can be found in [24]. Jahangiri and Rakha [12] obtained measurements at a frequency of approximately 25 Hz from the various sensors. As the sensor output samples were not synchronized, a linear interpolation was implemented by the authors to generate continuous signals from the discrete samples. Subsequently, they sampled the designed sensor signals at 100 Hz and divided each sensor's output in each direction (x, y, and z) into 1-s long, non-overlapping windows (t). Using RQA, each point (V i ) in the dimensional space is V i = p i + p i+τ + p i+2τ + · · · + p i+(D−1)τ , which means that each 1-s window (i.e., 100 samples) results in 70 × 70 RP for D = 10 and τ = 4; and 40 × 40 RP for D = 30 and τ = 3. As a result, six RPs of 70 × 70 and three RP of 40 × 40 were extracted to be used in image classification. Table 1 shows some information on the structure of used dataset. Jahangiri and Rakha [12] obtained measurements at a frequency of approximately 25 Hz from the various sensors. As the sensor output samples were not synchronized, a linear interpolation was implemented by the authors to generate continuous signals from the discrete samples. Subsequently, they sampled the designed sensor signals at 100 Hz and divided each sensor's output in each direction ( , , and ) into 1-s long, non-overlapping windows ( ). Using RQA, each point ( ) in the dimensional space is = + + + ⋯ + ( ) , which means that each 1-s window (i.e., 100 samples) results in 70 × 70 RP for = 10 and = 4; and 40 × 40 RP for = 30 and = 3.
As a result, six RPs of 70 × 70 and three RP of 40 × 40 were extracted to be used in image classification. Table 1 shows some information on the structure of used dataset.   In addition, an example of the extracted RQA images of VRUs and non-VRUs modes are shown in Figure 2.

Convolutional Neural Network Transfer Learning (CNN-TL)
Convolutional neural network (CNN) is a Deep Learning algorithm that has recently shown outstanding performance in many computer vision applications, such as image classification, object classification, and face recognition [33]. In this study, we used CNN as it takes images as inputs, and was proven to be able to process and classify it. Technically, each input image processes through a series of convolution hidden layers with certain filters to classify it with a defined probabilistic value between 0 and 1. However, because training CNNs needs a relatively large number of input image data and parameters to be processed, Transfer Learning (TL) was introduced as a pretrained method to expedite training and advance the performance of the CNN models. TL is defined as "a machine learning method where a model developed for a task is reused as the starting point for a model on a second task, which can be used in computer vision and natural language processing aiming to transfer knowledge between related source and target domains" [34,35]. There are many benefits for using TL; namely, it "overcomes the deficit of training samples for some categories by adapting classifiers trained for other categories and to cope with different data distributions in the source and target domains for the same categories" [33,34].

Proposed Framework and Results
In order to use CNN-TL for classifying VRU and non-VRU, we proposed the framework for classifying VRU using CNN-TL and RF shown in Figure 3. Following the extraction of RPs using RQA analysis, we resized and concatenated the nine resulting RPs to construct 227 227 images to be used for AlexNet and SqueezeNet, and 224 244 images to be used for VGG16,VGG19, shufflenet, and resnet101. For each CNN method, we used 47.5% of the images for training the pre-trained deep neural network using transfer learning, 2.5% of the images were used for validation, and the remaining 50% were used for testing. Consequently, as a key advantage of the proposed framework,

Convolutional Neural Network Transfer Learning (CNN-TL)
Convolutional neural network (CNN) is a Deep Learning algorithm that has recently shown outstanding performance in many computer vision applications, such as image classification, object classification, and face recognition [33]. In this study, we used CNN as it takes images as inputs, and was proven to be able to process and classify it. Technically, each input image processes through a series of convolution hidden layers with certain filters to classify it with a defined probabilistic value between 0 and 1. However, because training CNNs needs a relatively large number of input image data and parameters to be processed, Transfer Learning (TL) was introduced as a pretrained method to expedite training and advance the performance of the CNN models. TL is defined as "a machine learning method where a model developed for a task is reused as the starting point for a model on a second task, which can be used in computer vision and natural language processing aiming to transfer knowledge between related source and target domains" [34,35]. There are many benefits for using TL; namely, it "overcomes the deficit of training samples for some categories by adapting classifiers trained for other categories and to cope with different data distributions in the source and target domains for the same categories" [33,34].

Proposed Framework and Results
In order to use CNN-TL for classifying VRU and non-VRU, we proposed the framework for classifying VRU using CNN-TL and RF shown in Figure 3. Following the extraction of RPs using RQA analysis, we resized and concatenated the nine resulting RPs to construct 227 × 227 images to be used for AlexNet and SqueezeNet, and 224 × 244 images to be used for VGG16,VGG19, shufflenet, and resnet101. For each CNN method, we used 47.5% of the images for training the pre-trained deep neural network using transfer learning, 2.5% of the images were used for validation, and the remaining 50% were used for testing. Consequently, as a key advantage of the proposed framework, we used RF with varying number of trees from 10 to 200 to capture the temporal dependencies between the Remote Sens. 2020, 12, 3508 7 of 12 consecutive non-overlapping windows (t) of 1-s width and return the probability of a window/image being VRU. As this type of neural network fails to model the time dependency, RF aims to model this temporal relationship using the concatenating VRU probability of consecutive windows to form a vector of probabilities. In this study, we choose 3, 5, and 7 consecutive windows, which corresponds to 3, 5, and 7 s, respectively.
Remote Sens. 2020, 12, x FOR PEER REVIEW 7 of 12 windows to form a vector of probabilities. In this study, we choose 3 , 5 , and 7 consecutive windows, which corresponds to 3, 5, and 7 s, respectively. CNN-TL was trained and tested as a binary classifier (i.e., classifying whether the class is a VRU or a non-VRU). RQA images resulted from analyzing data collected using different smartphone sensors, namely: accelerometer, gyroscope, and rotation-vector. As Figure 3 shows, the classification results reached the highest accuracy of 98.70%, 98.62%, 98.71%, and 98.71% using only seven consecutive non-overlapping windows for AlexNet, SqueezeNet, VGG16, and VGG19 respectively. Figure 4 shows the results of the different CNN-TL methods.  CNN-TL was trained and tested as a binary classifier (i.e., classifying whether the class is a VRU or a non-VRU). RQA images resulted from analyzing data collected using different smartphone sensors, namely: accelerometer, gyroscope, and rotation-vector. As Figure 3 shows, the classification results reached the highest accuracy of 98.70%, 98.62%, 98.71%, and 98.71% using only seven consecutive non-overlapping windows for AlexNet, SqueezeNet, VGG16, and VGG19 respectively. Figure 4 shows the results of the different CNN-TL methods.

Random
Finally, we used resnet101 and shufflenet, which are deeper networks than the previously used pre-trained networks. The two deeper networks are trained used the same data but for only one epoch, and then, the fine-tuned networks are used as weak learners where the output probabilities from both networks are fed into the RF classifier. Figures 5 and 6 shows the classification accuracy of both networks after one training epoch and classification accuracy after feeding their outputs into the RF respectively.
Remote Sens. 2020, 12, x FOR PEER REVIEW 8 of 12 Finally, we used resnet101 and shufflenet, which are deeper networks than the previously used pre-trained networks. The two deeper networks are trained used the same data but for only one epoch, and then, the fine-tuned networks are used as weak learners where the output probabilities from both networks are fed into the RF classifier. Figures 5 and 6 shows the classification accuracy of both networks after one training epoch and classification accuracy after feeding their outputs into the RF respectively.
Remote Sens. 2020, 12, x FOR PEER REVIEW 8 of 12 Finally, we used resnet101 and shufflenet, which are deeper networks than the previously used pre-trained networks. The two deeper networks are trained used the same data but for only one epoch, and then, the fine-tuned networks are used as weak learners where the output probabilities from both networks are fed into the RF classifier. Figures 5 and 6 shows the classification accuracy of both networks after one training epoch and classification accuracy after feeding their outputs into the RF respectively.    As show in the figure above, the classification accuracy of resnet101 and shufflenet are 98.36% and 98.44% respectively. Moreover, the classification accuracy after feeding their outputs together into RF becomes 98.49%.

Conclusion
As the AV industry is rapidly advancing, non-motorized road user (i.e., VRUs) classification has become key to enhancing their safety in the road. In this research, by investigating the use of a novel CNN-TL image classification framework, we investigated the impact of extracted RPs, which captures the temporal evolution, on the precision of non-motorized road users' classification. We extracted RPs using data from smartphone sensors such as gyroscope, accelerometer, and rotation vector, without GPS data (we assumed they might have some possible issues such as quickly Figure 6. The classification accuracy after using rsenet101 and shufflenet as weak learners.
As show in the figure above, the classification accuracy of resnet101 and shufflenet are 98.36% and 98.44% respectively. Moreover, the classification accuracy after feeding their outputs together into RF becomes 98.49%.

Conclusions
As the AV industry is rapidly advancing, non-motorized road user (i.e., VRUs) classification has become key to enhancing their safety in the road. In this research, by investigating the use of a novel CNN-TL image classification framework, we investigated the impact of extracted RPs, which captures the temporal evolution, on the precision of non-motorized road users' classification. We extracted RPs using data from smartphone sensors such as gyroscope, accelerometer, and rotation vector, without GPS data (we assumed they might have some possible issues such as quickly depleting the smartphone's battery if the service is turned on). We proposed a framework consisting of CNN-TL as a pretrained algorithm to reduce training time and increase the classification accuracy. We also applied the RF algorithm to capture the temporal relationships between non-overlapping windows. We applied different CNNs including AlexNet, SqueezeNet, VGG16, and VGG19. The classification accuracy reached 98.70%, 98.62%, 98.71%, and 98.71% using seven consecutive windows for AlexNet, SqueezeNet, VGG16, and VGG19 respectively. Moreover, we trained two resnet101 and shufflenet systems for a shorter time using one epoch of data and considered them weak learners. The outputs of the weak learners were feed into the RF for final classification.
Results of the proposed framework proved that the proposed framework is promising, and it outperformed the results in the literature. Our experimental results show that using CNN-TL applied to extracted RQA images has a significant discriminating ability for VRUs classification, which seems to not be captured using other classification algorithms. Unlike other methods, images resulted from RQA would relax the assumptions about linearity, multicollinearity, or stationarity of the data that would be required using other features. Because of its relative straightforwardness, the ability to be generalized and transferred, and its potential high accuracy, we anticipate that this framework might be able to solve various problems related to signal classification and would become a popular choice in the future.

Data Availability
The dataset used to support the findings of this study is owned by Virginia Tech Transportation Institute (VTTI), https://www.vtti.vt.edu/index.html, and available upon request.