Trafﬁc Control Recognition with Speed-Proﬁles: A Deep Learning Approach

: Accurate information of trafﬁc regulators at junctions is important for navigating and driving in cities. However, such information is often missing, incomplete or not up-to-date in digital maps due to the high cost, e.g., time and money, for data acquisition and updating. In this study we propose a crowdsourced method that harnesses the light-weight GPS tracks from commuting vehicles as Volunteered Geographic Information (VGI) for trafﬁc regulator detection. We explore the novel idea of detecting trafﬁc regulators by learning the movement patterns of vehicles at regulated locations. Vehicles’ movement behavior was encoded in the form of speed-proﬁles, where both speed values and their sequential order during movement development were used as features in a three-class classiﬁcation problem for the most common trafﬁc regulators: trafﬁc-lights , priority-signs and uncontrolled junctions. The method provides an average weighting function and a majority voting scheme to tolerate the errors in the VGI data. The sequence-to-sequence framework requires no extra overhead for data processing, which makes the method applicable for real-world trafﬁc regulator detection tasks. The results showed that the deep-learning classiﬁer Conditional Variational Autoencoder can predict regulators with 90% accuracy, outperforming a random forest classiﬁer (88% accuracy) that uses the summarized statistics of movement as features. In our future work images and augmentation techniques can be leveraged to generalize the method’s ability for classifying a greater variety of trafﬁc regulator classes.


Introduction
Under the umbrella concept of smart city, the development of smart transportation and mobility has been on the agenda of many government departments and institutions [1]. Nowadays, as the growth of urbanization [2] and traffic congestion in European cities [3] is increasing, the need for daily fast commuting between one's place of residence and place of work, or that of less periodical recurring traveling, has motivated a lot of research on how this demand can be efficiently facilitated [4,5]. For this navigation task, maps are an elementary basis. These maps include the geometry and semantics of the route segments, as well as additional information such as restrictions or traffic regulators. Although mapping with aerial images or surveying equipment is an accurate and effective way of collecting features of a city or Earth's surface in general, this procedure is highly time-consuming and cost-expensive. This is especially relevant, as this data is subject to frequent changes.
A solution to the above problem has been given through the concept of citizens as sensors [6], where geographic information can be created and shared through individuals that can act as sensors of their environment. Under this concept, citizens can collect various measurements or data of their activity environment, such as temperature and noise values or images. These data coupled or center and a sliding window along the GPS track to automatically learn the timing and location of the approaching vehicle. To the best of our knowledge, we are the first to use a sequence-to-sequence deep learning model, as suggested and applied in this paper, for traffic regulator detection.
(a) GPS tracks. Each point represents a location that a vehicle had at a certain time.

Related Work
Recently, Zourlidou and Sester [21] conducted a systematic literature review on traffic-control detection from GPS tracks, highlighting (1) the importance of the topic itself, (2) the need for open data (GPS tracks and ground truth map) that researchers can use to test their methods and compare their results with others, (3) the low diversity of the predicted classes within each study and (4) the low percentage of studies that examine the cross-city applicability of their proposed methods (i.e., trained in city A and tested in city B). In this section, we briefly describe GPS-based methods for traffic regulator detection and identification. All the studies, along with the regulator classes they examine, are listed in Table 1.
The first category of methods are the studies that use map-extracted features. Such intersectionrelated features can be derived from open maps, e.g., OSM. The study by Saremi and Abdelzaher is unique in this regard [33]. They export features such as speed rating of road segments, distance of one junction to the closest one, end-to-end distance of the road that a junction belongs to, semi distances of a junction to the two ends of the road that belongs to and category of the street segments. They also test the combination of map-based features with trace-derived attributes (number of stops, traverse speed and stop duration), achieving improvement of the classification accuracy 97%. The second category includes approaches that use various features mainly related with stop or deceleration events. Here we find the earliest published study on the topic, proposed by Pribe and Rogers [25]. A Neural Network (NN) is trained to learn to associate the driver behavior with two types of traffic rules, traffic-lights and stop-signs. As input data, they use the average and standard deviation of stop-event related features: the number of times a vehicle stops before crossing a junction, the total duration of all stops and the last three stops that are closest to the junction. An additive feature is the percentage of traversals that include at least one stop for each road segment. A similar approach has been suggested by Hu et al. [5]. They use two types of features, namely physical and statistical. As physical features, they compute the final stop duration, minimum crossing speed, number of times the vehicle decelerates, number of stops and distance of last stop event from intersection. By computing the minimum, maximum, mean and variance of the physical features, four new statistical features are defined for each physical one. They test a random forest classifier, as well as spectral clustering, for a three-class classification problem (stop-sign, traffic-light and uncontrolled junctions). Their method reports an accuracy over 90% for various features used for training and testing. Carisi et al. [29] propose a simple heuristic method for a binary classification problem (stop-signs and traffic-signals) using the slowdown and standstill events observed in the traces. They explain how to enrich digital maps with the location and timing of the aforementioned regulators. Their method achieves over 90% accuracy for the binary classification problem on a small dataset. Qui et al. [27] detect stop-signs based on a prevalent characteristic of stopping at a stop-sign: a deceleration followed by an acceleration. Besides, they use some heuristic rules on crowdsourced GPS tracks to distinguish between four-way and two-way stop-signs and between stop-signs and traffic-lights; There is no stop-sign if only one stop segment at an intersection is detected in a single trace, where other traces do not have such segments. Méneroux et al. [35] present a supervised-learning algorithm (random forests and regression) for detecting and localizing traffic-signals based on the spatial distribution of vehicle stop-points along the road. One-month GPS traces collected from a city in Japan are used to train the algorithm and their method reaches up to 85% in detection score and approximately 5-meter positional accuracy. A similar method for intersection and stop bar position extraction has been proposed in [34]. Aly and Basalamah [28] harness pedestrians' trajectories for detecting stop-signs and traffic-lights. They recognize locations where pedestrians stay over a time threshold (dwelling time) and categorize the regulators to the two categories accordingly. Similar to the study [5], the most recent work by Golze et al. [26] uses the speed-related statistics (e.g., mean and maximum crossing speed) extracted from GPS tracks for traffic regulator classification.
The third category includes approaches that use speed-profiles as classification features. For example, Zourlidou et al. [31] and Kuntzsch et al. [30] explore the effect of high-quality speed profiles derived from CAN-Bus for training a tree-classifier to distinguish traffic-light controlled junctions from priority and yield controlled junctions. The study of [31] is the first one to use speed-profiles for regulator detection. It reports high recall but low precision and F-measure for predicting traffic-light regulator. Similarly, Méneroux et al. [32] detect traffic-signals by using speed-profiles. They test three different ways of deriving features: functional analysis of speed logs, raw speed measurements and image recognition technique. They demonstrate the functional description of speed profiles with wavelet transforms. Among different classifiers, random forests scored the best accuracy (95%). Last, Munoz-Organero et al. [36] detect in real-time various street infrastructure elements, such as traffic-lights, street crossings and roundabouts, by classifying speed and acceleration time series with a deep-learning approach. Although the combined precision and recall are relatively high, compared to the other two classes, the performance score of the traffic-light regulator exhibits a clear limitation on all the tested classification settings.
This article proposes a novel mechanism of manipulating the descriptive ability of speed-profiles that represent vehicles' movement behavior at regulated locations. Different from all the aforementioned approaches, we present a sequence-to-sequence conditional generative classifier for traffic regular recognition using GPS tracks crowdsourced from vehicles in a fashion of time-series with varying sequence lengths.
The rest of the paper is structured as follows: In Section 2, we describe the proposed method and the data we used for experimenting and in Section 3, we present the results. We discuss the results and future directions in Section 4 and summarize all findings in Section 5.

Dataset
The GPS tracks were collected using the mobile phone application Geo Tracker developed in the Android operating system. The duration for the data acquisition started in December 2017 and ended in March 2019. The data were naturalistic, in the sense that no instruction was given to the driver from an external person while driving regarding where to go or how to drive. All the trajectories/trips were part of drivers' daily travel, e.g., from home to work or shopping center, and vice versa. In total, 1204 trajectories that cross 1064 junctions ( Figure 2) regulated by 3538 junction arm regulators were accumulated. For the classification task, we considered one regulator for each junction arm. Table 2 gives a description of the dataset that we used for testing the proposed method.
Most of the trajectories have length between 1 and 12 km (Appendix A Figure A1a), trip duration lasts mostly between 0 and 28 min (Appendix A Figure A1b) and the most common junction types are three-way and four-way (Appendix A Figure A1d). Regarding the number of trajectories per regulator type, traffic-lights, priority-signs and uncontrolled rules have the biggest number of crossings (Appendix A Figure A1e), with yield-signs, stop-signs and roundabouts being ignored indeed from the classification due to data limitation as discussed later in the next section. Last, most junctions (689 out of 1064) are sampled from 1-10 trajectories, following by 141 junctions having between 11 and 20 trajectories and only two junctions having between 421 and 460 trajectories (Appendix A Figure A1f).

Problem Formulation
The task of traffic regulator recognition is defined as a classification problem. For the given junction arm, the recognition task is mathematically formulated as Y n = f (X n (T) ), where Y n is the traffic regulator (one of the classes priority-signs, traffic-lights and uncontrolled) regulating the given GPS track X n , and n belongs to N denoting the total number of GPS tracks recorded in the given junction arm. The GPS track stores the timely-ordered observed signals X n = {x 1 , ..., x T }, x i ∈ R d and d denotes the dimension of the feature vector, which contains the location and speed information at each signal point. By the definition above, f ( . ) is a sequence-to-one classifier. We slightly change the form of Y n to is the weighting function that summarizes the signal-wise prediction to the sequence-wise prediction for the traffic regulator for the give GPS track. Hence, the sequence-to-one classification now turns into a sequence-to-sequence classification.
In most cases for the given junction arm, N ≥ 1 GPS tracks are available. The arm regulator is the majority vote of all the GPS tracks traversed alone the given arm. Equations (1) and (2) denote the prediction process using the traversed GPS tracks. Note that the signal-wise classification provides fine-gained feedback at each signal point for a single track, while the arm-regulator classification result is the crowdsourced feedback from all the GPS tracks. where One remaining problem that the sequence-to-sequence model has to tackle is the varying sequence length. In other words, T is not fixed due to the different duration of the tracks and availability of GPS signals. We propose to use a sliding window with a fixed window size (w) for varying sequence lengths [37]. First, a sequence is divided into small sub-sequences, which capture both long and short dependencies and circumvent the problem of varying sequence lengths across different GPS tracks. Second, as we discussed in the previous section, the location and timing for a track regarding the traffic control is important. However, it is not known where and when the track might be exactly regulated by the traffic control. Besides, the exact location and timing may differ from one track to another. The sliding window exhausts through each time-step and automatically learns the location and timing impacted by the traffic control. Compared with a fixed location or timing, this method (w T) is less likely to be overfitted to a particular junction. Equation (3) denotes the sliding window method with a stride being the same size as the sliding widow size. The overlap between two consecutive windows is allowed when the stride is set to be smaller than w (see the left part of Figure 3).

Track-wise Prediction
Only in training Both in training and inference Concatenation

Conditional Generative Model
We propose to use a conditional generative model parameterized by neural networks for the classification function f ( . ), namely, the Conditional Variational Auto-Encoder (CVAE) [38,39]. The CVAE framework has been proven to be very successful for solving many complex problems, for instance, image classification and generation [39,40] and trajectory prediction [41]. The choice of this model is made by considering the following aspects: GPS tracks against the traffic regulators are stochastic due to (1) the uncertain driving behavior of the car drivers and (2) the location and timing for traversing along the given junction arm. The CVAE learns a recognition model that encodes the input into some stochastic variables, the so-called latent random variables, following some prior distribution such as Gaussian distribution. Then, it learns a generative model that is conditioned on the stochastic variables for the probabilistic prediction task [39].
In the following we briefly revisit the CVAE framework. Given the input X and the output Y, the CVAE model is defined as: The conditional probability of the output is an isotropic Gaussian distribution, whose mean µ = f (z, X), is a function of the input X and the latent variables z, and the covariance matrix Σ = σ 2 * I, is an identity matrix I multiplied by some scalar σ 2 . Due to the intractable true posterior q θ (z|X, Y), the equation cannot be solved analytically. A variational posterior q φ (z|X, Y) is introduced to approximate the true posterior. The model then can be trained using Stochastic Gradient Variational Bayes (SGVB) [38] by reaching a variational lower bound, denoted by Equation (5).
where p θ (z) is the prior that can be made independent from the input X [40] and is drawn from N ∼ (0, I). For the complete derivation of the lower bound, we recommend readers to have a look at [38,39]. The form of Equation (5) is interpreted as an auto-encoder, as the first term on the right side "encodes" both the input and the output into the latent variables and the second term "decodes" the output from the input and the latent variables. The decoder is also called the generative model. Note that compared to a traditional auto-encoder, here the CVAE model predicts the output Y, rather than reconstructs the input X.
The Kullback-Leilber divergence KL( . ) between the approximated posterior and prior distributions (both Gaussian) can be solved analytically. The reconstruction loss E q φ (z|X,Y) ( , ) can be approximated by the Monte Carlo sampling [38]. The non-linear mapping functions of both θ and φ are parameterized by neural networks. In order to enable the gradient in the sampling process, a re-parameterization trick [42] is applied for back propagation, where z (l) = g φ (X, Y, (l) ) = µ (l) + σ (l) (l) , and (l) = N ∼ (0, 1).
Then the loss of the CVAE model is optimized via stochastic gradient descent. Equation (5) denotes the optimization process.

Framework Pipeline and Input Features
In this sub-section, we introduce the overall pipeline of the CVAE model for the sequence-to-sequence classification task using GPS tracks with a sliding window and weighting function, demonstrated in Figure 3.
The CVAE model has two different information flows in training and inference/prediction, respectively. In the training process, both the GPS signals and the corresponding arm regulator are available. First, the label of the arm regulator is duplicated to align with the signal time steps. Then a sliding window is applied to exhaust the GPS signal and arm regulator sequences in parallel. After that both sequences are concatenated as a complete input for training a variational encoder for the latent variables. In the end, the decoder is trained by using the GPS track information and the latent variables for predicting the signal-wise arm regulators, which is later summarized by the weighting function for achieving the track-wise prediction. In the inference process, only the GPS signals are available. In order to predict the arm regulator, the GPS signals are concatenated with the latent variables directly sampled from the Gaussian distribution. We used Long Short-Term Memories (LSTMs) [43] for both the encoder and the decoder.
The GPS signals are extracted from the relative x and y positions of the UTM (Universal Transverse Mercator) coordinates in relation to the center of the given junction. First we used a predefined distance to select the relevant GPS tracks. Only the tracks within the threshold are of the interest. Because large distance may cause a track to traverse multiple different junctions. In addition, considering that the signals after the junction are not as important as the ones before the junction, when the vehicle is leaving, we set another threshold (maximum one window size) for the GPS signals after the junction center. Figure 4 exemplifies the GPS tracks extracted from some junctions regulated by priority-signs, traffic-lights and uncontrolled, respectively.
Second, after the extraction, we enriched the GPS signals by calculating the distance d, the xand y-offset denoted as ∆x and ∆y, the speed v in relation to the junction center. Note that because the raw GPS signals are not evenly distributed over time, we also added the time interval of two consecutive GPS signals ∆t as an input feature. The enriched GPS feature vector is denoted as X (T) = {x, y, d, ∆x, ∆y, v, ∆t} T t=1 .

Experimental Settings
We ran the CVAE model multiple times using different parameters and set the values based on the best performance. The most important parameters as defined after this process are listed as follows: • For the GPS tracks extraction, the distance threshold for selecting the relevant GPS tracks regarding the given junction is set to 65 m, the sliding window size is set to 8 and and the stride to 2; • For the data partitioning, the GPS tracks are randomly split into 70:30 for training and test, respectively; • For the neural networks of the CVAE model, the dimension of the latent variables z is set to 2, the dimension for the LSTM hidden state for both the encoder and decoder is set to 128; • For the training hyper-parameters, the batch size is set to 256, the number of training epochs to 500 and an early stop with 50-epoch patience. The learning rate is set to 1e − 3 using the Adam optimizer [44] with a decay rate of 1e − 8; • We use the average weighting function for summarizing the signal-wise prediction to the track-wise prediction.
More details of the settings and the code of the CVAE model can be found at the repository (https://github.com/haohao11/Traffic_Control_Recognition).

Comparison Model
The proposed CVAE model was compared with the performance of a random forest model [26] using the same dataset (1328 in total regulators) as in [26], enlarged by additional 1609 regulators from the same city/road-network (in total 2937 regulators). Different from the sequential features mentioned above, the random forest model uses two types of features summarized from the GPS tracks: physical and statistical features. The physical features are, for example, the number, percentage, duration and distance of the standstill phases; duration and distance of the last standstill phase relative to the given junctions, mean and maximum speed of each GPS track. The statistical features are the statistics, such as minimum, maximum, mean and variance of all the aforementioned physical features. Different strategies were leveraged to boost the performance of the random forest model, such as, random oversampling and bagging or AdaBoost [26].

Results
In this section, we present the empirical results for the CVAE model and the random forest model with different boosting strategies. The performance was measured by accuracy for classifying the three majority traffic regulators, i.e., priority-signs, traffic-lights and uncontrolled on the test dataset. Table 3 shows the evaluation results for the random forest model and the CVAE model. The basic random forest classifier achieved an accuracy of 0.83 for the test GPS tracks, including both non-turning and turning tracks. The performance was improved by removing the turning tracks that were more difficult to classify. However, the removal reduces the size of the dataset. Random oversampling was leveraged to increase the samples of the minority class and an increase in the performance was achieved by that way. The best performance (0.88 accuracy) was accomplished by the classifier using oversampling and the AdaBoost strategy.
The CVAE model was trained to classify all the GPS tracks, including both non-turning and turning tracks for the complete dataset. The CVAE model first predicts the traffic regulators for each single GPS signal and then summarizes the weighted signal-wise predictions to the track-wise prediction for each GPS track (see Section 2.2.1). A majority vote of the classified results for the tracks traversed along the given junction arm is the final classification. Overall, the CVAE model outperformed the random forest model using all the tracks (0.90 vs. 0.83), and as well the random forest model using only no-turning tracks with oversampling and AdaBoost strategy (0.90 vs. 0.88). Figure 5 shows the confusion matrices for the best random forest classier with oversampling and AdaBoost (Figure 5a) and the CVAE model (Figure 5b)  From the above results, in comparison with the random forest model, the CVAE model generated better accuracy for junction arm rule predictions. Additionally, the CVAE model was generalized to both non-turning and turning tracks and validated on a larger dataset. Moreover, the CVAE model did not use any advanced boosting strategies. The empirical results confirm that the proposed framework, a sequence-to-sequence classifier with a sliding window and an average weighting function, is suitable for dealing with both linear and non-linear GPS tracks of varying sequence length for traffic regulator classification.

Discussion
In this section, we first analyze the performance of the sequence-to-sequence CVAE model in terms of signal-wise prediction and the importance of the features used for the classification task, then we discuss the applicability of the model based on GPS signals for real-world traffic regulator detection.
The detailed results for the signal-wise predictions are listed in Table 4. Overall, the accuracy at signal-level for a three-class classification task is 0.73. It indicates that predicting traffic regulators for each GPS signal is very challenging, because the motion of the vehicles changes over time, e.g., decelerating, stopping and accelerating when they approach junctions. However, the accuracy is much higher than a random guess, which sheds light on the overall junction arm rule classification. The accumulated signal-wise predictions by the average weighting function and the track-wise predictions by the majority voting scheme lead to a very accurate junction arm rule classification (0.90 accuracy, see Table 3). It proves that the weighting function and the voting scheme can tolerate low-level errors (wrong classification of GPS signals), and most importantly the accuracy of individual information can be enhanced by accumulated crowd-sourcing information (multiple GPS signals and tracks). With a close look at the results for each class, the detailed precision and F-measure indicate that predicting the regulators for priority-signs and uncontrolled at the signal-wise level is more difficulty than predicting the regulator for traffic-lights. First, the respective sample size (support) of the priority-sign and the uncontrolled classes is smaller than the sample size of the traffic-light class. Note that we did not use any boosting strategy to increase the samples in the minority class. The priority-sign and uncontrolled classes might not be as well trained as the other class. Second, the regulators of priority-signs and uncontrolled are similar in the regard of enforcement; drivers need to practice their courtesy, which is highly individual dependent. On the other hand, the traffic-light regulator is clearly defined and has a stronger impact than the other two regulators on the drivers' behavior, which makes the classification more accurate. The CVAE model was trained using different feature combinations, in order to analyze how these features contribute to the classification performance. Namely, the features identified in Section 2.2.3 are the relative x and y coordinates and the distance d to the junction center, the xand y-offset denoted as ∆x and ∆y between two consecutive GPS signals with the time interval ∆t, and the speed v. Table 5 lists the detailed results. From the table we can see that the CVAE model (A) using x and y coordinates had a very limited performance compared to the other models using different or more features. This is because the coordinates are not evenly distributed over time and no accurate dynamic information (e.g., speed) is provided to indicate how vehicles approach junctions. The model (B) had a slightly better performance by only adding the distance feature d. But due to the same reason, its performance was rather limited. On the other hand, the model (C) using the offsets ∆x and ∆y achieved significantly better performance measured by all evaluation metrics compared to the previous two models. This is because the offset feature indicates how fast a vehicle crossing the junction changes its position between two consecutive GPS signals. When the coordinates, distance and offset features were leveraged, the model (D) achieved a further improved performance. Interestingly, adding the speed feature v did not provide a positive contribution, i.e., model (E) vs. model (D). But the time feature ∆t contributed to a slightly improved performance, i.e., model (F) vs. model (D). When all the aforementioned features were used, the model (G) achieved the best results measured by all the evaluation metrics. It becomes clear from the above analyses that the sequence-to-sequence CVAE model detects traffic regulators by learning the motion information (speed-profiles) from the vehicles driving through junctions. The motion information captured by GPS tracks can be easily acquired by a mobile phone application, which is relatively cheaper than acquiring images that require, e.g., larger storage and communication bandwidth. In addition, GPS signals are used as sequences. There is no extra computational overhead to pre-process the GPS signals for feature extraction locally on the mobile phone or remotely on the server side [5]. These advantages make the model light-weight and applicable for real-world traffic regulator detection tasks. Most importantly, as we analyzed above, the model provides a solution to tolerate the errors in the VGI-crowdsourced GPS tracks-generated by commuters. One single GPS signal may not correctly represent the traffic regulator. But a sequence of GPS signals and the aggregated GPS tracks via the majority voting scheme represent a highly accurate detection for the traffic regulator.

Conclusions and Future Work
In this paper, we propose a conditional generative framework for traffic control recognition using crowdsourced GPS tracks data. First, we discuss the advantages of using light-weight GPS data compared to image-based data. Second, we explain how our proposed novel framework differs from previously suggested methods that normally use statistical features summarized from GPS tracks. We propose to use the fine-grained GPS signals as sequences and train a sequence-to-sequence classifier based on the Conditional Variational Auto-Encoder (CVAE). A sliding window mechanism was applied to process sequences of varying length and an average weighting function for summarizing the signal-level prediction to the track-wise prediction. The proposed CVAE model outperformed a random forest model and achieved 0.90 accuracy tested on the mobile phone GPS data collected in the German city Hannover for both no-turning and turning junctions. The sequence-to-sequence method with the average weighting function and the majority voting scheme provides a solution to tolerate the errors generated by individual users. The usage of GPS signals as sequences makes our model easily applicable for real-world traffic regulator detection tasks.
In the future, different strategies will be investigated to further increase the prediction accuracy, for instance, using augmentation techniques to increase the number of GPS tracks and interpolation techniques to smooth the GPS signals. In addition, road images extracted from maps or satellite imagery can be fed to the CVAE framework for more accurately solving the task of traffic regulator detection. We will extend our model not only for detecting the most common traffic regulator classes, but also for more generalized classes.