Abstract
Even with the ubiquitous sensing data in intelligent transportation systems, such as the mobile sensing of vehicle trajectories, traffic estimation is still faced with the data missing problem due to the detector faults or limited number of probe vehicles as mobile sensors. Such data missing issue poses an obstacle for many further explorations, e.g., the link-based traffic status modeling. Although many studies have focused on tackling this kind of problem, existing studies mainly focus on the situation in which data are missing at random and ignore the distinction between links of missing data. In the practical scenario, traffic speed data are always missing not at random (MNAR). The distinction for recovering missing data on different links has not been studied yet. In this paper, we propose a general linear model based on probabilistic principal component analysis (PPCA) for solving MNAR traffic speed data imputation. Furthermore, we propose a metric, i.e., Pearson score (p-score), for distinguishing links and investigate how the model performs on links with different p-score values. Experimental results show that the new model outperforms the typically used PPCA model, and missing data on links with higher p-score values can be better recovered.
1. Introduction
Traffic data generated by loop detectors or floating cars in urban road networks serve as the foundation for various data-driven applications in intelligent transportation systems, including traffic forecasting and traffic control [1,2,3]. However, even with ubiquitous sensing data, the missing data problem is almost inevitable due to either detector faults or a limited number of probe vehicles operating as mobile sensors in road networks, which means not each road in the network is covered by a detector or traveled by a probe vehicle in each minute [4,5,6]. Such an issue of missing traffic data poses obstacles for many further data-driven explorations in both academic and industrial fields, e.g., the link-based traffic status modeling, and network-wise traffic dynamics capturing [7,8]. Hence, accurate and reliable imputation is a basic need for such kind of incomplete data for the downstream explorations.
Many efforts have been made for estimating the missing traffic data on multiple traffic datasets, resulting the generative probabilistic model [9], the matrix decomposition and tensor factorization models [10,11,12], the autoencoder model [13], the fusion models [14]. Some basic mathematical models are also adopted, including the autoregressive integrated moving average (ARIMA) model, the Bayesian networks (BNs) method, the Markov chain Monte Carlo (MCMC) method, and the K-nearest neighbors (KNN) model, which are all tested in [15] for traffic missing data imputation.
The studies in [16] have validated that the matrix decomposition-based method is not capable for recovering missing data when the missing percentage large. The tensor models are based on the global structure capturing, and it is faced with challenging to permutation in the spatial and temporal dimension [17]. The probabilistic principal component analysis (PPCA) model [18] also plays a major role in missing data completion due to its generative feature [19]. Observations are assumed to be generated from a low dimensional space, with which the missing data can be recovered by optimizing the generative parameters using the observations [20].
Although many studies have focused on tackling this kind of problem, existing studies focus on the situation that the data are missing at random. Specifically, missing data can be classified into missing at random, missing completely at random, and missing not at random (MNAR) [16]. MNAR data always exist in the practical scenarios, and it is more challenging to estimate the missing values, which is the target of this paper.
The studies in [15] demonstrate that the PPCA model yields best performance among ARIMA, BNs, MCMC, and the KNN models, and in the research in [21], it has been certified that the PPCA model outperforms the basic tensor decomposition method. Hence, in this study, we set the PPCA model as a basis and further improve the PPCA model for tackling the MNAR traffic data. Additionally, the missing data on different links or sensors may be of different levels of challenges for data completion. Hence, there is also a need to distinguish different scenarios that missing data are on different links or sensors. Instead of the centrality of a sensor in the network, we utilize the time series correlations to define a metric for distinguishing the role of a link in the traffic network. Such a metric is adaptive to the scenario that sensors or links are anonymous. Contributions of this work are summarized as below:
- We design a metric, p-score to denote the relative importance of links in terms of time series observations, which is used to distinguish the links with missing values.
- We propose a linear model for the MNAR traffic data imputation, which is based on the probabilistic principal component analysis.
- We conduct experiments on a real-world traffic dataset using the model and the proposed metric. Experimental results show missing data on links with higher p-score values can be better recovered. Moreover, testing on the real-world dataset, the results of the proposed model on links with the lowest p-score value also outperforms the typically used PPCA model.
The remainder of the paper is structured as follows. Section 2 presents the problem statement of the missing traffic data imputation. Section 3 the details of the proposed model. Section 4 shows the outcome of the experimental evaluation results, Section 5 presents a short discussion of the potential application scenarios of the proposed method, and finally, Section 6 gives the conclusions of this paper and the directions for future studies.
2. Problem Statement
Let be a traffic data organization matrix with each element denoting the observation of a link .
We assume that the traffic data are missing and links with missing values are organized as , which is indexed by with are supposed to have missing values. Here, is the link set that has missing values. Other values in are observed.
We label the missing status of with another variable, written as
Traffic sensing in urban road networks is faced with the missing data, or data sparsity problem. We construct the traffic matrix with all missing values in columns . The missing data imputation problem is to estimate these missing values, i.e., , where .
3. Methodology
3.1. PPCA
Assuming that the target variable is organized as a matrix Y, and it can be drawn from of a low rank by linear combination, written as
here, , where is the sample number and is the number of variables in the determination system. Specifically, in our link-based missing data imputation problem, is the total number of links.
is the latent variable. , and the row is drawn from a Gaussian distribution with zero mean, i.e., . Here, , indicating a lower dimension. is the loading matrix of rank . is a model error, and each row , which also has a zero mean. , . Given the linear expression above, the mean value of each sample of is .
3.2. Missing Variables Differentiation Based on Time Series
Assume that we have missing data on two different variables, , with the same percent, the imputation accuracy can be different due to the variable’s role in the whole variable set. In the traffic missing data imputation problem, two links, , may have different correlations to other links. In this section, we prose a metric to differentiate the role of each link.
The observation of each link is also a time series. We first adopt the Pearson correlation coefficient to estimate the correlation between each pair of time series, which is calculated as
By calculate the Pearson correlation among each pair of variables, we can obtain a correlation matrix, written as
We define a Pearson score (p-score) for each variable to differentiate the variables in , which is calculated as
The variable that obtains a higher p-score value than denotes it has higher correlation to other links. Such a metric can differentiate the variables in terms of the time series observations. When , and the two links have the same missing data percentage, the imputation accuracy for should be higher than that of .
3.3. Preliminaries and Assumptions
Assume that we have missing data on two different variables, with the same percent, the imputation accuracy can be different due to the variable’s role in the whole variable set. In the traffic missing data imputation problem, two links may have different correlations to other links. In this section, we prose a metric to differentiate the role of each link.
Assume that we have a dimensional Gaussian distribution, written as
where is a D-dimensional mean vector, is a covariance matrix, denotes the determinant of . Then, we partition the D-dimensional vector into two parts, written as
Correspondingly, the mean vector and the covariance matrix are, respectively partitioned into two parts and four parts, written as below.
We further utilize another variable to denote the inverse matrix of the covariance matrix, defined as
Note that, we have the theory of matrix inverse as
Hence, for the inverse of the covariance matrix, we have
where we care about the expression of and , written as
For the Gaussian distribution, the exponent parts can be expanded as
when we partition the D-dimensional vector into two parts , then the exponent part of the Gaussian distribution can be expanded into
We assume that is known in advance, so it can be regarded as a constant. Hence, the first order of is written as
which should have the same expression of the original expression for the first order part written as . For , when we consider the is known, then can be written as , which should be equal to , written as
Hence, we have the expression the estimated value of written as conditional Gaussian distribution
where and are already known as above.
Based on the conditional Gaussian distribution, we replace the part with , which is assumed to known observations, and replace the part as the unknown part , which is to be estimated because that the data are missing. Then, we have the expectation of the estimation as
Then, the estimation of the missing data is calculated as
Hence, the missing data estimations depend on the estimations of , and . Below are assumptions for estimating the model parameters.
Assumption 1:
is invertible..
Assumption 2:
.
Assumption 1 denotes that the matrix is of full rank. Assumption 2 denotes that, given the values in , the column is independent with the column .
The missing data imputation for MNAR is to estimate the value of with for such that . Assumption 2 leads to
3.4. Estimation of
We first define the regression coefficients of on and , for in the complete case, that will be used to express the mean of a variable with MNAR values.
Considering the model , with an assumption that matrix is of full rank . Therefore, the expression of can be reduced to the following linear system.
Here, and are the reduced matrix of and . denotes the reduced matrix of .
Given Assumption 1, the is invertible, and the inverted matrix is denoted as . The latent matrix of full rank can be written as
Using the original model , the expression of is then can be written as
where we can get the intercept and the coefficients of on .
For , let be the intercept, and be the coefficients standing for the effects of on in the complete case, i.e., when . Then, we have
where the coefficients are calculated as below equations.
Here, the arrow indicates the regression model of on , and the squared bracket indicates the coefficient. Based on the model setting, we have , hence .
Assumption 2 leads to
By taking the expectation of the left and right parts of the equality above given for , we have
Above two equalities are identical. So, we have
3.5. Estimation of Variance and Covariance
Let , for the variance , we have
Assumption 2 leads to . According to the conditional variance for a Gaussian vector, we have
Then, we have the first term of as
For the second term of , we have
where .
For the covariances between , , let , we have
For the first term, we have
According to the derivation in [22], is calculated as
Note that for the second term, , it can be directly calculated.
4. Experiment
4.1. Dataset and Preprocessing
We utilize a road traffic speed dataset published by [23]. Road segments are anonymous, covering the main urban expressways within two months from 1 August to 30 September 2016, (a total of 61 days).
The time interval is 10 min. From the original dataset, we select twenty links whose speed are generated in the morning rush hours (i.e., 7:00 A.M. to 9:00 A.M.) for evaluating the proposed method. The speed of each link is transformed to the congestion index, which is calculated as , denotes the speed value of link and denotes all observations of link . For each link , the time series length of speed observations is 732 (12 observations in two hours 61 days). Hence, the dimension of is is the number of links.
The basic assumption of the proposed model is that the observations of each link are drawn from a Gaussian distribution, Hence, we adopt the quantile–quantile plot (QQ Plot) to display the quantiles of the data (after normalizing) versus the theoretical quantile values from a normal distribution. If the distribution of the data is normal, then the data plot is linear. As shown in the Figure 1, the plot closely follows the straight lines, suggesting that the data after normalizing the congestion data have an approximately normal distribution.
Figure 1.
Plot of the Data Quantitles and Standard Normal Quantitles.
4.2. Metrics for Missing Data Imputation Accuracy
For evaluating the performance of missing data imputation, we adopt the below four metrics, including Root Mean Square Error (RMSE), Mean Absolute Error (MAE), Symmetric Mean Absolute Percentage Error (SMAPE), and R2. Note that a higher R2 value denotes better accuracy.
4.3. Benchmark and Experiment Settings
We compare the new model with the typically used PPCA model, where and is estimated by an expectation-maximum (EM) algorithm. We name it ppca-em in this section. We further use the estimated by EM as the known inputs of the new model in this study. As to the rank in the model, the best value is determined by cross-validation on the dataset. In this section, we further detail the experiment settings in terms of the MNAR data generation and the settings of link set .
4.3.1. Generating MNAR
Note that the model targets at solving the imputation for MNAR data. We utilize the mechanism of generating MNAR in [22]. Specifically, a logistic regression function is adopted as , where is an observation, and is set for selecting different missing percentage. The function transforms the observation to a value in . The observation with , is set to be the MNAR data, where is a random threshold. We set the parameters as below Table 1, which is corresponding to a specific missing percentage.
Table 1.
Settings for Generating MNAR Data in the Experiments.
4.3.2. Settings of Link Set
Missing data on different links may obtain different recover accuracy, even with the same missing percentage. For evaluating this proposition, we first test the missing data imputation accuracy with different p-score values of the links. When a link observation is set to be , all other links are set to be , where . Further, we test the missing data imputation accuracy of several select links (or link combination) compared with the ppca-em model, to evaluate the advantage of the new model.
4.4. Results and Analysis
We first examine the relationship between the missing data imputation accuracy and the proposed metric, i.e., p-score value. We select two links with the highest p-score value and the lowest p-score value in the dataset. The p-score values of two selected links, i.e., link 6 and link 18, are shown in Figure 2, where the color map denotes the values between the selected link and all links in link set .
Figure 2.
ρ Value of Two Links with the Highest and Lowest p-score.
Accordingly, we calculate the absolute errors of the model on these two selected links. Figure 3 shows the results missing data imputation results on these links in terms of different missing data percentages. We can see that missing data on link 18, which is with a higher p-score value than that of link 6, are better recovered regarding all scenarios of missing data percentages (25%, 50%, 75%).
Figure 3.
Performance of the Algorithm for Links with Highest p-Score and Lowest p-Score.
We further examine the relationship between the p-score value and the accuracy metrics on the traffic dataset, which is shown in below Figure 4. The accuracy measured by four metrics presents a positive correlation with the p-score value on different links, meaning that missing data on the links with higher p-score values can be better recovered.
Figure 4.
Scatter Plot between the Accuracy Metrics and the p-score Values.
We also compare the new model with the ppca-em model on other links or link combinations in the dataset. The settings and corresponding missing data imputation results measured by the four metrics are shown in Table 2. Except the above four metrics, we further added the accuracy as another metric for directly representing the estimation accuracy results and better understanding the accuracy comparison between the new model and the baseline. Here, the accuracy is calculated as
Table 2.
Experiment Setting and performance of the algorithms with different Percent of MNAR Data on Links.
Note that Figure 3 already shows that the new model obtains the worst accuracy on link 6. Hence, we further compare two models on this link to compare the new model with the ppca-em model. The results are shown in Figure 5. It shows that even on link 6, the absolute errors of the new model are still lower than the ppca-em model for three missing ratios.
Figure 5.
Performance of Models on the Link with Lowest p-Score Values.
The experiment results in Table 2 and Figure 5 demonstrate that the new model performs better than the typically used ppca-em model in terms of four accuracy metrics and computing time. The results indicate that the new model is more effective and efficient for the MNAR traffic data imputation problem, which is the target of this study. The typical ppca-em method is usually used for imputation of data missing at random, whereas the new model is more general and is capable of MNAR data imputation.
5. Discussion
Our improved linear probabilistic principal component analysis method can be applied to a variety of missing traffic data imputation applications, such as missing traffic speed estimation, or other traffic indicators. Notably, because the proposed missing data imputation method is a linear and interpretable model, which is naturally of high computing efficiency, it can be utilized in the systems where real-time missing data estimation is required. Additionally, the time-series based metric, p-Score value, is proposed to distinguish variables, e.g., links with missing traffic speed data, for estimating the missing values. Such a method can be applied to the applications of traffic surveillance systems to identify which sensors should be of high priority to maintained in the systems to ensure the full surveillance, or which links should be equipped with sensor for traffic surveillance.
6. Conclusions
In this study, we propose a general linear model based on the PPCA to tackle the MNAR traffic data imputation problem. We also propose a time series-based metric, i.e., the p-score, to distinguish links that are of missing data. Experimental results on a real-world traffic dataset show that the proposed model performs better than the typically used ppca-em model in terms of missing data imputation accuracy and computing time. Furthermore, we test the model on links with different p-score values. The experiment results show that the missing data on links with higher p-score values are better recovered. Such an observation helps us understand the data recovering distinction for different links in the road network, which has not been studied in any research to our best knowledge. In future work, we will further compare the model with other methods on more traffic datasets.
Author Contributions
Conceptualization, L.H. and R.S.; methodology, L.H.; software, Z.L.; validation, R.L.; formal analysis, R.S.; investigation, L.H.; resources, R.S.; data curation, Z.L.; writing—original draft preparation, L.H.; writing—review and editing, L.H. and R.S.; visualization, L.H.; supervision, R.S.; project administration, R.S.; funding acquisition, R.S. All authors have read and agreed to the published version of the manuscript.
Funding
This study is supported under the RIE2020 Industry Alignment Fund—Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s), and A*STAR under its Industry Alignment Fund (LOA Award I1901E0046).
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The dataset used in this paper is published by [23], which can be found at https://doi.org/10.5281/zenodo.1205229, Access on 18 February 2022.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Yuan, H.; Li, G. A survey of traffic prediction: From spatio-temporal data to intelligent transportation. Data Sci. Eng. 2021, 6, 63–85. [Google Scholar] [CrossRef]
- Neelakandan, S.; Berlin, M.A.; Tripathi, S.; Devi, V.B.; Bhardwaj, I.; Arulkumar, N. IoT-based traffic prediction and traffic signal control system for smart city. Soft Comput. 2021, 25, 12241–12248. [Google Scholar] [CrossRef]
- Tan, H.C.; Wu, Y.K.; Feng, J.S.; Wang, W.H.; Ran, B. Traffic missing data completion with spatial-temporal correlations. In Proceedings of the 93rd Annual Meeting of the Transportation Research Board, Washington, DC, USA, 12–16 January 2014. [Google Scholar]
- Li, H.P.; Wang, Y.H.; Li, M. Modified GAN Model for Traffic Missing Data Imputation. In CICTP 2020, Proceedings of the 20th COTA International Conference of Transportation Professionals, Xi’an, China, 14–16 August 2020; American Society of Civil Engineers: Reston, VA, USA, 2020; pp. 3013–3023. [Google Scholar]
- Yang, F.; Liu, G.; Huang, L.; Chin, C.S. Tensor Decomposition for Spatial—Temporal Traffic Flow Prediction with Sparse Data. Sensors 2020, 20, 6046. [Google Scholar] [CrossRef] [PubMed]
- Huang, L.P.; Zhao, S.D.; Luo, R.K.; Su, R.; Sindhwani, M.; Chan, S.K.; Dhinesh, G.R. An incremental map matching approach with speed estimation constraints for high sampling rate vehicle trajectories. In Proceedings of the IEEE 17th International Conference on Control & Automation (ICCA), Naples, Italy, 27–30 June 2022; pp. 758–765. [Google Scholar]
- Huang, L.P.; Yang, Y.J.; Chen, H.C.; Zhang, Y.; Wang, Z.; He, L. Context aware road travel time estimation by coupled tensor decomposition based on trajectory data. KBS 2022, 245, 108596. [Google Scholar] [CrossRef]
- Huang, L.; Li, Z.; Zhao, S.; Luo, R.; Su, R.; Guan, Y. Coupling Urban Road Travel Time and Traffic Status from Vehicle Trajectories by Gaussian Distribution. In Proceedings of the IEEE 25th International Conference on Intelligent Transportation Systems (ITSC), Macau, China, 8–12 October 2022; pp. 4056–4061. [Google Scholar]
- Huang, L.P.; Yang, Y.J.; Zhao, X.H.; Ma, C.; Gao, H. Sparse data-based urban road travel speed prediction using probabilistic principal component analysis. IEEE Access 2018, 6, 44022–44035. [Google Scholar] [CrossRef]
- Asif, M.T.; Mitrovic, N.; Garg, L.; Dauwels, J.; Jaillet, P. Low-dimensional models for missing data imputation in road networks. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013. [Google Scholar]
- Jia, X.; Dong, X.; Chen, M.; Yu, X. Missing data imputation for traffic congestion data based on joint matrix factorization. Knowl.-Based Syst. 2021, 225, 107114. [Google Scholar] [CrossRef]
- Asif, M.T.; Mitrovic, N.; Dauwels, J.; Jaillet, P. Matrix and tensor-based methods for missing data estimation in large traffic networks. IEEE Trans. Intell. Transp. Syst. 2016, 17, 1816–1825. [Google Scholar] [CrossRef]
- Jiang, B.; Siddiqi, M.D.; Asadi, R.; Regan, A. Imputation of missing traffic flow data using denoising autoencoders. Procedia Comput. Sci. 2021, 184, 84–91. [Google Scholar] [CrossRef]
- Shang, Q.; Yang, Z.; Gao, S.; Tan, D. An imputation method for missing traffic data based on FCM optimized by PSO-SVR. J. Adv. Transp. 2018, 2018, 2935248. [Google Scholar] [CrossRef]
- Li, Y.B.; Li, Z.H.; Li, L. Missing traffic data: Comparison of imputation methods. IET Intell. Transp. Syst. 2018, 8, 51–57. [Google Scholar] [CrossRef]
- Wu, P.; Xu, L.; Huang, Z. Imputation methods used in missing traffic data: A literature review. In Proceedings of the International Symposium on Intelligence Computation and Applications, Guangzhou, China, 20–21 November 2019. [Google Scholar]
- Chen, X.; Lei, M.; Saunier, N.; Sun, L. Low-rank autoregressive tensor completion for spatiotemporal traffic data imputation. IEEE Trans. Intell. Transp. Syst. 2022, 23, 12301–12310. [Google Scholar] [CrossRef]
- Tipping, M.E.; Bishop, C.M. Probabilistic principal component analysis. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 1999, 61, 611–622. [Google Scholar] [CrossRef]
- Ilin, A.; Raiko, T. Practical approaches to principal component analysis in the presence of missing values. J. Mach. Learn. Res. 2010, 11, 1957–2000. [Google Scholar]
- Audigier, B.; Husson, F.; Josse, J. Multiple imputation for continuous variables using a Bayesian principal component analysis. J. Stat. Comput. Simul. 2016, 86, 2140–2156. [Google Scholar] [CrossRef]
- Qu, L.; Li, L.; Zhang, Y.; Hu, J. PPCA-based missing data imputation for traffic flow volume: A systematical approach. IEEE Trans. Intell. Transp. Syst. 2009, 10, 512–522. [Google Scholar]
- Sportisse, A.; Boyer, C.; Josse, J. Estimation and imputation in probabilistic principal component analysis with missing not at random data. Adv. Neural Inf. Process. Syst. 2020, 33, 7067–7077. [Google Scholar]
- Chen, X.; Yang, J.; Sun, L. A nonconvex low-rank tensor completion model for spatiotemporal traffic data imputation. Transp. Res. Part C Emerg. Technol. 2020, 117, 102673. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).