Missing Traffic Data Imputation with a Linear Generative Model Based on Probabilistic Principal Component Analysis

Even with the ubiquitous sensing data in intelligent transportation systems, such as the mobile sensing of vehicle trajectories, traffic estimation is still faced with the data missing problem due to the detector faults or limited number of probe vehicles as mobile sensors. Such data missing issue poses an obstacle for many further explorations, e.g., the link-based traffic status modeling. Although many studies have focused on tackling this kind of problem, existing studies mainly focus on the situation in which data are missing at random and ignore the distinction between links of missing data. In the practical scenario, traffic speed data are always missing not at random (MNAR). The distinction for recovering missing data on different links has not been studied yet. In this paper, we propose a general linear model based on probabilistic principal component analysis (PPCA) for solving MNAR traffic speed data imputation. Furthermore, we propose a metric, i.e., Pearson score (p-score), for distinguishing links and investigate how the model performs on links with different p-score values. Experimental results show that the new model outperforms the typically used PPCA model, and missing data on links with higher p-score values can be better recovered.


Introduction
Traffic data generated by loop detectors or floating cars in urban road networks serve as the foundation for various data-driven applications in intelligent transportation systems, including traffic forecasting and traffic control [1][2][3]. However, even with ubiquitous sensing data, the missing data problem is almost inevitable due to either detector faults or a limited number of probe vehicles operating as mobile sensors in road networks, which means not each road in the network is covered by a detector or traveled by a probe vehicle in each minute [4][5][6]. Such an issue of missing traffic data poses obstacles for many further data-driven explorations in both academic and industrial fields, e.g., the link-based traffic status modeling, and network-wise traffic dynamics capturing [7,8]. Hence, accurate and reliable imputation is a basic need for such kind of incomplete data for the downstream explorations.
Many efforts have been made for estimating the missing traffic data on multiple traffic datasets, resulting the generative probabilistic model [9], the matrix decomposition and tensor factorization models [10][11][12], the autoencoder model [13], the fusion models [14]. Some basic mathematical models are also adopted, including the autoregressive integrated moving average (ARIMA) model, the Bayesian networks (BNs) method, the Markov chain Monte Carlo (MCMC) method, and the K-nearest neighbors (KNN) model, which are all tested in [15] for traffic missing data imputation.
The studies in [16] have validated that the matrix decomposition-based method is not capable for recovering missing data when the missing percentage large. The tensor models are based on the global structure capturing, and it is faced with challenging to permutation in the spatial and temporal dimension [17]. The probabilistic principal component analysis (PPCA) model [18] also plays a major role in missing data completion due to its generative Sensors 2023, 23, 204 2 of 13 feature [19]. Observations are assumed to be generated from a low dimensional space, with which the missing data can be recovered by optimizing the generative parameters using the observations [20].
Although many studies have focused on tackling this kind of problem, existing studies focus on the situation that the data are missing at random. Specifically, missing data can be classified into missing at random, missing completely at random, and missing not at random (MNAR) [16]. MNAR data always exist in the practical scenarios, and it is more challenging to estimate the missing values, which is the target of this paper.
The studies in [15] demonstrate that the PPCA model yields best performance among ARIMA, BNs, MCMC, and the KNN models, and in the research in [21], it has been certified that the PPCA model outperforms the basic tensor decomposition method. Hence, in this study, we set the PPCA model as a basis and further improve the PPCA model for tackling the MNAR traffic data. Additionally, the missing data on different links or sensors may be of different levels of challenges for data completion. Hence, there is also a need to distinguish different scenarios that missing data are on different links or sensors. Instead of the centrality of a sensor in the network, we utilize the time series correlations to define a metric for distinguishing the role of a link in the traffic network. Such a metric is adaptive to the scenario that sensors or links are anonymous. Contributions of this work are summarized as below:

•
We design a metric, p-score to denote the relative importance of links in terms of time series observations, which is used to distinguish the links with missing values.

•
We propose a linear model for the MNAR traffic data imputation, which is based on the probabilistic principal component analysis.

•
We conduct experiments on a real-world traffic dataset using the model and the proposed metric. Experimental results show missing data on links with higher p-score values can be better recovered. Moreover, testing on the real-world dataset, the results of the proposed model on links with the lowest p-score value also outperforms the typically used PPCA model.
The remainder of the paper is structured as follows. Section 2 presents the problem statement of the missing traffic data imputation. Section 3 the details of the proposed model. Section 4 shows the outcome of the experimental evaluation results, Section 5 presents a short discussion of the potential application scenarios of the proposed method, and finally, Section 6 gives the conclusions of this paper and the directions for future studies.

Problem Statement
Let Y ∈ R n×p be a traffic data organization matrix with each element Y ij denoting the i observation of a link j.
We assume that the traffic data are missing and links with missing values are organized as Y ·m 1 , Y ·m 2 , . . . , Y ·m d , which is indexed by M := {m 1 , m 2 , . . . , m d } ⊂ {1, . . . , p} with d < p are supposed to have missing values. Here, M is the link set that has missing values. Other values in Y are observed.
We label the missing status of Y ij with another variable, written as Traffic sensing in urban road networks is faced with the missing data, or data sparsity problem. We construct the traffic matrix Y with all missing values in columns M. The missing data imputation problem is to estimate these missing values, i.e.,Ŷ im , m ∈ M, where Ω im = 0. Assuming that the target variable is organized as a matrix Y, and it can be drawn from X of a low rank by linear combination, written as here, Y ∈ R n×p , where n is the sample number and p is the number of variables in the determination system. Specifically, in our link-based missing data imputation problem, p is the total number of links. X = (X 1· |· · ·|X n· ) T is the latent variable. X ∈ R n×r , and the row is drawn from a Gaussian distribution with zero mean, i.e., X i· ∼ N (0 r , Id r×r ). Here, r < min{n, p}, indicating a lower dimension. A is the loading matrix of rank r, A ∈ R r×p . = ( 1· |· · ·| n· ) T is a model error, and each row i· ∼ N 0 p , σ 2 Id p×p ∈ R p , which also has a zero mean. 1 = (1, . . . , 1) T ∈ R n , α = α 1 , · · · , α p ∈ R p . Given the linear expression above, the mean value of each sample of Y is α.

Missing Variables Differentiation Based on Time Series
Assume that we have missing data on two different variables, Y ·j , Y ·k , with the same percent, the imputation accuracy can be different due to the variable's role in the whole variable set. In the traffic missing data imputation problem, two links, Y ·j , Y ·k , may have different correlations to other links. In this section, we prose a metric to differentiate the role of each link.
The observation of each link is also a time series. We first adopt the Pearson correlation coefficient to estimate the correlation between each pair of time series, which is calculated as By calculate the Pearson correlation among each pair of variables, we can obtain a correlation matrix, written as We define a Pearson score (p-score) for each variable to differentiate the variables in M, which is calculated as The variable Y ·j that obtains a higher p-score value than Y ·k denotes it has higher correlation to other links. Such a metric can differentiate the variables in terms of the time series observations. When P score Y ·j > P score (Y ·k ), and the two links have the same missing data percentage, the imputation accuracy for Y ·j should be higher than that of Y ·k .

Preliminaries and Assumptions
Assume that we have missing data on two different variables, Y ·j , Y ·k with the same percent, the imputation accuracy can be different due to the variable's role in the whole variable set. In the traffic missing data imputation problem, two links Y ·j , Y ·k may have different correlations to other links. In this section, we prose a metric to differentiate the role of each link.
Assume that we have a D dimensional Gaussian distribution, written as where u is a D-dimensional mean vector, Σ is a D × D covariance matrix, |Σ| denotes the determinant of Σ. Then, we partition the D-dimensional vector into two parts, written as Correspondingly, the mean vector and the covariance matrix are, respectively partitioned into two parts and four parts, written as below.
We further utilize another variable Λ to denote the inverse matrix of the covariance matrix, defined as Note that, we have the theory of matrix inverse as Hence, for the inverse of the covariance matrix, we have where we care about the expression of Λ aa and Λ ab , written as For the Gaussian distribution, the exponent parts can be expanded as when we partition the D-dimensional vector into two parts x = (x a , x b ) T , then the exponent part of the Gaussian distribution can be expanded into We assume that x b is known in advance, so it can be regarded as a constant. Hence, the first order of x a is written as which should have the same expression of the original expression for the first order part written as x T Σ −1 u. For x T Σ −1 u, when we consider the x b is known, then Σ −1 u can be written as Σ a|b −1 u a|b , which should be equal to Λ aa u a − Λ ab (x b − u b ), written as Hence, we have the expression the estimated value of u a|b written as conditional Gaussian distribution where Λ aa and Λ ab are already known as above.
Based on the conditional Gaussian distribution, we replace the x b part with (Y ·k ) k∈M , which is assumed to known observations, and replace the x a part as the unknown part Y ·m , which is to be estimated because that the data are missing. Then, we have the expectation of the estimation as Then, the estimation of the missing data is calculated aŝ Hence, the missing data estimations depend on the estimations ofα m ,Σ m,M and Below are assumptions for estimating the model parameters.

Estimation of α
We first define the regression coefficients of Y ·j on Y ·m and Y ·k , for k ∈ I −j in the complete case, that will be used to express the mean of a variable with MNAR values.
Considering the model Y = 1α + XA + , with an assumption that matrix A ∈ R r×p is of full rank r. Therefore, the expression of Y ·j can be reduced to the following linear system.
Here, α |r and |r are the reduced matrix of α and . A |r ∈ R r×r denotes the reduced Given Assumption 1, the A |r is invertible, and the inverted matrix is denoted asÁ −1 . The latent matrix of full rank r can be written as Using the original model Y = 1α + XA + , the expression of Y ·j is then can be written as where we can get the intercept and the coefficients of Y ·j on Y ·m , (Y ·k ) k∈I −j .
where the coefficients are calculated as below equations.
Here, the arrow j → m, I −j indicates the regression model of Y ·j on Y ·(m,I −j ) , and the squared bracket [k] indicates the coefficient. Based on the model setting, we have E By taking the expectation of the left and right parts of the equality above given E[ ·k ] = 0 for ∀k ∈ {m} ∪ I −j , we have Above two equalities are identical. So, we havê

Estimation of Variance and Covariance
Let Z = (Y ·k ) k∈{ j } , for the variance Σ M,M , we have Assumption 2 leads to Var Y ·j Z = Var Y ·j Z, Ω ·m = 1 . According to the conditional variance for a Gaussian vector, we have Then, we have the first term of Var Y ·j as For the second term of Var Y ·j , we have For the first term, we have According to the derivation in [22] Note that for the second term, E[Y ·m ]E[ Y ·k ], it can be directly calculated.

Dataset and Preprocessing
We utilize a road traffic speed dataset published by [23]. Road segments are anonymous, covering the main urban expressways within two months from 1 August to 30 September 2016, (a total of 61 days).
The time interval is 10 min. From the original dataset, we select twenty links whose speed are generated in the morning rush hours (i.e., 7:00 A.M. to 9:00 A.M.) for evaluating the proposed method. The speed of each link is transformed to the congestion index, which is calculated as v ij/max v ·j , v ij denotes the i speed value of link j and v ·j denotes all observations of link j. For each link j, the time series length of speed observations is 732 (12 observations in two hours 61 days). Hence, the dimension of Y is n = 732, p is the number of links.
The basic assumption of the proposed model is that the observations of each link are drawn from a Gaussian distribution, Hence, we adopt the quantile-quantile plot (QQ Plot) to display the quantiles of the data (after normalizing) versus the theoretical quantile values from a normal distribution. If the distribution of the data is normal, then the data plot is linear. As shown in the Figure 1, the plot closely follows the straight lines, suggesting that the data after normalizing the congestion data have an approximately normal distribution.

Metrics for Missing Data Imputation Accuracy
For evaluating the performance of missing data imputation, we adopt the below four metrics, including Root Mean Square Error (RMSE), Mean Absolute Error (MAE), Symmetric Mean Absolute Percentage Error (SMAPE), and R 2 . Note that a higher R 2 value denotes better accuracy.

Benchmark and Experiment Settings
We compare the new model with the typically used PPCA model, where 2 and is estimated by an expectation-maximum (EM) algorithm. We name it ppca-em in this section. We further use the estimated 2 by EM as the known inputs of the new model in this study. As to the rank in the model, the best value is determined by cross-validation on the dataset. In this section, we further detail the experiment settings in terms of the MNAR data generation and the settings of link set ℳ.

Generating MNAR
Note that the model targets at solving the imputation for MNAR data. We utilize the mechanism of generating MNAR in [22]. Specifically, a logistic regression function is adopted as ( ) = 1/(− ( − )), where is an observation, and ( , ) is set for selecting different missing percentage. The function transforms the observation to a value in (0,1). The observation with ( ) > , is set to be the MNAR data, where is a random

Metrics for Missing Data Imputation Accuracy
For evaluating the performance of missing data imputation, we adopt the below four metrics, including Root Mean Square Error (RMSE), Mean Absolute Error (MAE), Symmetric Mean Absolute Percentage Error (SMAPE), and R 2 . Note that a higher R 2 value denotes better accuracy.

Benchmark and Experiment Settings
We compare the new model with the typically used PPCA model, where σ 2 and A is estimated by an expectation-maximum (EM) algorithm. We name it ppca-em in this section. We further use the estimated σ 2 by EM as the known inputs of the new model in this study. As to the rank r in the model, the best value is determined by cross-validation on the dataset. In this section, we further detail the experiment settings in terms of the MNAR data generation and the settings of link set M.

Generating MNAR
Note that the model targets at solving the imputation for MNAR data. We utilize the mechanism of generating MNAR in [22]. Specifically, a logistic regression function is adopted as f (x) = 1/(−a(x − b)), where x is an observation, and (a, b) is set for selecting different missing percentage. The function transforms the observation x to a value in (0, 1). The observation x with f (x) > µ, is set to be the MNAR data, where µ is a random threshold. We set the parameters (a, b) as below Table 1, which is corresponding to a specific missing percentage. Further, we test the missing data imputation accuracy of several select links (or link combination) compared with the ppca-em model, to evaluate the advantage of the new model.

Results and Analysis
We first examine the relationship between the missing data imputation accuracy and the proposed metric, i.e., p-score value. We select two links with the highest p-score value and the lowest p-score value in the dataset. The p-score values of two selected links, i.e., link 6 and link 18, are shown in Figure 2, where the color map denotes the ρ values between the selected link and all links in link set I. threshold. We set the parameters ( , ) as below Table 1, which is corresponding specific missing percentage. Missing data on different links may obtain different recover accuracy, even with same missing percentage. For evaluating this proposition, we first test the missing d imputation accuracy with different p-score values of the links. When a link observa • is set to be ℳ = { }, all other links are set to be ℐ − = ℐ\{ }, where ℐ = {1,2, … , Further, we test the missing data imputation accuracy of several select links (or link c bination) compared with the ppca-em model, to evaluate the advantage of the new mo

Results and Analysis
We first examine the relationship between the missing data imputation accuracy the proposed metric, i.e., p-score value. We select two links with the highest p-score v and the lowest p-score value in the dataset. The p-score values of two selected links, link 6 and link 18, are shown in Figure 2, where the color map denotes the values tween the selected link and all links in link set ℐ. Accordingly, we calculate the absolute errors of the model on these two sele links. Figure 3 shows the results missing data imputation results on these links in te of different missing data percentages. We can see that missing data on link 18, whic with a higher p-score value than that of link 6, are better recovered regarding all scena of missing data percentages (25%, 50%, 75%). Accordingly, we calculate the absolute errors of the model on these two selected links. Figure 3 shows the results missing data imputation results on these links in terms of different missing data percentages. We can see that missing data on link 18, which is with a higher p-score value than that of link 6, are better recovered regarding all scenarios of missing data percentages (25%, 50%, 75%).
We further examine the relationship between the p-score value and the accuracy metrics on the traffic dataset, which is shown in below Figure 4. The accuracy measured by four metrics presents a positive correlation with the p-score value on different links, meaning that missing data on the links with higher p-score values can be better recovered.
We also compare the new model with the ppca-em model on other links or link combinations in the dataset. The settings and corresponding missing data imputation results measured by the four metrics are shown in Table 2. Except the above four metrics, we further added the accuracy as another metric for directly representing the estimation accuracy results and better understanding the accuracy comparison between the new model and the baseline. Here, the accuracy is calculated as  We further examine the relationship between the p-score value and the accuracy metrics on the traffic dataset, which is shown in below Figure 4. The accuracy measured by four metrics presents a positive correlation with the p-score value on different links, meaning that missing data on the links with higher p-score values can be better recovered. We also compare the new model with the ppca-em model on other links or link combinations in the dataset. The settings and corresponding missing data imputation results measured by the four metrics are shown in Table 2. Except the above four metrics, we further added the accuracy as another metric for directly representing the estimation accuracy results and better understanding the accuracy comparison between the new model We further examine the relationship between the p-score value and the accuracy metrics on the traffic dataset, which is shown in below Figure 4. The accuracy measured by four metrics presents a positive correlation with the p-score value on different links, meaning that missing data on the links with higher p-score values can be better recovered. We also compare the new model with the ppca-em model on other links or link combinations in the dataset. The settings and corresponding missing data imputation results measured by the four metrics are shown in Table 2. Except the above four metrics, we further added the accuracy as another metric for directly representing the estimation accuracy results and better understanding the accuracy comparison between the new model and the baseline. Here, the accuracy is calculated as  Note that Figure 3 already shows that the new model obtains the worst accuracy on link 6. Hence, we further compare two models on this link to compare the new model with the ppca-em model. The results are shown in Figure 5. It shows that even on link 6, the absolute errors of the new model are still lower than the ppca-em model for three missing ratios. Note that Figure 3 already shows that the new model obtains the worst accuracy on link 6. Hence, we further compare two models on this link to compare the new model with the ppca-em model. The results are shown in Figure 5. It shows that even on link 6, the absolute errors of the new model are still lower than the ppca-em model for three missing ratios. The experiment results in Table 2 and Figure 5 demonstrate that the new model performs better than the typically used ppca-em model in terms of four accuracy metrics and computing time. The results indicate that the new model is more effective and efficient for the MNAR traffic data imputation problem, which is the target of this study. The typical The experiment results in Table 2 and Figure 5 demonstrate that the new model performs better than the typically used ppca-em model in terms of four accuracy metrics and computing time. The results indicate that the new model is more effective and efficient for the MNAR traffic data imputation problem, which is the target of this study. The typical ppca-em method is usually used for imputation of data missing at random, whereas the new model is more general and is capable of MNAR data imputation.

Discussion
Our improved linear probabilistic principal component analysis method can be applied to a variety of missing traffic data imputation applications, such as missing traffic speed estimation, or other traffic indicators. Notably, because the proposed missing data imputation method is a linear and interpretable model, which is naturally of high computing efficiency, it can be utilized in the systems where real-time missing data estimation is required. Additionally, the time-series based metric, p-Score value, is proposed to distinguish variables, e.g., links with missing traffic speed data, for estimating the missing values. Such a method can be applied to the applications of traffic surveillance systems to identify which sensors should be of high priority to maintained in the systems to ensure the full surveillance, or which links should be equipped with sensor for traffic surveillance.

Conclusions
In this study, we propose a general linear model based on the PPCA to tackle the MNAR traffic data imputation problem. We also propose a time series-based metric, i.e., the p-score, to distinguish links that are of missing data. Experimental results on a real-world traffic dataset show that the proposed model performs better than the typically used ppcaem model in terms of missing data imputation accuracy and computing time. Furthermore, we test the model on links with different p-score values. The experiment results show that the missing data on links with higher p-score values are better recovered. Such an observation helps us understand the data recovering distinction for different links in the road network, which has not been studied in any research to our best knowledge. In future work, we will further compare the model with other methods on more traffic datasets. Funding: This study is supported under the RIE2020 Industry Alignment Fund-Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s), and A*STAR under its Industry Alignment Fund (LOA Award I1901E0046).