# Missing Traffic Data Imputation with a Linear Generative Model Based on Probabilistic Principal Component Analysis

^{*}

## Abstract

**:**

## 1. Introduction

- We design a metric, p-score to denote the relative importance of links in terms of time series observations, which is used to distinguish the links with missing values.
- We propose a linear model for the MNAR traffic data imputation, which is based on the probabilistic principal component analysis.
- We conduct experiments on a real-world traffic dataset using the model and the proposed metric. Experimental results show missing data on links with higher p-score values can be better recovered. Moreover, testing on the real-world dataset, the results of the proposed model on links with the lowest p-score value also outperforms the typically used PPCA model.

## 2. Problem Statement

## 3. Methodology

#### 3.1. PPCA

#### 3.2. Missing Variables Differentiation Based on Time Series

#### 3.3. Preliminaries and Assumptions

**Assumption 1:**

**Assumption 2:**

**Assumption 2**denotes that, given the values in ${\left({Y}_{\xb7k}\right)}_{k\in \overline{\left\{j\right\}}}$, the column ${Y}_{\xb7j}$ is independent with the column ${\Omega}_{\xb7m}$.

#### 3.4. Estimation of $\alpha $

#### 3.5. Estimation of Variance and Covariance

## 4. Experiment

#### 4.1. Dataset and Preprocessing

#### 4.2. Metrics for Missing Data Imputation Accuracy

^{2}. Note that a higher R

^{2}value denotes better accuracy.

#### 4.3. Benchmark and Experiment Settings

#### 4.3.1. Generating MNAR

#### 4.3.2. Settings of Link Set $\mathcal{M}$

#### 4.4. Results and Analysis

## 5. Discussion

## 6. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Yuan, H.; Li, G. A survey of traffic prediction: From spatio-temporal data to intelligent transportation. Data Sci. Eng.
**2021**, 6, 63–85. [Google Scholar] [CrossRef] - Neelakandan, S.; Berlin, M.A.; Tripathi, S.; Devi, V.B.; Bhardwaj, I.; Arulkumar, N. IoT-based traffic prediction and traffic signal control system for smart city. Soft Comput.
**2021**, 25, 12241–12248. [Google Scholar] [CrossRef] - Tan, H.C.; Wu, Y.K.; Feng, J.S.; Wang, W.H.; Ran, B. Traffic missing data completion with spatial-temporal correlations. In Proceedings of the 93rd Annual Meeting of the Transportation Research Board, Washington, DC, USA, 12–16 January 2014. [Google Scholar]
- Li, H.P.; Wang, Y.H.; Li, M. Modified GAN Model for Traffic Missing Data Imputation. In CICTP 2020, Proceedings of the 20th COTA International Conference of Transportation Professionals, Xi’an, China, 14–16 August 2020; American Society of Civil Engineers: Reston, VA, USA, 2020; pp. 3013–3023. [Google Scholar]
- Yang, F.; Liu, G.; Huang, L.; Chin, C.S. Tensor Decomposition for Spatial—Temporal Traffic Flow Prediction with Sparse Data. Sensors
**2020**, 20, 6046. [Google Scholar] [CrossRef] [PubMed] - Huang, L.P.; Zhao, S.D.; Luo, R.K.; Su, R.; Sindhwani, M.; Chan, S.K.; Dhinesh, G.R. An incremental map matching approach with speed estimation constraints for high sampling rate vehicle trajectories. In Proceedings of the IEEE 17th International Conference on Control & Automation (ICCA), Naples, Italy, 27–30 June 2022; pp. 758–765. [Google Scholar]
- Huang, L.P.; Yang, Y.J.; Chen, H.C.; Zhang, Y.; Wang, Z.; He, L. Context aware road travel time estimation by coupled tensor decomposition based on trajectory data. KBS
**2022**, 245, 108596. [Google Scholar] [CrossRef] - Huang, L.; Li, Z.; Zhao, S.; Luo, R.; Su, R.; Guan, Y. Coupling Urban Road Travel Time and Traffic Status from Vehicle Trajectories by Gaussian Distribution. In Proceedings of the IEEE 25th International Conference on Intelligent Transportation Systems (ITSC), Macau, China, 8–12 October 2022; pp. 4056–4061. [Google Scholar]
- Huang, L.P.; Yang, Y.J.; Zhao, X.H.; Ma, C.; Gao, H. Sparse data-based urban road travel speed prediction using probabilistic principal component analysis. IEEE Access
**2018**, 6, 44022–44035. [Google Scholar] [CrossRef] - Asif, M.T.; Mitrovic, N.; Garg, L.; Dauwels, J.; Jaillet, P. Low-dimensional models for missing data imputation in road networks. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013. [Google Scholar]
- Jia, X.; Dong, X.; Chen, M.; Yu, X. Missing data imputation for traffic congestion data based on joint matrix factorization. Knowl.-Based Syst.
**2021**, 225, 107114. [Google Scholar] [CrossRef] - Asif, M.T.; Mitrovic, N.; Dauwels, J.; Jaillet, P. Matrix and tensor-based methods for missing data estimation in large traffic networks. IEEE Trans. Intell. Transp. Syst.
**2016**, 17, 1816–1825. [Google Scholar] [CrossRef] - Jiang, B.; Siddiqi, M.D.; Asadi, R.; Regan, A. Imputation of missing traffic flow data using denoising autoencoders. Procedia Comput. Sci.
**2021**, 184, 84–91. [Google Scholar] [CrossRef] - Shang, Q.; Yang, Z.; Gao, S.; Tan, D. An imputation method for missing traffic data based on FCM optimized by PSO-SVR. J. Adv. Transp.
**2018**, 2018, 2935248. [Google Scholar] [CrossRef] - Li, Y.B.; Li, Z.H.; Li, L. Missing traffic data: Comparison of imputation methods. IET Intell. Transp. Syst.
**2018**, 8, 51–57. [Google Scholar] [CrossRef] - Wu, P.; Xu, L.; Huang, Z. Imputation methods used in missing traffic data: A literature review. In Proceedings of the International Symposium on Intelligence Computation and Applications, Guangzhou, China, 20–21 November 2019. [Google Scholar]
- Chen, X.; Lei, M.; Saunier, N.; Sun, L. Low-rank autoregressive tensor completion for spatiotemporal traffic data imputation. IEEE Trans. Intell. Transp. Syst.
**2022**, 23, 12301–12310. [Google Scholar] [CrossRef] - Tipping, M.E.; Bishop, C.M. Probabilistic principal component analysis. J. R. Stat. Soc. Ser. B (Stat. Methodol.)
**1999**, 61, 611–622. [Google Scholar] [CrossRef] [Green Version] - Ilin, A.; Raiko, T. Practical approaches to principal component analysis in the presence of missing values. J. Mach. Learn. Res.
**2010**, 11, 1957–2000. [Google Scholar] - Audigier, B.; Husson, F.; Josse, J. Multiple imputation for continuous variables using a Bayesian principal component analysis. J. Stat. Comput. Simul.
**2016**, 86, 2140–2156. [Google Scholar] [CrossRef] [Green Version] - Qu, L.; Li, L.; Zhang, Y.; Hu, J. PPCA-based missing data imputation for traffic flow volume: A systematical approach. IEEE Trans. Intell. Transp. Syst.
**2009**, 10, 512–522. [Google Scholar] - Sportisse, A.; Boyer, C.; Josse, J. Estimation and imputation in probabilistic principal component analysis with missing not at random data. Adv. Neural Inf. Process. Syst.
**2020**, 33, 7067–7077. [Google Scholar] - Chen, X.; Yang, J.; Sun, L. A nonconvex low-rank tensor completion model for spatiotemporal traffic data imputation. Transp. Res. Part C Emerg. Technol.
**2020**, 117, 102673. [Google Scholar] [CrossRef]

$\mathit{a}$ | $\mathit{b}$ | Missing Percentage |
---|---|---|

−1 | −1.3 | 25% |

3 | 0 | 50% |

1 | −1.3 | 75% |

**Table 2.**Experiment Setting and performance of the algorithms with different Percent of MNAR Data on Links.

Experiment Setting: Missing Rate (%) @ $\mathit{\mathcal{M}}$ | ||||||||||

50 @$\left\{1\right\}$ | 50 @$\left\{3\right\}$ | 75 @$\left\{1\right\}$ | 75 @$\left\{1,3\right\}$ | 75 @$\left\{3,5\right\}$ | ||||||

p-score | 10.62@$\left\{1\right\}$ | 13.26@$\left\{3\right\}$ | 10.62@$\left\{1\right\}$ | $-$ | 9.42@$\left\{5\right\}$ | |||||

Performance Comparison | ||||||||||

Metrics | ppca-em | New | ppca-em | New | ppca-em | New | ppca-em | New | ppca-em | New |

RMSE | 0.992 | 0.746 | 0.559 | 0.595 | 1.069 | 0.746 | 0.835 | 0.871 | 0.942 | 0.627 |

MAE | 0.810 | 0.564 | 0.458 | 0.448 | 0.789 | 0.564 | 0.598 | 0.625 | 0.665 | 0.468 |

SMAPE | 0.340 | 0.223 | 0.216 | 0.157 | 0.289 | 0.223 | 0.231 | 0.228 | 0.253 | 0.201 |

R^{2} | 0.150 | 0.688 | 0.595 | 0.681 | 0.545 | 0.688 | 0.208 | 0.677 | 0.115 | 0.740 |

Accuracy | 83.0% | 88.9% | 89.2% | 92.2% | 85.5% | 88.9% | 88.4% | 88.6% | 87.3% | 89.9% |

Computing Time | ||||||||||

Sec | 6.54 | 2.03 | 6.29 | 2.03 | 6.73 | 2.64 | 6.06 | 4.06 | 11.32 | 4.11 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Huang, L.; Li, Z.; Luo, R.; Su, R.
Missing Traffic Data Imputation with a Linear Generative Model Based on Probabilistic Principal Component Analysis. *Sensors* **2023**, *23*, 204.
https://doi.org/10.3390/s23010204

**AMA Style**

Huang L, Li Z, Luo R, Su R.
Missing Traffic Data Imputation with a Linear Generative Model Based on Probabilistic Principal Component Analysis. *Sensors*. 2023; 23(1):204.
https://doi.org/10.3390/s23010204

**Chicago/Turabian Style**

Huang, Liping, Zhenghuan Li, Ruikang Luo, and Rong Su.
2023. "Missing Traffic Data Imputation with a Linear Generative Model Based on Probabilistic Principal Component Analysis" *Sensors* 23, no. 1: 204.
https://doi.org/10.3390/s23010204