# Missing Traffic Data Imputation with a Linear Generative Model Based on Probabilistic Principal Component Analysis

^{*}

## Abstract

**:**

## 1. Introduction

- We design a metric, p-score to denote the relative importance of links in terms of time series observations, which is used to distinguish the links with missing values.
- We propose a linear model for the MNAR traffic data imputation, which is based on the probabilistic principal component analysis.
- We conduct experiments on a real-world traffic dataset using the model and the proposed metric. Experimental results show missing data on links with higher p-score values can be better recovered. Moreover, testing on the real-world dataset, the results of the proposed model on links with the lowest p-score value also outperforms the typically used PPCA model.

## 2. Problem Statement

## 3. Methodology

#### 3.1. PPCA

#### 3.2. Missing Variables Differentiation Based on Time Series

#### 3.3. Preliminaries and Assumptions

**Assumption 1:**

**Assumption 2:**

**Assumption 2**denotes that, given the values in ${\left({Y}_{\xb7k}\right)}_{k\in \overline{\left\{j\right\}}}$, the column ${Y}_{\xb7j}$ is independent with the column ${\Omega}_{\xb7m}$.

#### 3.4. Estimation of $\alpha $

#### 3.5. Estimation of Variance and Covariance

## 4. Experiment

#### 4.1. Dataset and Preprocessing

#### 4.2. Metrics for Missing Data Imputation Accuracy

^{2}. Note that a higher R

^{2}value denotes better accuracy.

#### 4.3. Benchmark and Experiment Settings

#### 4.3.1. Generating MNAR

#### 4.3.2. Settings of Link Set $\mathcal{M}$

#### 4.4. Results and Analysis

## 5. Discussion

## 6. Conclusions

$\mathit{a}$ | $\mathit{b}$ | Missing Percentage |
---|---|---|

−1 | −1.3 | 25% |

3 | 0 | 50% |

1 | −1.3 | 75% |

**Table 2.**Experiment Setting and performance of the algorithms with different Percent of MNAR Data on Links.

Experiment Setting: Missing Rate (%) @ $\mathit{\mathcal{M}}$ | ||||||||||

50 @$\left\{1\right\}$ | 50 @$\left\{3\right\}$ | 75 @$\left\{1\right\}$ | 75 @$\left\{1,3\right\}$ | 75 @$\left\{3,5\right\}$ | ||||||

p-score | 10.62@$\left\{1\right\}$ | 13.26@$\left\{3\right\}$ | 10.62@$\left\{1\right\}$ | $-$ | 9.42@$\left\{5\right\}$ | |||||

Performance Comparison | ||||||||||

Metrics | ppca-em | New | ppca-em | New | ppca-em | New | ppca-em | New | ppca-em | New |

RMSE | 0.992 | 0.746 | 0.559 | 0.595 | 1.069 | 0.746 | 0.835 | 0.871 | 0.942 | 0.627 |

MAE | 0.810 | 0.564 | 0.458 | 0.448 | 0.789 | 0.564 | 0.598 | 0.625 | 0.665 | 0.468 |

SMAPE | 0.340 | 0.223 | 0.216 | 0.157 | 0.289 | 0.223 | 0.231 | 0.228 | 0.253 | 0.201 |

R^{2} | 0.150 | 0.688 | 0.595 | 0.681 | 0.545 | 0.688 | 0.208 | 0.677 | 0.115 | 0.740 |

Accuracy | 83.0% | 88.9% | 89.2% | 92.2% | 85.5% | 88.9% | 88.4% | 88.6% | 87.3% | 89.9% |

Computing Time | ||||||||||

Sec | 6.54 | 2.03 | 6.29 | 2.03 | 6.73 | 2.64 | 6.06 | 4.06 | 11.32 | 4.11 |

