# Impact of Data Loss on Multi-Step Forecast of Traffic Flow in Urban Roads Using K-Nearest Neighbors

^{*}

## Abstract

**:**

## 1. Introduction

## 2. K-Nearest Neighbors Model

#### 2.1. Basic KNN and Notations

#### 2.2. Enhanced KNN

- Detector-wise: although the flow differs from one detector to another as shown in Figure 1, it may happen that patterns from different detectors have a partial similarity. Since these candidates are retrieved from different detectors, the other parts of them may be very different, which badly impacts the forecasting accuracy. Thus, if the prediction is to be made for a given detector, the model inspects only data related to it.
- Weekday-wise: when our model explores the historical data to retrieve state vector profiles, it only considers the same weekday. For instance, if the current state vector is taken from a Wednesday, all the profiles are constructed from historical data belonging to Wednesdays. Observations showed that there is a clear difference between working-day and weekend flows. More precisely, even days of the same category differ in flow patterns, which justifies our choice. This weekday-wise pre-selection of state vector profiles showed a significant improvement in the model’s performance through preliminary experiments.
- State vector length: the length of state vectors, denoted by l, indicates how far backward from a given instant t the data is relevant to make accurate predictions. Hence, a state vector of length l is given by:$$\begin{array}{c}\hfill v(t-l,t)=\left[f\right(t-l),f(t-l+1),\dots ,f(t-1),f(t\left)\right]\end{array}$$The length of state vectors impacts the prediction quality as well. If l is relatively small, the information provided by the state vector may be insufficient to make accurate predictions. However, a longer state vector might also provide irrelevant information. To choose the best value for l, preliminary experiments have been launched with different values of l, such that $l\in \{20,30,40,50,60,90,120,150\}$. Tests showed that $l=60$ min (six timestamps) is the best choice for our dataset.
- Search radius: to ensure that state vector profiles share similar characteristics with the current state vector, we only consider profiles within a certain radius denoted by R. This means that the model selects profiles falling no further than r timestamps forwards and backwards from a current instant t. Therefore, the search space is constrained within $t-r$ and $t+r$. Obviously, as we decrease R, the ratio of profiles closer to the current state vector in terms of characteristics increases, and vice versa. One issue can be raised here, when R becomes smaller, profiles become fewer, which may also affect the prediction accuracy. Consequently, a trade-off value of R has to be determined in this respect. Experiments included $R\in \{40,50,60,90,120,150,200,300\}$ and showed that $R=90$ min ($r=9$ timestamps) is the best search radius for our experiments. Note that above 200 min (20 timestamps), the efficiency drastically decreases, which indicates that search radius imposition is worthy.

## 3. Imputation Techniques

#### 3.1. Mean

#### 3.2. Mean per Weekday

#### 3.3. Linear Regression

## 4. Data and Reconstructed Data

#### 4.1. Data Description

#### 4.2. Reconstructed Data

#### 4.3. Performance of Imputation Methods

- Completeness ratio: The experiments reported in Table 1 and Table 2, respectively plotted in Figure 5 and Figure 6, used different levels of completeness to investigate the impact of various missing portions of data. The results showed that the completeness percentage has an influence on the accuracy of the imputation methods. As we increase the number of missing entries, the performance quality of the three imputation methods decreases from around 91 with $50\%$ completeness to 51 with $90\%$ completeness. This is clearly apparent in Figure 6, where a list of random gap lengths is passed in. In contrast to that, Figure 5 shows that there is only a slight impact on the completeness ratio when deletions are based on fixed gap lengths. This kind of performance is mainly due to the large gaps of deletions (week and month), in this case deletions sometimes take place mostly in the training set and sometimes in the test set, which alternates the performance quality.
- Gap lengths: The results exhibited in Table 3 and Figure 4 suggest that for small gap deletions the performance of the models is worse than the one with larger gaps. When gap length is between 10 min and 1 day, the MAE is between 75 and 80, however, it drops down to around 72 for one week gap and 58 for one-month deletion. This kind of performance suggests, first, that the deletion of whole consecutive days has a smaller impact on the performance of the models than missing shorter entries for one day. Secondly, this means that training with smaller complete datasets is better than doing it with larger ones with multiple missing entries of a length less than one day. The efficiency of the models gets even better when the gap gets larger, namely one week and one month. In these cases, two possibilities are to be considered. The first is that the deletions are mostly (due to their length: a week or a month) in the training set, which means that only a few entries on the test set have to be imputed, which explains low errors (MAE). The other is that more missing entries are located in the test set, thus the training set is somehow complete, which affected well the filling process of the missing values in the test set.

#### 4.4. Deviation between Original and Reconstructed Data

## 5. Results and Discussion

#### 5.1. Under Original Data

#### 5.2. Under Artificial Datasets

## 6. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Kumar, S.V.; Vanajakshi, L. Short-term traffic flow prediction using seasonal ARIMA model with limited input data. Eur. Transp. Res. Rev.
**2015**, 7, 21. [Google Scholar] [CrossRef] - Duan, P.; Mao, G.; Zhang, C.; Wang, S. STARIMA-based traffic prediction with time-varying lags. In Proceedings of the 2016 IEEE 19th International Conference on Intelligent Transportation Systems (ITSC), Rio de Janeiro, Brazil, 1–4 November 2016; pp. 1610–1615. [Google Scholar]
- Van Der Voort, M.; Dougherty, M.; Watson, S. Combining Kohonen maps with ARIMA time series models to forecast traffic flow. Transp. Res. Part C Emerg. Technol.
**1996**, 4, 307–318. [Google Scholar] [CrossRef] - Lu, S.; Zhang, Q.; Chen, G.; Seng, D. A combined method for short-term traffic flow prediction based on recurrent neural network. Alex. Eng. J.
**2021**, 60, 87–94. [Google Scholar] [CrossRef] - Sadeghi-Niaraki, A.; Mirshafiei, P.; Shakeri, M.; Choi, S.M. Short-Term Traffic Flow Prediction Using the Modified Elman Recurrent Neural Network Optimized through a Genetic Algorithm. IEEE Access
**2020**, 8, 217526–217540. [Google Scholar] [CrossRef] - Bai, L.; Yao, L.; Li, C.; Wang, X.; Wang, C. Adaptive Graph Convolutional Recurrent Network for Traffic Forecasting. Adv. Neural Inf. Process. Syst.
**2020**, 33, 17804–17815. [Google Scholar] - Lv, Y.; Duan, Y.; Kang, W.; Li, Z.; Wang, F. Traffic Flow Prediction with Big Data: A Deep Learning Approach. IEEE Trans. Intell. Transp. Syst.
**2015**, 16, 865–873. [Google Scholar] [CrossRef] - Klosa, D.; Mallek, A.; Büskens, C. Short-Term Traffic Flow Forecast Using Regression Analysis and Graph Convolutional Neural Networks. In Proceedings of the 2021 IEEE 23rd International Conference on High Performance Computing & Communications; 7th International Conference on Data Science & Systems; 19th International Conference on Smart City; 7th International Conference on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys), Hainan, China, 20–22 December 2021; pp. 1413–1418. [Google Scholar]
- Castro-Neto, M.; Jeong, Y.S.; Jeong, M.K.; Han, L.D. Online-SVR for short-term traffic flow prediction under typical and atypical traffic conditions. Expert Syst. Appl.
**2009**, 36, 6164–6173. [Google Scholar] [CrossRef] - Liu, L. A Short-Term Traffic Flow Prediction Method Based on SVR. In Proceedings of the 2021 2nd International Conference on Urban Engineering and Management Science (ICUEMS), Sanya, China, 29–31 January 2021; pp. 1–4. [Google Scholar]
- Zheng, Z.; Su, D. Short-term traffic volume forecasting: A k-nearest neighbor approach enhanced by constrained linearly sewing principle component algorithm. Transp. Res. Part C Emerg. Technol.
**2014**, 43, 143–157. [Google Scholar] [CrossRef] - Cai, P.; Wang, Y.; Lu, G.; Chen, P.; Ding, C.; Sun, J. A spatiotemporal correlative k-nearest neighbor model for short-term traffic multistep forecasting. Transp. Res. Part C Emerg. Technol.
**2016**, 62, 21–34. [Google Scholar] [CrossRef] - Cheng, S.; Lu, F.; Peng, P.; Wu, S. Short-term traffic forecasting: An adaptive ST-KNN model that considers spatial heterogeneity. Comput. Environ. Urban Syst.
**2018**, 71, 186–198. [Google Scholar] [CrossRef] - Nihan, N.L. Aid to determining freeway metering rates and detecting loop errors. J. Transp. Eng.
**1997**, 123, 454–458. [Google Scholar] [CrossRef] - Zhong, M.; Lingras, P.; Sharma, S. Estimation of missing traffic counts using factor, genetic, neural, and regression techniques. Transp. Res. Part C Emerg. Technol.
**2004**, 12, 139–166. [Google Scholar] [CrossRef] - Liu, Z.; Sharma, S.; Datla, S. Imputation of missing traffic data during holiday periods. Transp. Plan. Technol.
**2008**, 31, 525–544. [Google Scholar] [CrossRef] - Tian, Y.; Zhang, K.; Li, J.; Lin, X.; Yang, B. LSTM-based traffic flow prediction with missing data. Neurocomputing
**2018**, 318, 297–305. [Google Scholar] [CrossRef] - Duan, Y.; Lv, Y.; Liu, Y.L.; Wang, F.Y. An efficient realization of deep learning for traffic data imputation. Transp. Res. Part C Emerg. Technol.
**2016**, 72, 168–181. [Google Scholar] [CrossRef] - Pamuła, T. Impact of data loss for prediction of traffic flow on an urban road using neural networks. IEEE Trans. Intell. Transp. Syst.
**2018**, 20, 1000–1009. [Google Scholar] [CrossRef] - Qu, L.; Li, L.; Zhang, Y.; Hu, J. PPCA-based missing data imputation for traffic flow volume: A systematical approach. IEEE Trans. Intell. Transp. Syst.
**2009**, 10, 512–522. [Google Scholar] - Li, L.; Li, Y.; Li, Z. Efficient missing data imputing for traffic flow by considering temporal and spatial dependence. Transp. Res. Part C Emerg. Technol.
**2013**, 34, 108–120. [Google Scholar] [CrossRef] - Bishop, C. Bayesian pca. Adv. Neural Inf. Process. Syst.
**1998**, 11. [Google Scholar] - Minka, T. Automatic choice of dimensionality for PCA. Adv. Neural Inf. Process. Syst.
**2000**, 13. [Google Scholar] - Muralidharan, A.; Horowitz, R. Imputation of ramp flow data for freeway traffic simulation. Transp. Res. Rec.
**2009**, 2099, 58–64. [Google Scholar] [CrossRef] - Van Lint, J.; Hoogendoorn, S.; van Zuylen, H.J. Accurate freeway travel time prediction with state-space neural networks under missing data. TRansportation Res. Part C Emerg. Technol.
**2005**, 13, 347–369. [Google Scholar] [CrossRef] - Chen, H.; Grant-Muller, S.; Mussone, L.; Montgomery, F. A study of hybrid neural network approaches and the effects of missing data on traffic forecasting. Neural Comput. Appl.
**2001**, 10, 277–286. [Google Scholar] [CrossRef] - Tang, J.; Zhang, G.; Wang, Y.; Wang, H.; Liu, F. A hybrid approach to integrate fuzzy C-means based imputation method with genetic algorithm for missing traffic volume data estimation. Transp. Res. Part C Emerg. Technol.
**2015**, 51, 29–40. [Google Scholar] [CrossRef] - Tan, H.; Feng, G.; Feng, J.; Wang, W.; Zhang, Y.J.; Li, F. A tensor-based method for missing traffic data completion. Transp. Res. Part C Emerg. Technol.
**2013**, 28, 15–27. [Google Scholar] [CrossRef] - Wang, F.Y. Parallel control and management for intelligent transportation systems: Concepts, architectures, and applications. IEEE Trans. Intell. Transp. Syst.
**2010**, 11, 630–638. [Google Scholar] [CrossRef] - Smith, B.L.; Demetsky, M.J. Traffic flow forecasting: Comparison of modeling approaches. J. Transp. Eng.
**1997**, 123, 261–266. [Google Scholar] [CrossRef] - Smith, B.L.; Williams, B.M.; Oswald, R.K. Comparison of parametric and nonparametric models for traffic flow forecasting. Transp. Res. Part C Emerg. Technol.
**2002**, 10, 303–321. [Google Scholar] [CrossRef] - Davis, G.A.; Nihan, N.L. Nonparametric regression and short-term freeway traffic forecasting. J. Transp. Eng.
**1991**, 117, 178–188. [Google Scholar] [CrossRef] - Habtemichael, F.G.; Cetin, M.; Anuar, K.A. Methodology for quantifying incident-induced delays on freeways by grouping similar traffic patterns. In Proceedings of the Transportation Research Board 94th Annual Meeting, Washington, DC, USA, 11–15 January 2015; pp. 15–4824. [Google Scholar]

**Figure 7.**Deviation between original and reconstructed datasets with different missing portions: Detector MS219.

**Figure 8.**Distribution of original and reconstructed data with $50\%$ completeness and gaps of length 1.

**Figure 9.**Distribution of original and reconstructed data with $50\%$ completeness and gaps of length 72.

**Figure 16.**Performance of E-KNN on incomplete and completed datasets with a completeness level of $90\%$.

**Table 1.**Performance (MAE) of imputation methods in function of completeness ratio on fixed gap-lengths datasets.

Completeness Ratio | $50\%$ | $60\%$ | $70\%$ | $80\%$ | $90\%$ |
---|---|---|---|---|---|

Mean | $77.28$ | $78.71$ | $76.24$ | $79.11$ | $79.09$ |

Mean Weekday | $78.26$ | $75.80$ | $74.04$ | $79.11$ | $76.46$ |

Linear Regression | 76.95 | 74.82 | 73.26 | 78.42 | 76.01 |

**Table 2.**Performance (MAE) of imputation methods in function of completeness ratio on datasets with a list of gap-lengths.

Completeness Ratio | $50\%$ | $60\%$ | $70\%$ | $80\%$ | $90\%$ |
---|---|---|---|---|---|

Mean | $92.13$ | $69.48$ | $71.62$ | $64.90$ | $51.62$ |

Mean Weekday | $91.56$ | $57.82$ | $66.50$ | $61.83$ | $51.60$ |

Linear Regression | 91.22 | 57.27 | 66.50 | 61.44 | 51.50 |

Gap Length | Mean | Mean Weekday | Linear Regression |
---|---|---|---|

1 | $78.50$ | $77.89$ | 76.85 |

3 | $78.62$ | $79.70$ | 78.33 |

6 | 78.33 | $79.42$ | $78.40$ |

36 | $79.90$ | $80.47$ | 79.65 |

72 | $80.45$ | $79.98$ | 78.85 |

144 | $77.49$ | $77.03$ | 76.39 |

288 | $78.13$ | $77.16$ | 76.56 |

1008 | $76.51$ | $72.55$ | 71.99 |

4320 | $72.12$ | $58.83$ | 58.10 |

06:00–09:00 | 16:00–19:00 | 06:00–22:00 | All Day | |||||
---|---|---|---|---|---|---|---|---|

Detector ID | MAE | MAPE | MAE | MAPE | MAE | MAPE | MAE | MAPE |

MS217 | 63.58 | 14.30 | 100.50 | 12.10 | 75.08 | 11.97 | 62.87 | 16.64 |

MS218 | 23.96 | 28.86 | 55.94 | 20.29 | 36.11 | 22.78 | 28.80 | 30.57 |

MS219 | 61.54 | 15.34 | 92.90 | 11.98 | 71.28 | 12.35 | 60.41 | 18.80 |

MS220 | 73.98 | 21.12 | 80.82 | 17.16 | 74.10 | 18.30 | 62.32 | 27.12 |

MS221 | 26.24 | 25.25 | 53.18 | 16.99 | 45.87 | 21.58 | 38.24 | 31.30 |

MS222 | 38.42 | 26.63 | 57.00 | 17.77 | 47.59 | 19.31 | 37.11 | 30.54 |

MS223 | 47.54 | 23.98 | 49.78 | 17.45 | 43.80 | 19.09 | 35.06 | 29.68 |

Average | 47.89 | 22.21 | 70.02 | 16.25 | 56.26 | 17.91 | 46.40 | 26.38 |

**Table 5.**E-KNN’s performance for imputation methods in function of completeness ratio on datasets with a list of gap-lengths.

Completeness Ratio | $50\%$ | $60\%$ | $70\%$ | $80\%$ | $90\%$ | |||||
---|---|---|---|---|---|---|---|---|---|---|

MAE | MAPE | MAE | MAPE | MAE | MAPE | MAE | MAPE | MAE | MAPE | |

Incomplete | $157.19$ | $158.73$ | $150.03$ | $127.85$ | $102.47$ | $114.57$ | $120.49$ | $112.31$ | $58.73$ | $37.23$ |

Mean | 53.56 | $32.32$ | $54.09$ | $34.57$ | $53.10$ | $33.00$ | $51.16$ | $30.14$ | $46.97$ | $27.93$ |

Mean Weekday | $53.76$ | $29.24$ | 51.65 | $29.45$ | 52.12 | $29.81$ | 50.29 | $27.39$ | 46.69 | $26.35$ |

Linear Regression | $53.79$ | 29.10 | $51.72$ | 29.38 | $52.25$ | 29.72 | $50.46$ | 27.21 | $46.73$ | 26.30 |

**Table 6.**E-KNN’s performance for imputation methods in function of completeness ratio on datasets with fixed gap-lengths.

Completeness Ratio | $50\%$ | $60\%$ | $70\%$ | $80\%$ | $90\%$ | |||||
---|---|---|---|---|---|---|---|---|---|---|

MAE | MAPE | MAE | MAPE | MAE | MAPE | MAE | MAPE | MAE | MAPE | |

Incomplete | $124.91$ | $111.74$ | $167.87$ | $139.26$ | $159.17$ | $159.63$ | $142.12$ | $131.63$ | $111.20$ | $93.35$ |

Mean | $60.35$ | $38.45$ | $56.99$ | $35.66$ | $53.69$ | $33.49$ | $50.79$ | $30.55$ | $48.09$ | $28.01$ |

Mean Weekday | $60.15$ | $33.34$ | 56.58 | $31.61$ | 53.18 | $29.95$ | $50.63$ | $28.40$ | 47.87 | $26.83$ |

Linear Regression | 60.09 | 33.17 | $56.62$ | 31.48 | $53.28$ | 29.85 | 50.27 | 28.34 | $47.91$ | 26.78 |

Gap Length | Incomplete | Mean | Mean Weekday | Linear Regression | ||||
---|---|---|---|---|---|---|---|---|

MAE | MAPE | MAE | MAPE | MAE | MAPE | MAE | MAPE | |

1 | 58.73 | 32.74 | 52.16 | 32.03 | 52.41 | 29.71 | 52.36 | 29.58 |

3 | 57.57 | 32.29 | 53.22 | 32.75 | 53.74 | 30.30 | 53.62 | 30.16 |

6 | 174.61 | 207.74 | 53.34 | 32.62 | 53.92 | 30.41 | 53.78 | 30.27 |

36 | 235.41 | 179.15 | 55.14 | 34.36 | 54.98 | 31.06 | 55.08 | 30.98 |

72 | 219.87 | 215.39 | 55.57 | 34.62 | 54.89 | 31.07 | 55.02 | 31.01 |

144 | 172.62 | 166.38 | 55.12 | 33.47 | 54.71 | 30.01 | 54.84 | 29.95 |

288 | 151.30 | 157.05 | 55.37 | 33.92 | 54.11 | 29.60 | 54.24 | 29.51 |

1008 | 144.13 | 139.71 | 56.00 | 35.25 | 55.41 | 30.79 | 55.64 | 30.74 |

4320 | 70.35 | 47.85 | 49.92 | 30.08 | 48.97 | 27.27 | 48.95 | 27.12 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Mallek, A.; Klosa, D.; Büskens, C.
Impact of Data Loss on Multi-Step Forecast of Traffic Flow in Urban Roads Using K-Nearest Neighbors. *Sustainability* **2022**, *14*, 11232.
https://doi.org/10.3390/su141811232

**AMA Style**

Mallek A, Klosa D, Büskens C.
Impact of Data Loss on Multi-Step Forecast of Traffic Flow in Urban Roads Using K-Nearest Neighbors. *Sustainability*. 2022; 14(18):11232.
https://doi.org/10.3390/su141811232

**Chicago/Turabian Style**

Mallek, Amin, Daniel Klosa, and Christof Büskens.
2022. "Impact of Data Loss on Multi-Step Forecast of Traffic Flow in Urban Roads Using K-Nearest Neighbors" *Sustainability* 14, no. 18: 11232.
https://doi.org/10.3390/su141811232