Trafﬁc Missing Data Imputation: A Selective Overview of Temporal Theories and Algorithms

: A great challenge for intelligent transportation systems (ITS) is missing trafﬁc data. Trafﬁc data are input from various transportation applications. In the past few decades, several methods for trafﬁc temporal data imputation have been proposed. A key issue is that temporal information collected by neighbor detectors can make trafﬁc missing data imputation more accurate. This review analyzes trafﬁc temporal data


Introduction
With the development of smart cities, a lot of data can be collected to make many aspects of a city better, such as transportation [1]. The mining of traffic data can be used in the society and city analysis [2][3][4]. Tremendous traffic data can be collected by various sensors, including loop detectors, probe vehicles, infrared radars, and cameras. These datasets, such as volume, delays, travel time reliability, and emissions, not only capture the dynamic operating status of transportation systems but are also expensive in terms of the human and material resources needed to install the required devices and store the data.
Furthermore, the collected data face two problems. One problem is missing data. Missing data can be caused by sensor malfunctions, communication network problems, restricted power supply conditions, scheduled maintenance, severe weather, or aging problems [5][6][7]. For instance, the missing ratio of traffic data in Beijing was nearly 10%, and 4% was caused by the malfunction of sensors. Under extreme conditions, the missing ratio could reach 20% [8]. For the California performance measurement system (PeMS), the missing ratio could also reach 10%. The other problem is data sparseness, which usually happens when the detector coverage is low. Many applications of intelligent transportation systems (ITSs) rely on complete data, so the missing data problem influences the promotion of ITSs and their applications [9][10][11][12][13]. For example, on congested roads or during peak hours, missing data can mislead traffic analysis and traffic management. Additionally, the statistical analysis can also be misled. Generally, due to missing data, the dataset becomes

Research Methods
The common imputation methods can be divided in two groups, i.e., simple methods and complex methods. Simple methods use simple operations. For instance, the value of lost items can be set as the mean or median. Simple methods may be a good choice when the loss percentage is low because of the lower information loss and the small calculation burden. Many researches suggest simple methods with all possible values imputation (APV) [16] and historical mean, median, and mode values imputation (MMMI) [17,18]. In addition, the common factor method (CFM) is also one of temporal imputation methods, and it calculates different impact factors as coefficients by using historical data and estimates the value of missing data by using their average with consideration of the weighted factors [19,20]. Simple methods are usually based on the assumption that the traffic data are evolved regularly. However, simple methods cannot perform well once the amount of missing data is large, especially when they encounter consecutive missing data. When the number of missing items is large, the features of data cannot be extracted, which means that the model cannot classify data sensitively. Furthermore, if the influence of various stochastic factors exists, the fluctuations and randomness of traffic data should not be ignored. As a result, the imputation of this kind of traffic dataset cannot adopt simple methods that rely on historical data or factors. For solving this problem, complex methods are proposed. Complex methods can fix complex patterns of traffic data and have better performance under some special practical conditions [21][22][23][24][25][26][27][28]. Complex methods mainly include three categories: prediction methods, interpolation methods, and statistical learning methods.
Prediction methods impute the dataset by predicting the value of missing data via prediction models, which are usually based on historical data [29]. Previously, one-way prediction methods, including heuristic techniques (HT) (historical average, weighted average) [30], Kalman filter (KF) [31], autoregressive integrated moving average (ARIMA) [32], data augmentation (DA) [30], seasonal ARIMA [33,34], feed-forward neural network (FFNN) [35] and fractionally integrated vector autoregressive and moving average (VARFIMA) [36], have been based mostly on temporal neighboring information. However, these methods assume that the value of the missing interval of a day is similar to the latest intervals or the same interval in neighboring days. This assumption does not consider the random fluctuation of traffic flows among days. Actually, two-way spatial or temporal data can improve the prediction accuracy. Considering spatial and temporal dependency, more and more studies use matrix-based methods for the prediction. Matrix-based methods utilize the full information of different days at the same detector or multiple detectors on the same day. Different extended ARIMA models have been studied to consider the spatial correlation between different detectors [37][38][39][40][41]. After the multivariable state-space model was tested [42], an extended Kalman filter approach was developed, which shares errors at two adjacent road links for the traffic flow estimation [43]. Moreover, various Bayesian networks (BN) have been designed and examined [44][45][46]. In addition, the adaptive least absolute shrinkage and selection operator (LASSO) has been used to select the model and estimate coefficients simultaneously when forecasting missing data [47,48].
Additionally, the full consideration of the data collected near the missing data can improve the imputing accuracy. To utilize the data collected near the missing data and retain appropriate training time costs for the online imputing procedure, interpolation methods have been proposed. They replace the value of the missing data with the average or the weighted average of known multidimensional data in the same site or neighboring states of adjacent sites by regression and clustering with temporal and spatial information [49]. Similar to prediction methods, temporal neighboring interpolation methods (e.g., exponential smoothing (ES) [50], splines interpolation (SI) [51], hot (cold) deck imputation (HDI) [52], Bayesian iteration imputation (BII) [53], regression methods (RM) [54][55][56][57][58][59], multiple imputation (MI) [60,61], k-nearest neighbor imputation (KNN) [62], and improved KNN version local least squares (LLS) [63]) are commonly used to interpolate missing values. Another type of interpolation method is pattern neighboring-based methods [64,65], which try to find the closest fluctuation and missing pattern from the neighboring days or detectors. A self-organized map-based method for urban networks, which is associated with wavelet, has also been tested [66]. However, it is hard to build a database that includes all possible patterns because it cannot be guaranteed that all patterns have been collected and recorded previously. Pattern neighboring-based methods neglect the stochastic variation of traffic flow. As a result, once there is no record of a proper pattern, the chosen pattern is not similar enough to the original one and their shapes cannot match well. A large number of studies have shown that the various patterns of traffic volumes are usually similar to the patterns of weekdays among different weeks or the patterns of other detectors on the same day. In addition, it is necessary to apply a multi-vector or matrix-based data structure to complement the imputation process for making use of either full spatial or temporal flow variation information. Fuzzy c-means (FCM) [67] is a statistical clustering approach applied to resolve missing data with the input of matrix-based data. The widely ignored distinction between days in a week is introduced, and the input is represented as the value of a time step of a certain day in a week to distinguish patterns of different days. The fuzzy c-means algorithm is introduced to classify the known days at the same station, and the values of missing data are imputed by minimizing the errors between the imputation and the value of clusters, for which a genetic algorithm is applied. This is reasonable because vehicles generally move along a specific route and through a sequence of intersections, and the variations in the traffic flow of nearby intersections are related.
Statistical learning is a kind of machine learning using statistical methods. It can be regarded as a special case of data-based machine learning. Starting from some observation (training) samples, this paper attempts to obtain some laws that cannot be obtained through principle analysis, and use these laws to analyze objective objects, so as to make a more accurate prediction of future data. Statistical learning methods try to learn the scheme by fitting and mapping with the utilization of the observed data, then impute the missing data multiple times and make statistical inferences in an iterated procedure [68]. The statistical model refers to the model based on probability theory and the mathematical statistical method. Some processes cannot derive their models with theoretical analysis methods, but the functional relationship between variables can be obtained through experimental data and mathematical statistics, which is called the statistical model. Classical statistical models, including expectation-maximization (EM) [69] and maximization likelihood (ML) [70], the treatment method for C4.5 [44], Bayesian network (BN), and Markov chain Monte Carlo (MCMC) [44], and probabilistic principal component analysis (PPCA) [71], are undertaken to address statistical imputation. Having a statistical foundation is one advantage of EM and ML. However, EM and ML have some limitations. For instance, the original data distribution needs to be assumed. BN learns the probability distributions and produces unbiased estimations and confidence intervals. Unfortunately, in the process of estimating parameters, BN and MCMC rely heavily on the prior knowledge. The above methods are operated with the vector-based input with interval to interval variations. Matrix-based PPCA integrates PCA and ML for adapting a high missing ratio, in which PCA builds up the latent sliding regression model to separate the significant and dominant Gaussian-type linear parts of low-dimension traffic flow from the parts that are normal and hard to describe by models, and ML uses the dominant parts to calculate the value of missing data. As a result, PPCA balances the periodicity, predictability, and other statistical features. Kernel PPCA (KPPCA) is also proposed to construct the relationship between collected data and latent variables, which is nonlinear [72]. An extension of the data cube in data mining is described for large traffic flow databases, which arranges data as two-dimensional spatial-temporal plots [73]. These matrix-based and cube-based methods usually focus on special locations, and only time series data are explored. Some problems need to be solved: (1) the matrix data results in the spatial data and temporal data not being utilized simultaneously [74]; (2) the mode number cannot exceed the matrix dimension, which is two, so the methods based on the matrix can only discover limited mode correlations; (3) when the missing ratio is large, matrix-based methods cannot impute data well, especially in some extreme cases. Recent statistical methods mainly focus on the traffic data in the spatial dimension. As for the temporal dimension, the patterns are more focused. The Bayesian estimation method was introduced to modify PPCA (BPCA) [75] and tested in two neighboring points (one is from the upstream detector, and the selected time is the interval before the current time interval. The other is from the downstream detector, and the selected time is the time interval after the current time interval). Continuous hyperparameters are introduced and used to determine the latent variable dimension. PPCA does not require prior distribution. PPCA is unlikely to face overfitting problems. The best latent variable dimension still remains an intractable problem. Because the best latent variable dimension cannot be determined, the imputation error and the complexity of models are not determined, and they change with the best latent variable dimension. Clearly, vehicles move in both the spatial dimension and temporal dimension. As a result, it is more reasonable that the traffic data are analyzed in three or more dimensions, and they can contain both spatial and temporal data. As an extension of matrix-based methods, tensor-based methods were developed. Tensor-based methods can combine multiple variables to estimate missing data by introducing multi-way spatial relations, such as link-mode and hour-mode. A first-order weighted optimization Tucker decomposition imputation method (TDI) determines principal components by the weighted optimization (WOPT) algorithm, which can adapt the missing ratio up to 90% [23]. The truncated higher-order singular value decomposition (HOSVD) initialization is used to supply a suboptimal initial approximation because Tucker decomposition is not unique, which brings a nonconvex objective function. A limited number of researches have described the latent traffic patterns explored by tensor models since existing methods try to fill the gap between practical data and tensor models. Additionally, the size of a core tensor is usually determined manually in most numerical studies, instead of relating to total features. The aforementioned prediction, interpolation, and statistical imputation methods apply typical regression, neural network, classifier, and other machine learning models to impute missing data with means, regression, correlations, clustering, patterns, and schemes to build up a good foundation for deep learning imputation methods. Driven by data quantity, more deep learning methods are promising to perform better imputation accuracy than the above traditional methods. Considering the spatial and temporal dependency, the kernel regression model combined with k-nearest neighbors (KNN) has also been applied to forecast missing values with consideration of spatial data from neighboring stations [76]. Generic traffic flow features are firstly captured by using the stacked autoencoder model (SAE) [77]. Likewise, a stacked denoising autoencoder (SDAE) is further used to find relationships among neighbor sensor clusters using a k-means clustering algorithm based on the average daily traffic [78]. The function of SDAE is validated over a dataset, which contains traffic flows with 10-90% missing ratios during 6 days. Next, considering the data condition covering both weekdays and weekends and nearly 50% of missing data, SDAE is used to extract missing features from more data dimensions for missing data imputation [79]. Missing data with no gap length restrictions are interpolated by two brand new machine learning methods [80], in which the spatial context is modeled by surrounding sensors and the optimal pattern clusters are learned by an automated clustering analysis tool. A long short-term memory (LSTM) framework is employed to infer missing traffic data based on the combination of the mean value and the last observation data, neglecting the pseudo-periodic characteristics of the traffic data [81]. An exponential function and a partition function are utilized to resolve the attention weights for predicting the missing values [82]. Then two temporal smoothing methods are applied to infer missing data with consideration of long-period and short-period information with the revised LSTM, by using missing flags and missing interval weights [83]. The prediction residual is also learned by using a masking vector and an influential factor with an exponential distribution to model the decay weights in cells. A spatial and temporal multi-view learning algorithm that integrates LSTM units and support vector regression (SVR) is proposed [84]. Other traffic data, such as probe data from floating vehicles, are also used to improve the imputation accuracy by data fusion [85]. Multi-output Gaussian processes (GPs) are employed to combine the spatial and temporal missing patterns together with probe data. Observation uncertainty is resolved through the Bayesian nonparametric formalism of GPs. The complicated spatial dependencies between nearby road segments are captured by a multi-output extension mechanism through convolution [84]. Afterward, a convolutional neural network (CNN) is proposed to act as the context encoder for imputing the images with missing data, which transforms the raw data into spatial-temporal images in advance. A neural network for deep learning can offer a flexible framework for extracting and identifying the dynamic spatial and temporal local trends of observed data, and it can recognize the patterns of missing data and memorize the historical information in the long term.
Supervised classification algorithms are also introduced to combine with imputation methods [86,87], in which different types of classifiers including KNN, approximate models, and decision trees are trained by data previously imputed by various imputation methods. Consequently, the imputation performance is measured by the classification accuracy through cross-validation. The results imply that complicated imputation methods usually outperform simpler methods. No matter what method is adopted, the most important thing is to fully utilize the potential spatial-and temporal-related data.
Additionally, some scholars have used traffic simulation models to perform traffic data imputation, such as DynaSmart [88], DynaMIT [89], Vissim [90], Paramics [91], and TransWorld [92,93]. However, these simulation models are closely related to the adopted assumptions and the validation accuracy of basic simulation modules in the systems, which is not easy to handle well.

Missing Pattern
Before selecting the most suitable imputation methods, we have to identify the missing data types.
According to recent works [7,8,28,80,94], the statistical missing patterns can be divided into three types: missing at determinate/missing not at random (MNAR), missing at random (MAR), and missing completely at random (MCAR). These missing patterns are shown in Figure 1, and they have the following characteristics: 1.
MNAR: The missing data are regular. It usually means that there are some faults in detectors.

2.
MAR: The points of missing traffic data are related to nearby points. They usually occur as a group at special intervals, but the position of each group is random. 3.
MCAR: The missing data are isolated, random, and independent.
2. MAR: The points of missing traffic data are related to nearby points. They usually occur as a group at special intervals, but the position of each group is random. 3. MCAR: The missing data are isolated, random, and independent.
Actually, MCAR is a kind of MAR. MCAR is special in all MAR because the missing possibility is almost certain, and it is hard to perform the data imputation when detectors suffer from long-time malfunction. In this case, MAR and MCAR are usually used to conduct the imputation. There is another classification measured by the length of missing time, in which the missing patterns are categorized as short-period missing values and long-period missing values [83].

Assumption
Most existing imputing methods usually make clear assumptions in advance. Because traffic variations are influenced by many factors, which cannot be fully considered, these methods are restricted when applied in many cases. The assumptions are as follows.
1. Data are lost randomly. 2. Missing data are at low missing rates. 3. Missing data are assumed to have determined attributes or patterns. 4. Missing data are mostly influenced by historical data, neighboring data, or spatial data, without consideration of variations from interval to interval, from day to day, or from upstream to downstream.

Imputation Style
There are two kinds of imputation styles: the singular choice imputation and the multiple choice imputation [6].

Singular choice imputation (SCI)
The imputation result is directly achieved by the proposed model or scheme in SCI, in which the algorithms do not focus on multiple choices of different attributes or patterns by methods, and efforts are not taken to improve the spatiotemporal correlation analysis and do not incorporate the most valuable information for imputation. Because of the fast computational speed, SCI is often applied for real-time analysis. As most patterns and attributes are predetermined in models and strategies of prediction methods and interpolation methods, they are mostly SCI methods, such as hot deck, average, and regression. Actually, MCAR is a kind of MAR. MCAR is special in all MAR because the missing possibility is almost certain, and it is hard to perform the data imputation when detectors suffer from long-time malfunction. In this case, MAR and MCAR are usually used to conduct the imputation.

Multiple choice imputation (MCI)
There is another classification measured by the length of missing time, in which the missing patterns are categorized as short-period missing values and long-period missing values [83].

Assumption
Most existing imputing methods usually make clear assumptions in advance. Because traffic variations are influenced by many factors, which cannot be fully considered, these methods are restricted when applied in many cases. The assumptions are as follows.

1.
Data are lost randomly.

2.
Missing data are at low missing rates.

3.
Missing data are assumed to have determined attributes or patterns.

4.
Missing data are mostly influenced by historical data, neighboring data, or spatial data, without consideration of variations from interval to interval, from day to day, or from upstream to downstream.

Imputation Style
There are two kinds of imputation styles: the singular choice imputation and the multiple choice imputation [6].

Singular choice imputation (SCI)
The imputation result is directly achieved by the proposed model or scheme in SCI, in which the algorithms do not focus on multiple choices of different attributes or patterns by methods, and efforts are not taken to improve the spatiotemporal correlation analysis and do not incorporate the most valuable information for imputation. Because of the fast computational speed, SCI is often applied for real-time analysis. As most patterns and attributes are predetermined in models and strategies of prediction methods and interpolation methods, they are mostly SCI methods, such as hot deck, average, and regression.

2.
Multiple choice imputation (MCI) MCI methods overcome the drawback of SCI methods that derive standard errors of the estimated parameters that are too small. The basic idea of MCI is: (a) proposing a model that incorporates random variation in imputing missing data; (b) generating complete datasets for M times; and (c) analyzing each complete dataset and utilizing statistical parameters, probabilistic results, and EM-or ML-iterated algorithms from M cases to infer the best single one. Thus, it can deal with the inherent uncertainty of the imputations. For MI, the data must be missing at random. Most statistical methods are MCI.

Applicable Conditions
Under various and complex traffic conditions with different data qualities, different imputation methods show distinct performances, which is related to the application and invalid conditions of the methods.
First, the relationship between imputation performance and the missing rate has been argued frequently in many studies. The effects of missing rates on hybrid neural network approaches were evaluated [95]. Each day has 12 consecutive hour gaps, which can impact on the effect of time interval and the effectiveness of selected imputation methods [64]. Gaps are generated for 10% to 90% of the dataset by using a clustering approach. It was found that the result is acceptable even with a large missing ratio [78]. Even for a challenging missing range of 1 month, the method still gives an acceptable result [61]. These methods, used to generate random gaps, are similar to the missing pattern of MNAR, in which the data loss is usually caused by some faults in the detector or in the process of communication. TDI figures out the imputation for the missing ratio up to 90% [23]. The mixed missing rate is also tested [8,93]. Consecutive interval missing data are tackled properly by the massive vector classification method, and the performance with the data of different interval units (i.e., 5 min, 15 min, and 1 h) is tested [83].
Second, the numbers of partially missing detectors and invalid detectors in the spatial range are also critical factors to be discussed, which will heavily influence MCAR and MNAR and the performance of methods affected by spatial elements. However, these kinds of studies are rare.
In summary, different methods are effective under distinct conditions, such as the number of partially missing detectors, the number of invalid detectors, interval units, missing patterns, data size, and missing rates, which can determine the real-time application performance.

1.
Offline imputation with moderate-to-high datasets Many imputation works are offline since the goal of imputing data is mining the latent regulation and correlation behind the data. Without the high requirement of calculating speed, offline imputation can also fully utilize moderate-to-high datasets to capture precise patterns of traffic flows, even with data before and after the points of missing data. Since offline imputation is concentrated on imputation accuracy, the complexity of models is much higher than online imputation.

2.
Online imputation with light-to-moderate datasets With the restriction of online calculating, the small-size input with low-dimension data is more appropriate for online imputation. With the increasing applications of realtime traffic control, traffic routing, and traffic management, the requirements of online imputation are adjusted with consideration of the balance between the calculation speed and the imputation accuracy. Some old and simple imputation methods may help a lot in this situation, with accelerations to the calculation speed.

Limitations
Based on the aforementioned characteristics of the existing imputation methods, a series of common limitations need to be resolved, which can be concluded as follow:

1.
Compared with the tensor-based method [23], CM [27], and KM [28], vector and matrix methods share partial spatiotemporal information so that the results seem to be good when the main features are achieved by chance. However, without sufficient information incorporated, most cases will fail, especially in extreme conditions, since daily periodicity similarity, local interval fluctuation, variation from day to day, and spatial influence between upstream and downstream should be considered. 2. When data are integrated from multiple shorter-term observations, the main reason for missing data is the error at the stage of processing data. Additionally, the stage of data aggregation can lead to data loss. Moreover, the repeated and error data are considered as missing data to be imputed.

3.
Most methods focus on spatiotemporal feature learning, but they cannot perform very well in urban traffic. This is because being divided by signal phases and the length of the phase time of traffic lights, the turning movements can influence the variation in traffic flows both in temporal and spatial aspects [8].

4.
Since there is no mechanism to resolve the parameters of model structures, the parameters of model structures should be determined before the operation of imputation methods, and the chosen parameters will impact the result greatly [23,28,75].

5.
Probe data can be a new source that can enrich the data collected by loops and point detectors [28,84]. If the data upstream can be obtained in advance, some proper methods can be used to make the prediction or reconstruction of downstream data more accurate. 6.
It is worth studying the quantity of sparsity detector and location optimization with a consideration of missing data imputation under different missing rates and the relationship of their threshold conditions with the number of partially missing detectors, the number of invalid detectors, and missing rates [96]. 7.
Although the imputation methods are various, it is vital to investigate the performance of each method under different traffic conditions and their application and invalid conditions.

Public Datasets
The benchmark datasets still need to be improved, which is a critical problem in temporal data imputation. If the benchmark datasets are lacking, the findings and methods cannot be compared well, and the proposed methods may only perform well with some special datasets. The commonly-used datasets are summarized in Table 1. The PeMS dataset, built by the California Department of Transportation, is the most widely used one. The data of PeMS has been collected since 2001. The special range of data of PeMS covers the freeways across all major cities of California. The data of the PeMS dataset is collected from nearly 40,000 individual detectors. The data are aggregated for each 30 s, and the resolution of data is 5 min. The content of the PeMS dataset contains detector data, traffic counts, vehicle classification, incidents, lane closures, etc. Although the PeMS dataset has sufficient data, its environment is freeways, which means that it cannot support the data imputation studies of the urban traffic system. Microwave sensors are also an important source of data and have been distributed all over the world. The largest amount of Chinese data has been collected by the government and research institutions.

Missing Data Imputing Methods with Mathematical Formulation
Based on a previous literature review, we know the development of distinct massing data imputing methods. In this section, we give a brief review of their formulations.

PPCA-Based Missing Data Imputing
PPCA model assumes that every sample y i depends on a q-dimensional latent variable x i as follows [7].
where d << q is proposed to retrieve the common hidden feature of traffic flow data. µ is a d-dimensional column vector that characterizes the sample average of y i . Here, the subscript i denotes the index of the observation/latent variable. PPCA model assumes that the latent variables x i follows a q-dimensional multivariate Gaussian distribution, x i ∼ N q (0, I). The d-dimensional column vector ε i is introduced as isotropic noise satisfying ε i ∼ N d 0, σ 2 I , where σ 2 is the scaling factor. This relaxes the strict assumption on daily flow similarity and makes the model more flexible. The projection matrix W ∈ R d×q represents a mapping between the latent variable space and the observed variable space followed by all the observed latent variable pairs (y i , x i ). When some elements of y 1 , . . . , y n are missing, we search µ, W, and σ 2 that produce maximum likelihood in agreement with the known data arg max where the conditional probability density function is Meanwhile, we impute y miss i to best fit with the above distribution assumptions and thus the estimated maximum likelihood. Moreover, if some data are missing, µ is calculated by taking the average of the available data, and · means the Euclidean norm.

GMM-Based Missing Data Imputing
The Gaussian mixture model (GMM) is commonly used for clustering. Each GMM consists of several Gaussian distributions, each of which is called a component, representing a different cluster. All the components are linearly added together to form a probability density function (PDF) of GMM.
There is a set of observed data, which is X = {x 1 , . . . , x n }. Each vector x i is a p-dimensional vector. We assume that X is generated by a GMM with K components. The function f k (x i ) represents the probability density function of the k component, expressing the probability of x i generated by the k component. Hence, the PDF of the GMM is as follows where π k represents the weight of the k component in the GMM, µ k and ε k represent the mean vector and covariance matrix of the k component, respectively, P(x i ) represents the probability of x i generated by the GMM. The PDF of the k component f k (x i |µ k , ε k ) is expressed as In a GMM, we assume that X is generated by a GMM with k components and X. The complete data are Y = (X, Z) = {(x 1 , z 1 ), . . . , (x n , z n )}. Z is the implicit category of the data. z i represents the cluster of x i and z i = (z i1 , . . . , z iK ) Assuming that there are K clusters and the weight of the k component in the GMM is π k and θ k is the corresponding parameter, the density of z i based on x i is as follows The logarithmic likelihood function of the imputation data with some missing data of Y is Expectation-maximization algorithm (EM) is usually used to solve the above problem.

KNN-Based Missing Data Imputing
With the KNN method, a categorical missing value is imputed with the majority among its k nearest neighbors. The average value (mean) of the k nearest neighbors is regarded as the prediction for a numerical missing value. It is formally defined as follows.
Given (X, U, 0) and the set of its k nearest neighbors D k = X j , Y j , 1 j = 1, 2, . . . , k , the KNN estimator is defined as where v is a value in the domain of the target feature Y and 1 Y j = v is an indicator function that returns the value 1 if its argument is true and 0 otherwise. Therefore, KNN imputation is a model-free method. The similarity between an instance and its nearest neighbors, determined from the differences between instances, should certainly be maximal for selecting the nearest neighbors. The usually used method is Minkowski distance (or its variants) as follows where q is a nonnegative integer called the Minkowski coefficient.

Copula-Based Missing Data Imputing
The copula method expresses that a joint distribution can be decomposed into a combination of several one-dimensional margin distributions and a copula function with a cumulative distribution function of marginal F p z p that can be generated as follows C θ is a cdf of p-dimensional random variables, and is also known as a copula function. θ is a parameter vector of the copula, commonly referred to as the dependence parameter vector. The copula function can be obtained via an inversion method.
where u i ∈ [0, 1] represents the probability density function, F c (·) is the joint distribution, and f −1 i (·) is the inverse of the marginal distribution. For the copula-based missing data imputing, we use the copula theory to detect the spatial correlation based on the distance, verify the spatial autocorrelation's existence, and find the optimal correlation function. In addition, marginal distribution is also conducted to obtain an optimal distribution by the fitting process, and the parameters of the above procedure are estimated. Then, the copula-based interpolation can be finished through numerical integration of the density function to complete the missing data imputation.

Tensor-Based Missing Data Imputing
Tensor representation is one of the most practical ways to estimate a multidimensional object whose entries are indexed by several variables. Tensor is often used for extracting hidden structures and capturing underlying correlations between modes in the data with a multimode system. In many applications, tensor completion problems can formulate the missing data problem. While tensors naturally have a high dimensional characteristic, the tensor of interest is often low-rank, or approximately so. Hence, the low-rank approximation can be used for missing data estimation or tensor completion. More details can be found in the literature [75].

ARIMA-Based Missing Data Imputing
The ARIMA model is a time series analysis and prediction model and is represented by the following finite difference equations.
where BX t = X t−1 , B is the lag operator, ϕ i (i = 1, 2, . . . , p) are autoregressive parameters, θ j (j = 1, 2, . . . q) are moving average parameters, and ε t ∼ N 0, σ 2 is the error term that follows a normal distribution. The level and variance represent parameters that are analogous to cross-sectional statistical methods. The slope and autocorrelation parameters represent parameters unique to longitudinal designs. The appropriate order is used for constructing the model and forecasting time series.

Random Forest-Based Missing Data Imputing
Random forest (RF) is an algorithm that generalizes ensembles of decision trees with the ability to perform regression and classification. RF uses bagging aggregation to combine multiple random predictors to aggregate predictions by allowing for high complexity without overgeneralizing and overfitting the training data. In this approach, separate training sets are bootstrapped from the training set, and the predictive functions where the asterisk marks a function on a bootstrapped set; random forests are an improvement over bagged trees.

LSTM-Based Missing Data Imputing
To facilitate an understanding of the method in this paper, we briefly introduce the mechanism of LSTM. The advantage of LSTM is that it uses forget gates to select valuable short-term and long-term memory from the dataset and avoid the problem of vanishing and exploding gradients, rather than simply recording recurrent states by RNN. The process of LSTM is as follows Equations (17)- (19) refer to the input gate(i t ), forget gate(ψ t ), and output gate(o t ), respectively, in which W, U, and V are trainable weighted transition matrices governing the connection from corresponding inputs to the hidden layer, while b is bias terms, x t is the input information at the interval t, h t−1 is the output of the hidden layer at the interval t − 1, and C t−1 is the cell information at the interval t − 1; C t and C t are the candidate and new cell information at the interval t, respectively.
LSTM-M is used for managing missing data from the two categories, in which a long-period and a short-period mechanism are designed for modeling missing data in the input variables, and hidden states are employed to capture the properties mentioned above. Weights r t are introduced to control the impact of a unique meaning and time stamp and to be flexible from 0 to 1 according to the time interval relative to the previous variables. Thus, the weights should represent the patterns and be conducive to the inference tasks r t = exp{−max(0, W r l t + b r )} (22) where W r and b r are parameters to be learned jointly with those in the LSTM network. The LSTM-M time series model for missing data incorporates two temporal prediction scales to obtain the missing data directly from the input values.

Test Data
This paper uses PeMS to conduct the test. The data of PeMS were collected from the highways in California. This paper selects several adjacent stations. There are 12 loop detectors, and they are located on the southbound Highway 99. Figure 2 summarizes the selected 12 detectors of the PeMS dataset. The data from 1 January 2018 to 31 December 2019 are used to do the test. Detector IDs, detector locations, and the direction are shown in Table 2.

Test and Results
Normalized mean absolute error (NMAE) and normalized root mean square error (NRMSE) are used to evaluate the performance of the imputation methods. The formulations for calculating NMAE and NRMSE are as follows where φ is a normalized parameter. Then, all mentioned methods are tested with integrated data missing patterns. The respective results are shown in Table 3. PPCA has the best performance.
For a better test, five representative methods of all types of methods are selected to conduct further tests and comparisons, including PPCA, KNN, LSTM, LSTM-M, and ARIMA, and they are tested under three different missing patterns (MCAR, MAR, and MNAR) and missing ratios. The results are shown in Figures 3-8.
For the traffic flow data depicted in Figures 3, 5 and 7, PPCA is clearly superior to the other methods. In the MAR scenario, KNN also shows a good imputation advantage. Under different missing ratios, the performance of ARIMA is relatively flat. The other four temporal imputation methods change continuously with the missing ratio. In the traffic speed data depicted in Figures 4, 6 and 8 PPCA significantly outperforms the other methods, and ARIMA has amazing performance. It can be seen that for different data situations, different temporal imputation methods perform in various ways. Compared with the other three methods, the fluctuations of PPCA and ARIMA are relatively small. In terms of traffic speed in the MAR scenario, KNN shows a good imputation advantage. It can be seen that KNN performs significantly better than other methods except for PPCA in MAR scenarios. Although the improved LSTM-M is better than LSTM for errors of both different flow and speed datasets, LSTM-M and LSTM are worse than the other methods.
A similar conclusion can be obtained by inspecting Figures 9-12, which show the errors for the different methods of three missing patterns. Figures 9 and 10 show boxplots for the average flow imputation results of different missing ratios. Figures 11 and 12 show boxplots for the average speed imputation results of different missing ratios. These figures display different missing patterns clearly. The input performance of each method is different. In terms of the temporal data imputation of traffic flow, PPCA is far superior to other methods, followed by KNN. Traditional LSTM and the improved LSTM-M are not effective.
In the temporal data imputation of traffic speed, PPCA is far superior to other methods. Surprisingly, ARIMA shows unexpected results, far better than KNN, LSTM, and LSTM-M.
where r W and r b are parameters to be learned jointly with those in the LSTM network.
The LSTM-M time series model for missing data incorporates two temporal prediction scales to obtain the missing data directly from the input values.

Test Data
This paper uses PeMS to conduct the test. The data of PeMS were collected from the highways in California. This paper selects several adjacent stations. There are 12 loop detectors, and they are located on the southbound Highway 99. Figure 2 summarizes the selected 12 detectors of the PeMS dataset. The data from 1 January 2018 to 31 December 2019 are used to do the test. Detector IDs, detector locations, and the direction are shown in Table 2.                   For the traffic flow data depicted in Figures 3, 5, and 7, PPCA is clearly superior to the other methods. In the MAR scenario, KNN also shows a good imputation advantage. Under different missing ratios, the performance of ARIMA is relatively flat. The other four temporal imputation methods change continuously with the missing ratio. In the traffic speed data depicted in Figures 4, 6, and 8, PPCA significantly outperforms the other methods, and ARIMA has amazing performance. It can be seen that for different data situa-  For the traffic flow data depicted in Figures 3, 5, and 7, PPCA is clearly superior to the other methods. In the MAR scenario, KNN also shows a good imputation advantage. Under different missing ratios, the performance of ARIMA is relatively flat. The other four temporal imputation methods change continuously with the missing ratio. In the traffic speed data depicted in Figures 4, 6, and 8, PPCA significantly outperforms the other meth-  In the temporal data imputation of traffic speed, PPCA is far superior to other methods. Surprisingly, ARIMA shows unexpected results, far better than KNN, LSTM, and LSTM-M.    In the temporal data imputation of traffic speed, PPCA is far superior to other methods. Surprisingly, ARIMA shows unexpected results, far better than KNN, LSTM, and LSTM-M.    In the temporal data imputation of traffic speed, PPCA is far superior to other methods. Surprisingly, ARIMA shows unexpected results, far better than KNN, LSTM, and LSTM-M.    In the temporal data imputation of traffic speed, PPCA is far superior to other methods. Surprisingly, ARIMA shows unexpected results, far better than KNN, LSTM, and LSTM-M.

Discussion
In actual traffic conditions, three missing patterns are often randomly integrated. We verify the random missing patterns by different methods. As shown in Figure 13 for the missing flow of 20%, 50%, and 80%, the correspondence relationship between the ground truth (abscissa) and the imputation value (ordinate) clearly shows that the closer to the red line, the more accurate the imputation value is compared to the ground truth. PPCA performs the best among the five temporal imputation methods. As the missing ratio increases, all models show gradual fragility, which is moving away from the red line.

Discussion
In actual traffic conditions, three missing patterns are often randomly integrat verify the random missing patterns by different methods. As shown in Figure 13 missing flow of 20%, 50%, and 80%, the correspondence relationship between the g truth (abscissa) and the imputation value (ordinate) clearly shows that the closer red line, the more accurate the imputation value is compared to the ground truth. performs the best among the five temporal imputation methods. As the missing ra creases, all models show gradual fragility, which is moving away from the red line As shown in Figure 14 for missing speeds of 20%, 50%, and 80%, in the five tem imputation models, we can still clearly see that the corresponding relationship be the ground truth (abscissa) of PPCA and the imputation value (ordinate) is closer middle red line. With the increase in the missing ratio, the temporal imputation m show gradual fragility of a different degree.

Conclusions
Missing values of traffic time series data is a common problem in intelligent portation systems. This paper reviews the development of temporal imputation. I marizes the strategy of temporal imputation, covering all stages of the process, fro ating data sets with artificial blanks to evaluating the obtained results. We have wit major developments in transportation imputation research. Five representative m are compared, which are widely used: PPCA, KNN, LSTM, LSTM-M, and ARIMA els capture important temporal information in the datasets from different patte As shown in Figure 14 for missing speeds of 20%, 50%, and 80%, in the five temporal imputation models, we can still clearly see that the corresponding relationship between the ground truth (abscissa) of PPCA and the imputation value (ordinate) is closer to the middle red line. With the increase in the missing ratio, the temporal imputation methods show gradual fragility of a different degree.
Mathematics 2022, 10, x FOR PEER REVIEW

Discussion
In actual traffic conditions, three missing patterns are often randomly integrat verify the random missing patterns by different methods. As shown in Figure 13 missing flow of 20%, 50%, and 80%, the correspondence relationship between the g truth (abscissa) and the imputation value (ordinate) clearly shows that the closer red line, the more accurate the imputation value is compared to the ground truth. performs the best among the five temporal imputation methods. As the missing r creases, all models show gradual fragility, which is moving away from the red line As shown in Figure 14 for missing speeds of 20%, 50%, and 80%, in the five tem imputation models, we can still clearly see that the corresponding relationship be the ground truth (abscissa) of PPCA and the imputation value (ordinate) is closer middle red line. With the increase in the missing ratio, the temporal imputation m show gradual fragility of a different degree.

Conclusions
Missing values of traffic time series data is a common problem in intelligent portation systems. This paper reviews the development of temporal imputation. I marizes the strategy of temporal imputation, covering all stages of the process, fro ating data sets with artificial blanks to evaluating the obtained results. We have wit major developments in transportation imputation research. Five representative m are compared, which are widely used: PPCA, KNN, LSTM, LSTM-M, and ARIMA els capture important temporal information in the datasets from different patte

Conclusions
Missing values of traffic time series data is a common problem in intelligent transportation systems. This paper reviews the development of temporal imputation. It summarizes the strategy of temporal imputation, covering all stages of the process, from creating data sets with artificial blanks to evaluating the obtained results. We have witnessed major developments in transportation imputation research. Five representative methods are compared, which are widely used: PPCA, KNN, LSTM, LSTM-M, and ARIMA. Models capture important temporal information in the datasets from different patterns to estimate missing values. All missing patterns show different degrees and provide reliable results with different missing rates. The complex upstream-downstream correlations in urban road networks also need more explanation. We will discuss the summary of spatial imputation and spatial-temporal imputation models in future work. Data Availability Statement: Publicly available datasets were analyzed in this study. This data can be found here: http://pems.dot.ca.gov/.