Traffic Estimation for Large Urban Road Network with High Missing Data Ratio

Intelligent transportation systems require the knowledge of current and forecasted traffic states for effective control of road networks. The actual traffic state has to be estimated as the existing sensors does not capture the needed state. Sensor measurements often contain missing or incomplete data as a result of communication issues, faulty sensors or cost leading to incomplete monitoring of the entire road network. This missing data poses challenges to traffic estimation approaches. In this work, a robust spatio-temporal traffic imputation approach capable of withstanding high missing data rate is presented. A particle based approach with Kriging interpolation is proposed. The performance of the particle based Kriging interpolation for different missing data ratios was investigated for a large road network comprising 1000 segments. Results indicate that the effect of missing data in a large road network can be mitigated by the Kriging interpolation within the particle filter framework.


Introduction
Intelligent Transportation Systems (ITS) require accurate knowledge of traffic states for effective traffic monitoring and control. Traffic states are usually estimated from noisy sensor measurements [1] using various approaches, which can typically be subdivided into model based approaches, data driven approaches or streaming-data-driven [2]. An overview of the different modelling methodologies is given in Refs. [3,4]. These modelling methodologies include microscopic, macroscopic and mesoscopic approaches. Microscopic traffic models [3][4][5][6] describe the motion of each individual vehicle with a high level of detail. Hao et al. [7,8] recently proposed a model predictive control (MPC) termed urban cell transmission model (UCTM) for optimal switching of traffic lights at intersections based on the predicted traffic.
In macroscopic models [9,10], traffic state is represented by aggregating behaviour of the traffic, usually in terms of the average speed and the average density over a given period of time. Mesoscopic models [3] employ some features of microscopic and macroscopic approaches by utilising varying levels/degrees of detail to model traffic behaviour. This is achieved by modelling some locations with aggregated measurements as in macroscopic and the remaining locations are modelled down to the details of individual vehicles as is done in the case of microscopic.
Macroscopic models are sufficient to produce acceptable estimation accuracy when compared to the computational overhead of the microscopic models. Hence, they are the preferred choice for most practical purposes such as traffic control/management, road pricing and changes in infrastructure. based matrix decomposition is used to select most influential segments in large road network and then imputing any missing measurements in the selected segments using Kriging.
The rest of the paper is organised as follows. Section 2 discusses some related work followed by the presentation of traffic flow and measurement model used in this work in Section 3. A background theory of random sets, covariance and variograms, which is the building block of Kriging is presented in Section 4. Recursive Bayesian estimation and particle filters (PF) are presented in Section 5. The proposed method was discussed in Section 5.3. Results and discussion are presented in Section 6.2 with conclusions being drawn in Section 7.

Related Work
A review of three different missing data imputation methods was presented by Ref. [27]. These include interpolation, prediction and statistical learning. The interpolation method imputes missing measurement at the particular location by averaging all historical measurements at that location at similar times of day. Prediction methods use a deterministic mathematical description to model the relationship between historical and future data. The statistical methods on the other hand treats the traffic as a random variable and tries to capture stochastic nature of the traffic pattern into the imputation algorithm.
In Refs. [28][29][30], a multi-resolution approximation using linear combinations of basis functions was proposed to address computational complexity of large datasets. The novelty in Ref. [28] lies in the use of multiple basis functions computed at lower resolutions closer to observation locations and then combining them to capture the different covariance functions with varying properties. The solution is achieved by dividing the spatial domain recursively into small regions and smaller sub-regions until the fine-scale dependencies are captured.
Nychka et al. [28] used radial basis functions (RBF) and special type of Gaussian Markov random field (GMRF) called spatial autoregressive (SAR) model to model the spatial correlation among the coefficients while Katzfuss et al. [30] automatically determines the appropriate basis/covariance function. Whereas in Ref. [29], it was assumed that the sub-domains are independent, Ref. [30] assumed depended sub-domains and performed full-scale approximation. As the computations are done locally in parallel, it is possible to fuse multisensor data sources, in which case, the different covariance functions are used for each data source. While both mentioned that the approach could be extended to non-stationary functions, it was not implemented nor was there a derivation for such. The different methods on missing traffic data imputation considered small road network in the range of a few kilometres.

Model Formulation
Consider a general discrete time state space system of a form, where f k is a nonlinear function of the target state vector x k , h k represents a nonlinear relationship between sensor output z k and target state vector x k affected by a measurement noise w k . In addition, p(x k+1 |x k ) is the probability density function of the new state x k+1 given the previous state x k , and p(z k |x k ) is the likelihood function of the measurement z k given the state x k .

Stochastic Compositional Traffic Flow Model
In this work, a stochastic compositional model (SCM) [31] is considered for modelling of a motorway/freeway vehicle traffic evolution expressed as p(x k+1 |x k ) in Equation (1). This extended cell-transition model incorporates traffic speed and uses forward and backward waves to describe the complex relationship between traffic behaviour especially for a large road network. SCM utilises sending and receiving functions which models the stochastic nature of traffic state evolution. The vehicles that are able to leave a cell are represented by receiving functions while those that are allowed to enter a cell are determined by the receiving functions. The receiving functions are usually less than or equal to the sending functions to obey the law of conservation of vehicles.
In the SCM, the road network is divided into a given number of cells, n, also called segments. Each segment has a length L i and number of lanes l i as shown in Figure 1, where i = (1, . . . , n). At any given time period k, a certain number of vehicles Q i,k crosses the boundary between two segments i and i + 1. The number of vehicles in a given cell i within the same time period k is represented by N i,k with their average speed given by v i,k .  The overall state vector at time k is given by T is the local state vector at segment i. Equations (3)-(5) models traffic state evolution within the cells.
x n,k+1 = f n (x n−1,k , x n,k , Q out k , v out k , η n,k ) (5) where Q in k , called inflow represents the number of vehicles entering the first cell with average speed v in k , Q out k , called outflow represents the number of vehicles leaving the last segment with average speed v out k . The inflow and outflow vehicles with their average speeds are called the boundary conditions and are supplied to the model at the beginning.

Measurement Model
Consider the road network shown in Figure 1 divided into different segments with n boundaries. The traffic state at a boundary j ∈ J = 1, 2, . . . , n is sampled at discrete time steps t s , s = 1, 2, . . ., to give z j,s = (Q j,s , v j,s ) T . The measurements at all the boundaries are collected into a matrix given by Z s = (z T 1,s , z T 2,s , . . . , z T n,s ) T . The relationship between the sampling interval ∆t s and the state update time step ∆t k Equation (1) is such that sampling interval is split into q state update time steps. That is, ∆t s = q∆t k . The measurement model and noise is represented by Equation (6) with error With the assumption of Gaussian noise, z j,s can be expressed as:

Random Set
Let z(r) be a non-stationary set of m measurements z(r) = [z(r 1 ), z(r 2 ), . . . , z(r n )] T , in our case average vehicle speeds v(r) or vehicle counts N(r), observed in segments i = 1, . . . , n at locations Note that location of each sensor is uniquely defined by the segment topology ( Figure 1). Then, according to Equation (7), the random set of variables z(r) can be approximated by a Gaussian process which is uniquely defined by the drift, i.e., the mean field µ(r) = E[z(r)], and the corresponding covariance . Individual samples z(r i ) of the set Equation (7) can be also expressed as with drift µ(r i ) and residual δ(r i ). Based on the degree to which the moments, in our case mean and variance of the random set z(r) are dependent (or independent) on a spatial relationship between points r i = (r 1 , r 2 , . . . , r n ), the 1st or the 2nd-order of stationarity can be recognised. While, the 1st-order stationary random set assumes a constant mean E[z(r)] = µ(r) = µ, the 2nd-order stationary random field assumes a linear drift between the increments E[z(r)] = µ(r) = ∑ N n=0 α n f(r). In the rest of this sequel, we assume that the random set z(r) is intrinsic and stationary on the 2nd-order.

Covariance
By assuming that random variables z(r) are 2nd-order stationary and isotropic, the covariance function K(r i , r j ) in Equation (7) reads as follows: The role of covariance function K(r i , r j ) is to model the correlation between measurements z(r i ) and z(r j ) observed at locations r i and r j based on their separation distance called a lag h. Since the correlation between two random variables solely depends on their spatial distance, and not at all on their location, the lag can be conveniently expressed as an Euclidean l 2 norm, defined as h = ||r j − r i || 2 . Above statements can be summarised into the following equalities, i.e., the isotropy assumptions: For a straight stretch of motorway ( Figure 1) this is equivalent to the path length through the road network.
The process of covariance function K(r i , r j ) modelling requires to find a covariance curve that has the best fit to the empirical data Equation (9), possibly being a subject to constraints. The covariance models to choose from include exponential, spherical, Gaussian, linear or power model [14]. In this work, the best fit for the dataset was achieved by the exponential model depicted in Figure 2a and given by where a is the nugget, b is the range and c is still. As can be observed in Figure 2a, observe that the covariance function decreases with distance, so it can be thought of as a similarity function.

Bayesian Estimation
Consider a general discrete time state space system represented as in Equations (1) and (2). The goal of Bayesian estimation is to infer the state variable x k as defined in Section 3.1 with the available sensor measurements z 1:k . By using the Bayesian framework, this estimation problem relates to the recursive evaluation of the probability density function (PDF) p(x k |z 1:k ) in two consecutive steps, the prediction and the measurement update of the state vectors.
p(x k |z 1:k−1 ) The prediction state density p(x k |z 1:k−1 ) of state x k is calculated from the prior PDF p(x k−1 |z 1:k−1 ) by using Chapman-Kolmogorov equation Equation (14) follows the 1st order Markov property which assumes that p(x k |z 1:k−1 ) only depends on state x k and x k−1 at time k and k − 1 respectively. The measurement update p(x k |z 1:k ) is computed from the prior distribution of Equation (14) and measurements z 1:k by a Bayesian formula which results in The 1st order Markov property for Equation (15) implies that p(x k |z 1:k ) only depends on measurement z k at time k.

Particle Filter
Arguably, the most popular algorithm for nonlinear recursive estimation is the particle filter (PF), extensively evaluated in Ref. [32]. PF represents any arbitrary probability density function p(x k |z 1:k ) by samples or particles x l k , where l = 1, . . . , N p is the number of particles, i.e., The particles are used to form an approximative distribution as wherep(x k |z 1:k ) is an approximated distribution, δ(x k − x l k ) is a the Dirac delta function and w l k|k the weights of the particles satisfying ∑ N p l=1 w k|k = 1. The time update of the Bayesian recursion Equation (1) is in case of PF evaluated as The particles x l k−1 in above Equation (18) are sampled from proposal distribution π(x k |x l k−1 ), i.e., x l k−1 ≈ π(x k |x l k−1 ). Proposal distribution is very often defined by the state transition PDF, that is π(x k |x l k−1 ) = p(x k |x i k−1 ). In this case, the weights updates result to The measurement update p(x k |z 1:k ) (2) is computed by a Equation (15), which can be in terms of the particles x l k represented as Similarly, the particle filter weights are updated as The denominator in Equations (15) and (21) is only a normalising factor independent of x k thus can be safety omitted if the distribution is numerically normed as shown by Equations (20) and (21). w l k|k ∝w l k|k−1 p(z k |x l k ), ≈w l k|k−1 p(z k |x l k ).
The MC recursion tempt to degrade over time as all relative weights would tend to zero except for one that tends to one. Therefore, when particle depletion ratio reaches 0.5 a Sampling Importance Resampling (SIR) or Sampling Importance Sampling (SIS) techniques are applied in the recursion.

Missing Measurement Interpolation and Improved Likelihood Computation
For a large road network with many segments, using all the measurements in the particle filter measurement update step becomes computationally intensive. Column based matrix decomposition approach similar (as earlier stated in Section 1) to the work of [26] is employed to select the most probable segments that would give acceptable accuracy.
The idea is to select m most influential segments from all available segments n using CBMD and then estimating missing measurements (if any) of the most influential segments using Kriging for improved particle likelihood computation. Let Z n ∈ R k×n represent a set of all segments or measurement locations in a given time period, where the rows, k is number of time instances at which the measurements were taken and columns n number of road segments.
The goal of CBMD is to approximate the measuremtns Z n with a subset of measurements Z m ∈ R k×m where m < n is a subset of the measurements using singular value decomposition (SVD) as Ref. [33]: where Φ ∈ R n×m is the transformation matrix that expresses every column of all measurement Z n in terms of the basis in Z m . Having computed the SVD, the right singular matrix is used to assign a probability P z i to each selected location according to: where v i,j is the ith element of the jth right singular vector and r is the rank of the matrix. From the probabilities computed, m locations with the highest probabilities are chosen as the reduced measurement to approximate the entire network which is used in computing the particle filter likelihood during the measurement update step. The likelihood function term p(z k |x k ) in Equation (21), is computed when a new measurement arrives. The performance of the PF degrades substantially when there is a missing measurement. For the multivariate Gaussian distribution, the PDF is given by: where R is the covariance matrix of the measurement data, |R| ≡ det(R) is the determinant of R and ν is the difference between the PF predicted value (z s ) and measurement (z s ), given by: The measurement matrix z s can be expressed as, whereẑ krig s represents the value estimated by Kriging (when measurement is not available). Note that the sampling time index t s is split into q update time indices t k as mentioned in Section 3.2. The measurement update state of the PF is performed only when t k ≡ t s . The modified particle method is presented in Algorithm 1. [24] 1.

Algorithm 1 Particle Filter for Traffic State Estimation with Kriging Estimated Measurements
Road network approximation Use compressed sensing to select m most significant locations out of the n segments to be used for the measurement update step as defined in Section 5.3.

2.
Initialisation At k = 0; define all boundary conditions: number of samples, weight of samples as below, For l = 1, . . . N p , N p number of particles; End for 3.
Update the predicted states (Output): Re-sample the weights (Selection) only when t k = t s

Missing Data Estimation via Kriging Models
The underlying idea of Kriging is to obtain the response z(r u ), interpreted as a random variable positioned at the location r u , by interpolating random variables z(r) = [z(r 1 ), z(r 2 ), . . . , z(r n )] T from Equation (7), i.e., observations z(r i ), i = 1, . . . , n at locations r i . The Kriging predictor incorporates the covariance structure among the observation points z(r i ) into the weights w(r u ) = [w 1 (r u ), w 2 (r u ), . . . , w n (r u )] T for predictingẑ(r u ) as a linear combination: In order to assess the accuracy of the Kriging predictionẑ(r u ) w.r.t the real (true) value z(r u ) an error (r u ) is declared (r u ) = z(r u ) −ẑ(r u ). (29) The following criteria, evaluated in terms of mean E (r u ) and variance Var (r u ) = σ 2 (r u ) of the prediction error (r u ) Equation (29), apply to any type of Kriging interpolation:

•
Lack of bias, implies that E (r u ) = 0 and therefore E ẑ(r u ) = E z(r u ) . Thus, the Kriging interpolation is said to be the globally unbiased Equation (30).
• Minimum variance, implies that the mean of the squared deviations Var[ (r u )] = σ 2 (r u ) must be the minimal Equation (31), Var[z(r u ) −ẑ(r u )] or min Kriging weights w(r u ) are chosen such that the mean squared prediction error σ 2 (r u ) Equation (31), also known as Kriging variance or Kriging error is minimised as min 1 T w=1 σ 2 (r u ) over all z(r u ) Equation (31) subject to the unbiased conditions of Equations (30) and (32) by a Lagrange multiplier 2λ to give, where matrix K(r i , r j ) is the covariance between the individual samples and column vector k(r i , r u ) is a covariance function between samples and the point to interpolate.

Performance Evaluation
A road network with 1000 segments was simulated using SUMO software [34] to validate the proposed method. The segments are spaced 0.5 km apart and measurements (number of vehicles crossing each segment boundary with their average speed) taken every second. Traffic signs were installed at some locations to model the effect of congestion. This is an extension of the previous work [25] where a smaller number segments was considered. The aggregate traffic flow and speed were sampled every 60 s and the results collected over a period of 10,800 s (3 h). Two types of vehicles, bus and passenger car was defined with the parameters as in Table 1. The vehicles were added randomly into the network through the inflow boundary every one second and they travel through the network until they get to the last boundary when they leave the network. As a vehicle crosses each induction loop, it is counted with its speed. The average speed of the vehicles arriving at an induction loop over a period is recorded as the average speed. The entire statistics, flow, occupancy, and speed are collected in an output file for further processing.

Simulation Design
To simulate different scenarios such as congestion and free flow, (i) the number of lanes were decreased from 3 to 2, and (ii) the rate of vehicle injection into the network is varied at different time periods. Figures 3 and 4 show the spatio-temporal evolution of the traffic and their corresponding average speed, respectively, for a 100 segment section. The average speed of the vehicles varies around 100 km/h when the flow was around 2000 veh/h. Between time interval [1.5 h, 1.7 h], the flow was increased slightly to cause congestion, this resulted to a decrease in the average speed as can be seen in the first spike from Figure 4. Observe that the effect was felt more close to the inflow boundary. The vehicles' speed increases marginally as they move into the network. Between time interval [1.6 h, 1.9 h] the flow was decreased leading to increase in the average speed. Finally, the number of lanes in segments 10 to 14 were reduced from 3 to 2 between time interval [2.4 h, 2.4 h] while maintaining vehicle injection rate. This results to substantive decrease in speed as can be seen (cone-shaped) in Figure 4. As the vehicles leave the segments with closed lanes, there is increase in their speed again.

Results and Discussion
In order to test the prediction accuracy for different levels of sparsity, a statistical measure namely the root mean squared error (RMSE), was computed. Note, z i is the ground truth or actual measurement,ẑ i is the estimated value and m r is number of independent Monte Carlo runs. The measurements at some boundaries were randomly removed each time and then estimated using the proposed method. Different missing data rates (from 10, 20, . . . , 70%) were investigated by randomly removing measurements at some locations using leave one out cross validation. This was repeated for 100 Monte Carlo runs and the average value used. Figures 5 and 6 show the average estimation error for the different missing data ratios and the number of segments. It would be observed that the prediction error increases with the missing data rate. This is expected as less data is available for the computation. This effect could be reduced further by incorporating a mechanism known as multi resolution approach [30].  The plots also show that the higher the number of segments used, the better the accuracy. This could be attributed to the fact that there is better information exchange within the network and hence, on average, more segments with available data are used for the higher dimensional segments scenario. This is in agreement with [35] where estimation accuracy in the presence of sparse sensor data was improved by exchanging particle weights between segments. For instance, a 10 segment network with 70% missing data ratio will result to using only 3 data points to estimate the remaining 7 missing locations. The chances of these 3 locations correlating with the other 7 is lower compared to when 30 out of 100 data points are available. Figures 7 and 8 show the spatio-temporal evolution of the traffic flow and the corresponding average speed for the 100 segment scenario.  The number of vehicles crossing each boundary in space and their associated average speed is represented by the colour bar.Compared to the ground truth shown in Figures 3 and 4, it is evident that estimated number of vehicles crossing segment boundaries and the associated speeds have been estimated with a good accuracy. Observe also that the estimated flow and speed captured the periods where there is drop in number of vehicles and decrease in speed.

Conclusions
This paper presented a traffic estimation for a large road network with different missing data ratios. The computational overhead of the large network was addressed by using a method called reduced measurement space proposed by [26] to select the most influential and information rich segments in the road network. These are subsequently used in the particle filter measurement update step. Missing data in the selected segments are imputed using Kriring. A 1000-segment road network was simulated using SUMO. Different missing data ratios ranging from 10% to 70% were tested for different sizes of road network ranging from 100 to 1000 segments.
The results indicate that considering a larger number of segments would reduce the overall estimation error even when the missing data ration is high. From the foregoing results and discussion, it is recommended that the best estimation accuracy would be obtained when the entire road network is considered at once. The effects of computational overhead could further be reduced by using a distributed approach with a central control unit.