Representation of Trafﬁc Congestion Data for Urban Road Trafﬁc Networks Based on Pooling Operations

: In order to improve the efﬁciency of transportation networks, it is critical to forecast trafﬁc congestion. Large-scale trafﬁc congestion data have become available and accessible, yet they need to be properly represented in order to avoid overﬁtting, reduce the requirements of computational resources, and be utilized effectively by various methodologies and models. Inspired by pooling operations in deep learning, we propose a representation framework for trafﬁc congestion data in urban road trafﬁc networks. This framework consists of grid-based partition of urban road trafﬁc networks and a pooling operation to reduce multiple values into an aggregated one. We also propose using a pooling operation to calculate the maximum value in each grid (MAV). Raw snapshots of trafﬁc congestion maps are transformed and represented as a series of matrices which are used as inputs to a spatiotemporal congestion prediction network (STCN) to evaluate the effectiveness of representation when predicting trafﬁc congestion. STCN combines convolutional neural networks (CNNs) and long short-term memory neural network (LSTMs) for their spatiotemporal capability. CNNs can extract spatial features and dependencies of trafﬁc congestion between roads, and LSTMs can learn their temporal evolution patterns and correlations. An empirical experiment on an urban road trafﬁc network shows that when incorporated into our proposed representation framework, MAV outperforms other pooling operations in the effectiveness of the representation of trafﬁc congestion data for trafﬁc congestion prediction, and that the framework is cost-efﬁcient in terms of computational resources.


Introduction
Cars have become the preferred means of transportation for more and more people due to the rapid development of urbanization and improvement of people's living standards. The huge number of cars has become very challenging in terms of the efficient operation of urban road traffic networks and causes traffic congestion. Road traffic congestion in many cities around the world is very serious, especially in metropolitan cities [1]. There have been a lot of research on the prediction of urban road traffic congestion and traffic management [2][3][4][5]. Understanding the congestion patterns of an entire road network rather than a single road or several roads in an area is important. Prediction of traffic congestion helps people and calculated each grid's average speed in the same way as Yu et al. for every 30 min [28]. Duan et al. segmented an urban area of Xi'an into 16 × 16 grids and summed the number of trips from a certain origin to a certain destination when predicting the number of such trips [29]. Zhang et al. divided a metropolitan freeway transportation network in Seattle into grids and calculated the average congestion level for each grid when predicting traffic congestion in that area [30]. However, these operations are often selected without further consideration, and specifically, there has rarely been an evaluation of their impacts regarding the prediction of traffic flow variables.
Although this existing scheme has facilitated the prediction of traffic flow variables such as traffic volume, speed, congestion, and demand, until recently there has been little in-depth research on how to properly represent traffic congestion data for an urban road network in order to retain its spatial structure as much as possible and to predict traffic congestion on its road segments. Compared with the prediction of other traffic flow variables, prediction of traffic congestion in an urban traffic network is much more intuitive for and practically significant to both travelers and traffic management departments. For travelers, traffic congestion prediction helps one to choose better travel routes and reduce pollution associated with emissions from vehicles. For traffic management, it can improve operational efficiency by controlling and coordinating urban road traffic networks. Therefore, we propose a representation framework for traffic congestion data in urban road networks. This framework aims to reduce the size of large-scale traffic congestion data in order to lower the requirements of computing resources for deep learning models. Moreover, it does not damage the performance of models used to predict traffic congestion.
The contributions of the paper can be summarized as follows: • We develop an effective and cost-efficient representation framework for traffic congestion data of urban road networks. This framework combines grid-based partition of urban road traffic networks and a pooling function to reduce the size of traffic congestion data, while at the same time still retaining the spatial structure of road networks on a courser scale; • We construct a model based on convolutional neural networks and long short-term memory neural networks to learn both spatiotemporal correlations and dependencies of traffic congestion between road segments and predict traffic congestion in road networks; • The effectiveness and efficiency of our proposed representation framework is demonstrated by extensive experiments on a typical urban road traffic network.
The remainder of this paper is organized as follows. Section 2 presents a detailed account of our proposed framework and pooling operation. Section 3 describes extensive experiments on a dataset of traffic congestion for an urban road traffic network, which verify the effectiveness, efficiency, and feasibility of the proposed approach. Finally, Section 4 provides some conclusions.

The Proposed Approach
In this section we first propose our framework for the representation of traffic congestion data of a road network. Our proposed representation framework consists of two steps. The first step segments original traffic congestion matrices into equally sized grids. The second step reduces all values in each grid using a pooling operation into a single value which will replace all values in that grid, and thus, the size of original traffic congestion matrices is reduced in a way similar to image down-sampling [31]. These two steps are described in Sections 2.1 and 2.2. Then, we construct a deep learning model based on CNNs and LSTMs to evaluate the effectiveness of the proposed representation framework.
We use raw snapshots, as shown in Figure 1a, of traffic congestion maps captured from online map service providers as a raw data source for traffic congestion data. As can be seen in Figure 1a, roads are marked with different colors for different congestion levels which provide useful traffic congestion data, yet there are also background and other nonroad elements which are not needed and thus need to be removed. In order to keep only roads marked with congestion information, an image mask is derived from a special kind of raw snapshots of traffic congestion maps for the same area. Such special raw snapshots are special in that all roads in them are marked with the color for being smooth, which is green, as used by almost all online map service providers. They are widely available late at night when there are few vehicles running on roads. As an example, Figure 2a shows such a special raw snapshot captured at 02:54, 30 March 2019. With help of image processing algorithms, green pixels for smooth roads are converted to 1 while all other pixels were converted to 0. Thus, an image mask was obtained as shown in Figure 2b. After background removal using image masks for road networks, these raw snapshots are transformed into images like Figure 1b, which only keep a road network whose road segments are marked with congestion levels using different colors, for example green, yellow, red, and dark red. Then, each of these network-only images is converted into a matrix, with each pixel inside turned into a normalized value in [0.0, 1.0] according to color of that pixel's congestion level. Although the derived properties of traffic flow, such as the congestion intensity [32] and congestion index [33], in the existing literature are defined on wider ranges of values, such value ranges are inappropriate as direct inputs to deep learning models for the prediction of traffic congestion, because without being normalized they cause a problem known as internal covariate shift [34]. Such matrices form a set of original traffic congestion matrices. As discussed in Section 1, these original traffic congestion matrices need to be reduced because of the often limited availability of computing resources and to prevent overfitting.

Grid-Based Partition of Congestion Data
Let P t be an original traffic congestion matrix with M rows and N columns representing traffic congestion levels for a road network at time t as shown below: where each element of P t is a numerical value representing one traffic congestion level for a corresponding pixel of a raw traffic congestion map snapshot captured at time t. After grid-based partition of P t using a grid size of g × g, P t now is divided into R × C grids. As an intuitive visual illustration, Figure 1c shows this segmentation process applied to a network-only image corresponding to P t . For a grid located at (i, j) where 1 ≤ i ≤ R, and 1 ≤ j ≤ C, { p k t,i,j |1 ≤ k ≤ g 2 } denotes the set of values for congestion levels as represented by all pixels in that grid at time t.

Reduction of Grid Values
Each of these grids of P t are reduced into a single value through a pooling operation, so that P t is converted into a compressed traffic congestion matrix C t with R rows and C columns, as shown below: where an element C t i,j of C t will be calculated by a pooling function applied upon a set { p k t,i,j |1 ≤ k ≤ g 2 } which contains all values in a grid index by (i, j) of P t .
Through this process of grid-based partition and reduction, each element of C t is a derived value representing one corresponding grid of P t .Thus, P t is now down-sampled and compressed into C t by a ratio of 1/g 2 , yet the relative spatial relationships between roads are mostly kept, as shown in Figure 1d.
Pooling operations used by the reduction process above are requisite to our proposed approach because the effectiveness of grid-based representation of road network traffic congestion data is determined by such operations, which act as a kind of feature extraction filter. We propose a pooling function which retrieves the maximum of all values (MAV) in a grid of P t , which is rarely used when representing traffic congestion data. MAV is described by Equation (1):

Prediction Model
Convolutional neural networks (CNNs) use convolution filters to extract local and global features through sliding windows, and can learn spatial correlations of traffic flow variables nearby or in entire cities [22,27,[35][36][37]. Long short-term memory neural networks (LSTM) were proposed by Hochreiter and Schmidhuber in 1997 [38]. They can learn temporal relationships and dependencies from time series data and have been applied to short-term traffic prediction [5,[39][40][41]. Deep learning models combining CNNs and LSTMs are widely used in the literature regarding traffic flow prediction and can capture both the spatial correlations and temporal dependencies of traffic flow variables on road networks [5,36,[42][43][44].
On the basis of the popularity and performance of models combining CNNs and LSTMs in the existing literature, we propose a spatiotemporal traffic congestion network (STCN) based on CNNs and LSTMs to evaluate the effectiveness of our proposed representation framework for traffic congestion data. An overview of STCN's architecture is shown in Figure 3. STCN contains three main components. The first component consists of four CNNs and is used to learn spatial features and their correlations to traffic congestion between roads in a road network. The second consists of two LSTMs and is used to mine temporal dependencies across a series of historical traffic congestion data. Finally, the third has a full connection layer and a reshape operation to construct predicted traffic congestion in that road network.
The first component takes a sequence of matrices in the form of C t ordered chronologically as its input. The spatial features and correlations extracted by the first component are used as inputs to the second component. The output by the second component is processed by the third to predict traffic congestion levels and construct a traffic congestion map as output.
Additionally, a max-pooling layer is used after each convolution layer to select representative features, and a batch-normalization layer to overcome internal covariate shift. A dropout layer is inserted before the full connection layer to prevent overfitting.

Dataset
Originating from various traffic detectors [45] or powered by online map services such as Google Maps [46], real-time traffic condition maps of transportation networks are regularly archived in the form of snapshots by transportation administration departments 24 h a day, 7 days a week, and are provided online [47,48]. In order to build a dataset of traffic congestion data, we first create an initial data source in this work by complying with procedures used by these transportation administration departments. Figure 1a is an example of a raw snapshot of a traffic congestion map for an urban area in Guiyang, Guizhou province, China, which was one of top 10 most congested major cities in China in 2019 [49]. Each raw snapshot for this area is 256 pixels wide and 256 pixels high and mainly covers urban arterial roads. Such snapshots are retrieved at a scale of 1:50,000 every 10 min during morning rush hours between 07:00 and 10:00 from 1 January 2019 to 30 September 2019 through a free API provided by an online map service provider AutoNavi [50]. Requests to that API sometimes fail and cause missing data, which are left blank as is in this paper. These snapshots form an initial data source of traffic congestion data.
After their background and other nonroad elements have been removed as described in Section 2, raw snapshots from our initial data source are transformed into network-only images. Then, we use these network-only images containing road segments marked with congestion levels to build a dataset for traffic congestion research. The congestion level given by the color of each pixel inside images like Figure 1b is linearly converted to a normalized value based on its color (transparent, green, yellow, red, or dark red) to form an original traffic congestion matrix P t [30]. Specifically, transparent pixels are converted to 0.0, green ones to 0.25, yellow ones to 0.5, red ones to 0.75, and dark red one to 1.0, because traffic congestion levels are categorized by the online service provider based on the calculated linear travel time index. Gray pixels in a network-only image such as Figure 1b indicate road segments inaccessible or with missing data and are treated as transparent ones. After conversion, the size of original traffic congestion matrices is 256 × 256.

Comparative Methods and Metric
Pooling operations applied to values in each grid of an original traffic congestion matrix determine the effectiveness of our proposed representation framework, as discussed in Section 2.1. We compare our proposed MAV pooling operation with three others used in the existing literature: • The nearest neighbor value (NNV), as defined by Equation (2), is based on a common image resampling algorithm [51]. In our experiment, this operation returns the value in the upper left corner of a grid [52]; • The average of the maximum and minimum values (AMM) which is inspired by weighted median filter [53,54]. In our experiment, this operation returns the mean of the maximum value and minimum value in a grid, as defined by Equation (3); • The nonzero average (ANZ) of the values in each grid used in the existing literature for the prediction of traffic flow variables such as speed or congestion [5,28], as defined by Equation (4).
in which p k t,i,j is a certain element in a grid indexed by i and j of P t before this grid is transformed.

Experiment Settings
Detailed descriptions of the architecture and parameter configuration of our STCN are shown in Table 1. It was implemented based on an open-source deep learning framework-Keras [55]. Experiments were run on a workstation with Ubuntu 18.04 installed. This experimental device had only one Nvidia GeForce RTX 2080 Ti graphics card which had 11,019 megabytes of GPU memory. The model was trained based on the optimizer RMSprop [56]. The learning rate was set to 0.001 and the decay parameter was set to 0.9. The batch size was dynamic because this model was trained and tested by day using back-testing, and thus missing data introduce different numbers of samples each day. The loss function was a customized weighted mean squared error (wMSE) defined in Equation (5), where w t ij stands for the penalty weight applied to different congestion levels because different congestion levels have different priorities. In addition, early stopping was used to prevent overfitting.
STCN discussed in Section 2.3 was used to compare the accuracy of our proposed pooling method and the other three described above, which are essential to our proposed framework for the representation of road network traffic congestion data in a down-sampled and compressed way to predict traffic congestion.
Previous work determined an optimized time lag of 120 min for traffic prediction [57]. Therefore, for all pooling operations, 12 compressed traffic congestion matrices with an interval of 10 min during the past 12 × 10 = 120 min arranged in chronological order were used as input to the STCN model. These input matrices were obtained after grid-based partition of the original traffic congestion matrices using a grid size of 2 × 2 and the separate application of each of these four pooling operations. The reason for using a grid size of 2 × 2 in our work is twofold. Firstly, it is inspired by its popular utilization in convolutional neural networks [25,[58][59][60][61][62] and as a default pool size of pooling layers in deep learning frameworks such as Keras [55] and TensorFlow [63]. Secondly and more importantly, it can strike a balance between demand of computational resources and loss of traffic congestion information due to down-sampling [64,65]. The output of this model is a compressed congestion matrix at one of six short terms including typical prediction horizons of 10, 30, and 60 min as used in [5,30], and also 20, 40, and 50 min used in this paper. For all four pooling operations, the ground-truth matrices for the predicted output matrices were obtained after grid-based partition of corresponding original traffic congestion matrices using a grid size of 2 × 2 and application of the NNV operation defined by Equation (2).
Instead of dividing the dataset into a training set and a test set by a certain fixed time point or using cross-validation, we used back-testing by day to compare the accuracy of each of the four pooling operations when incorporated into our proposed traffic congestion representation framework for traffic congestion prediction [30]. Traffic congestion levels in the morning rush hours between 07:00 and 10:00 on each of the 20 working days from 3 September 2019 to 30 September 2019 were tested. Traffic congestion data from the past 133 consecutive working days before each tested working day were used as training data.
We used mean absolute error (MAE), mean squared error (MSE), and roads-only mean absolute percentage error (roMAPE) defined respectively by Equations (6)-(8) as the accuracy metrics to evaluate the performance of the traffic congestion prediction when using different pooling operations inside our proposed representation framework in this paper. Because nonroad areas including background and other elements were converted to 0 in the original traffic congestion matrices, roMAPE only considers errors for c t ij which corresponds to an original traffic congestion matrix's grid containing road segments marked with congestion levels. In these three equations, c t ij and c t ij respectively denote a ground-truth traffic congestion level and a predicted traffic congestion level at time t for an element indexed by (i, j) in a compressed traffic congestion matrix C t . In Equation (8), [P] is the Iverson bracket which converts a logical proposition P to either 1 or 0 according to whether P is true or false. Figure 4 show the results of the accuracy metrics of our proposed representation framework incorporating each pooling operation, as evaluated by STCN. The metric values in Table 2 were rounded to four places and minimum values were marked with a bold typeface according to their original values before they were rounded.

Table 2 and
It can be seen that in terms of MSE, MAV achieved minimum average daily errors with 0.0050 for 30 min, 0.0052 for 40 min, and 0.0052 for 60 min. As for MAE, MAV performed better than the other three with 0.0189 for 10 min, 0.0211 for 30 min, 0.0219 for 40 min, 0.0219 for 60 min. With regard to roMAPE, MAV produced three minimum average daily errors with 5.6986 for 20 min, 5.8901 for 40 min, and 5.8229 for 50 min. Considering MSE, MAE, and roMAPE, in more than half of these six prediction horizons, MAV produced minimum errors when predicting traffic congestion levels. Additionally, Figure 2 illustrates the trends of MAE and MSE according to the four pooling operations along the prediction horizon. It can be observed that the prediction errors generally go upward, which might be caused by more uncertainties as the prediction horizon moves further into the future. However, MAV produces optimal overall prediction errors with a more stable trend than the others. Hence, it can be inferred that when used as a pooling operation, MAV together with our proposed framework can properly and effectively represent traffic congestion data for short-term traffic congestion prediction. The reason for MAV's overall optimal performance might be twofold. Firstly, MAV always chooses the most serious congestion level in each grid, which is foremostly representative for real-world regions corresponding to these grids. Secondly, serious traffic congestion in one region is more likely to propagate to other ones. As for the other three pooling operations, NNV misses the most congested level in 75% cases, while the other two reduce the significance and representativeness of the most congested level through averaging.  As an example of the prediction errors by day, details about the daily prediction performance in terms of MAE and MSE with a horizon of 40 min are shown in Figure 5. It can be seen that MAV has a smaller variation than ANZ, AMM, and NNV, which is confirmed by the standard deviation values shown in Table 3. To evaluate requirement of computational resources, Table 4 lists the usage of GPU time and memory both by the original matrices and compressed ones derived using our proposed framework with MAV as the pooling reduction operation. When original traffic congestion matrices are used as input, STCN and these input data could not fit onto the one graphics card of our experiment device. Therefore, it had to be evaluated on another workstation with the same configuration as our experimental device, the difference being that it had two graphics cards of the same type described in Section 3.3. In addition, it was run with the help of a distributed deep learning framework-Horovod [66]. Metric values are reported as recorded on each experimental device. It can been seen the compressed matrices derived using our proposed representation framework combined with MAV save more than 73% GPU time and use only a little more than 55% of GPU memory when compared to the original matrices. Using a grid size of 2 × 2, original matrices are reduced by 75% in size and thus our proposed representation framework is cost-efficient in terms of GPU time and memory.  To compare the effectiveness of the original traffic congestion matrices (ORIGINAL) and the compressed ones derived using our proposed approach, Table 5 lists the average metric values of MAE and MSE across six prediction horizons. The metric values in Table 5 were rounded to four places and minimum values were marked with a bold typeface according to their original values before they were rounded. In terms of MSE, MAV commits smaller errors of 0.0047, 0.0050, 0.0052, respectively, for the prediction horizons of 20, 30, and 60 min. As for MAE, MAV performs better with error values of 0.0208, 0.0211, 0.0224, and 0.0219, respectively, for 20, 30, 50, 60 min into the future. It can be inferred that the compressed traffic congestion matrices derived using our proposed approach are at least as effective as the original ones, while at the same time are more efficient. This might be because the maximum value of a grid is characteristic of that gird. Particularly, when only a grid size of 2 × 2 is used in our experiment, it is possible for the maximum value to well represent its grid with no loss of information. As an example of the prediction of traffic congestion levels using MAV as the pooling operation of our proposed representation framework and using STCN, Figure 6 shows several examples of both the ground truth congestion levels and the predicted ones on 25 September 2017, with a prediction horizon of 10 min. It can be seen that the predicted congestion maps are visually intuitive and recover congestion levels for most road segments in the network.

Conclusions
In this work, in order to reduce the usage of computational resources while at the same achieve optimal performance, we first propose a framework to represent urban road traffic network congestion levels. This was used to utilize historical records of traffic congestion data with a large size to predict future short-term traffic congestion levels. We captured raw snapshots of congestion maps for an urban road traffic network in Guiyang, Guizhou province, China. These snapshots were preprocessed and transformed into a dataset consisting of matrices representing traffic congestion levels at different times. To evaluate the effectiveness and cost-efficiency of our proposed MAV pooling operation, we compared its prediction performance with that of three other existing methods within our proposed representation framework. We also propose a deep learning neural network STCN for traffic congestion prediction, using it with the back-testing method. The results as regards our aforementioned dataset show that MAV achieves optimal overall performance and can effectively and cost-efficiently represent congestion levels in an urban road traffic network for short-term traffic congestion forecasting.
On the other hand, this study only focuses on the evaluation of the representation performance of traffic congestion levels using raw snapshots of a single urban area traffic network. In addition, MAV has a limited compression ratio because it depends on the grid-based partition of urban road traffic networks, restricting its scope in terms of applicable road networks. In future work, we will try to experiment with snapshots of congestion maps for different scales of road networks and look for other representation schemes to improve the compression ratio of urban road traffic networks, as to make it feasible to investigate congestion of traffic networks with larger scales.