Multi-Task Fusion Deep Learning Model for Short-Term Intersection Operation Performance Forecasting

: Urban road intersection bottleneck has become an important factor in causing trafﬁc delay and restricting trafﬁc efﬁciency. It is essential to explore the prediction of the operating performance at intersections in real-time and formulate corresponding strategies to alleviate intersection delay. However, because of the sophisticated intersection trafﬁc condition, it is difﬁcult to capture the intersection trafﬁc Spatio-temporal features by the traditional data and prediction methods. The development of big data technology and the deep learning model provides us a good chance to address this challenge. Therefore, this paper proposes a multi-task fusion deep learning (MFDL) model based on massive ﬂoating car data to effectively predict the passing time and speed at intersections over different estimation time granularity. Moreover, the grid model and the fuzzy C-means (FCM) clustering method are developed to identify the intersection area and derive a set of key Spatio-temporal trafﬁc parameters from ﬂoating car data. In order to validate the effectiveness of the proposed model, the ﬂoating car data from ten intersections of Beijing with a sampling rate of 3s are adopted for the training and test process. The experiment result shows that the MFDL model enables us to capture the Spatio-temporal and topology feature of the trafﬁc state efﬁciently. Compared with the traditional prediction method, the proposed model has the best prediction performance. The interplay between these two targeted prediction variables can signiﬁcantly improve prediction accuracy and efﬁciency. Thereby, this method predicts the intersection operation performance in real-time and can provide valuable insights for trafﬁc managers to improve the intersection’s operation efﬁciency.


Introduction
Intersections are the sites of collection and turn of vehicles, which can easily be the bottlenecks of restricting the entire road network operation efficiency fully. Improving the traffic efficiency of intersections has always been the concern of transportation researchers and engineers. Short-term traffic forecasting of the intersection operation state can provide real-time traffic information, which is helpful for traffic managers to optimize signal control schemes to mitigating traffic delays.
The main traffic parameters of intersections, including the passing time, traffic speed and waiting time, etc., are used to detect the intersection's operating performance. Among these parameters, the passing time and traffic speed can intuitively reflect the intersection's overall operating performance [1]. With the development of traffic sensors, especially the emergence of mobile Internet, it is possible to extract traffic parameters from largescale traffic data. The mobile sensors-equipped vehicles (i.e., the floating car) can monitor the traffic operation performance of large-scale intersection groups at low cost [2]. It can transfer the real-time traffic information to the database by the Global Positioning System (GPS) and modern communication technology, which have gradually become the intersection. Therefore, we combine the residual network with CNN model to extract the spatial features of intersections.
This research focuses on devising the multi-task fusion deep learning model (MFDL) to predict the intersection operation performance. The multi-task learning method has been applied in traffic fields such as network traffic classification [26], traffic flow forecasting [27] and traffic demand prediction [28]. Moreover, the multi-task learning method is rarely used in intersection prediction. The two main reasons are that the intersection's traffic state is more complex and the high-precision data is not easy to obtain. This paper selects the passing time and speed as the prediction targets, which can reflect the process state and the visual state of the intersection. Compared with the existing multi-task model, we adopted the attention mechanism to assign the weight of variables to achieve feature fusion automatically. The multi-task learning method considers the difference between passing the time and speed and sharing each task's common feature, which is a more comprehensive intersection operation state prediction. The contribution of this work is as follows.
First, based on the grid model, the traffic parameters are extracted. The grid-based method can identify the intersections rapidly without the digital map, which is easily transferred to other cities.
Second, the residual network (ResNet) is incorporated into the MFDL model to enhance the depth of the model, which contributes to alleviate the problem of gradient disappearance. Furthermore, based on the GCN model, the intersections' topological propagation patterns are also considered, which previous studies were rarely involved.
Third, the MFDL framework integrates the deep learning method, preprocessed the data, extracts the traffic feature of the intersection, and finally predicts the passing time and speed of the intersections. Compared with the benchmark models, the MFDL model not only captures the Spatio-temporal traffic feature of intersections but also has better accuracy and robustness. Meanwhile, compared with the signal-task model, the MFDL can significantly improve the prediction accuracy and efficiency. The MFDL can be easily transferred to other cities for the traffic operation performance prediction of the intersection.
Using the intersection groups of Beijing as a case study, the proposed method's accuracy and stability are demonstrated. The rest of this paper is organized as follows. Section 2 describes the floating car data and details the proposed methodology. Section 3 conducts a case study from the Beijing core area and an in-depth analysis of the experimental results. Finally, Section 4 ends with major conclusions and discussions for future research. Figure 1 shows the multi-task fusion deep learning method framework, which includes the data process procedures, the model construct and model evaluation. For the first and second sections, according to the reconstructed intersection, the coordinate of floating car data (i.e., FCD) would be transformed to a based-grid coordinate to extract traffic parameters. For the third section, we construct the multi-task fusion deep learning model by fusing the CNN, GCN, LSTM and ResNet, an attention mechanism. For the fourth section, after entering the dataset into the model, we evaluate and verify the model's effectiveness.

Data Preprocessing
The sampling rate of the obtained floating car data is 3 s, obtaining from DiDi company, one of the companies with the largest number of car Hailing users globally. The FCD data mainly contains three kinds of information: temporal, spatial and speed information. The variables and attributes of FCD are lists in Table 1.  39.966344. The data are the same as those used in the previous study [9,29]. Figure 2a shows the trajectory points, and according to trajectories, we can obtain the topology of the intersection (see Figure 2b). We select ten intersections as the target objects in this study area, whose ID from No.0 to No.9 (see Figure 2b). The selected intersections cover three types, including the regular intersection with four legs (i.e., No.0 to No.7) and the intersection with three legs (i.e., No.8 and No.9). Due to the block of GPS signal and the error of hardware/software in the process of data collection and transmission, there may be some errors in the original floating car data. It is necessary to preprocess the FCD to reduce the impact of error data and improve the prediction results accuracy. A more detailed data preprocessing process has been described in previous studies [9,16,29]. This paper introduces two main steps as follows: Step 1: Remove the FCD, which is out of range of the intersection area; Delete the FCD whose speed is over 90 km/h; Eliminate the redundant FCD, which is similar or duplicate data.
Step 2: Replenish the missing data by the interpolation method due to the weak satellite signal or operation error during the FCD collecting.

The Grid Model
In this study, the intersection area range (i.e., I A) is set to 300 m, covering the range of intersection proposed by the previous study [30]. In the intersection area, the intersection will be divided into discrete grids with a side length of D; If the grid size is too large, it may be out of the intersection area. Furthermore, if the grid size is too small, it cannot cover a floating car adequately. Previous study thought that the grid size of the identified intersection should be range from 1 to 10 m [31]. Given the above consideration, the length of the grid is set to 5 m, and the intersection area can be composed of Rows × Columns × 5 × 5 m 2 , and the Rows = Columns = 60 in this study (See Figure 3). After dividing the intersection area into square grids, the floating car data is matched to the grid by the basic arithmetic algorithm, which is efficient and simple. Thereby, the FCD is transformed into the grid-based dataset (i.e., GFCD). Mathematically, the algorithm is as follows: where I A right , I A le f t , I A up , I A down are the right, left, up and down boundaries of the intersection area; NR, NC are the number of rows and columns; R lat , C lng is the grid coordinate that a trajectory point belongs to, whose latitude and longitude are Tr lat , Tr lng .

Identification of Traffic Intersection Area
Due to the unique Spatio-temporal characteristics of each signalized intersection, the corresponding range of the influence area of signalized intersection is also different. If we define the influence area of signal intersection as a fixed area, it cannot sufficiently reflect the unique traffic characteristics of a signalized intersection. It is well known that the intersection is the bottleneck in the road network, where the stop-and-go phenomenon is also the most common. Therefore, we extracted the stop feature to define the range of the intersection. The stop frequency in a special area at intersections varies with the GFCD spatial feature. In general, the entrance area is more distinguished than other areas since the stop frequency in the entrance area is higher than the upstream and inner areas. Based on this feature, we constructed the stop dataset of GFCD by making statistics of stops in each intersection area grid.
Based on the stop dataset, the intersection area can be clustered into three groups, including the upstream area, the entrance area, and the nearby stop-line area. In order to distinguish the three areas, this study adopts the fuzzy C-means (FCM) clustering [32] method to identify the clusters. FCM clustering combines the essence of the fuzzy theory and provides more flexible clustering results [33]. In most cases, the traffic areas cannot be divided into obviously separated clusters. The membership degree of the FCM clustering method ranges from 0 to 1, which is suitable for the traffic stop scenario clustering. Based on the FCM, the function of FCM is as follow: where U = {u kl , k = 1, · · · , c; l = 1, · · · , n;} is the membership degrees, which is restricted by the normalization rules (i.e., ∑ c k=1 u kl = 1), where n is the number of GFCD and c is the number of cluster center; d 2 kl is the Euclidean Distance; v l is the spatial vector of GFCD; c k is the spatial vector of cluster centers; Under the constraint ∑ c k=1 u kl = 1, to minimize the objective function, it can be got through determining the derivation of Lagrangian. Then, iteratively update u kl and v l by using the following equations: According to the prior knowledge about the stop dataset, the number of clusters is set to three clusters: the low, medium and high frequency of stops corresponding to three areas (i.e., upstream, the entrance and nearby stop-line area). For the Beijing center area case, the intersection can be clustered into several groups (see Figure 4). Figure 4 presents the GFCD cluster result corresponding to the three areas, which shows the effectiveness of the proposed method. It intuitively can be seen that the cluster 1, 2 and 3 represent the upstream, entrance lane and nearby stop-line area. Moreover, the stop lines' central area of intersections can be determined by the highest frequency of the stop. It is noted that the central areas of intersection 0 and 2 contain a part of cluster 2 since the left-turn phase is not set. This leads to the obvious conflict point between straight and turning left direction, leading to a high stop frequency in the central area. Furthermore, according to the different clustering results, we constructed the dataset of upstream, approach and central area of the intersection.

Identification of the Floating Car Trajectories Direction
After defining the intersection area range, the turning direction will be identified based on the grid model. First, the legs of the intersection are identified by GFCD and label (see Figure 5). The intersection legs can be divided into five areas, and the points of FCD are then mapped to the five areas. Then, based on the order of entering and exit Area ID, the direction of trajectory is identified, respectively. Taking southbound straight as an example, the entrance Area_ID is 2 and the exit Area ID is 1, and it also passes through the central area of the intersection (i.e., Area_ID = 0). Based on the grid model's algorithm, it is simple to identify the direction of the floating cars' trajectories passing through the intersection.

The Multi-Task Fusion Deep Learning Model
The multi-task fusion deep learning (MFDL) model architecture is composed of Residual Network, GCN, CNN, LSTM and Attention mechanism. Since the LSTM or the GRU only extracts the temporal information, the proposed model architecture can capture the Spatio-temporal feature and the topological structure of the intersection.

The Residual Network
More road network features can be extracted by the deeper model [16]. However, the deeper models are easy to encounter the problem of gradient explosion and vanishing. To solve the problem, the Residual Network (i.e., ResNet) emerges as the times require [34], whose core idea makes the model deeper through the skip layer connection (see Figure 6a). The "Conv" indicates a convolutional layer, "BN" denotes a batch-normalization layer, and "ReLU" represents an activation layer. In this study, we use the improved structure of ResNet, which can solve the vanishing or exploding gradient problem better (see Figure 6b). The ResNet model is shown as follows: where the X T is the residual block input; X T+1 is the residual block output.

The GCN
The road network can be regarded as a topological structure composed of points and lines, which points are intersections and lines are roads. When capturing the spatial feature, the CNN models cannot process the non-Euclidean structure's data and extract the intersection's topological relationship. In contrast, the Graph Convolutional Network (i.e., GCN) can make up for this defect [22]. It can capture the intersection of topological dependencies. In this study, the intersections are defined on the graph and focus on the structured traffic time series of pass intersection time (see Figure 7). At the time step t, the intersections graph can be defined as G = V t , E, W ij . The observation V t is the set of vertices, corresponding to the observations from n approaches in the intersections and the E is the set of edges, indicating the connectedness between approaches in the intersections, while the W ij ∈ R n×n is the weighted adjacency matrix of G. The GCN function can be defined as follows: In Equation (8), σ() is an activation function;L = L + I, L ∈ R n×n is the adjacency matrix, I represents the identity matrix;D denotes the diagonal node-degree matrix ofL; the W l−1 is the weight of the parameter matrix.
It should be noted that the calculation of stacking multiple GCN layers is more complex, and the gradient is easier to disappear [35]. Furthermore, with the deeper GCNs arising, the over-smoothing will make the features of the same vertex indistinguishable and debase the forecast accuracy [36]. Therefore, the ResNet GCN is proposed to make up for them. Then the graph signal P t of intersection is transformed to P t as the ResNet input. The P has the same shape as the P and the contains the topological information between the intersections.
In Equation (9),D − 1 2LD − 1 2 is the Laplacian matrix (see Equation (9); P ∈ R t×a is the input; a is the approaches of intersections; t is the time steps for approaches in intersections.

Multi-Task Fusion Deep Learning Method
In this section, a novel multi-task fusion deep learning model framework is proposed to realize the intersection operation performance forecast. The framework integrates the historical pattern, real-time pattern, spatial pattern, topological structure and weather condition to predict the passing time and speed of the intersection. Herein, there are four variable groups in the multi-task fusion deep learning method (see Figure 8). In the first variable group, the passing time is used as the input variable to capture the temporal feature. The second variable group extracts the intersection topological information. The third variable group captures the Spatio-temporal features of speed. The fourth variable group shows that the effect of weather on prediction accuracy. The fusion section is used to fuse the information from four variable groups. For the passing time variable group, the passing time is the most intuitive parameter representing the intersections' operation performance [9]. Noted that the historical passing intersection time reveals the normal propagation rule and verifies the real-time passing time pattern. Therefore, this section adopts the historical and real-time pattern of passing intersection time as the input matrix. We extracted the passing time from the floating car data (see Figure 9). According to the definition of intersection region effect, for intersection approach n ∈ N in different intersections, the i ∈ I trajectory's passing time can be calculated by t i out − t i in . Since the trajectory points do not coincide with the boundary of the intersection approach, it is necessary to estimate the time stamp of the approach boundary (i.e., t i out , t i in ) based on the acceleration formula. In Equations (10) and (11), by calculating the distance (i.e., RArccos(sin lat i in|out sin lat i s|e cos lng i in|out −lng i s|e + cos lat i in|out cos lat i s|e , R is the radius of the earth, | means or) between FCD point and the boundary of the intersection approach in geodetic coordinates, and combining with the time interval (i.e., Tin), we can obtain the time difference between the points and the boundary of the intersection approach. Therefore, according to Equations (10) and (11), the average passing time (i.e., p t ) can be calculated by Equation (12): 10, 15, 20...)min (12) where the lat i in , lng i in , lat i o , lng i o are the coordinates of the entrance and exit boundary of the intersection, respectively; lat i s , lng i s , lat i e , lng i e are the coordinates of the trajectory points that enter and exit, respectively; Tin represents the time interval, and the time interval of the floating car data in this study is 3 s.
The passing-time matrix P n t is given by: The input of the passing time variable group (i.e.,I 1 ) is given by: where n ∈ N is the number of entrances of intersections, and t is the time steps for each entrance of intersections, P r t,n presents the real-time pattern, P h t,n presents historical patterns. In the passing time variable group, the three-time steps (i.e., t − 2, t − 1, t) are adopted to predict t + 1 passing time. For example, when time granularity is 10 min, there are 96 time-slices in the daytime (i.e., from 6:00 to 22:00), and the dataset possesses 31 days of data. Therefore, for the 92 entrances lane at intersections, the input dimension matrix (P r t,n , P h t,n ) is [92 × 2 × 96 × 31]. The passing time matrix is divided into two datasets, the training dataset, whose proportion is 70% (i.e., [92 × 2 × 96 × 31 × 70%]), the test dataset, whose proportion is 30% (i.e., [92 × 2 × 96 × 31 × 30%]).
For the speed variable group, because of the spatial correlation between the upstream, inner areas and the entrance area, we selected three parameters: the upstream average V u t,n , the inner average speed V in t,n and the entrance speed as the input variables V e t,n . The speed matrices V u t,n ,V e t,n ,V in t,n are defined as Equations (16)- (18). The input of the speed variable group (i.e., I 2 ) is given by Equation (19).
In the graph variable group, the intersection group's topology has a great influence on the passing time. The ResNet GCN model is adopted to capture the topology of the intersection group. The passing time and average speed are input into the ResNet GCN model as the graph signal, respectively. According to Equations (9), (13) and (17), the graph variable can be defined as: In the weather variable group, we consider four categories of weather variables, including temperature (i.e., TE, measured by Celsius degree), atmospheric pressure (i.e., PR, measured by Pascal), wind speed (i.e., WS, measured by a mile per hour) and precipitation (i.e., RA, measured by millimeter). The weather condition obtains one value per hour (see Table 2). The data is obtained from the free meteorological data website called "Wheat A" (Wheat A) [37]. To correspond to the time granularity of the average passing time, the time slice of weather-condition data should be transformed to the corresponding time bucket (e.g., the weather-condition from 6:00 to 6:10 will be equal to the recorded data from 6:00 to 7:00 (see the first row in Table 2). Similarly, according to the data division rules, the weather-condition data should be split into training and test dataset. The input of the weather variable matrix is given by Equation (21).
In the feature fusion layer, the attention mechanism can distribute the different weights of the features from the neural network layers. In this paper, the attention layer is proposed to capture weight scores of different time steps, usually assigning a heavier weight score to adjacent time periods and a lower weight score to distant time periods [38].
where H is a matrix consisting of output vectors [h 1 , h 2 , · · · h T ], T is the length of the vector. the V j , j ∈ [1, 4], represents the feature variable from four subsection; ω j , j ∈ [1, 4] are the weight of the different features; and the ⊗ is the Hadamard product.

Model Configuration
The model experiment was implemented using Python 3.6 with Tensorflow [39], Keras [40] on Windows 10 for comparing the models. The experiment's platform's calculation cell is constructed with 32 CPU cores, 64G RAM and NVIDIA GeForce RTX 2080 GPU to meet the requirements of this experiment.
There are four subsections, the feature fusion and the output section, in the model framework. Moreover, the Rectified Linear Unit (ReLu) is used to solve the exploding/vanishing gradient problem. The dropout is set to 0.5 to prevent over-fitting. Furthermore, the Adam algorithm is proposed to update the parameters of the neural network. The partial hyper-parameter settings are shown in Table 3.

Models to Be Compared
This section will feed the training and test dataset to the proposed MFDL and benchmark models, respectively. Three typical benchmark models are selected to compare with the proposed MFDL model, including the mathematical statistics-based models (MS) (e.g., ARIMA), machine-learning-based model (ML) (e.g., SVR) and the deep learning model (DL) (e.g., LSTM, GRU, CNN and ConvLSTM). To ensure fairness, the following benchmark algorithms have the same input features (the same category and the time interval).
MS model: For the ARIMA model, we use the Akaike Information Criterion (AIC) as the standard to select the optimal model. Noted that it is difficult for the ARIMA model to capture the Spatio-temporal feature of the intersections, so we constructed 92 models to represent the 92 intersection entrance lanes.
ML model: Two main parameters selection of SVR (i.e., the penalty coefficient "C" and the parameter "Gamma") are based on the cross-validation, and the kernel function is set to the radial basis function.
DL model: For LSTM and GRU both have two hidden layers and 128-unit neurons. For the CNN and ConvLSTM, the kernel size is 2 × 2, and the kernel layers are set to 32 and 64 filters, respectively.
The time lag is set to 10 min, and the hyperparameters are set the same as the proposed model. Then we use the RMSE and MAE to measure the total predictive accuracy of fitting in the whole test data and use WMAPE to measure the models' predictive performance.

Loss Function and Evaluation Metrics
In order to compare the proposed fusion deep learning model framework with the benchmark models, three indicators are required to evaluate model performance, including the Mean Absolute Error (MAE), the Root-Mean-Square Error (RMSE) and weighted-meanabsolute-percentage error (WMAPE). The mean-squared error (MSE) is adopted as the loss function of speed and the passing time. Furthermore, the weight of loss is set to 0.5. These definitions are as follows: where n is the number of test samples; Y i is the real values; ∧ Y i is the predicted values; Y i is the average values. Figure 10 shows the average daily order volume of floating cars at ten intersections. It can be found intuitively that the order volume on the weekdays is larger than on the weekend, which means that the traffic pattern on weekdays is different from that on weekends. Meanwhile, the order volume of the ten intersections of different periods is different. It means that the traffic Spatio-temporal pattern of different intersections is different.  Figure 11 shows the trends of the average speed and passing time on both weekdays and weekends, respectively. It can be seen that the tendencies of the two-variables fluctuation are constant in the evening (0:00-5:00) both on the weekday and weekend. Therefore, when exploring the variables feature, we choose the daytime (6:00-22:00). Meanwhile, during the peak hours (i.e., 6:00-9:00, and 17:00-20:00), the speed tendency on the weekend is faster than the weekday in general. Corresponding to the speed tendency, the passing time on the weekday is longer than the weekend, meaning that the variables' temporal features are discrepancies. Therefore, it is necessary to consider the temporal characteristics in the prediction experiment.

The Correlation between Speed and the Passing Time
The passing time can directly represent the operation performance of a signalized intersection, and the speed can directly reflect the visual state of the intersection operation state. It is necessary to carry out a correlation analysis between the passing time and the speed to enhance the interpretability of the model and improve the accuracy of the model [41]. The Pearson correlation coefficient can reflect the correlation degree of the two variables, which range from −1 to 1 [42]. Suppose the correlation degree is greater than zero, which indicates that two variables are positively correlated. In that case, that is, the greater the value of one variable, the greater the value of the other variable. If the correlation degree is small than zero, which is means that the two variables are negative correlation; that is, the larger the value of one variable is, the smaller the value of the other variable.
The total correlation degree between the average speed and the passing time is −0.84, which means that the two variables are a strong negative correlation. Moreover, we select four typical intersections, including the No.0 with fewer orders, the No.6 with a one-way lane, the No.9 with three legs and the regular No.3 (see Figure 12). It can be seen that the absolute value of correlation degree of No.3 is the largest (i.e., −0.89), and due to the influence of intersection topology and signal phase, the correlation degree of other intersections decreases. Figure 13 shows that the training procedure, to training the optimal model and avoid the overfitting problem, the early stopping technique is proposed. It can be seen that the training and validation loss decrease rapidly. Furthermore, the training and validation loss has remained stable in 50 epochs, which means that the proposed model's robustness is strong.  The prediction performances are shown in Table 4. As shown in Table 4, the MFDL considerably outperforms mathematical statistics-based and machine-learning-based models in most cases. The ARIMA model has the worst performance since the ARIMA lacks capturing the spatial feature of traffic parameters and processing the complicated nonlinear problems. Moreover, the SVR also has poor performance. It is difficult for SVR to deal with large-scale Spatio-temporal data when it consumes limited computing resources. Compare with the MS and ML models, the deep learning models perform better. Through the control method of input, activation, and output of traffic data flow, continuously circulate iteration, the deep learning model can better capture a large-scale time series' characteristics. Among the deep learning models, the Conv-LSTM has the best performance since the Conv-LSTM can capture the temporal information and the spatial information.

Model Performance Comparisons and Result Analysis
The experimental results illustrate that the multi-task fusion deep learning models work best among these methods, which can efficiently learn the Spatio-temporal feature of the two targets' predicted variables. As Table 4 shows, taking the prediction speed and the passing time in 10 min time granularity as an example, the MFDL model outperforms the MS model on RMSE, MAE and WAPE with the improved accuracy of 1.94, 1.34, 14.47%, 15.01, 14.04 and 23.63%, respectively. This is because the MFDL models contain the characteristics of time series, space, topology and weather. It is noted that the MFDL model is less affected by the weather, and the prediction accuracy is not significantly affected when the weather subsection is deleted. In contrast, the No-CNN model result (i.e., the MFDL without graph) changed more significantly, although the No-CNN model reveals better performance to capture the intersection's topology dealing with the temporal feature with low ability.
In general, with the increase of the time granularity (Tg), the prediction performance gets worse. This phenomenon is due to the larger the time granularity, the smaller the data sample. In addition, all of the speed metrics values are smaller than the passing time since the passing time's random fluctuations at different intersections have a more significant disturbance. Figure 14 shows the overall prediction errors produced by the different methods on different time granularities (i.e., Tg = 10, 15 and 20). It intuitively indicates that the proposed model reveals the best performance on the passing time and speed among the benchmark models. In contrast, the ARIMA model has the most significant error dispersion, which indicates that the ARIMA model cannot regress the multi-task Spatio-temporal feature. Furthermore, it can be seen that the MFDL model has a smaller interquartile range, whose distance between Quartile 1 and 3 is smaller, and the metrics values are more concentrated than other models. Moreover, with the increase of time granularity, the error distribution becomes more dispersed, which revealed the results mentioned in Table 4. Figure 15 demonstrates the comparison of the benchmark model and the proposed model in terms of various time steps by WMAPE. As shown in Figure 15, the ARIMA and SVR models work worst in different time steps. In contrast, the proposed MFDL model can provide reliable prediction precision, which the WMAPE is the lowest among the models in different time steps. Furthermore, as the increase of time step, the increase of the metrics of the MFDL is the least, indicating the stability of the MFDL model is excellent.   Table 5 shows the comparison of individual prediction metrics and multi-task prediction in 50 epochs. It can be seen that the prediction precision of the passing time and speed in multi-task fusion deep learning (MFDL) is better than the prediction of the passing time and speed in single-task learning (STL) in different time granularity. This phenomenon indicates that the passing time and speed can promote each other to improve the prediction accuracy. In addition, the training time in the STL model consumes more time than the MFDL model in 50 epochs. It is noted that when the time granularity is 10 min, the training time is reduced by 8.3 min and the efficiency increases by 46.42%, which means that the MFDL is more efficient.  Figure 16 shows the heat maps of the absolute error (i.e., |groundtruevalues − predictedvalues|) ofaveragespeedandthepassingtimeandtherelativeerror(i.e., |groundtruevalues − predictedvalues| /groundtruevalues) of average speed and the passing time at the intersection entrance lane in a one-time step by MFDL. The x-axis represents the time of daytime (from 6:00 to 22:00), and the y-axis represents the 92 entrance lanes of the intersection. In Figure 16, the deeper the red color areas are, the more significant the errors are, and the deeper the blue color areas are, the lower the errors are.
From Figure 16a, it can be seen that most of the passing-time absolute errors are below 10 s. Minority entrance lanes have a slightly higher absolute error at a specific time. Noted that although the absolute delay of some entrances is significant, the relative error is small (see Figure 16b). For example, the absolute error on the entrance 26 (i.e., northbound left-turn entrance lane of intersection No.2) is significant during 7:50 to 8:40, 12:10 to 12:30 and 19:10 to 19:20, etc. However, the relative error of entrance 26 is inconspicuous. This phenomenon is probably due to the average passing time of entrance 26 is longer (i.e., 141 s) than the average passing time (i.e., 69.4 s) in intersection 2, which has more interference factors leading to absolute error.
From Figure 16c,d, it can be intuitively seen that there is a significant difference between the pattern of relative error and the pattern of absolute error of speed. Even though some of the speed absolute errors of some entrance lanes are significant, the relative errors are minor. Since the average speed is smaller (i.e., 7.78 m/s), the relative error of speed is more sensitive to the ground-true speed, which may lead to a larger relative error. Taking the entrance lane 18 (i.e., southbound right-turn of intersection 1) as an example, the absolute error and the relative error are significant, indicating that data fluctuation is volatile, affecting prediction accuracy. Overall, the MFDL model can primarily capture the Spatio-temporal characteristics of the passing time and traffic speed in the intersections and make an accurate prediction. The prediction error visualization can significantly express the accuracy of prediction results.

Sensitivity Analysis
For the MFDL models, the temporal features of inputs are likely to be associated with the accuracy and stability of the prediction result, and we select the RMSE as the evaluation index. Figure 17 shows the RMSEs of the passing time and speed. The red curve means the fluctuation trend of the RMSE median, and the blue box pattern represents the error distribution of 92 entrance lanes in different periods. It can be seen that the RMSEs of the MDFL model fluctuate slightly throughout the day, indicating that the MFDL model has strong robustness. Moreover, the RMSEs fluctuate of the passing time rise slightly during the peak hour (i.e., 7:00-8:00, 17:00-18:00), whose median only increases by 2.4 of an entire day. Meanwhile, the RMSEs fluctuate in the speed prediction is also relatively low. In conclusion, the proposed model has high accuracy and fine stability under various temporal features. Furthermore, the spatial features of input may be associated with the accuracy of the prediction result. Since it is not significant to build a GCN model only for a single direction prediction experiment. Therefore, we carry out the experiment by the MFDL-No-Graph model. Figure 18 shows the result of the evaluation metric (WMAPE) of different time slices in different directions. It can be seen that the WMAPE in different directions is different. The floating data fluctuates in various spatial positions, which influences the prediction results. Obviously, the WMAPE of the turn-right direction is significant due to the turn-right data having more interference factors (i.e., non-motor vehicles and pedestrians), leading to more data fluctuates. Moreover, the WMAPEs increase with the time granularity, which implies that the more samples, the higher prediction accuracy. It has been explained that the weather has an impact on the prediction accuracy in Table 4. To further analyze the weather factors on prediction accuracy, we investigated and tested the model of weather effect, including four conditions: without temperature (i.e., No temperature model), without pressure (i.e., No pressure model), without precipitation (i.e., No precipitation model) and without wind speed (i.e., No wind speed model) (see Figure 19). It intuitively shows that the precipitation has a more significant impact on the prediction than other weather factors in different time granularity. It is not difficult to understand because the precipitation affects travel speed leading to the traffic operation state various, which verifies the previous research [43].

Discussion
In this study, we constructed a multi-task fusion deep learning framework for intersection traffic operation performance prediction. The passing time and the speed are selected to be prediction targets, which can reflect the process state and the visual state of the intersection operation performance.
In the data-collecting stage, the floating car data is used as the data source to verify the prediction model's availability, which can reflect the traffic performance of largescale intersection areas. The floating car data can accurately describe the upstream and downstream traffic state of the intersection, which makes up for the shortage of small coverage of fixed sensor data [16].
In the parameter extraction stage, we adopted the grid model, which can identify the intersections rapidly without the digital map. The novel grid model is proposed to extract the traffic parameters from the floating car data. On the one hand, the grid model simplifies the complex map-matching algorithm and improves the efficiency without the digital map. On the other hand, the intersection affected area and the direction of GFCD can be identified by the grid model and the fuzzy C-means (FCM) clustering method, which exceeded the limit of fixed influence area of intersection [44]. It indicated that the grid model has significant universality, which can be applied to other cities.
In the construction model stage, we design the MFDL framework of four variable groups. Meanwhile, to enhance the depth of the model, the ResNet is incorporated into the MFDL model, which can enhance the depth of the model and alleviate the problem of gradient disappearance [45]. The MFDL can capture the temporal, spatial, topology feature of the passing time and speed and the prediction results are promising. The two target predictions are negatively correlated, and the interplay between these two targeted prediction variables can significantly improve the prediction accuracy and efficiency, which is consistent with the studies of Kunpeng Zhang et al. [46]. This proposed method predicts the intersection operation performance in real-time and can provide valuable insights for traffic managers to improve the intersection's operation efficiency.
There is more work ahead in the future development of this study. First, although accurate speed and time can be extracted from a single data source (i.e., FCD), it is difficult to estimate the actual traffic flow because the permeability cannot be obtained [47]. In order to comprehensively detect the operation performance of signalized intersections, it is necessary to import multi-source data, such as induction loop data [48], microwave data [49], etc., to extract traffic flow information. Second, in the passing time pattern, we consider the real-time and historical pattern. In future work, we will consider more passing time patterns to improve the accuracy. Lastly, the existing amount of data is enough to support the construction and validation of the model. Naturally, if a larger range of floating car data can be obtained in the future, using this proposed model to predict the traffic performance of intersections and validate the model will better reflect the universality of the model.

Conclusions
In this paper, a multi-task fusion deep learning framework is proposed for intersection traffic operation performance prediction. The passing time and the speed are selected to be prediction targets, which can reflect the intersection operation performance. The main conclusions of this paper are summarized as follows.
MFDL model enables us to capture the Spatio-temporal and topology feature of the traffic state efficiently. Comparisons with benchmark models show that the fusion deep learning model achieves the best prediction accuracy and robustness among the baseline model in different time granularity. In the process of STL and MFDL comparison, when the time granularity is 10 min and the epoch is 50, the training time is reduced by 8.3 min, and the efficiency increased by 46.42%, which means that the MFDL is more efficient. In the analysis of weather factors, the precipitation has a more significant impact on the prediction than other weather factors in different time granularities.
Future work will concentrate on exploring the novel deep learning structure based on the fusion method. For the influence factors of intersections operation state, we will consider multi-source input variables, including the time scheme, traffic flow, waiting time, etc., to improve the prediction accuracy.