4.1. Causal Weight Estimation for Congestion Propagation Based on Granger Causality Theory
Since external interventions cannot be imposed on this system to validate causal relationships, causal inference methods based on observational data must be employed to discern propagation mechanisms. In this context, Nobel laureate Clive Granger proposed the Granger causality test in 1969 [
39], which determines whether the historical information of one time series
can significantly improve the forecasting accuracy of another
. The Granger causality test is a causal inference method relying solely on observed time-series data; it does not require external interventions. Instead, it constructs predictive models from available historical observations and compares the reduction in forecast error variance between nested models to determine causality. The mathematical formulation of the Granger causality test is given by:
where
is the dependent variable at the moment
t,
is the independent variable at the moment
t, the dimension is the total number of segments
J,
is a constant term,
is the number of lags,
and
is a matrix of dimensions indicating the effect of lags,
is the model prediction error,
is the model prediction variance, and
is the value of the causal effect when it is the value of the past value, which indicates that the past value is helpful in predicting the current and future state of the time series
.
Granger causality traditionally relies on the assumption of a linear vector-autoregressive (VAR) model. However, in practical scenarios—where factors such as traffic flow and airspace state within the air route network often interact in a highly nonlinear fashion—the linearity assumption underlying conventional Granger-causality tests is unable to capture these complex coupling effects. Accordingly, in nonlinear Granger-causality methods, Equations (13) and (14) are modified as follows:
where
and
are two nonlinear functions used to describe the complex nonlinear relationship between the dependent and independent variables, thereby revealing their intrinsic causal relationships. To fit these nonlinear functions, existing studies typically train neural network models. The network parameters are adjusted to minimize the loss function such that the predicted values approach the actual values as closely as possible, thereby embodying the core principle of “using the past to predict the future” in Granger causality theory.
Based on nonlinear Granger causality, causal relationships are represented by the nonlinear functions
and
. The key challenge is how to extract congestion-propagation causal weights among segments from these nonlinear relationships fitted to time series data by a neural network. In deep-learning models, each segment’s input features—namely, its congestion state assessment values—affect the predicted output for the target segment. During training, the model gradually adjusts the weights through backpropagation and loss function optimization to quantify the relative contributions of these input features to the prediction. Thus, by analyzing the neural network’s weights, the specific congestion-propagation causal weights can be estimated [
40]. However, since different models vary in structure and underlying principles, the particular methods for estimating congestion-propagation causal weights also differ. Therefore, after introducing the employed models in the following section, this paper will detail the specific calculation steps.
4.2. Congestion-Propagation Causal Weight Estimation Based on the Multi-Channel Attention DSNG-BiLSTM Model
Among deep-learning models, LSTM addresses the vanishing gradient problem of traditional recurrent networks through gating mechanisms, enabling effective modeling of long-term dependencies in time series data. We propose the Multi-Channel Attention DSNG-BiLSTM model based on BiLSTM [
41], which utilizes both past and future information to more comprehensively learn the causal relationships in the time series, thereby enhancing the accuracy of causal inference.
- (1)
BiLSTM:
BiLSTM is a neural network structure consisting of two independent and inverted LSTMs, where the update formula for a single LSTM is:
where
,
, and
are the forgetting gate, input gate, and output gate, respectively.
is the unit state, and
is the model output. The outputs of the two are spliced to form the final output of BiLSTM, as shown in
Figure 4:
The model also contains several adjustable parameters, such as the learning rate , minimum learning rate , maximum epochs , and learning rate decay factor , which help to further optimize the model training process and performance.
- (2)
Decoupling Causal Architecture
In an air route network, segments are interconnected in a highly complex manner. In traditional neural network architectures, the interconnections among hidden layers cause information from each input sequence to intermingle, making it difficult to accurately extract the specific causal effects and quantitative influences between every pair of segments. Therefore, this study adopts a Decoupling Causal Architecture, which decouples the relationships between each output and its corresponding input, thereby preserving the separation of the congestion time series for each segment. This approach enables a more effective capture of causal relationships governing congestion propagation throughout the entire air route network.
The Decoupling Causal Architecture primarily has two goals: decoupling sequences and adjusting output time lags. To address these objectives, two units are designed: the causal separation unit and the time lag calibration unit. Specifically, the causal separation unit constructs an independent neural network model for each segment in the air route network, treating it as an independent channel . Each channel is configured with k input units, corresponding to the congestion time series of segments adjacent to the target segment j, with the output being the congestion time series of the target segment j. By having each input unit accept only the congestion time series of a single segment, the separation of the congestion time series for k segments is achieved, enabling the investigation of the isolated impact of congestion states in adjacent segments on the target segment’s congestion state.
To capture inherent time-lag effects in congestion propagation, we include a time-lag calibration unit. This mechanism automatically selects the appropriate lagged prediction value from three preset time-lag levels to cover the critical propagation periods of segment congestion. Specifically, for each channel, when the time window corresponding to the congestion time series in a given input unit is , the output value is automatically set to the prediction corresponding to , thereby facilitating the extraction of causal relationships under different time delays.
Through the above design, the Decoupling Causal Architecture not only effectively decouples the timing data of each segment and reduces interference between inputs, but also utilizes a time-lag adjustment mechanism, which enables the model to capture the intrinsic pattern of congestion propagation across multiple time lags.
- (3)
Sparse Regularization:
Because the high-dimensional time series data suffer from feature redundancy and poor generalization, we apply Group Lasso and L2 regularization to filter out irrelevant features. Group Lasso and L2 regularization are regularization methods well-suited for high-dimensional data with group structure features. Group Lasso is applied to the input weights of the first LSTM layer, which encourages the selection or discarding of entire feature groups, thereby achieving sparsity; meanwhile, L2 regularization is applied to the weights between the linear layer and the hidden layers of the LSTM to suppress overly large weights and avoid overfitting. This approach automatically identifies and eliminates segment features that contribute little or are irrelevant to congestion propagation, thus enhancing model interpretability and feature selection capability. The specific formula is as follows:
Equations (23) and (24) represent the Group Lasso and L2 regularization terms, respectively. Here, and denote the regularization strength parameters for Group Lasso and L2 regularization, respectively; is the k-th column vector of the input weight matrix, where it is formed by vertically concatenating , , , and . denotes the paradigm of the column vector, is the weight matrix of the linear layer, and is the weight matrix between the hidden layers of the LSTM.
After training is completed, we examine the input weight matrix and focus on each column corresponding to an input channel (i.e., an adjacent segment). When a column’s weights are driven to zero by sparse regularization, it indicates that the corresponding adjacent segment has been automatically excluded and exerts virtually no influence on the target segment’s congestion propagation. By regularizing the input weights in this way, the model can automatically identify those adjacent segments that are most critical to the target segment’s congestion state while suppressing redundant or irrelevant neighbors, thereby improving both interpretability and robustness.
- (4)
Multi-Channel Attention Mechanism:
In practical operations, congestion in segments often exhibits temporal dynamics and multi-scale interactions. Variations in traffic flow during different periods, unexpected events, and mutual influences among segments collectively produce complex effects on the overall congestion state. To address this, this study introduces a self-attention mechanism [
42] and proposes a multi-channel attention mechanism. Specifically, an independent attention mechanism is assigned to each channel
to process the congestion time series data of different segments. This approach allows each channel to adaptively adjust the attention weights for each time step in the historical data, thereby enhancing the capture of temporal dynamic features. This strategy not only more precisely captures the temporal dependencies within individual segments but also comprehensively accounts for the interactions among different segments to determine which inputs are most important for predicting the target output. Consequently, the congestion-propagation causal matrix generated based on this multi-channel attention mechanism more accurately reflects the causal relationships of congestion propagation among segments. The specific procedure of the sub-attention mechanism in each channel is described as follows:
Step 1: For the output sequence
, three linear layers are applied to map it to the query matrix
, the key matrix
, and the value matrix
, respectively. The calculation formulas are as follows:
where
,
and
are
.
Step 2: The dot product between the query matrix
and the key matrix
is computed to calculate the similarity, and then scaled, resulting in a matrix of shape
that reflects the spatiotemporal dependency scores.
Step 3: Softmax normalization is applied to each row of the score matrix.
Step 4: A weighted aggregation obtains contextual representation by computing the weighted sum using the attention weights. Subsequently, global contextual representation is achieved by integrating the temporal information through global weighted average pooling.
where
C is the matrix formed by concatenating the contextual vectors
obtained for each channel after the self-attention aggregation.
Step 5: Finally, the output is mapped through a fully connected layer to the target prediction space.
where
denotes the predicted congestion state of the target segment at the next specific time.
- (5)
GISTA Optimization Algorithm:
GISTA is designed to optimize objective functions that include nonsmoothed regularization terms. It combines gradient descent with a soft-thresholding operation to enhance both the interpretability and generalization capability of the model. Compared with the traditional ISTA algorithm [
43], GISTA introduces a line search to update the learning rate, ensuring that each update reduces the objective function by more than a predetermined tolerance. This mechanism improves both the convergence speed and stability of the algorithm. Based on Equations (23) and (24), the loss function of the algorithm can be derived as follows:
where the first term
is the MSE, which measures the gap between the predicted and actual values. The first term and the third term
are the smoothing part, which is differentiable, denoted as
. The second term
is a non-differentiable sparse regularization term, denoted as
. The specific steps of the algorithm are described below:
Step1: Initialize parameters.
Step2: Calculate the current total loss according to equation (33).
Step3: Backpropagation to get the loss gradient for all trainable parameters.
Step4: Perform a gradient descent update on the parameters at the current learning rate and perform a column-by-column soft-thresholding operation on the input weights matrix in the first layer of the network in order to process .
Step5: Calculate the new loss using the updated temporary parameter and calculate the tolerance term between the old and new loss, if the loss difference is less than the tolerance term then update the learning rate and repeat Step5 until the condition is satisfied or .
Step6: Update the parameters and record the new loss to update the initial learning rate for the next epoch.
Step7: Judge whether or is reached; if it is satisfied then stop sex training and return the current parameters and loss, otherwise go back to Step3.
Based on the design proposed above, this paper introduces the Multi-Channel Attention DSNG-BiLSTM model. It employs a Decoupling Causal Architecture to ensure that the influences decouple different input channels, and it incorporates sparse regularization techniques that enable the model to automatically select the input features that contribute significantly to the target output while assigning lower weights to those with lesser contributions. Subsequently, the congestion-propagation causal weight is estimated from the input weight matrix using the following procedure:
Step 1: From the channel , select all segments that have been identified as having a causal influence.
Step 2: For each selected segment , extract the corresponding column vector from .
Step 3: Compute the norm for each column vector ; the computed value represents the estimated congestion-propagation causal weight of segment on the target segment .
Based on this process, the corresponding congestion-propagation causal weights can be determined. The specific network architecture is shown in
Figure 5.