1. Introduction
Structural health monitoring (SHM) systems are vital for preventing bridge operational risks, improving maintenance, and enhancing emergency response [
1,
2]. China aims for comprehensive SHM coverage on special bridges (e.g., those spanning rivers, seas, and gorges) and full implementation across highway bridges [
3]. China is progressively establishing a comprehensive SHM system for road and bridge networks [
4,
5]. Compared to single-bridge monitoring, network-level SHM provides a new perspective for structural assessment and maintenance. Considering that bridge groups within a region are subject to similar environmental conditions (temperature, humidity, rainfall, icing, etc.) and loads (vehicle, wind, earthquakes, etc.), this geographical correlation can alleviate signal interference and incomplete data, while also enabling cross-bridge comparative analysis. Analyzing bridge networks presents a great structural engineering challenge, as these networks are characterized as spatial correlated systems vastly larger and more complex than any single structure [
6].
Traffic data are crucial for analyzing bridge and bridge network performance, assessing reliability, managing maintenance, and optimizing operation. The accurate prediction of traffic flow plays a significant role in estimating vehicle loads on bridges, which could utilize the statistical information of vehicle type, wheelbase, axle load, and wheel spacing. Including traffic conditions into damage detection algorithms can significantly improve the diagnostic accuracy [
7]. Understanding the bridge network correlations is important for road network vulnerability [
8], as bridge capacity or closures impact transport network operability and resilience [
9]. Traffic flow distribution is crucial for bridge network fragility analysis [
10], and traffic demand significantly impacts optimal plan for bridge network maintenance scheduling [
11]. Traffic load research contributes to bridge design and safety assessment [
7]. Specifically, vehicle weight and traffic volume significantly affect the fatigue life of orthotropic steel bridges [
12,
13]. Bocchini and Frangopol used a simplified traffic flow model to evaluate bridge network life-cycle performance and reliability [
14]. Fiorillo and Ghosn identified overweight traffic loads as major risk factors for highway bridge networks [
15]. For large bridges, weigh-in-motion (WIM) systems monitor key traffic parameters (axle weight, speed and volume etc.), providing vital data for traffic load simulation [
16]. These data support critical applications, like reliability-based vehicle weight limits [
17] and improved bridge load rating analysis [
18].
This study mainly focuses on traffic flow prediction (TFP) within bridge networks so as to understand the traffic load correlations across networks, supporting subsequent SHM research. TFP remains challenging due to numerous influencing factors, like weather, accidents, and holidays [
19]. Prediction techniques fall into parametric models (e.g., time series analysis [
20] and Kalman filtering [
21]) and nonparametric methods [
22].
Parametric models capture temporal patterns and influencing factors in traffic flows but struggle with complex nonlinear relationships between variables. Consequently, nonparametric models—primarily machine learning (ML) and deep learning (DL) approaches—have gained significant attention. Their structures are not predefined but are instead learned based on data. ML methods for TFP include the k-nearest neighbor method [
23], support vector regression [
24], and artificial neural networks [
22]. DL approaches primarily use CNNs, recurrent neural networks (RNNs), transformers, and hybrid methods. CNNs are specialized for capturing spatial features (e.g., city-wide traffic flow modeled over distinct periods [
25] and regional crowd flow [
26]). RNNs are specialized for modeling temporal dependencies. Wang et al. decomposed a time series into a trend term and residual term using LSTM-RNN reconstruction [
27]. Transformers are also used for sequence modeling [
28]. Hybrid/ensemble methods combine different structures to improve performance. Han et al. used adjacent point inputs with a CNN (spatial) and LSTM (temporal) for highway TFP [
29]. Du et al. fused multi-modality traffic data via a CNN-GRU-Attention module for TFP [
30]. Jia and Yan converted taxicab trajectory data to images using an RCNN (temporal) and a CNN (spatial), supplemented with external data [
31]. Ma et al. integrated a CNN, LSTM, and attention mechanism with meteorological data for short-term highway TFP [
32]. Méndez et al. combined a CNN and BiLSTM, along with auxiliary stations and meteorological data, for long-term TFP [
33].
Building on these DL advancements, graph neural networks (GNNs) offer a powerful framework for TFP by effectively modeling unstructured data and complex relational patterns through graph structures, often integrated with other DL techniques. Yu et al. first applied GNNs to traffic prediction [
34]. Li and Zhu integrated spatial and temporal elements using a temporal graph to complement a spatial graph for TFP [
35]. Variants, like spatial and temporal transformers, leverage self-attention mechanisms for traffic speed prediction [
36]. Guo et al. designed a graph convolutional network (GCN) to capture multi-scale dependencies (recent, daily, and weekly) in traffic patterns [
37]. Jiang et al. used an extended GCN for traffic speed prediction [
38]. Zhao et al. proposed a points of interest-based dynamic GCN for traffic forecasting [
39]. Emerging hybrids include integration of GNNs with large language models for TFP (Sun et al. [
40]).
Other DL-based TFP research can reference to [
41,
42,
43,
44,
45]. Existing traffic flow forecasting studies primarily serve intelligent transportation systems (ITSs) for traffic efficiency and resource allocation [
19,
40], using past traffic data of an area to predict its future traffic flow. Currently, in the field of bridge health monitoring, on the one hand, not all bridges in the road network are equipped with WIM systems; on the other hand, even for those installed with WIM, there exist issues of incomplete or missing data.
The key innovations and scientific contributions of this study are summarized as follows.
Firstly, this study advances TFP specifically for SHM of bridge networks. Unlike existing TFP research focusing on intelligent transportation systems for traffic efficiency, we develop a novel TFP pipeline considering spatio-temporal correlations of bridge networks—predicting traffic loads critical to structural assessment, reliability analysis, and maintenance planning.
Secondly, this study proposes a transformer-based traffic flow prediction model considering spatio-temporal correlations of bridge networks (ST-TransNet), integrating external factors (processed via fully connected networks) and multi-period traffic flows of input bridges (captured by self-attention encoders) to generate traffic flow predictions through a self-attention decoder.
Thirdly, this study introduces a fixed-value imputation approach to address the missing data issue in real-world WIM systems, which preserves multi-period time synchronization to maintain sequence integrity. Imputed data are excluded from the calculation of evaluation metrics to ensure accuracy without interpolation artifact distortions. This data processing approach enhances methodology adaptability to practical application scenarios.
Finally, the proposed method is validated through real-world WIM data from an operational eight-bridge network. ST-TansNet reduces the RMSE of TFP to 12.76 vehicles/10 min, achieving a relative error reduction of 22.8–40.5% over 7 different types of baseline models (SVR, CNN, BiLSTM, CNN&BiLSTM, ST-ResNet, transformer, and STGCN). Furthermore, ablation studies validate the necessity and effectiveness of each component (External Factor 1, External Factor 2, distant traffic flow, near traffic flow, recent traffic flow, output traffic flow, and the direct link between the encoder and output), ensuring the rationality of the framework design. This in-field validation confirms the actual application capability of the proposed ST-TransNet for reliable TFP of bridge networks.
This study proposes a spatial-temporal transformer-based traffic flow prediction model (ST-TransNet) for bridge networks which can effectively predict the traffic flow of target bridges and support bridge health monitoring and network-level analysis. The rest of this paper is outlined as follows.
Section 2 introduces the structure of the proposed model for TFP.
Section 3 describes the implementation details on a bridge network, including bridge locations, the dataset, and the training hyperparameter settings.
Section 4 contains the results and discussion, presenting the prediction results, validating the model effectiveness, and comparing it with the baseline models.
Section 5 contains the conclusions of the paper.
2. Methodology
2.1. Overall Architecture of ST-TransNet
This study proposes ST-TransNet, a model designed to capture the correlation among traffic flows within bridge networks. ST-TransNet considers the spatio-temporal correlations inherent in traffic flows and incorporates external factors into its analysis. The attention mechanism operates across various time steps, enabling it to learn temporal dependencies, while the interaction among different features within the same time step allows it to grasp spatial correlations. This dual functionality of the attention mechanism enables the depiction of spatio-temporal correlations within traffic flows across bridge networks.
The architecture of ST-TransNet, which is utilized for TFP in bridge networks, is depicted in
Figure 1. As shown in
Figure 1a, the ST-TransNet comprises of input section and output section. The input section consists of five segments, with two segments modeling external components and three depicting the traffic flow on input bridges during various time periods. The outcomes of these five components are integrated through fusion, with distinct weights assigned to each component. These fused data, in conjunction with the masked output traffic flow, are then fed into the output module. The subsequent part of
Section 2 will offer a detailed introduction to the proposed ST-TransNet.
2.2. Timestamp Input and Feature Extraction by External Components
This section details the calculation of external components in ST-TransNet. External factors considered in this study include hour of day (External Factor 1) and day of week and holiday (External Factor 2) as the first two input components in
Figure 1a. They are processed via two distinct layers of fully connected neural networks.
Traffic flow is affected by a variety of external factors, including weather conditions (sunny, rainy, typhoon, etc.), temporal elements (rush hour, weekdays, holidays, etc.), and events (traffic accidents, natural disasters, ship collisions, etc.). In this study, we primarily focus on temporal factors, including hour of day, day of week, and holiday.
Figure 2 illustrates the influence of the temporal factor on traffic flow.
Figure 2a shows the daily variation in traffic flow for three bridges over two consecutive days, exhibiting a distinct diurnal pattern. The traffic flow fluctuates throughout the day, with peak hours between 12:00 and 18:00 and troughs between 00:00 and 06:00.
Figure 2b reveals that traffic flow on weekends for these bridges is typically lower than on weekdays.
Figure 2c indicates that traffic flow on holidays may differ from that observed on regular days.
A twenty-four-bit one-hot encoding scheme is employed to represent the hour of day information (External Factor 1), which encodes the specific hour. For instance, the sequence {1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0} indicates that the samples fall within the time range of 00:00 to 01:00.
An encoding scheme using nine bits of one-hot representation is utilized to encode the day of week and holiday information (External Factor 2). The first seven bits correspond to the day of the week, the eighth bit indicates whether it is a holiday, and the ninth bit signifies if the day is a compensatory day off. For instance, the sequence {0,1,0,0,0,0,0,1,0} indicates that it is Tuesday and a holiday, while {0,0,0,0,0,1,0,0,1} represents working overtime on a Saturday.
Two stacked fully connected layers, designated as FC1 and FC2, are employed to extract the features of external factors, e.g., the hour of day and day of week and holiday, respectively. denotes the length of external factors, which is equal to the length of the output traffic flow. and represent the output dimensions of these two fully connected networks, respectively. The output vectors of FC1 and FC2 are , , respectively.
2.3. Multi-Period Traffic Flow Input by Recent, Near, and Distant Components
2.3.1. Traffic Flow Input Matrices
Given that different time periods can influence traffic patterns in various ways, we segment the traffic flow into three distinct time periods for separate analysis, as depicted in the latter three networks in the input part shown in
Figure 1a. Three temporal dependencies are modeled: recent (preceding-hour flow), near (adjacent-day homologous flow), and distant (multi-week historical flow).
We now introduce the selection of input traffic flows for these three networks. denotes the length of the output, while the output traffic flow can be represented as . Assign as ; thus, . Considering the cyclical nature of traffic patterns, we shall assign to represent the number of days and to represent the number of weeks that exert significant influence. Let represent the length of recent influential samples, while and denote the length of samples in one day and one week, respectively. Consequently, the input traffic flow matrices are structured as follows: for the distant network (Encoder 1), for the near network (Encoder 2), and for the recent network (Encoder 3). Thus, the length of the input matrix for the distant network is , the length of the input matrix for the near network is , and the length of the input matrix for the recent network is .
Since the attention mechanism cannot capture temporal information, positional embedding is required to incorporate position information into the input sequences. In this study, we introduce an incremental one-dimensional vector to the traffic flow matrices. For example, given an input traffic flow matrix , where is the length of the input traffic flow, which may vary among the three networks, and represents the number of input bridges, the added vector can be defined as , . Consequently, the input matrix is modified accordingly.
2.3.2. Computational Procedures of Encoder Blocks
As depicted in
Figure 1a, the distant traffic flow matrix
, near traffic flow matrix
, and recent traffic flow matrix
are each processed through Encoder 1, Encoder 2. and Encoder 3 to capture spatio-temporal correlations. The detailed encoder blocks are shown in
Figure 1b, including the multi-head self-attention mechanism, residual connection, and layer normalization.
The multi-head self-attention mechanism is capable of extracting distinct spatio-temporal correlation patterns. The embedded input traffic flow
is initially projected into high-dimensional latent subspaces. For the
-th self-attention head, the query vector
, key vector
, and value vector
are derived from
, as follows:
where
is the input matrix,
,
, and
are the weight matrices for
,
, and
, respectively. The elements within these weight matrices are learned by the network throughout the training process. Typically, we set
, where
corresponds to the number of heads in the encoder.
For the
-th self-attention head, the result
is computed as follows:
Multi-head self-attention
is achieved by concatenating and integrating the learned features from each head, as follows:
Residual connections are applied following the multi-head self-attention layer and fully connected layer to alleviate the issue of gradient vanishing. Layer normalization is performed on the output of each residual connection to accelerate the back-propagation of errors and enhance the convergence speed during training.
Upon calculating three stacked encoder blocks, the feature representations for traffic flows are acquired as follows: the output of Encoder 1 (responsible for capturing distant traffic flow patterns) is denoted as , the output of Encoder 2 (modeling near traffic flow characteristics) is denoted as , and the output of Encoder 3 (focusing on recent traffic flow features) is denoted as .
2.4. Prediction Module of Traffic Flow
2.4.1. Fusion of Input Components
As shown in
Figure 1a, the outcomes of the five input components are integrated using a parametric matrix-based fusion technique to serve as input for the output network. To facilitate the feature fusion, extracted features are further processed by matrix multiplication with learnable parameters to ensure that the output vector length is uniformly aligned with the output length
before final concatenation as follows:
where, as previously described,
,
,
,
, and
are extracted features. The weight matrices
,
,
,
, and
are learnable and assign different weights to External Factor 1, External Factor 2, and the distant, near, and recent components, respectively. The fused feature is
,
.
2.4.2. Computational Procedures of Decoder Blocks
As depicted in
Figure 1a, the fused input component
and masked predicted traffic flow serve as the input for the output part. The output module consists of three stacked decoder blocks arranged in series. The detailed decoder block is shown in
Figure 1c. The multi-head self-attention mechanism, residual connection, and layer normalization were introduced in the encoder block. An additional multi-head cross-attention layer is designed to integrate the fusion of input components
from the encoder and the outcome of last masked multi-head self-attention layer
from the decoder.
The calculation process of the masked multi-head self-attention layer in the decoder is similar to the multi-head self-attention layer in the encoder, other than that the input is the masked predicted traffic flow. Given that the output traffic flow is unknown during the prediction stage, mask embedding is employed. Predictions at time can only rely on the predicted outputs at times preceding , denoted as . The lower triangular mask matrix sets future positions to “”, which ensures that the model can only attend to historical timesteps, preventing information leakage from the future. After masking, a softmax normalization operator converts these “” values to zero-weight values, effectively blocking their influences on the prediction. The query vector, key vector, and value vector are derived from the masked .
The multi-head cross-attention layer bridges the decoder output
with the fusion of input components
in the encoder. For the
i-th cross-attention head, the query matrix
is derived from the last masked multi-head cross-attention layer
in the decoder, while the key matrix
and value matrix
are generated from the fused input component
in the encoder, as follows:
where the dimension for the outcome of the masked multi-head self-attention layer
is the same as the decoder input
, where
denotes the number of predicted bridges, while
,
, and
are learned weight matrices of
,
, and
, respectively. We set
, where
represents the number of attention heads for the decoder.
2.5. Loss Function
As shown in
Figure 1a, the calculated results of the decoder blocks are then passed through the softmax operation to obtain the predicted traffic flow. The ST-TransNet is trained to predict traffic flow from two external factors and four traffic flow matrices by minimizing the root mean square error between the predicted traffic flow matrix
and the actual traffic flow matrix
, as follows:
where
and
are the predicted traffic flow and ground truth of the
-th target bridge at the
-th time period, respectively.
2.6. Evaluation Metrics
The root mean square error (RMSE) and relation root mean square error (RRMSE) are utilized as evaluation metrics. RMSE quantitatively evaluates the absolute magnitude of traffic flow prediction errors, while RRMSE, defined as the ratio between the predicted RMSE to the rooted mean square of actual traffic flow, quantifies the proportion of prediction errors relative to the actual traffic flow.
5. Conclusions
The increasing deployment of SHM systems across bridges now provides the dense, synchronized data needed to investigate bridge network traffic flow correlations—a critical aspect previously inaccessible due to limited instrumentation. Capitalizing on this opportunity, we present the first comprehensive analysis of traffic flow correlations within an instrumented bridge network.
To model these complex spatiotemporal dependencies, we propose ST-TransNet, a novel self-attention architecture that integrates multi-scale traffic flows (recent, near, and distant) from source bridges with external factors (hour of day, day of week and holiday), processes traffic flows through self-attention encoders and external factors via fully connected layers, then employs a self-attention decoder to predict target bridge traffic flows while explicitly modeling target–bridge correlations.
Validated using field WIM data from the investigated 8 bridges, the proposed ST-TransNet achieves an RMSE of 12.78 vehicles/10 min, outperforming a series of baseline models—i.e., SVR, CNN, BiLSTM, CNN&BiLSTM, ST-ResNet, transformer, and STGAN—achieving significant relative reductions of 40.5%, 36.9%, 36.6%, 37.3%, 35.6%, 31.1%, and 22.8%, respectively. The framework successfully captures complex cross-bridge relationships by integrating multi-scale temporal dependencies and external factors, with recent flows exhibiting the strongest influence. Decoder refinement informed by learned target–bridge correlations further enhances forecasting precision.
This work reveals significant network-level spatio-temporal correlations in traffic flows across bridge networks, offering valuable insights for shifting the paradigm from single-bridge to network level health monitoring. This study may enable more accurate system-level performance assessment, identification of critical load paths, optimized sensor placement, and enhanced network-wide safety evaluation.
Future research will include the following two aspects: (1) embedding the unique topology structure of bridge network into the proposed ST-TransNet using graph-based neural networks for user-specific customization of a certain city-scale bridge network; and (2) demonstrating the generalization ability and adaption efficiency under various scenarios.