A Transformer-Based Bridge Structural Response Prediction Framework

Structural response prediction with desirable accuracy is considerably essential for the health monitoring of bridges. However, it appears to be difficult in accurately extracting structural response features on account of complex on-site environment and noise disturbance, resulting in poor prediction accuracy of the response values. To address this issue, a Transformer-based bridge structural response prediction framework was proposed in this paper. The framework contains multi-layer encoder modules and attention modules that can precisely capture the history-dependent features in time-series data. The effectiveness of the proposed method was validated with the use of six-month strain response data of a concrete bridge, and the results are also compared with those of the most commonly used Long Short-Term Memory (LSTM)-based structural response prediction framework. The analysis indicated that the proposed method was effective in predicting structural response, with the prediction error less than 50% of the LSTM-based framework. The proposed method can be applied in damage diagnosis and disaster warning of bridges.


Introduction
Response data of bridge structures can be used to participate in the assessment of structural health, and when the response values are above a certain range of the norm, it indicates that the monitored bridge structure is at risk of abnormalities or damage [1][2][3][4]. In this context, there is a need to be able to accurately predict the response of bridge structures. Structural response data are a kind of time-series data, which are usually used to predict the future period of data from a certain period of time in the past. These kinds of data have been widely used in economic and social science fields, such as weather prediction, stock prediction, etc. Traditional structural response prediction methods mainly focus on some linear and nonlinear models [5][6][7][8], which have good results in some simple systems, but for high-order nonlinear systems with long-term time dependence and spatial correlation, these methods have the drawbacks of huge computational effort and insufficient accuracy [9].
Early neural networks mainly refer to Back Propagation (BP) neural networks [10], and these networks were used in areas such as predicting financial markets [11,12], electrical loads [13][14][15], and traffic accidents [16,17]. These networks have a common problem: the output data are only related to the input data, but not the order of the input data. The neurons themselves do not have the ability to store information, and the whole network has no "memory" capability. With the rise of deep learning, deep learning networks have shown good performance in speech recognition and image processing [18][19][20], and LSTM is considered to be the best network model for processing time-series data [21,22]. Therefore, LSTM has gradually become a research hotspot in the field of engineering structures, especially in the operation and maintenance phase of bridges, where it can predict the response of bridges in a short period of time, and many researchers have proposed LSTMbased response prediction models. Zhang et al. [23] established a convolutional long-short especially in the operation and maintenance phase of bridges, where it can predict th response of bridges in a short period of time, and many researchers have proposed LSTM based response prediction models. Zhang et al. [23] established a convolutional long-shor term memory (ConvLSTM) network to learn spatiotemporal latent features from data, an thus establish a surrogate model for structural response forecasting. Li et al. [24] applie LSTM to model the bridge aerodynamic system with the potential fluid memory effec Bilal et al. [25] developed a LSTM network with overlapping data to evaluate importan response data after earthquakes. However, LSTM still has its limitations, and the semanti capture capability for time-series data is still insufficient, which may lead to poor predic tion accuracy. To remedy this deficiency, researchers have proposed the Transforme structure [26], which is better than LSTM in the semantic capture of time-series data du to the use of an attention mechanism as the underlying network. Therefore, in this paper, a Transformer-based bridge structural response predictio framework is proposed for improving the accuracy of bridge structural response predic tion, and the performance of the proposed framework is tested on a concrete bridge. T the best of our knowledge, this is the first Transformer framework that has been used fo bridge structural response prediction. The paper is organized as follows. Section 2 pro vides the details of the proposed framework. Section 3 provides basic information abou the bridge response prediction experiments, containing the dataset, training parameter etc. Section 4 provides the experimental results. Section 5 provides the discussion. Sectio 6 provides the limitations of the proposed method. Section 7 provides the conclusion.

Framework for Structural Response Prediction
The traditional CNN and RNN are discarded in Transformer, and the whole networ structure is composed entirely of an attention mechanism. The original Transformer con sists of and only consists of Self-Attention and Feed Forward Neural Network. After year of development, Transformer has produced many variants.
The Transformer structure used in this paper is shown in the Figure 1, with an en coder-decoder structure. The encoder consists of 6 encoding blocks, and similarly the de coder is composed of 6 decoding blocks. As with all generative models, the output of th encoder will be used as the input to the decoder.

Attention Mechanism
Attention mechanisms [27] have been widely used in various areas of deep learning in recent years, and is easily encountered in various different types of tasks, be it image processing, speech recognition, or natural language processing. Its inspiration comes from the human attention mechanism. The visual attention mechanism is a signal processing mechanism in the brain that is unique to human vision. By quickly scanning the global image, human vision obtains the target area to focus on, which is generally known as the focus of attention, and then devotes more attention resources to this area to obtain more detailed information about the target to be focused on, while suppressing other useless information. [28,29]. Its purpose is to select the information that is more critical to the current task goal from among the many information available.
The input vector is denoted by X = [X 1 , X 2 , . . . , X n ]. In order to match a weight to the input vector, the attention weight Attention = [a 1 , a 2 , . . . , a n ] is calculated as follows: where Q, K, and V denote "query", "key", and "value", respectively; d k is the scaling factor and denotes the dimensionality of K. For larger values of d k , the product of dot products is too large, thus pushing the Softmax function to regions with very small gradients. To counteract this effect, the dot product is scaled using 1 √ d k .
After getting Attention, it will be sent to the next module of encoder, i.e., feed forward neural network. This is fully connected and has two layers: the first layer has an activation function of ReLU, and the second layer is a linear activation function that can be expressed as: As shown in Figure 2, there are two kinds of attention mechanisms used in Transformer, Self-Attention and Encoder-Decoder attention. Both are computed in a multi-head way, but Encoder-Decoder attention uses the traditional attention mechanism, where Query is the encoder value at the last time i had been computed by Self-Attention, and both Key and Value are the output of Encoder. Self-Attention only calculates the attention (or weight matrix) inside the encoder or decoder without reference to the current state at the decoder's side.

Positional Encoding
Since there is no loop nor convolutional structure in the Transformer, in order for the model to be able to utilize the order of the sequence, it is necessary to insert some infor-

Positional Encoding
Since there is no loop nor convolutional structure in the Transformer, in order for the model to be able to utilize the order of the sequence, it is necessary to insert some information that can represent the relative or absolute position in the sequence [30]. Position encoding is not part of the model architecture, it is actually only part of the preprocessing. For each position vector, it provides a unique encoding. The dimensionality of the encoding used in this paper is 250 dimensions.

Multi-Head Attention
Multi-Head attention [31] provides multiple "representation subspaces" for an attention. Because different Query/Key/Value weight matrices are used in each attention, each matrix is generated by random initialization. Then, through training, the response values are projected into different "representation subspaces". Multi-Head attention in this paper consists of self-attention stacking to form a depth structure. The calculation is shown as follows: In this paper, the Q,

Experiment
In this paper, our objective is to predict the bridge response for a specific period in the future based on the bridge response for a given period in the past, where the main load on the bridge is the vehicle load, and the bridge response used in this paper is the strain. We established a dataset of bridge strains, and the specific information of this dataset is shown in the description later. To show the performance of our proposed method, we compare the proposed method with previous LSTM-based methods.

Songhua River Bridge Structural Response Dataset
This bridge is located in Tonghe County, Heilongjiang Province. The total length of the bridge is 2578.28 m. The main bridge structure is a prestressed concrete continuous box girder divided into two links. Each link span arrangement is 1132 m, and 14 sensors were placed on each link span (From left to right, S01-S14). The strain responses of four sensors (S01, S02, S03, and S09) were selected as the dataset in this paper. The reason for our arrangement of the 14 sensors was due to the need of other projects, and this number of sensors was not needed for the study in this paper. The method in this paper has good prediction for each sensor, only four sensors were randomly selected. The strain response was monitored for 6 months, and was measured every half hour. The sensor layout is shown in Figure 3. were placed on each link span (From left to right, S01-S14). The strain responses of four sensors (S01, S02, S03, and S09) were selected as the dataset in this paper. The reason for our arrangement of the 14 sensors was due to the need of other projects, and this number of sensors was not needed for the study in this paper. The method in this paper has good prediction for each sensor, only four sensors were randomly selected. The strain response was monitored for 6 months, and was measured every half hour. The sensor layout is shown in Figure 3.

Training Platform
The training process was performed on a single workstation using a high-performance GPU (NVIDIA RTX 2080Ti) and CPU (AMD Ryzen 2700X 3.7 GHz). The code was written in Python 3.6, the framework was built using Pytorch 1.8.0, and the training process was performed on Windows 10. The optimizer was Adam with a learning rate of 0.001 and a decay rate of 0.0001. Epoch was set to 100.

Loss Function and Evaluation Metrics
RMSE (Root Mean Square Error): RMSE is the most commonly used regression loss function as well as evaluation metrics; it is calculated by finding the square root of the sum of squares of the distance between the predicted and true values, with the following formula:

Training Platform
The training process was performed on a single workstation using a high-performance GPU (NVIDIA RTX 2080Ti) and CPU (AMD Ryzen 2700X 3.7 GHz). The code was written in Python 3.6, the framework was built using Pytorch 1.8.0, and the training process was performed on Windows 10. The optimizer was Adam with a learning rate of 0.001 and a decay rate of 0.0001. Epoch was set to 100.

Loss Function and Evaluation Metrics
RMSE (Root Mean Square Error): RMSE is the most commonly used regression loss function as well as evaluation metrics; it is calculated by finding the square root of the sum of squares of the distance between the predicted and true values, with the following formula: where y i is the true value,ŷ i is the predicted value, and n is the number of samples.

Experimental Design
In this paper, two deep learning methods were used for strain response prediction; the first one is the proposed Transformer-based method and the second one is the LSTM-based method. The dataset was used as the bridge strain response dataset in Section 3.1. Both methods used the same training parameters, with the strain response of the previous two hours as the input and the strain response of the next half hour as the prediction target, and the epochs were set to 100. The ratio of training set, validation set, and test set was 0.7:0.1:0.2. Figure 4 shows the strain response prediction results of the two methods. From the prediction results of the Transformer, the strain response was successfully predicted and the predicted strain response values basically matched with the field test values, indicating that the proposed method is able to predict the strain response in the short term. From the comparison results, the agreement of the proposed method is better than that of the LSTM, which, in general, predicts the strain response trend of the structure, but from the details, there is still a significant difference between the predicted strain response and the test value at some time points, indicating that the stability of the LSTM prediction is not as good as that of the proposed method. To better show the details of the errors, the variation of the errors with time points and the probability density function are calculated so that the variation of the errors with time points and the distribution of the errors can be more accurately reflected.

Experimental Results
The Figure 5 shows the errors of the Transformer-based method and LSTM, as well as the fitted curves of the normal distribution of their error fits. From the error curves, it can be seen that the error of Transformer is smaller than that of LSTM, and both have the same trend of change, with jitter occurring at the 200th time point. Table 1 shows the mean error and the 95% confidence interval (CI). From the mean value of the error, the mean error of the Transformer is about 19.2-55.5% of that of the LSTM. From the 95% CI, there is a much narrower CI for the Transformer of approximately 59.0-87.7% for the LSTM. The strain responses used in this study are all raw data without pre-processing such as filtering, and the error is controlled within an acceptable range in the presence of noise, which shows the engineering feasibility of the proposed method.
where i y is the true value, ^i y is the predicted value, and n is the number of samples.

Experimental Design
In this paper, two deep learning methods were used for strain response prediction; the first one is the proposed Transformer-based method and the second one is the LSTMbased method. The dataset was used as the bridge strain response dataset in Section 3.1. Both methods used the same training parameters, with the strain response of the previous two hours as the input and the strain response of the next half hour as the prediction target, and the epochs were set to 100. The ratio of training set, validation set, and test set was 0.7:0.1:0.2. Figure 4 shows the strain response prediction results of the two methods. From the prediction results of the Transformer, the strain response was successfully predicted and the predicted strain response values basically matched with the field test values, indicating that the proposed method is able to predict the strain response in the short term. From the comparison results, the agreement of the proposed method is better than that of the LSTM, which, in general, predicts the strain response trend of the structure, but from the details, there is still a significant difference between the predicted strain response and the test value at some time points, indicating that the stability of the LSTM prediction is not as good as that of the proposed method. To better show the details of the errors, the variation of the errors with time points and the probability density function are calculated so that the variation of the errors with time points and the distribution of the errors can be more accurately reflected.    The Figure 5 shows the errors of the Transformer-based method and LSTM, as well as the fitted curves of the normal distribution of their error fits. From the error curves, it can be seen that the error of Transformer is smaller than that of LSTM, and both have the same trend of change, with jitter occurring at the 200th time point. Table 1 shows the mean error and the 95% confidence interval (CI). From the mean value of the error, the mean error of the Transformer is about 19.2-55.5% of that of the LSTM. From the 95% CI, there is a much narrower CI for the Transformer of approximately 59.0-87.7% for the LSTM. The strain responses used in this study are all raw data without pre-processing such as filtering, and the error is controlled within an acceptable range in the presence of noise, which shows the engineering feasibility of the proposed method.    The Figure 5 shows the errors of the Transformer-based method and LSTM, as well as the fitted curves of the normal distribution of their error fits. From the error curves, it can be seen that the error of Transformer is smaller than that of LSTM, and both have the same trend of change, with jitter occurring at the 200th time point. Table 1 shows the mean error and the 95% confidence interval (CI). From the mean value of the error, the mean error of the Transformer is about 19.2-55.5% of that of the LSTM. From the 95% CI, there is a much narrower CI for the Transformer of approximately 59.0-87.7% for the LSTM. The strain responses used in this study are all raw data without pre-processing such as filtering, and the error is controlled within an acceptable range in the presence of noise, which shows the engineering feasibility of the proposed method.

Impact of Different Number of Prediction Points
To test the performance of the proposed method in predicting different numbers of strain responses, we changed the parameters of the sliding window and adjusted the number of prediction points to two, four, six, eight, ten, and twelve, respectively, and the results are shown in Figure 6. In particular, when the number of prediction points is greater than one, multiple batches of prediction values are generated for each time point, and the mean value of multiple prediction values is used as the final prediction value in this paper. The RSME for each of the six cases are calculated, and it indicates the magnitude of the prediction error, and the larger the RSME, the larger the prediction error.

Impact of Different Number of Prediction Points
To test the performance of the proposed method in predicting different numbers of strain responses, we changed the parameters of the sliding window and adjusted the number of prediction points to two, four, six, eight, ten, and twelve, respectively, and the results are shown in Figure 6. In particular, when the number of prediction points is greater than one, multiple batches of prediction values are generated for each time point, and the mean value of multiple prediction values is used as the final prediction value in this paper. The RSME for each of the six cases are calculated, and it indicates the magnitude of the prediction error, and the larger the RSME, the larger the prediction error.
sults are shown in Figure 6. In particular, when the number of prediction points is greater than one, multiple batches of prediction values are generated for each time point, and the mean value of multiple prediction values is used as the final prediction value in this paper. The RSME for each of the six cases are calculated, and it indicates the magnitude of the prediction error, and the larger the RSME, the larger the prediction error. The prediction errors are counted in Figure 7. The results show that the proposed methods can predict the strain response when the prediction value is less than twelve points, but the accuracy varies widely. As the number of prediction points increases, the prediction error becomes larger. When the number of prediction points is four, the RSME increases significantly; when the number of prediction points increases to six, the increase in RSME becomes flat; when the number of prediction points is ten, the increasing trend of RSME becomes faster. Combining the results of Figures 6 and 7, when the number of prediction points is within four points, the prediction error is small, and the prediction results are more credible; when the number of prediction points is between four and ten, the prediction error is moderate, and the results are less credible; when the number of prediction points is greater than ten, the prediction error is relatively large, and the results are not credible.  The prediction errors are counted in Figure 7. The results show that the proposed methods can predict the strain response when the prediction value is less than twelve points, but the accuracy varies widely. As the number of prediction points increases, the prediction error becomes larger. When the number of prediction points is four, the RSME increases significantly; when the number of prediction points increases to six, the increase in RSME becomes flat; when the number of prediction points is ten, the increasing trend of RSME becomes faster. Combining the results of Figures 6 and 7, when the number of prediction points is within four points, the prediction error is small, and the prediction results are more credible; when the number of prediction points is between four and ten, the prediction error is moderate, and the results are less credible; when the number of prediction points is greater than ten, the prediction error is relatively large, and the results are not credible. of RSME becomes faster. Combining the results of Figures 6 and 7, when the number o prediction points is within four points, the prediction error is small, and the predictio results are more credible; when the number of prediction points is between four and ten the prediction error is moderate, and the results are less credible; when the number o prediction points is greater than ten, the prediction error is relatively large, and the result are not credible.

Impact of Different Time Intervals
In this section, the effect of different time intervals on the prediction results is dis cussed, and the dataset is set to a uniform length of 2000 samples in order to demonstrat fairness. After our experiments, it was found that the length of the batch had a large im pact on the training results. In previous tests, the batch was set to 32, and the previou conclusions are based on this setting. In this section, when the batch was set to 32, th prediction accuracy was found to drop sharply. As a result, the length of the batch wa

Impact of Different Time Intervals
In this section, the effect of different time intervals on the prediction results is discussed, and the dataset is set to a uniform length of 2000 samples in order to demonstrate fairness. After our experiments, it was found that the length of the batch had a large impact on the training results. In previous tests, the batch was set to 32, and the previous conclusions are based on this setting. In this section, when the batch was set to 32, the prediction accuracy was found to drop sharply. As a result, the length of the batch was reduced to six for multiple tests, and the prediction accuracy dropped more slowly when the time interval was increased at this point. Figure 8 shows the prediction results and errors of the Transformer at different time intervals, which were set to 0.5 h, 1 h, 1.5 h, and 2 h. The results show that, in general, the prediction errors gradually increase as the time interval increases. Figure 9 shows the prediction errors of different time intervals; it can be seen the prediction results are more reliable for time intervals of 0.5 h and 1 h, and the prediction errors are mainly distributed in (−3, 3) and (−6, 3), while the prediction results are less reliable for time intervals of 1.5 h and 2 h, and the main distribution intervals of the errors are in (−15, 15) and (−12, 16). According to the prediction results, it can be seen that the time interval has a large influence on the prediction accuracy, and the prediction accuracy decreases significantly when the time interval is increased from 1 to 1.5 h, indicating that the response data of the bridges are more regular when the time interval is small, which is related to the form of external loads on the bridges and the local traffic conditions. Therefore, in practical applications, it is best to keep the time interval at a low level, as much as possible within 1 h. reduced to six for multiple tests, and the prediction accuracy dropped more slowly when the time interval was increased at this point. Figure 8 shows the prediction results and errors of the Transformer at different time intervals, which were set to 0.5 h, 1 h, 1.5 h, and 2 h. The results show that, in general, the prediction errors gradually increase as the time interval increases. Figure 9 shows the prediction errors of different time intervals; it can be seen the prediction results are more reliable for time intervals of 0.5 h and 1 h, and the prediction errors are mainly distributed in (−3, 3) and (−6, 3), while the prediction results are less reliable for time intervals of 1.5 h and 2 h, and the main distribution intervals of the errors are in (−15, 15) and (−12, 16). According to the prediction results, it can be seen that the time interval has a large influence on the prediction accuracy, and the prediction accuracy decreases significantly when the time interval is increased from 1 to 1.5 h, indicating that the response data of the bridges are more regular when the time interval is small, which is related to the form of external loads on the bridges and the local traffic conditions. Therefore, in practical applications, it is best to keep the time interval at a low level, as much as possible within 1 h.

Limitations
This section analyzes the limitations of the proposed method. Since there are various options for the number of input points and number of output points for structural response prediction, the use case presented in the previous section is the best application case with both four points for input and one point for prediction. However, as the input points change, not all cases are more accurate for Transformer than LSTM, so this section changes the input points to three points to compare the effects of the two prediction methods. The results are shown in . In the case of three input points, the prediction accuracy of Transformer for S01, S02, and S09 is slightly lower than that of LSTM, and the prediction accuracy of Transformer for S03 is almost the same as that of LSTM. In contrast to the case with four input points, the prediction accuracy of Transformer for all sensors with three input points is less than the former, while the prediction accuracy of LSTM for S01, S02, and S09 decreases, and increases for S03. After several tests, the prediction result tends to be stable after increasing the input points, which is similar to the prediction result in , so it is recommended to use at least four points as input points in order to ensure stable prediction results.

Limitations
This section analyzes the limitations of the proposed method. Since there are various options for the number of input points and number of output points for structural response prediction, the use case presented in the previous section is the best application case with both four points for input and one point for prediction. However, as the input points change, not all cases are more accurate for Transformer than LSTM, so this section changes the input points to three points to compare the effects of the two prediction methods. The results are shown in Figure 10. In the case of three input points, the prediction accuracy of Transformer for S01, S02, and S09 is slightly lower than that of LSTM, and the prediction accuracy of Transformer for S03 is almost the same as that of LSTM. In contrast to the case with four input points, the prediction accuracy of Transformer for all sensors with three input points is less than the former, while the prediction accuracy of LSTM for S01, S02, and S09 decreases, and increases for S03. After several tests, the prediction result tends to be stable after increasing the input points, which is similar to the prediction result in Figure 4, so it is recommended to use at least four points as input points in order to ensure stable prediction results.

Conclusions
In this paper, a Transformer-based time series prediction framework is proposed for predicting the structural response of bridges with time dependence. The proposed framework contains multiple encoder modules and attention modules, and this structure enhances the semantic recognition of the temporal series data and is more conducive to extracting the features of the structural response. The accuracy of the proposed framework is verified by six-month strain response data of a concrete bridge. The proposed framework is compared with the most commonly used LSTM-based structural response prediction framework, and the results is shown as follows:

Conclusions
In this paper, a Transformer-based time series prediction framework is proposed for predicting the structural response of bridges with time dependence. The proposed framework contains multiple encoder modules and attention modules, and this structure enhances the semantic recognition of the temporal series data and is more conducive to extracting the features of the structural response. The accuracy of the proposed framework is verified by six-month strain response data of a concrete bridge. The proposed framework is compared with the most commonly used LSTM-based structural response prediction framework, and the results is shown as follows:

•
From the mean value of the error, the mean error of the Transformer is about 19.2%-55.5% of that of the LSTM. • From the 95% CI, with a much narrower CI for the Transformer of approximately 59.0%-87.7% for the LSTM.
Deep learning-based structural response prediction has the drawback of poor interpretability. Compared with traditional methods, deep learning-based methods rely more on the temporal regularity features of the data itself, which mainly reflects the approximation ability of deep learning and cannot correspond to the real physical features. In the future, we will pay more attention to the interpretability of time-series prediction, and calculate which time points of data are more valuable for predicting future responses.

Conflicts of Interest:
The authors declare no conflict of interest.