Atmospheric Temperature Prediction Based on a BiLSTM-Attention Model

: To address the problem that traditional models are not effective in predicting atmospheric temperature, this paper proposes an atmospheric temperature prediction model based on symmetric BiLSTM (bidirectional long short-term memory)-Attention model. Firstly, the meteorological data from ﬁve major stations in Beijing were integrated, cleaned, and normalized to build an atmospheric temperature prediction dataset containing multiple feature dimensions; then, a BiLSTM memory network was used to construct with forward and backward information in the time dimension. And the limitations of the traditional LSTM method in long-term time series analysis were solved by introducing the attention mechanism to achieve the prediction analysis of atmospheric temperature. Finally, by comparing the prediction results with those of BiLSTM, LSTM-Attention, and LSTM, it is revealed that the proposed model has the best prediction effect, with a MAE value of 0.013, which is 0.72%, 0.41%, and 1.24% lower than those of BiLSTM, LSTM-Attention, and LSTM, respectively; the R 2 value reaches 0.9618, which is 2.73%, 1.23%, and 4.98% higher than BiLSTM, LSTM-Attention, and LSTM, respectively. The results show that the symmetrical BiLSTM-Attention atmospheric temperature prediction model can effectively improve the prediction accuracy of temperature data, and the model can also be used to predict other time series data.


Introduction
The study of temperature prediction affects many fields, and the ability to accurately predict atmospheric temperatures is important for urban flood and drought prevention, resource use, and agricultural development; as such, this has become a topic that needs to be addressed and further researched [1,2]. In response to this, early researchers have developed models to predict atmospheric temperatures based on historical meteorological data combined with statistical knowledge [3], which are more interpretable than people's subjective experiences. However, the accuracy of such prediction methods is not high, and atmospheric temperature data are generally subject to a variety of different influencing factors, and the data exhibit strong randomness and uncertainty [4]. The use of neural network models trained on a large amount of historical data will allow for more accurate learning of data fluctuations and better extraction of data features, thus largely improving the accuracy of predictions [5].
For historical data with time series characteristics, people have started to use time series prediction models to carry out temperature prediction, such as LSTM (long shortterm memory) [6][7][8], which are based on the memory function of neural networks, and can achieve good temperature data prediction for a large amount of historical data in a time series. However, symmetry-based BiLSTM networks can better avoid the shortcomings of unidirectional LSTM networks. The attention mechanism is widely used to improve the problem of unfocused and time-consuming feature extraction by allocating computational resources to neural networks. Therefore, in the field of time series prediction, the attention mechanism is also of very good use [9].

Literature Review
Several scholars have already made many contributions to the field of temperature prediction [10,11]. More traditional machine learning methods have been used in the past [12,13]. Zhou

Literature Review
Several scholars have already made many contributions to the field of temperature prediction [10,11]. More traditional machine learning methods have been used in the past [12,13]. Zhou et al. proposed a grey-Markov temperature prediction model based on seasonal indices based on the interannual cyclical and seasonal variation patterns of historical temperature data, and obtained more accurate prediction results by predicting the average temperature of Guangzhou city [14]. Raviprased et al., formulated a compound-specific prediction model that could better predict the critical temperature of superconductors to solve the problem of predicting the effectiveness of the decision tree approach in this area, which was finally demonstrated after comparison with other multiple models [15]. Hou et al. developed a BP (back propagation) neural network model for climate change prediction by integrating factors such as atmospheric CO2 emissions, heat dissipation of the Earth, and changes in ocean surface temperature over the years, and produced a more accurate prediction of temperature changes in the future years [16]. Cai et al. established an SVM (support vector machine) model to predict the indoor temperature of buildings, and compared it with a BP neural network, and found that the SVM prediction accuracy effect was better than that of the BP neural network, which proved the applicability of SVM prediction method in the field of prediction [17]. From the above studies related to temperature prediction, it can be seen that prediction is mostly based on traditional machine learning models [18][19][20], which can better achieve the prediction of temperature data, but the prediction accuracy of the model is affected by many factors, such as the quality of the data, the extraction of data features, the configuration of model parameters, etc., and its prediction accuracy needs to be further improved.
Among existing temperature prediction studies, time series prediction models represented by LSTM networks have been widely used in temperature prediction, and can reliably predict long-term time series data [21]. Qiu et al., used LSTM models to predict daily river temperatures and, through experimental analysis of data from the Three Gorges reservoir system, captured the daily average variation of the thermal system more accurately, demonstrating that the LSTM outperformed other methods in predicting the daily average water temperature of rivers [22]. MASOOMA et al. used an LSTM model based on a spatial attention mechanism to accurately capture the space and time of multiple meteorological features to predict temperature, and discovered that spatial feature attention captured the interaction of input features on target features, and the study maintained a better prediction accuracy [23]. Song et al., proposed a temporal prediction model, based on LSTM and Kalman filtering, for predicting observations in atmospheric quality datasets, and found that the LSTM-Kalman model had better prediction results when compared with the LSTM model [24]. Liu et al. analyzed the time dependence of ocean temperature variability at multiple depths, and proposed a new method for ocean temperature time series prediction, namely the time-dependent ocean-temperature-prediction-based long shortterm memory network (TD-LSTM), which confirmed that the TD-LSTM outperformed other methods and performed well in different regions and depths [25].
The above studies on temperature prediction are summarized in Table 1. All of the above methods provide good solutions for temperature prediction.

Literature
Method Overall Evaluation of the Method [17] Grey-Markov Combining longitudinal and cross-sectional analysis for non-stationary data prediction, but the method is traditional and the accuracy is not high.
[18] Decision Trees Multiple conventional and non-conventional models are used for superconductor critical temperature prediction, demonstrating the benefits of decision trees, however, the prediction accuracy needs to be improved. A wide range of atmospheric factors affecting temperature change are considered to predict climate conditions over the next 25 years, but the method is single and the accuracy needs to be improved.
[20] SVM A support vector machine SVM model for indoor temperature prediction is shown to outperform a back propagation neural network BPNN model, but the model is traditional, the contrast is single.
[22] LSTM Neural networks predict daily water temperatures and quantify trends, resulting in a significant improvement over traditional models, however, the model is single and the feature extraction is not sufficient.
[23] LSTM-Attention Accurately capture the spatial and temporal relationships of multiple meteorological features, but the feature extraction is not sufficient.
[24] LSTM-Kalman Kalman filtering added to data series processing, however, the data features are not sufficiently extracted and the prediction accuracy needs to be verified.
[25] TD-LSTM A time-varying parameter matrix based on the fusion of historical observations is proposed, but more models need to be compared and the accuracy can be further improved.

Data Sources and Their Visualization
This paper focuses on the prediction of Beijing temperature data, which were taken from the multi-site meteorological dataset of Beijing in the machine learning database UCI (University of California-Irvine). A total of five major sites were selected for the actual study; the five sites are Changping, Dingling, Tiantan, Huairou, and Shunyi. Each site contains atmospheric monitoring data from March 2013 to February 2017, and each site has about 35,064 pieces of data; the data therefore contain a total of 35,064 × 5 samples. The dataset is recorded for each characteristic value for 24 h within each day. The sample data are shown in Table 2.  Table 1. Several other features in Table 1 are meteorological terminology, and the meaning of these terminologies is explained in Table 3.

Removing Invalid Attributes
The dataset of Tiantan City is used as the main sample for the exp a total of 35,064 records and 17 attributes. The attributes that are not r prediction, such as "No", and "station", can be deleted directly whe pre-processing.

Fill Missing Values
The missing values in the original dataset are indicated by "NA values of the valid features are counted. We found that there are a certa ing values for each feature; for example, 597 missing values for PM10, 1 for SO2, 744 missing values for NO2, etc. The results of missing value

Removing Invalid Attributes
The dataset of Tiantan City is used as the main sample for the experiment, which has a total of 35,064 records and 17 attributes. The attributes that are not relevant to this data prediction, such as "No", and "station", can be deleted directly when conducting data pre-processing.

Fill Missing Values
The missing values in the original dataset are indicated by "NA", and the missing values of the valid features are counted. We found that there are a certain number of missing values for each feature; for example, 597 missing values for PM10, 1118 missing values for SO 2 , 744 missing values for NO 2 , etc. The results of missing value statistics are shown in Figure 3. According to the missing value statistics, the missing values are filled by mean interpolation.

Feature Importance Analysis
In general, feature importance measures the weight and value of a featur construction of a model; the higher the score of a feature used as an input, the

Feature Importance Analysis
In general, feature importance measures the weight and value of a feature in the construction of a model; the higher the score of a feature used as an input, the more important it is relatively, and conversely, the less important it is. The importance scores for the above features of PM2.5, PM10, SO 2 , etc., are ranked, as shown in Figure 4. It can be seen that DEWP (dew point temperature) has the highest importance score, PRES (atmospheric pressure) has the next highest importance score, and CO (carbon monoxide) and RAIN (rainfall) have relatively low importance scores.

Data Normalization Process
The data prediction accuracy is affected by the data dimensionality, and to e the effect of dimensionality on the experimental results, the data need to be nor The normalization operation can transform all the data with magnitudes into dim less data, i.e., all lie within [0, 1] [26], and can also improve the training accur speed of the model. In this paper, the minimum-maximum normalization conver performed for PM2.5, PM10, CO, O3, TEMP, DEWP, and RAIN attributes, and version method is shown in Equation (1). * = − − In the above equation, is the maximum data in the column, and minimum data in the column. * is the value of the entire column normalize highest value.
The feature box visualization type plots before and after normalization are s Figures 5 and 6, respectively. It can be seen from the figures that the original eig data fluctuate greatly, and the normalized eigenvalue distributions are all betwe at which point, the data need to be restored using the inverse normalization oper

Data Normalization Process
The data prediction accuracy is affected by the data dimensionality, and to eliminate the effect of dimensionality on the experimental results, the data need to be normalized. The normalization operation can transform all the data with magnitudes into dimensionless data, i.e., all lie within [0, 1] [26], and can also improve the training accuracy and speed of the model. In this paper, the minimum-maximum normalization conversion was performed for PM2.5, PM10, CO, O 3 , TEMP, DEWP, and RAIN attributes, and the conversion method is shown in Equation (1).
In the above equation, x max is the maximum data in the column, and x min is the minimum data in the column. x * is the value of the entire column normalized by the highest value.
The feature box visualization type plots before and after normalization are shown in Figures 5 and 6, respectively. It can be seen from the figures that the original eigenvalue data fluctuate greatly, and the normalized eigenvalue distributions are all between [0, 1], at which point, the data need to be restored using the inverse normalization operation.

BiLSTM Network
The LSTM network is a deep learning network model evolved from a recurrent neural network [27] that can improve some shortcomings in the recurrent neural network model, such as gradient disappearance [28]. The LSTM network contains a total of four structures: memory unit, forgetting gate , input gate , and output gate Its structure is shown in Figure 7.

BiLSTM Network
The LSTM network is a deep learning network model evolved from a recurrent neural network [27] that can improve some shortcomings in the recurrent neural network model, such as gradient disappearance [28]. The LSTM network contains a total of four structures: memory unit, forgetting gate f t , input gate i t , and output gate o t Its structure is shown in Figure 7. As seen in Figure 6, the LSTM network has added cell states with three gate co nents compared to the RNN (recurrent neural network) [29]. The forgetting gate LSTM structure is responsible for determining what percentage of information is As seen in Figure 6, the LSTM network has added cell states with three gate components compared to the RNN (recurrent neural network) [29]. The forgetting gate f t in the LSTM structure is responsible for determining what percentage of information is left in that network, and is calculated as follows: In the above equation, x t is the input sequence; h t−1 is the state memory of the previous moment; σ(·) is the sigmoid activation function; W f is the weight matrix of the forgetting gate; b f is the bias of the forgetting gate; and f t is the state of the forgetting gate.
The input gate i t is responsible for selectively memorizing the new information in the cell state, and is calculated as follows: In the above equation, C t is the cell state candidate; C t is the new cell state; tanh(·) is the hyperbolic tangent function; W i is the weight matrix of the input gate; W c is the weight matrix of the cell state; b i is the bias of the input gate; b c is the bias of the cell state; and i t is the state of the input gate.
The output gate o t is responsible for determining the current state of the output information, and is calculated as follows: In the above equation, W o is the weight matrix of the output gate; b o is the bias of the output gate; and o t is the state of the output gate.
In this paper, we focus on the BiLSTM model, which is a bidirectional long shortterm memory model combining forward and backward information, i.e., it can process information in both directions. Both forward and backward directions have hidden layers, and these hidden layers can extract the forward and backward key information together in a given time [30]; thus, we can obtain more adequate temperature data features and help to improve the prediction accuracy of the model. The BiLSTM network structure is shown in Figure 8.

Attentional Mechanisms
The attention mechanism is mainly designed to quickly extrac mation from the large volume of information, reduce the influence of

Attentional Mechanisms
The attention mechanism is mainly designed to quickly extract more valid information from the large volume of information, reduce the influence of invalid information on the training effect of the model, and achieve the purpose of improving prediction accuracy [31].
There are generally hard and soft attention mechanisms for machine learning. The hard attention mechanism is a random selection of the information in the input sequence. Since the selection probability is difficult to quantify, which increases the difficulty of model training, in this paper, we choose to use a soft attention mechanism. Combined with the sequence data information, the input information is calculated as a weighted average, and then input into the network for training, which can effectively improve the attention of the model to the input information, and achieve a reasonable allocation of resources, which is suitable for predicting temperature sequence data in this paper [32]. The flow structure of the soft attention mechanism is shown in Figure 9. In Figure 9, is the input of the BiLSTM layer em anism; ℎ is the output of the BiLSTM layer; is the channels of the BiLSTM obtained after making calculat anism; y is the final output of the neural network mod The main formulas of the attentional mechanism a is the attention distribution value at moment In Figure 9, x i is the input of the BiLSTM layer embedded with the attention mechanism; h i is the output of the BiLSTM layer; α i is the different weights of the different channels of the BiLSTM obtained after making calculations based on the attention mechanism; y is the final output of the neural network model.
The main formulas of the attentional mechanism are as follows: where e t is the attention distribution value at moment t; u and w are weight coefficients; b is bias; α t is the different weights of different channel information in BiLSTM; s t is the output h t and weight matrix of BiLSTM layer after weighting. The attention mechanism is mainly manifested in the operation of the weight coefficients of different channels, which can be updated and optimized to adjust the allocation ratio of the model to the channel information [33], producing the best training effect of the model in the current computing environment.

BiLSTM-Attention Model
Combined with the temperature dataset used in this paper, the BiLSTM-Attention model is proposed, which can fully utilize the advantages of the bidirectional memory network and the attention mechanism. The BiLSTM network structure can process the input of the network based on both forward and backward directions simultaneously, and obtain the information of the previous moment and the next moment at a particular time. Moreover, BiLSTM has a unique bidirectional network structure, so it can extremely enhance the information memory of the model at the beginning and end phases of the input information during the training process [34]. Based on the bidirectional memory network structure, the means of the attention mechanism are embedded, which makes the model channels reasonably assign weights and strengthen the attention of the key information; therefore, using the improved model, the prediction effect of temperature data can be improved. The network structure of the BiLSTM-Attention model is shown in Figure 10. The BiLSTM-Attention model is divided into four parts: th layer, the BiLSTM layer, the attention layer, and the output layer Figure 10, the first layer is the input layer, ( = 1, 2, …, n) is the i the second layer is the bidirectional LSTM layer, which is further LSTM and backward LSTM layers; the third layer is the attention m = 1, 2, …, n) values are the different weights of different channels of layer is the output layer, in which y which is the final output of th (1) Input layer: This refers to the input feature vectors. The inpu cuses on pre-processing the atmospheric temperature datase ture vectors that can be directly accepted and processed by th (2) BiLSTM layer: The BiLSTM layer consists of forward and b The BiLSTM-Attention model is divided into four parts: the feature vector input layer, the BiLSTM layer, the attention layer, and the output layer. As can be seen from Figure 10, the first layer is the input layer, x i (i = 1, 2, . . . , n) is the input of the input layer; the second layer is the bidirectional LSTM layer, which is further divided into forward LSTM and backward LSTM layers; the third layer is the attention mechanism layer, α i (i = 1, 2, . . . , n) values are the different weights of different channels of information; the fourth layer is the output layer, in which y which is the final output of the network.
(1) Input layer: This refers to the input feature vectors. The input layer of this paper focuses on pre-processing the atmospheric temperature datasets into the form of feature vectors that can be directly accepted and processed by the BiLSTM layer. backward LSTM layer computation vector is denoted as ← h i (i =1,2, . . . ,n), which yields the output h of the BiLSTM layer at moment t, as follows.
In the above equation, α and β are constants and the sum of α and β is 1. (3) Attention layer: In temperature data prediction, the neural network is trained to focus on certain key features through the attention mechanism, the core of which is the weight coefficient. The first step is to learn the importance of each feature, and then assign the corresponding weight to each feature according to the importance. Equation (9) enables the transition from the input initial state to the new attention state, after which the final output state vector s t is obtained through Equation (10). Finally, s t is integrated with the dense layer as an output value input into the final output layer. (4) Output layer: The input into the output layer is the output of the attention mechanism layer in the implicit layer, which in this paper is mainly the set of predicted y-vectors of atmospheric temperature.
By continuously optimizing and updating the weights and biases, the cost function in the model structure gradually becomes smaller, and the network model becomes better during the training period.
Based on the measured and predicted temperature values, the prediction results of the model are evaluated based on commonly used prediction metrics such as MAE (mean absolute error), MSE (mean squared error), MAPE (mean absolute percentage error), and R 2 (linear correlation coefficient) [35]. The formulas, meanings, and evaluation criteria of each metric are shown in Table 4, where n is the total number of measured values, y i is the temperature predicted measurements, y i is the predicted value of the temperature prediction and y i is the average value of y i . Table 4. Model evaluation metrics.

Model Evaluation Metrics Equation Evaluation
Criteria (y i − y i ) Average absolute error, the more the value tends to 0, the better the model.
Mean square error, the more the value tends to 0, the better the model.
Average absolute percentage error, the more the value tends to 0, the better the model.
Linear correlation coefficient, the more the value tends to 1, the better the model.

Experimental Datasets
The experimental process used the dataset once it had undergone the processing stage in Section 3.1, including real-time monitoring data from five major stations, with 35,064 sample records for each dataset, including several features, such as PM2.5, PM10, NO, PRES, etc. For this dataset, the model is trained by randomly dividing it into an 80% training set and a 20% test set.

Experimental Environment
The configurations used for the experiments, such as software and hardware, are shown in Table 5.

Input and Output Variables
The model mainly solves the learning-based problem of mapping features between input and output variables. For the atmospheric temperature prediction experiment, the input and output variables are determined to conform to the requirements of the model. In this experiment, the model input is a time series variable consisting of a time step and an independent variable, and the output is a predicted value formed by a onedimensional array.
(2) Model output: The model output is the prediction result Y , a one-dimensional array of predictions. In this experiment, the output variable of the model is temperature. Y = [y 1 , y 2 , . . . y i . . . y n ](i = 1, 2, . . . , n)

Optimizer Selection
For the training of large-scale data, optimizers are generally needed to speed up the model learning rate and make the model converge faster. In machine learning, optimizers are generally used mainly for solving the gradient descent problem; the principle of gradient descent is shown in Equation (17).
In the above equation, η is the learning rate, θ n is the parameter before the update; θ n+1 is the parameter after the update; and ∇θ J(θ) is the derivative of the current parameter.
For the SGD (stochastic gradient descent) optimizer, the parameters of the gradient descent can be updated once with a single piece of data, but the amount of data used to update the parameters of this optimizer is extremely small, and the amplitude of the gradient update is extremely large. As for the ADAM optimizer (adaptive moment estimation), it can store the exponential decay average of the squared v t of the historical gradient as well as keep the exponential decay average of the past gradient m t , which enables adaptive learning for each parameter.
β 1 is the exponential decay rate, which controls the weight assignment (momentum vs. current gradient) and usually takes a value close to 1, with a default of 0.9. β 2 is the exponential decay rate, which controls the influence of the previous gradient squared.
In this paper, two optimizers, SGD and ADAM, are used for comparative analysis, and MSE is used as the model evaluation index. As shown in Figure 11a, when the SGD optimizer is used, the error of both the training and test sets gradually decreases and tends to 0. However, the error of the test set is higher than that of the training set within 0-400 iterations; thus, the model has the problem of overfitting. In Figure 11b, when the ADAM optimizer is used, the errors of the training and test sets are also gradually reduced to zero, and the error line fits better; consequently, the ADAM optimizer is finally chosen for the optimization of the prediction model. = ∇ ( −1 ) = 1 −1 + (1 − 1 ) = 2 −1 + (1 − 2 ) 2 1 is the exponential decay rate, which controls the weight assignment (m vs. current gradient) and usually takes a value close to 1, with a default of 0.9.
2 is the exponential decay rate, which controls the influence of the previo ent squared.
In this paper, two optimizers, SGD and ADAM, are used for comparative and MSE is used as the model evaluation index. As shown in Figure 11a, when optimizer is used, the error of both the training and test sets gradually decreases to 0. However, the error of the test set is higher than that of the training set wit iterations; thus, the model has the problem of overfitting. In Figure 11b, when th optimizer is used, the errors of the training and test sets are also gradually re zero, and the error line fits better; consequently, the ADAM optimizer is finally c the optimization of the prediction model.

Model Parameters Configuration
In this paper, the training is based on the Sklearn framework for the Python To avoid the problem of overfitting the model, the Dropout layer is added for ment. By setting the value of the dropout parameter to 0.01, which means that e randomly discards the neuron weights of the network built in each layer with a ity of 0.01, the purpose of improving the generalization ability of the built mod achieved.
In addition, during the training process of the model, the recurrent neura may generally execute for a longer time, and the training duration is generally by the size of the input sample set, the size of the set epoch, batch size, and othe eters, hence the model hyperparameters should be set reasonably to achieve training.
Here, the process of setting the dropout parameter is used as an example to the process of setting the parameter, and the other parameters are also adopted in way. When the dropout is set to 0.01, 0.1, 0.5, and 0.9 respectively, the other pa are kept constant and the model is trained. It can be seen that the model ach highest R 2 value on both the training set and the test set when dropout = 0.01 accuracy error on both the training set and the test set is the smallest, indicatin

Model Parameters Configuration
In this paper, the training is based on the Sklearn framework for the Python platform. To avoid the problem of overfitting the model, the Dropout layer is added for improvement. By setting the value of the dropout parameter to 0.01, which means that each layer randomly discards the neuron weights of the network built in each layer with a probability of 0.01, the purpose of improving the generalization ability of the built model can be achieved.
In addition, during the training process of the model, the recurrent neural network may generally execute for a longer time, and the training duration is generally governed by the size of the input sample set, the size of the set epoch, batch size, and other parameters, hence the model hyperparameters should be set reasonably to achieve efficient training.
Here, the process of setting the dropout parameter is used as an example to illustrate the process of setting the parameter, and the other parameters are also adopted in a similar way. When the dropout is set to 0.01, 0.1, 0.5, and 0.9 respectively, the other parameters are kept constant and the model is trained. It can be seen that the model achieves the highest R 2 value on both the training set and the test set when dropout = 0.01, and the accuracy error on both the training set and the test set is the smallest, indicating that the model fits best when dropout = 0.01. Therefore, the dropout parameter is set to 0.01. The change in prediction accuracy with dropout is shown in Figure 12.
model fits best when dropout = 0.01. Therefore, the dropout parameter is set to 0.01. Th change in prediction accuracy with dropout is shown in Figure 12. During the experiments, the model was trained using the control variable method and the input parameters of the model, i.e., dropout, learning rate, epoch, batch size, an window, were configured separately, where dropout was chosen from 0-1 with four com mon values of 0, 0.01, 0.5, and 0.9 for the experiments. The learning rate was experimente with in decreasing order of 0.1, 0.01, and 0.001; epoch was experimented with in increas ing order of 100; batch size was experimented with in increasing order of 128; and window was experimented with in increasing order of 3. As shown in Table 6, the meanings of th main parameters of the BiLSTM-Attention model and their values are provided.

Model Training Process
For the experiments, the training and test sets were divided into a ratio of 8:2, with sample size of 27245×5×11 for the training set and 7713×5×11 for the test set. The exper mental procedure is divided into the following steps.
(1) Input a training set sample of size 27245×5×11, with a step size of 5 and a dimensio of 11. During the experiments, the model was trained using the control variable method, and the input parameters of the model, i.e., dropout, learning rate, epoch, batch size, and window, were configured separately, where dropout was chosen from 0-1 with four common values of 0, 0.01, 0.5, and 0.9 for the experiments. The learning rate was experimented with in decreasing order of 0.1, 0.01, and 0.001; epoch was experimented with in increasing order of 100; batch size was experimented with in increasing order of 128; and window was experimented with in increasing order of 3. As shown in Table 6, the meanings of the main parameters of the BiLSTM-Attention model and their values are provided.

Model Training Process
For the experiments, the training and test sets were divided into a ratio of 8:2, with a sample size of 27,245 × 5 × 11 for the training set and 7713 × 5 × 11 for the test set. The experimental procedure is divided into the following steps.
(1) Input a training set sample of size 27,245 × 5 × 11, with a step size of 5 and a dimension of 11. (2) Randomly initialize model parameters, including dropout, learning rate, epoch, etc.
(3) The training data are learned by BiLSTM and the feature vector of (none, 5, 11) is output, connecting the weights of the temporal states by the attention mechanism layer, and finally, the prediction result of the atmospheric temperature value is output by the softmax function. (4) The output predictions are compared with the true labels to calculate the cross-entropy loss, and the weight parameters in the network are updated by the ADAM optimizer to calculate the error loss gradient reverse optimization according to the set learning rate size. (5) Repeat the model training process (1)-(4) over and over again according to the number of training steps.
The flow chart for model training is shown in Figure 13.
(2) Randomly initialize model parameters, including dropout, learning rate, The training data are learned by BiLSTM and the feature vector of (none,5 put, connecting the weights of the temporal states by the attention mecha and finally, the prediction result of the atmospheric temperature value is the softmax function. (4) The output predictions are compared with the true labels to calculate th tropy loss, and the weight parameters in the network are updated by t optimizer to calculate the error loss gradient reverse optimization accor set learning rate size. (5) Repeat the model training process (1)-(4) over and over again according t ber of training steps.
The flow chart for model training is shown in Figure 13.

Loss Curve
The loss curve plots of the BiLSTM-Attention model training are shown 14a. When the number of iterations is 500 rounds, the model has the best fittin the loss values of the training and test sets, and the model has the best stabil losses of both the training and test sets have converged and the difference bet is small, and the fitting effect is the best. The curve plots of BiLSTM, LSTM-Att LSTM loss profiles are shown in Figure 14b-d, respectively, and it can be se fitting effects of the comparison models are noisy and fluctuating, and the los the test set are sometimes good and at other times bad, and the fitting effect is

Loss Curve
The loss curve plots of the BiLSTM-Attention model training are shown in Figure 14a. When the number of iterations is 500 rounds, the model has the best fitting effect on the loss values of the training and test sets, and the model has the best stability, i.e., the losses of both the training and test sets have converged and the difference between them is small, and the fitting effect is the best. The curve plots of BiLSTM, LSTM-Attention, and LSTM loss profiles are shown in Figure 14b-d, respectively, and it can be seen that the fitting effects of the comparison models are noisy and fluctuating, and the loss values of the test set are sometimes good and at other times bad, and the fitting effect is poor.

Visualization of Prediction Results
After the actual prediction using the BiLSTM-Attention model, it is revealed that the improved model shows very good prediction results, both in the test set and in the training set. A line graph of the model's prediction results on the test set is shown in Figure 15, where it can be seen that the errors between the true and predicted values are very small, and the curves are very close, i.e., the predicted values can reflect the magnitude of the true values well.

Visualization of Prediction Results
After the actual prediction using the BiLSTM-Attention model, it is revealed that the improved model shows very good prediction results, both in the test set and in the training set. A line graph of the model's prediction results on the test set is shown in Figure 15, where it can be seen that the errors between the true and predicted values are very small, and the curves are very close, i.e., the predicted values can reflect the magnitude of the true values well. In order to better express the goodness of the model prediction results, the errors between the real temperature values and the predicted temperature values can be displayed using box plots, as shown in Figure 16, from which it can be seen that the errors between the predicted and real values are basically all between 0.1 and 0.3, and very few In order to better express the goodness of the model prediction results, the errors between the real temperature values and the predicted temperature values can be displayed using box plots, as shown in Figure 16, from which it can be seen that the errors between the predicted and real values are basically all between 0.1 and 0.3, and very few errors are slightly larger, but they are also between 0 and 1. Therefore, it is shown that the overall prediction errors are extremely small, and the model has a very good prediction effect. The model is therefore suitable for the field of atmospheric temperature prediction.

Comparative Testing of Models
To quantify the effectiveness of the BiLSTM-Attention model for temperature prediction, the prediction evaluation indexes of BiLSTM, LSTM-Attention, and LSTM were compared with those of Tiantan City as an example, and the results are shown in Table 7. It can be seen that the BiLSTM-Attention model outperforms the other models in both the test set and the training set, and the prediction accuracy of the model reaches 0.9618, while the mean square error is only 0.0004, the average absolute error is only 0.0130, and the average absolute percentage error is only 4.2370. In terms of time, the average training time of each model is 207 s, while the execution time of BiLSTM-Attention model is relatively short at 204 s, which shows that the training time of this model does not increase due to the addition of the attention mechanism.
As shown in Figures 17 and 18, the BiLSTM-Attention model, both for the training and test sets, shows the highest R 2 value and its error value is the smallest. Comparing the prediction results of BiLSTM-Attention with BiLSTM, LSTM-Attention, and LSTM models, its MAE values are reduced by 0.72%, 0.41%, and 1.24% compared with BiLSTM, LSTM-Attention, and LSTM, respectively; furthermore, its MSE values are reduced by 0.03% compared with BiLSTM, LSTM-Attention, and LSTM by 0.03%, 0.02%, and 0.06%, respectively, and the R 2 values are improved by 2.73%, 1.23%, and 4.98% compared to BiLSTM, LSTM-Attention, and LSTM, respectively. This shows that the stability of the proposed improved model is better and the model relevance is stronger. This experimental result also effectively verifies that the BiLSTM-Attention model has some superiority in the field of temperature prediction.

Comparative Testing of Models
To quantify the effectiveness of the BiLSTM-Attention model for temperature prediction, the prediction evaluation indexes of BiLSTM, LSTM-Attention, and LSTM were compared with those of Tiantan City as an example, and the results are shown in Table 7. It can be seen that the BiLSTM-Attention model outperforms the other models in both the test set and the training set, and the prediction accuracy of the model reaches 0.9618, while the mean square error is only 0.0004, the average absolute error is only 0.0130, and the average absolute percentage error is only 4.2370. In terms of time, the average training time of each model is 207 s, while the execution time of BiLSTM-Attention model is relatively short at 204 s, which shows that the training time of this model does not increase due to the addition of the attention mechanism.
As shown in Figures 17 and 18, the BiLSTM-Attention model, both for the training and test sets, shows the highest R 2 value and its error value is the smallest. Comparing the prediction results of BiLSTM-Attention with BiLSTM, LSTM-Attention, and LSTM models, its MAE values are reduced by 0.72%, 0.41%, and 1.24% compared with BiLSTM, LSTM-Attention, and LSTM, respectively; furthermore, its MSE values are reduced by 0.03% compared with BiLSTM, LSTM-Attention, and LSTM by 0.03%, 0.02%, and 0.06%, respectively, and the R 2 values are improved by 2.73%, 1.23%, and 4.98% compared to BiLSTM, LSTM-Attention, and LSTM, respectively. This shows that the stability of the proposed improved model is better and the model relevance is stronger. This experimental result also effectively verifies that the BiLSTM-Attention model has some superiority in the field of temperature prediction.  Experiments were conducted separately for multi-site temperature data, and th diction results based on the model were obtained for each site, as shown in Table  R 2 value on the test set was maintained at 0.9638, MAE at 0.0133, MSE at 0.0004, and M at 4.4130, indicating that the BiLSTM-Attention model has good model generali ability and portability on different regional monitoring datasets.  Experiments were conducted separately for multi-site temperature data, and the prediction results based on the model were obtained for each site, as shown in Table 8. The R 2 value on the test set was maintained at 0.9638, MAE at 0.0133, MSE at 0.0004, and MAPE at 4.4130, indicating that the BiLSTM-Attention model has good model generalization ability and portability on different regional monitoring datasets.

Discussion
From Table 7, it can be seen that the BiLSTM [34] tends to have better prediction accuracy than the LSTM [22], which also indicates that the bidirectional network structure can more fully consider the complete information of the sequence data in the forward and backward directions, thus improving the prediction accuracy of the model. At the same time, it can be seen that the improved network based on the attention mechanism [23] has better accuracy, which also indicates that the attention mechanism can assign different attention weights to different stages of temperature change during the training process of the model, so that the model can focus on the key sequence information as much as possible, thus achieving the purpose of enhancing the improvement of temperature prediction accuracy.
The execution time of the model can also be used as a criterion to evaluate the goodness of the model. Controlling the same sample set and the same hardware environment, different comparison models were input for experiments and the training time of the model was recorded, as shown in Table 7. It can be seen that the execution times of the four major models are relatively close to each other, and the average execution time is 207 s. However, the execution time of the BiLSTM-Attention model is relatively shorter, indicating a faster execution speed.

Conclusions
In this paper, an improved symmetric BiLSTM network is proposed for the prediction of atmospheric temperature data, and Beijing temperature data are used as an example for validation and analysis, with the following main findings.
(1) The proposed BiLSTM-Attention model enables the model to efficiently extract feature data in a specific time step through a bidirectional LSTM network structure while retaining complete information between the past and the future. It then continuously and dynamically adjusts the weight values of different channels based on the attention mechanism, which in turn enables efficient allocation of computational resources, and can effectively improve the model's temperature prediction accuracy.
(2) The model is used for temperature data prediction and compared with BiLSTM, LSTM-Attention, and LSTM models, and it was found that the proposed BiLSTM-Attention model has the highest prediction accuracy and the lowest error, reduced by 0.72%, 0.41%, and 1.24%, respectively; the R 2 value reached 0.9618, which improved by 2.73%, 1.23%, and 4.98% compared with BiLSTM, LSTM-Attention, and LSTM, respectively. Thus, it is shown that the BiLSTM-Attention model has good theoretical value and practical application significance, and can provide better solutions in the field of temperature prediction.
The following directions exist for future research that deserve further study.
(1) Consider using more efficient and fast hyperparameter optimization methods to optimize the model parameters to obtain a more suitable parameter configuration for the temperature prediction domain. (2) In future research, more regional temperature datasets should be collected for a more complete prediction analysis to make the model more reliable and adaptable. (3) More detailed comparison experiments can be attempted in future studies to prove the superiority of the model.