Ionospheric TEC Prediction in China Based on the Multiple-Attention LSTM Model

: The prediction of the total electron content (TEC) in the ionosphere is of great signiﬁcance for satellite communication, navigation and positioning. This paper presents a multiple-attention mechanism-based LSTM (multiple-attention Long Short-Term Memory, MA-LSTM) TEC prediction model. The main achievements of this paper are as follows: (1) adding an L1 constraint to the LSTM-based TEC prediction model—an L1 constraint prevents excessive attention to the input sequence during modelling and prevents overﬁtting; (2) adding multiple-attention mechanism modules to the TEC prediction model. By adding three parallel attention modules, respectively, we calculated the attention value of the output vector from the LSTM layer, and calculated its attention distribution through the softmax function. Then, the vector output by each LSTM layer was weighted and summed with the corresponding attention distribution so as to highlight and focus on important features. To verify our model’s performance, eight regions located in China were selected in the European Orbit Determination Center (CODE) TEC grid dataset. In these selected areas, comparative experiments were carried out with LSTM, GRU and Att-BiGRU. The results show that our proposed MA-LSTM model is obviously superior to the comparison models. This paper also discusses the prediction effect of the model in different months. The results show that the prediction effect of the model is best in July, August and September, with the R-square reaching above 0.99. In March, April and May, the R-square is slightly low, but even at the worst time, the ﬁtting degree between the predicted value and the real value still reaches 0.965. We also discussed the inﬂuence of a magnetic quiet period and a magnetic storm period on the prediction performance. The results show that in the magnetic quiet period, our model ﬁt very well. In the magnetic storm period, the R-square is lower than that of the magnetic quiet period, but it can also reach 0.989. The research in this paper provides a reliable method for the short-term prediction of ionospheric TEC.


Introduction
The ionosphere is an important ionization region of the earth's upper atmosphere, with an altitude of approximately 60-1000 km. The ionosphere is affected by solar storms, geomagnetic activities and other factors, which will generate ionospheric disturbances and have an important impact on radio and other communication signals transmitted in the ionosphere [1,2]. The total electron content (TEC) is among the important parameters for studying ionospheric changes [3,4]. Accurate prediction of TEC is very important for communication systems such as satellite positioning and remote sensing systems [5,6]. With TEC prediction, corresponding protection measures can be started in satellite navigation, satellite positioning, and radio communication in the case of poor space weather conditions. Therefore, it is very important to study the TEC prediction model and predict it as early as possible.
There are two main types of ionospheric short-term prediction methods: the first type are the methods combining observation data with the ionospheric theoretical model [7], and the other type is based on the artificial neural network (ANN) [8,9]. Among them, the artificial neural network has become a popular tool in ionospheric TEC modeling and prediction due to its strong non-linear representation ability [10]. Yakubu et al. [11] used an ANN model to predict the daily and seasonal effects of TEC on the Indian equatorial station Changanacherry. Watthanasangmechai et al. [12] proposed a neural network model to predict TEC in Thailand. However, TEC data are typical time series data, which contain a strong time correlation. The TEC prediction methods based on the artificial neural network only consider the spatial position of the data and cannot characterize the characteristics of the time series, which will lead to a large prediction error. Inyurt et al. [13] showed that the ANN model could not reflect the time series characteristics of data, resulting in large prediction errors and low prediction accuracy in different seasons; Huang et al. [14] showed that a RBF neural network is not sensitive to the daily changes of TEC, resulting in a large prediction error at night. Habarulema et al. [15] showed that the ANN model is susceptible to the interference of solar activity, and the TEC prediction error varies greatly in the high and low years of solar activity. The model is insensitive to the seasonal changes of TEC, resulting in low prediction accuracy. The recurrent neural network (RNN) is a chain-connected neural network that takes sequence data as input and performs recursion in the evolution direction of the sequence. It is a deep learning model that can characterize both the spatial characteristics and the temporal characteristics of the data. It is the mainstream algorithm for time series modeling [16]. Yuan et al. [17] showed that the RNN can predict TEC; however, a unit in the RNN is mainly affected by the units near it, which will make it difficult for the gradient to be transferred back from the back layer during long-time series prediction, causing the gradient to disappear, which makes it unable to represent the non-linear relationship of long-time span, that is, it cannot solve the long-term dependence of data. Long Short-Term Memory (LSTM) can solve the gradient disappearance problem by its unique gate structure [18]. Chimsuwan et al. [19] used the LSTM model to predict TEC. However, since the LSTM model treats historical time series equally, it cannot adaptively focus on important features, which results in limited prediction accuracy. To solve this problem, this paper adds multiple attention modules and an L1 constraint to the traditional LSTM model. The attention modules can redistribute weights according to the importance of multiple historical data input to the model to improve the attention of the model to important features [20][21][22] [23], GRU [24], and Att-BiGRU [25]. We also tested the performance of the MA-LSTM model for predicting TEC in a magnetic quiet period and a medium magnetic storm period, and for different months.
The contributions of this paper are as follows: (1). The LSTM model with multiple attention modules is applied to TEC prediction to make TEC modelling more adaptive and obtain higher prediction accuracy. (2). An L1 constraint is added to TEC prediction, which can avoid overfitting caused by excessive attention to a historical observation value in TEC modeling.

Data Description
The ionospheric data used in this paper are from the Center for Orbit Determination of Europe (CODE) TEC grid data, and the data have a temporal resolution of 2 h [26,27]. The information on the geographical location of the eight areas selected is shown in Table 1 and Figure 1, and the values of TEC are shown in Figure 2.

Data Preprocessing
TEC data are typical time series data, and only stable non-random time series can be predicted. Therefore, a series of preprocessing is required before TEC prediction. The preprocessing of the selected ionospheric TEC data in this paper includes a TEC data stationarity test, difference processing, a pure random stationarity test, and TEC data normalization.

The TEC Data Stationary Test and Difference Processing
The stationarity of time series is the basic assumption of time series analysis. Therefore, the stationarity of time series needs to be tested first before TEC prediction. In this paper, we use the ADF test to determine the stationarity of TEC sequences. The stationarity test results of the eight regions are all non-stationary time series. In order to transform the data into stationary time series for prediction, first-order difference processing is required. The calculation formula of the first-order difference is as follows: where is the first-order difference operator and x t is the observation data at the time t. Figure 3 shows the results of the first-order difference of the TEC data of the eight regions in Table 1. After the first-order difference processing, the ADF (Augmented Dickey-Fuller) test is performed again. This time, all eight regions pass the ADF test, that is, the first-order difference data of the eight regions are stable time series.

The Pure Randomness Stationarity Test
Stationary time series are not necessarily predictable, and pure random stationary time series are unpredictable. Therefore, it is also necessary to detect the pure randomness of the TEC differential sequence. In this paper, the LB (Ljung-Box) method is used to test the pure randomness of time series. The LB method is used to test whether a series of observations in a given time period is purely random. LB test results show that TEC data after first-order difference processing is not purely random data and can be predicted.

Normalization of TEC Data
After the first-order difference processing, the original TEC data become a stationary non-random time series and can be predicted. However, there are still great changes in the data of the whole data space, which will affect the prediction results, so the data needs to be normalized. In this paper, min-max normalization is used to map the first-order differential TEC data between 0 and 1. The calculation formula of min-max normalization is as follows: where x i is the TEC observation value at time i, and x is a vector composed of all TEC observations in a selected area.

Sample Making
After the previous stationarity test, difference processing, the pure randomness test and TEC data normalization processing, the experimental samples are prepared next. This paper selects the TEC data from 0:00 on 1 January 2002 to 24:00 on 31 December 2011 in eight Chinese regions, and the total number of observation points at each region is 47476. After the first-order differential processing, it becomes 47475. A segmentation method with a sliding window of 13 + p (p indicates the number of samples to be predicted in the future) is used to make the normalized data into samples. The sample making process is shown in Figure 4. x i is the input of sample i, including 13 observations, as shown by the orange data in Figure 4. y i is the output of the sample i. When predicting the data within t hours in the future, y i contains t / 2 points, as shown by the blue points in Figure 3 (this paper is predicting TEC data for the next two hours, so t = 2 and p = 1). The whole experiment is shown in Figure 5.

Experimental Environment
This paper builds the MA-LSTM model based on Python 3.6 and the keras deep learning library. The experimental equipment is configured with Intel i5-7200u CPU, 8G memory, 500G solid-state hard disk, and the GPU is NVIDIA Geforce 940MX.

Evaluation Indexes
In order to test the performance of various models in TEC prediction, this paper uses two evaluation indexes to evaluate the model: the root mean square error (RMSE), the R-square and the mean absolute percentage error (MAPE). The calculation formulas are shown in Equations (3)-(5): where n is the number of test samples, ytrue i is the true value of test sample i, ypre i is the predicted value of test sample i, and ymean is the average value of all the test samples. RMSE is used to describe the prediction error. The smaller the RMSE, the better the prediction performance of the model; the R-square is used to describe the fitting degree between the predicted value and the real value. The closer it is to 1, the better the fitting ability of the model to the TEC observation data; MAPE represents the error accuracy between the predicted and true values, the closer MAPE is to zero, the higher the prediction accuracy of the model. LSTM [28] is a recursive neural network, which is composed of several LSTM units. One LSTM unit includes three gate structures, as shown in Figure 6, namely, input gate, forgetting gate and output gate. These three gate structures are connected by memory cell units to realize purposeful selection of features in the network. The calculation formula of each module of LSTM unit type are as follows: where [ ] represents vector connection, × represents matrix dot product, * represents Hadamard product, σ represents sigmoid activation function, and tanh represents hyperbolic tangent activation function.
The weight matrix and bias matrix can be learned during the training process.x t represents the input of the network at time t, c t represents the updated value of memory cell state at time t.h t represents the state of the hidden layer at the time t.i t determines how much information in c t is used to update c t . f t determines how much information is retained in the memory cell unit at time t−1. The state of the hidden layer at time t is determined by h t ,o t and c t .

LSTM Based on Multiple-Attention Modules (MA-LSTM) Proposed in This Paper
When the traditional LSTM model is used to model TEC data, the prediction weights of the data at any position in the historical sequence to the future data are equal, so it is impossible to adaptively and accurately model TEC. To solve this problem, this paper adds three attention modules [29], which can adaptively assign weights to each input sequence, so that the model can selectively focus on the historical sequence and reduce the prediction error. The proposed MA-LSTM TEC prediction model is shown in Figure 7. The proposed MA-LSTM model includes four parts: the input layer, the LSTM layer, the multiple-attention layer and the output layer.
Input layer: Each sample includes two parts: feature vector x n and regression value y n (i.e., prediction value), which form an input sample in pairs as [x n , y n ].
LSTM Layer: This layer includes two independent LSTM neuron layers, one of which is provided with an L1 constraint and the other is not. These two layers, respectively, receives a pair of feature and regression value samples of the input layer [x n , y n ] and process them separately. The processed results are spliced by concatenation function and transmitted to the next layer. The calculation formula of this layer is as follows: k n = LSTM([x n , y n ]) (12) m n = LSTM([x n , y n ]) (13) r n = concat([k n , m n ]) (14) where k n represents the output of the first LSTM layer with an L1 constraint. m n represents the output of the unconstrained LSTM layer. r n represents the vector after connecting k n and m n . All r n are connected to form the output of this layer, which is h n .

Multiple-attention layer:
This layer contains three parallel attention branches, which, respectively, receive the data from the LSTM layer. After receiving the data transmitted from the LSTM layer, each branch first calculates the similarity between each feature in the received data and the regression value through the attention function, and each feature obtains an attention score. The calculation formula of attention function selected in this paper is as follows: score(h n , y n ) = V T tanh[Wh n + Uy n ] where W, V, U are the parameters that can be learned in the training process. After obtaining the attention score, then normalize it with the softmax function to obtain the probability distribution of attention. The specific calculation formula is as follows: a 1n = so f tmax(score(h n , y n )) (16) a 2n = so f tmax(score(h n , y n )) (17) a 3n = so f tmax(score(h n , y n )) (18) a 1n , a 2n , a 3n represents the respective attention distribution values of the three branches. The softmax function is as follows: So f tmax(score(h i , y i )) = e score(h i ,y i ) ∑ n n=1 e score(h n ,y n ) Then, a 1n , a 2n , a 3n are, respectively, multiplied h n to obtain the output value of each attention branch. The calculation formula is as follows: Finally, connect the attention values of the three branches as the output of the attention layer.
Output layer: This layer includes a flattened layer and a fully connected layer. It is used to map the previously results into predicted values and then output them.

Model Parameter Selection
When MA-LSTM is used for TEC modeling, the optimal parameters of the model need to be determined first. In this paper, the grid search method is adopted to find the optimal super parameters. Result of grid search is as shown in Table 2.

Prediction Comparison of Different Stations
This paper first discusses the situation of TEC prediction at the second hour. The MA-LSTM model proposed in this paper is compared with the classical time series models LSTM, GRU and Att-BiGRU (in terms of structure, the LSTM is a three-layer gate structure, the GRU is a two-layer gate structure, and the Att-BiGRU is a two-layer gate structure in both directions). In the selected eight regions, samples were made according to the method described in Section 2.2. A total of 90% of the samples were used for training models and  Table 3. As can be seen from Figure 8 and Table 3, the performance of the models with attention mechanism (MA-LSTM and Att-BiGRU) is significantly better than those of the models without attention mechanism (LSTM and GRU). The multiple-attention mechanism model proposed in this paper (MA-LSTM) is superior to the single attention mechanism model (Att-BiGRU). In the eight selected regions, the average RMSE of the MA-LSTM model proposed in this paper is 1.171TECU, which is reduced by 253.29, 228.44 and 35.01%, respectively, compared with LSTM, GRU and Att-BiGRU. The average R-square is 0.988, increased by 15.29, 13.04 and 0.01%, respectively. The average MAPE of our model is 0.054, which is increased by 248.15, 231.48 and 85.19%, respectively.

Prediction Comparison of Different Months
This section mainly discusses the prediction performance of the LSTM, GRU, Att Bi-GRU and MA-LSTM models in different months of the eight regions in Table 1. The training data are 42715 TEC sequences from 0:00 on 1 January 2002 to 18:00 on 30 December 2010, and the test data are TEC sequences from 12 different months in 2011. The mean prediction performance of different models in different months are shown in Table 4 and Figure 9. It can be seen that the fitting index R-square of our model exceeded 0.99 in July, August and September, and the maximum reaches 0.993, which appears in August. The lowest and highest RMSE values of our model are 1.086TECU and 1.267TECU, which are lower than those of the comparison models. The minimum MAPE of a single month of our model is 0.042, and the maximum is 0.071, which are significantly lower than those of the comparison model. That is to say, in all selected regions, in all months, in all comparison indicators, our MA-LSTM mode outperforms other comparison models. From Figure 9 we also can see that the R-square of our model is high in July, August and September, which indicates that the predicted value has a good fit with the real value in these three months. In March, April and May, the R-square is slightly low, and the prediction performance of the model is slightly poor, but even at the worst time, the fitting degree between the predicted value and the real value still reaches 0.965.  Figure 9. Comparison of (a) RMSE, (b) R-square, and (c) MAPE values for various models in different months.

Prediction Comparison of Different Geomagnetic Conditions
In order to further study the prediction ability of the MA-LSTM model under different geomagnetic activities, we divided the test data into a magnetic quiet period and a magnetic storm period for comparative analysis (in this paper, samples with Kp < 3 are taken as magnetic quiet days, and samples with Kp > 3 and −100 < Dst ≤ −50 are taken as magnetic storm periods). Histograms of absolute error distribution during magnetic quiet periods and magnetic storm periods are shown in Figure 10, from which we can see that during the magnetic quiet period, the absolute error distribution of the prediction is more concentrated, basically between [−2,2], while during the magnetic storm, the absolute error distribution of the model is wider, reaching [−3,3], that is, the prediction effect of the MA-LSTM model during the magnetic quiet period is better than that during the magnetic storm. However, even during the magnetic storm period, the R-square of our model can still reach 0.989 in the selected area.  Figure 11 shows the comparison between the predicted value and the true value of a magnetic quiet day and a magnetic storm day, from which we can see that on the magnetostatic day, the predicted value matches the real value perfectly, while on a magnetic storm day, the error between the predicted value and the true value is relatively large, and the fitting is not very good.

Conclusions
In this paper, we proposed a multiple-attention LSTM (MA-LSTM)-based TEC prediction model, and applied it to the TEC prediction of eight locations-Beijing, Guangzhou, Harbin, Tibet, Yushu, Ziyang, Hangzhou and Heze in China. We compared our model in this paper with LSTM, GRU and Att-BiGRU. The experimental results show that the MA-LSTM model in this paper is obviously superior to the comparison models. We also discussed the prediction performance of our model in different month. Results show that our MA-LSTM model has the highest prediction performance in July, August and September, with the R-square reaching above 0.99. In March, April and May, the prediction performance is slightly poor, but even at the worst time, the R-square still reaches 0.965. We further discussed the prediction effect during a magnetic quiet period and a magnetic storm period. The results show that the prediction effect of a magnetic quiet period is slightly better than that of magnetic storm.
The author will study the grid prediction of TEC at multiple time points in the future.