Forecasting the 10.7-cm Solar Radio Flux Using Deep CNN-LSTM Neural Networks

: Predicting the time series of 10.7-cm solar radio ﬂux is a challenging task because of its daily variability. This paper proposed a non-linear method, a convolutional and recurrent neural network combined model to achieve end-to-end F10.7 forecasts. The network consists of a one-dimensional convolutional neural network and a long short-term memory network. The CNN network extracted features from F10.7 original data, then trained the feature signals in the long short-term memory network, and outputted the predicted values. The F10.7 daily data during 2003–2014 are used for the testing set. The mean absolute percentage error values of approximately 2.04%, 2.78%, and 4.66% for 1-day, 3-day, and 7-day forecasts, respectively. The statistical results of evaluating the root mean square error, spearman correlation coefﬁcient shows a superior effect as a whole for the 1–27 days forecast, compared with the ordinary single neural network and combination models.


Introduction
The 10.7 cm solar radio flux is the solar radio emission intensity in a 100-MHz-wide band centered at 2800 MHz (i.e., 10.7-cm wavelength). It is measured in units defined as solar flux units (sfu), where 1 sfu = 10 −22 Wm −2 Hz −1 . The systematic F10.7 record started in 1947, measured by the Canada National Research Institute [1]. The F10.7 is primarily raised by the upper chromosphere and the base of the corona. It has a close relationship with the active regions of the sun. In addition, the formation of F10.7 includes thermal free-free emission and thermal gyroscopic resonance processes [2,3]. The thermal free-free emission takes place in the chromosphere and corona, as well as in plasma concentrations in the magnetic fields that support the active regions of the chromosphere and corona. On another hand, gyroresonance emissions occurs near sunspots with sufficiently strong magnetic fields. F10.7 is more measurable and independent of the weather on Earth than other metrics, such as sunspot number, extreme ultraviolet (EUV), ultraviolet (UV) and X-rays [4]. It is often used as an input parameter to predict and reconstruct atmospheric density in low-earth orbit (Ahluwalia 2016) and the density distribution of ionospheric electrons [5]. The prediction of F10.7 is of great significance in the aerospace field [6,7]. However, due to its daily variability, te high-precision forecasting in the short-and mid-term are often challenging to accomplish.
The traditional prediction method for F10.7 is to construct an autoregressive model, which usually uses linear mathematical expressions to express the functional relationship between the forecast value and the observed value. In Ref. [8], a Fourier series method, based on the short-period oscillation of solar radiation, was proposed to evaluate the effect of the F10.7 prediction. In Ref. [9], a 54-order autoregression model was used to forecast F10.7 for the subsequent 27 days, and the results showed that the average relative error of the forecast from 2003 to 2007 was 9.52%, which was similar to the accuracy of the prediction model used by the US Air Force. Ref. [10] adopted a recursive linear autoregressive model, which predicted the future one day in the first n days. Then, the predicted value would be incorporated into the input value of the model and used for the next prediction. Ref. [4] proposed a linear model, which used the observations of the first 81 days to predict F10.7 for the next 1-45 days. The results showed that the correlation coefficientprediction correlation, with this method was 0.99 and 0.85 for 1-day and 40-day forecasting, respectively.
Furthermore, specific external parameters from solar activities have predicted F10.7 by physical approaches, such as an EUV image of the solar disk, the sunspot number, the ionospheric parameter Fo (critical frequency of the ionospheric layer) and magnetic flux. Ref. [11] used the index P SR defined by the intensity values of solar EUV images to establish an empirical model for the F10.7 prediction. Ref. [3,12] have also successfully achieved short-and mid-term prediction of F10.7 by constructing a magnetic flux transport model, and its Spearman correlation index is generally higher than 0.95 for 3-day forecasts.
Most studies in F10.7 forecasting have only focused on constructing linear models, which usually have a relatively stable effect in mid-and long-term prediction. A significant drawback of the above methods is their lack of the high-quality short-and mid-term forecasting abilities. There had been substantial research on machine learning undertaken to alleviate this dilemma. In Ref. [13], a support vector machine (SVR) was used to make a short-term prediction of F10.7. This method reduced the computational complexity in the training process through a learning-based algorithm and achieves an accuracy close to that of traditional feed-forward neural network with fewer data. However, limited by the sample size, the literature does not test the case for moderate solar activity years. In addition, the lack of reference models makes it impossible to know its level of prediction. Ref. [14] adopted the backpropagation (BP) neural network algorithm to build the F10.7 prediction model, using the first 54 days as the input of the neural network to predict the future 1-3 days. The results show that this method is superior to the SVR method in short-term prediction. In order to further improve the prediction effect, some scholars tried to extract the signal characteristics of the original data by the signal processing method and then predict them by machine learning method. Common ones include empirical mode decomposition (EMD) [15,16], Fourier transform [17] and wavelet analysis [18,19]. Relevant literature above have exemplified the effectiveness of machine learning methods, signal processing technique and a combination of the two, but still faces the following limitations: (1) Traditional machine learning methods are usually shallow neural networks. With the increase of network layers, gradient disappearance or gradient explosion will occur, which will decrease detection accuracy. (2) Limited by the network depth, the shallow neural network algorithm extracts mostly surface signal's features, which is not sufficient to explain the model. In many cases, it still needs to extract sample features manually as the input samples. (3) The signal processing method may cause the loss or distortion of some signal features. In addition, signal features depend on manual experience, which easily causes the instability of feature extraction. These factors will limit the prediction effect of (2).
We proposed a time series prediction model for the short-and mid-term F10.7 forecasting to address the above deficiencies. The model is comprised two independent networks: the feature extraction network and the predicting network, respectively devised by a onedimension convolutional neural network (CNN) and a long short-term memory network (LSTM). In the F10.7 prediction task, CNN first adaptively extracted the signal features from raw time series. These feature samples were then trained and predicted by the LSTM. The main contributions were summarized as follows: (1) Design a one-dimensional CNN to use in sequences, realizing the end-end forecasting model for the F10.7 data. The rest of this paper has been divided into five parts. Section 2 introduces the theoretical background of CNN and LSTM. Section 3 presents the framework of the proposed model. Section 4 deals with the results of the case studies. Section 5 gives the conclusion.

Convolutional Neural Network
CNN was proposed by Yann Lecun in 1998 [20] and has been primarily used in computer image processing since its proposal. The feature extraction network and classification network are contained in the CNN model. The feature extraction network consists of a convolutional and pooling layer, which extracts and compresses the input signals by a convolution operation and a down-sampling operation. The classification network consists of a fully connected layer. The parameters and weights of the two network parts are updated by forwarding propagation and error back-propagation mechanisms. The convolutional layer is a representative layer of the CNN network. In essence, it is a series of digital filters from which features extraction is possible. A convolution kernel used the input signal to extract the corresponding features by a local convolution operation. In addition, the local connectivity and weight sharing nature in the convolutional layer have been able to significantly reduce the number of network parameters to accelerate training and avoid over-fitting. The convolution operation formula is shown in Formula (1): where l represents the layer number, i and j denote the serial number of the feature map, and M j is the feature map of the (l − 1) th layer connected with the jth feature map of the l th layer. ω l ij represents the convolution kernel weight of the i th feature graph of the (l − 1) th layer that connected with the l th layer's j th feature graph. b l j represents the bias of the j th feature graph at the l th layer; The symbol * is the convolution operation.
The pooling layer is also known as the down-sampling layer, which is generally located between two convolution layers and aims to reduce the dimension of the feature graph while retaining important feature information. Pooling operations generally include maximum pooling, average pooling, and summation pooling. The maximum pooling is used more often, and is mathematically expressed as follows: where down() is the down-sampling function; x l−1 j represents the j th feature map of the (l − 1) th layer; β l j and b l j represent the weight and bias of the j th feature graph in the l th layer, respectively.
The fully connected layer is located at the end of the CNN, usually using a traditional multilayer perceptron. The high dimensional feature signals are usually reduced to onedimensional feature vectors before being input into the fully connected layer. The input layer and the output layer are fully connected. The activation function is used in the output layer to achieve multi-classification output. The forward propagation formula of full connection is as follows: where ω l ij represents the weight between the i th neuron of the l th layer and the j th neuron of the (l − 1) th layer; b l j is the bias of the j th neuron of the l th layer; a l i is the activation value of the i th neuron of the l th layer; z l+1 j represents the output value of the j th neuron at the (l + 1) th layer.

Implementation Scheme of CNN for F10.7 Prediction
CNN was initially used for 2D image processing. Due to its excellent depth feature extraction capability, more and more scholars have explored its application in the field of time series prediction in recent years [21][22][23]. The most significant difference between CNN in time series prediction and image recognition tasks is the type of input signal. The original signal is two-dimensional data in the image recognition task, which can be used directly for training. In contrast, the input signal is usually one-dimensional in the time series prediction task, such as vibration, voltage, and sound signals. In order to be adopted by CNN, it is necessary to make corresponding changes. Such as the convolution kernel and pooling layer need to be reset to one-dimensional dimensions, or the time-series signal reconstructed for two-dimensional vector. Compared with the latter scheme, using the original one-dimensional signal as the network's input will be beneficial in preserving the original information and reducing the computational effort. It takes full advantage of the characteristics of the CNN network integrating automatic signal feature extraction, signal dimensionality reduction, feature selection and pattern classification to achieve "end-to-end" prediction. Prior studies have adopted the above scheme and verified the algorithm's validity [24,25].

Long Short-Term Memory Network
LSTM model is an improved RNN architecture proposed by Hochreiter [26]. Its special hidden units endow the capacity to store long-term information. Recently, a substantial body of literature has demonstrated that the LSTM models yield impressive results in a variety of time series tasks, including language modelling [27], traffic speed prediction [28] and travel time prediction [29]. The internal structure of the LSTM unit is shown in Figure 1. The training process of LSTM includes forward calculation and error backpropagation, in which forward calculation is mainly realized by forget gate, input gate and output gate. The specific formula of forwarding calculation is as follows: Figure 1. Structure of cell in LSTM. Adapted from Le [30].
where w i and w c are the weight matrix of the input gate. w f and w o are the weight matrix of forget gate and output gate, respectively. b denotes the bias of correspondent gates.
[h t−1 , x t ] stands for concatenating h t−1 and x t . σ and tanh are the activation functions, shown in Formulas (10) and (11). LSTM has three inputs at time t: the network input value x t at the current time, the LSTM output value h t−1 at the last time, and the unit state c t−1 at the last time.
The forgetting gate determines the proportion of unit states inherited from the preceding moment. The input gate affects how much of the current input is available to be saved to the cell state. The output gate controls part of the cell state as the current output value of the LSTM. The backpropagation of LSTM is represented as the following three steps: Step 1. The output value of each neuron is calculated forward by Formulas (4)-(9); Step 2. Reversely calculate the error of each neuron. Like RNN, the backpropagation of LSTM error also includes two directions: one is the backpropagation along time, that is, the error term at each moment is calculated from the current moment t; The other is to propagate the error backwards; Step 3. Calculate the gradient of each weight according to the corresponding error.

The Overall Framework
Some literature on the time series problems [31,32] refined the signal characteristics before the machine-learning processes, aiming to improve the predicted accuracy further. Nonetheless, such practice usually needs to be done manually, which relies on expert experience and is time-consuming. Furthermore, some signal feature extracting [33,34] may lose or distort some raw information. That would be an adverse effect on forecasting.
The CNN-LSTM model in this paper implemented unattended raw signal depth feature extraction by end-to-end signal processing. It consisted of two functional networks in series: the feature extraction network (i.e., the CNN model) and the sequence prediction network (i.e., the LSTM model), and their structure is schematically shown in Figure 2, with the following core procedure: Step 1. Normalize the F10.7 data set and divide it into the training set and test set. Take the consecutive 27-day sequence as a sample. The training samples and test samples were made using the sliding window method (sliding window set to 1). These training samples were randomly fed into the network for training.
Step 2. CNN performs feature extraction on F10.7 and outputs feature signals.
Step 3. LSTM trains the feature signals and gives the prediction results. Denormalization of the prediction results to obtain the final predicted value.

The Networks Architectures and Parameters Settings
The CNN-LSTM is cascade connection, by CNN and LSTM, which are the role of features extraction and sequences prediction, respectively. Their detailed frame and parameters setting were shown in Table 1. The CNN are composed of convolutional layers, pooling layers and flattened layers. The CNN's input is the raw F10.7 time series. Its two convolutional layers use ReLU as the activation function and perform batch normalization. The pooling layer remains the main features and reduces the parameters and computation, prevents overfitting, and improves generalization. The flatten layer allows output the one-dimension signals to the LSTM networks. The sequences prediction network consists of a three-layer LSTM network. It uses the ADAM as the optimizer, the learning rate is set to be 0.001 and the training iterations are set to 500.

Dataset Introduction
The dataset used in this paper was acquired from the Center for Space Standards and Innovation (http://www.celestrak.com/SpaceData/SW-All.txt accessed on 3 February 2021). Furthermore, the training and testing data selected the F10.7_ADJ parameter, which means the observed10.7-cm solar radio flux, adjusted to one astronomical unit. We selected the F10.7 data from 1980-2002, which included 8401 samples as the training sets. The data from 2003-2014 contained 4383 samples and was set to be the testing set. The proportion between the training set and testing set was around 2:1.

Evaluation Metric
In the experiment, multiple evaluation indexes are used to evaluate the prediction effect of comparative algorithms, as shown in Formulas(12)- (15). Mean Absolute Percentage Error (MAPE) is calculated by dividing the difference between the actual value and the estimated value by the actual value. The absolute value in this ratio is summed for every forecasted point in time and divided by the sample number n. Root-mean-square error (RMSE) reflects the overall stability of the forecast. The correlation coefficient (R) and Spearman correlation coefficient (ρ) measure the estimation and observation correlation. A high correlation coefficient indicates that the estimated value and the observed value increase or decrease synchronously, and the expected result is satisfactory.
. (15) where n is the sample number, Y i andŷ i denote the i th observed and estimated values, correspondingly.

The Annual Accuracy Analysis of F10.7 Prediction
A positive correlation between MAPE and solar activity level has been reported in the literature [4,10]. This view was also supported in this paper. Figure 3 showed the annual average MAPE results of comparative algorithms from 2003-2012 on 3-day forecasts. The comparison models include the BP model [14], a three-layer feed-forward network, the EMD-BP model [15] and a stacked model of EMD and BP. In addition, there is the traditional three-layer LSTM model.The MAPE is positive correlation with the level of annual average F10.7 for all algorithms. For each algorithm, the MAPE values in periods of moderate and high solar activity (i.e., F10.7 > 100 sfu ) are higher than that of periods of low solar activity (F10.7 < 100 sfu). Furthermore, the MAPE values of the BP algorithm are inferior to the other three algorithms in all solar activity years. The LSTM algorithm has a relatively stable MAPE, on the whole. The EMD-BP has a diminutive MAPE value under low solar activity but sharp deterioration under medium and high solar activity. The CNN-LSTM algorithm has the optimal performance in all testing years. Its average MAPE value in low-solar-activity years is 1.52%, lower than 3.39%, 1.91% and 2.39% of BP, EMD-BP and LSTM, respectively. The average MAPE is 2.85% in moderate-and high-solar-activity years, which is lower than 4.95%, 4.18% and 3.38% of BP, EMD-BP and LSTM, respectively.  Table 2 shows the prediction results of CNN-LSTM, BP, EMD-BP and LSTM models for prediction lengths of 1-27 days. From a general view, the MAPE and RMSE values rise with the increase of forecast days. Contrarily, the Spearman correlation coefficient has a dropping trend. Our method has an average MAPE of 3.41% in the short-term forecast (i.e., 1-7 forecasting days), which has a lead of 32.7%, 21.2% and 12.3% over BP, LSTM and EMD-BP, respectively. In the index RMSE, our method has a lead of 33.9%, 25.6% and 9.0% over with the other three. In the mid-term forecast (i.e., 8-27 forecasting days), the MAPE and RMSE have a dramatic rise; our algorithm is ahead by 30.7%, 16.3% and 8.1% in MAPE and 28.5%, 20.5% and 7.4% in RMSE. As for Spearman correlation coefficients, all the algorithms are higher than 0.9 over the 1-27 days forecast, and our method also performed better.  Figure 4 shows the fitting curves of predicted values and observed values among comparative algorithms for 10-day forecasts in 2003 (i.e., the year of high solar activity). All the predictive curves are fitted well with the trend of the observed curve. We observe some relatively significant deviations between the predicted curves and the observed curve in local extremum cycles of the fitting curves of BP, EMD-BP and LSTM. The fitting curve of CNN-LSTM fits the curve well most of the time, but there is still a relatively large deviation in some periods (i.e., around the 90th, 170th and 200th days). Figure 5 shows the fitting effect among comparative algorithms for 10-day forecasts in 2007 (i.e., a year of low solar activity). From the curves, we observe that the CNN-LSTM model still shows the optimal fitting effect. The fitting curve of 10-day forecasts represents a mid-term forecast, which reveals the ability of trend prediction. Compared with the fitting curves of the BP, EMD-BP and LSTM models, the CNN-LSTM model demonstrates a smoother and smaller deviation, especially at high-solar-activity days.

Comparison with Physical Methods in Short-Term Prediction
This part comprehensively evaluates the performance of the short-term forecast. The CH-MF model [12] mixed the heteroscedasticity linear model and Bayesian optimization method, and a solar magnetic flux transport model namely Y model [12] was compared with the CNN-LSTM model. The results are shown in Table 3. All the three methods have good and comparable effects in the metrics of Spearman coefficients, Pearson coefficients and correlation coefficients. Comparing the Y model and CH-MF model, our method has a lead of 39.9% and 42% in MAPE for 1-day forecasts and has a lead of 35.1% and 2.4% in RMSE. In 3-day forecasts, it leads by 28.6% and 51.8% in MAPE and 16.5% and 29.9% in RMSE.

Conclusions
We employed the combination model CNN-LSTM, which stacked the specified CNN and LSTM network in series for the short-and mid-term F10.7 forecasting. The CNN network was used to extract the features of the F10.7 daily sequence adaptively. The LSTM network then performed the predicted task. We compared the CNN-LSTM model with other F10.7 forecasting models under different solar activity years and forecast days. The results showed that the CNN-LSTM model had MAPE values of 2.04% (1-day ahead), 2.78% (3-day ahead) and 4.66% (7-day ahead) in short-term forecasts. Additionally, mid-term forecasts had MAPE values of 5.85% (14-day ahead) and 7.40% (27-day ahead). For the Spearman correlation coefficients and RMSE, the CNN-LSTM model also outperformed the BP, EMD-BP, and LSTM model. Further, the CNN-LSTM model was found to be effective in improving the prediction accuracy of high-solar-activity years. The CNN-LSTM was also compared with the Y model and the CH-MF models, and showed distinct advantages in a variety of evaluation metrics. We summarized the reasons that our method had impressive performance in short and medium-term prediction accuracy of F10.7: (1) Compared with the common shallow neural network, the LSTM model was more suitable for F10.7 prediction because of its memory nature. (2) The feature extractor of the convolutional neural network achieved adaptive extraction, which avoided the instability of manually extracted features and facilitated further prediction of time series by LSTM. (3) Compared with other machine learning algorithms, the CNN-LSTM model provided a deeper neural network architecture, extracting depth features from the raw signal.
Nevertheless, we realized that the prediction error increased near the peak points, which might be due to the low proportion of samples in the vicinity of the extreme value points relative to the whole sample, making the model less effective in prediction in the face of this unbalanced sample problem. Future work will explore the use of down-sampling techniques and generative adversarial networks to improve this problem.