Evaluation of F10.7, Sunspot Number and Photon Flux Data for Ionosphere TEC Modeling and Prediction Using Machine Learning Techniques

: Considering the growing volumes and varieties of ionosphere data, it is expected that automation of analytical model building using modern technologies could lead to more accurate results. In this work, machine learning techniques are applied to ionospheric modeling and prediction using sun activity data. We propose Total Electron Content (TEC) spectral analysis, using discrete cosine transform (DCT) to evaluate the relation to the solar features F10.7, sunspot number and photon ﬂux data. The ionosphere modeling procedure presented is based on the assessment of a six-year period (2014–2019) of data. Different multi-dimension regression models were considered in experiments, where each geographic location was independently evaluated using its DCT frequency components. The features correlation analysis has shown that 5-year data seem more adequate for training, while learning curves revealed overﬁtting for polynomial regression from the 4th to 7th degrees. A qualitative evaluation using reconstructed TEC maps indicated that the 3rd degree polynomial regression also seems inadequate. For the remaining models, it can be noted that there is seasonal variation in root-mean-square error (RMSE) clearly related to the equinox (lower error) and solstice (higher error) periods, which points to possible seasonal adjustment in modeling. Elastic Net regularization was also used to reduce global RMSE values down to 2.80 TECU for linear regression.


Introduction
The study of ionosphere dynamics has attracted the attention of several research groups related to Space Weather area. This is mainly due to its known effects on electronic equipment, communication and power transmission [1], which are now part of our society's foundations. Among different measurements, ground-based GNSS stations and ionosondes can provide important ionosphere data at fixed locations, while low-earth orbiting satellites can provide atmospheric soundings in broad regions, including the oceans. However, ionosphere Total Electron Content (TEC) maps have gained prominence as one of the most important tools to evaluate current and future ionosphere behavior, since they can provide full diurnal information on large areas around the globe. To provide reliable TEC maps to the scientific community, since 1998 the International GNSS Service (IGS) has combined different sources of TEC maps, and made the final products freely available through its website [2]. Considering the lack of measurements around the globe, the development of ionosphere mathematical models is critical to this task. Besides, the importance of data interpolation for ionosphere models increases with the possibility of forecasting electron density variability over days. Since the Klobuchar model [3], which is still used in single frequency GPS systems, several other interesting empirical models have been proposed [4][5][6][7].
Machine learning is a computer science research field in which data analysis can provide the ability to learn without being explicitly programmed. When a deterministic solution is unknown to a problem, we can use machine learning techniques to derive a close-to-optimal solution, using a variety of correlated data. It also stands out in complex problems that traditional concepts are unable to solve [8]. Nowadays, machine learning models are present in a variety of fields, such as self-driving cars, biology systems and remote sensing.
Ionospheric modeling can benefit from machine learning techniques to achieve accurate models, using available features related to sun activity and, therefore, the ionosphere state. The solar radio flux at 10.7 cm (F10.7 index), corresponding to the frequency of 2800 MHz, is known to be an excellent indicator of solar activity [9]. Likewise, photon flux data at different frequency ranges correlate to ion formation and neutralization in the ionosphere. Also, the sunspot number follows the cyclical nature of the sun [9]. In this work, we evaluate the influence of these features for ionosphere TEC modeling and prediction.

Features Evaluation
The direct influence of sun activity in ionosphere dynamics is well known. The main source of energy ionizing the molecules in the ionosphere comes from the sun. This energy is unequally distributed at different heights, depending on the solar radiation frequencies that hit the upper atmosphere, which causes variability in electron density. Besides that, earth rotation plays an important role in defining electron content daily change. Geographic regions with peak electronic concentration follow this movement to some extent, which explains periodic components observed in TEC. Similar behavior related to seasonal variation can also be observed. Thus, it is expected that sun activity measurements strongly correlate to the observed TEC change, and these features could be explored to model the quiet ionosphere dynamics. Obviously, other factors are important for defining the full variation of the ionosphere, in particular, the earth's magnetic field, the equatorial ionization anomaly (EIA) [10,11], special phenomena like Coronal Mass Ejections (CMEs) [12] that may cause significant ionospheric disturbances and plasma bubbles [13,14].

F10.7
The 10.7 cm wavelength solar radio emissions (F10.7) originate in the sun's chromosphere and have been measured since 1947. As mentioned in Section 1, these are widely used as an indicator of solar activity.

Sunspot Number
Sunspots are known as temporary phenomena of the Sun's atmosphere. They can be seen as darker spots in surrounding areas, and their occurrence varies cyclically [15]. Actually, sunspots are the sites of strong magnetic fields which are generated by the dynamo effect inside of the Sun. This feature has an important correlation with F10.7, as can be seen in Figure 1.

Photon Flux
The photon flux data estimate the number of photons striking the atmosphere in a given time per unit area, and impact the number of electrons which are produced in the ionosphere. They can be used to calculate the power density at different bandwidths. In this work, data for different bandwidths are used, and Figure 2 shows the variation for three of them.

Ionosphere Modeling
The ionosphere modeling procedure presented is based on the assessment of a six-year period (2014-2019) of data. Global TEC maps were provided by IGS [2] from the National Aeronautics and Space Administration (NASA) website with a spatial-temporal resolution of 2.5 • × 5.0 • × 2 h for latitude, longitude and time, respectively. The correspondent daily F10.7 index, sunspot number and photon flux data were obtained using the Solar Irradiance Platform tool [16]. Each geographic location in TEC maps was evaluated independently, and the corresponding sequence of TEC values available for a given day is first passed to the frequency domain by applying the well known Discrete Cosine Transform (DCT) [17], which shows better spectral compaction and energy concentration properties than Discrete Fourier Transform (DFT), defined as: where X k is the kth frequency component, x n is the nth data point to transform and N is the number of data points. Also, the inverse discrete cosine transform (IDCT) can be defined as: where Then, the DCT frequency components are associated to the correspondent features. This process was repeated daily for the period of analysis. To illustrate the relation between each one of the 12 DCT frequency components (obtained from the 12 TEC data available in a day) and F10.7, sunspot number and photon flux data at bandwidth 1.86 to 2.95 nm,    After passing TEC to the frequency domain, a multi-dimension regression technique can be used to estimate the best fit to the data. The resulting curves (one for every DCT coefficient, in every geographic location) constitute the ionosphere model used in the experiments. There are different types of regression methods that can be used, and our experiments focused on linear, polynomial and Support Vector Machine (SVM) [8]-which is not a proper regression method but can be adapted to be used as such.

Linear Regression
The multi-dimensional linear regression model [18] can be defined as where θ 0 is the linear coefficient, X = X 1 , ..., X k are the feature values and k is the number of total features used. Y(X) is defined as the predicted value, in our context the TEC frequency component value estimated for a set of features, and is the model error of estimation. Considering n observations, the model can be represented in a compact matrix notation [18]: and the Equation (3) can be written in a simpler form as The main key of this regression is to estimate the optimal θ coefficients which minimize a cost function. The most popular approach used is the least squares [19], that estimates θ to minimize the residual sum of squared errors, according to Equation (5).

Polynomial Regression
Polynomial regression [18] makes use of linear model approach, but considers an mth degree polynomial fit to data by adding a power of each feature as a new feature, and then applying the multiple linear regression. An example for feature data is shown in matrix notation in Equation (6).
where n the number of data available and θ 1 , . . . , θ m are the regression coefficients to be estimated to minimize a cost function, which can be calculated with the least squares method presented in Section 3.1.

Support Vector Machine
Support vector machines (SVMs) [20] have been used successfully in supervised data set classification. Linear SVMs are effective when data sets show an approximate linear distribution yet contain some outliers. In several cases a non-linear class separation, instead of a hyperplane, is more adequate. However, it is the reversed SVM [21] that can be used to implement linear and non-linear regression involving a higher number of features while limiting margin violations. The width of the margin is controlled by a hyperparameter , wherein a larger number of would introduce more error in the solution [8], otherwise, a lower number would penalize every error and require more computational resources. To implement a polynomial regression with SVM, a kernelized SVM model was used, and a new hyperparameter C was added to control the regularization of the model with L 2 regularization (Section 3.4). Lower values of C add more regularization to the model, decreasing the overfitting but requiring more computational power. In this work we used = 0.4 and C = 50, following [21], to balance the error introduced in the model, overfitting and regularization.

Regularization Methods
In multiple dimension regression, when several features are used, the probability of overfitting the model increases. In addition, when the main goal is to explain a phenomenon, a model with a low number of features, like three or four, can be more useful than a model with hundreds of features [22]. Therefore, regularization can be used in the model to decrease the overfitting by adding to the cost function of the model a regularization term in the training set. This forces the algorithm not to adjust and to maintain the weight of θs as the smallest possible [8].
L 1 norm regularization, also known as Least Absolute Shrinkage and Selection Operator (LASSO) regression adds to the cost function the module of L 1 norm of the coefficients vector [23]. Hence, if the performance of the model is the mean squared error (MSE), the LASSO cost function can be described by Equation (7) [24].
where k is the number of coefficients (features), α is the hyperparameter which controls how much the model is regularized. For instance, if α = 0 we have a regular linear regression. Also, it is important to notice that the intercept (or polarized term) is not regularized [8], so the summation index can start from a different number depending on the indexing method. The L 2 regularization (also known as Ridge regression), is based on the same principle but adds to the cost function the half of squared L 2 vector norm: It is important to mention that for SVM, α is inversely proportional to C. Lastly, another regularization method which combines the L 1 and L 2 penalties by a mixing hyperparameter r, called Elastic Net, can be used. The cost function for Elastic Net regression can be seen in Equation (9).
In this work the Elastic Net regression seems more appropriate for linear and polynomial regressions, once the features used show important correlation, as shown in Section 4.1. In such cases Elastic Net is preferred over L 1 or L 2 [8]. The values of r = 0.02 and α = 0.03 were obtained by walk-forward validation and grid search.

Model Analysis and Experiments
In machine learning, it is important to analyze the available data to better recognize a possible solution to the problem [21]. The features used in this work are a time series showing complex behavior. The choice of the most promising models could benefit from the evaluation of correlation between features and TEC frequency components, from the learning curves used during training and also from a qualitative comparison of reconstructed TEC maps.

Features Correlation Analysis
The TEC frequency components values were compared to the features F10.7 and sunspot number at two different geomagnetic latitudes: 22.5 • N and 22.5 • S. The Pearson correlation coefficient, combined for a period of 3 and 5 years of data, is shown in Figures 6 and 7 for each longitude. Both figures presented close results, which were similar to other latitudes. The amount of data needed in modeling is related to the complexity of the problem, and we can verify that the 5-year period data (2014-2018) seems more adequate for our analysis than the 3-year period (2016-2018). Also, the dependence of low-frequency components was substantial, which agrees with the prior analysis of  The evaluation of photon flux data is also shown in Figure 8 for four different bandwidths at latitude 22.5 • S using 5 years of data. Although similar heat maps were obtained for different bandwidths in photon flux data, F10.7 and sunspot, which indicates a very strong correlation between them, we can see important differences when comparing three and five years of data. Therefore, it is expected that having five years of information available may result in better TEC modeling.

Learning Curves
Learning curves are a graphical representation of the model performance as a function of the size of training set [8]. They can be evaluated to select the most promising models to be used and to reject models which present significant overfitting. In this work, the metric to evaluate the model performance was the root mean squared error (RMSE), largely adopted in regression models [18]. To generate the curves, normalized training data from 2014 to 2018 was used, and the candidate models were trained and evaluated multiple times, using different sizes of training subsets by the walk-forward method [21]. One year of data (2019) was used as a testing set, and the scale of RMSE in learning curves was normalized.
Polynomial regression from 4th to 7th degrees presented short RMSE values for the training set. However, the correspondent learning curves for the test set revealed a clearly unstable model behaviour caused by overfitting, as shown in Figure 9 for the 1st DCT frequency component of the model at latitude 22.5 • S and longitude 180 • E. Considering the observed overfitting presented in high order polynomial regression models, our analysis limited polynomial regression up to the 3rd degree. The resultant learning curves for the 1st to 4th DCT frequency components at latitude 22.5º S and longitude 180º E can be seen in Figures 10-13, using the features F10.7, sunspot number and the two photon flux bandwidths. The strong correlation in photon flux at different bandwidths, previously observed in Figure 8, indicates reduction of new information when more bandwidths are added. As expected, the use of more than two photon flux bandwidths apparently did not lead to significant reduction in RMSE values, but substantially increased the required processing time. In general, these four models presented reasonably well-behaved learning curves and were considered in the following experiments, although the comparison of prediction model learning curves suggests polynomial regression of the 3rd degree is closer to overfitting than the others.

Reconstruction of TEC Maps
An interesting qualitative analysis can be conducted by using the candidate models to reconstruct the TEC maps, and comparing them to the correspondent TEC map provided by IGS. Figure 14 shows reconstructed global TEC maps for 20 March, with the correspondent IGS TEC map. Clearly the 3rd degree polynomial regression model TEC reconstruction seems more divergent from IGS data than the others, which agrees with learning curve evaluation, presented in the previous section.  Figure 15 shows the absolute difference between each reconstructed TEC map shown in Figure 14 and the IGS TEC map. It can be noted that the results obtained with linear regression and SVM models are similar, with prediction errors geographically distributed at close locations. The qualitative evaluation, combined with learning curves results, indicates the 3rd degree polynomial regression model seems inadequate for our purposes. Significant errors in prediction were observed close to the ionization crests, symmetrical to the magnetic equator, a critical location for ionosphere modeling.

RMSE Results
The temporal variation in global RMSE, considering the linear, 2nd degree polynomial and SVM models, for the entire year of 2019 (test set) is shown in Figure 16. Although for the whole period the variation in RMSE was similar for the three models, it can be noted that there is seasonal variation clearly related to the equinox (lower error) and solstice (higher error) periods. Considering that the models were defined using 5-year training data (not seasonally tuned), it would be expected that their lower errors would occur near the periods where the Earth is equally energized by the sun on both hemispheresthe equinoxes-around 20 March and 22 September. On the other hand, for the periods when the sun energy concentrates in one hemisphere-the solstices-around 21 June and 21 December, it seems the models could benefit from some sort of seasonal adjustment.   Figure 16, where it was observed that the errors are similar for the three models.

Regularization
Elastic Net regularization procedure, described in Section 3.4, was applied to linear and polynomial models. Error results are shown in Figures 17 and 18, respectively, and summarized in Table 2. Both Figures show error reduction with regularization, mainly for the linear regression model during most of the set duration. For the polynomial model, regularization performed better during periods close to the northern hemisphere summer solstice. However, for the opposite period of the year, regularization was ineffective or slightly worsened the results. It seems the effects of regularization are more noticeable in linear regression, possibly due to the simplicity of the model. Since the SVM model already uses a regularization parameter C, as discussed in Section 3.3, Elastic Net was not used in this model.

Conclusions and Future Work
In this work, we have examined TEC data in the frequency domain using DCT and its correlation with solar features, which was modeled using different approaches. Despite its simplicity, the linear regression model was able to achieve lower RMSE values when Elastic Net regularization was used. Temporal analysis of RMSE further revealed seasonal variation in errors that could be explored in future works. Different models could be achieved using seasonally divided training data. Those models outputs may be combined to obtain better results. Also, the introduction of knowledge about the physics phenomena in the model could lead to more accurate outputs. Further, deep learning could be explored, like basis functions, autoregressive models and recurrent neural networks to build different TEC time series forecasting models.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.