A Comprehensive Application of Machine Learning Techniques for Short-Term Solar Radiation Prediction

: Forecasting the output power of solar PV systems is required for the good operation of the power grid and the optimal management of energy ﬂuxes occurring in the solar system. Before forecasting the solar system’s output, it is essential to focus on the prediction of solar irradiance. In this paper, the solar radiation data collected for two years in a certain place in Jiangsu in China are investigated. The objective of this paper is to improve the ability of short-term solar radiation prediction. Firstly, missing data are recovered through the means of matrix completion. Then the completed data are denoised via robust principal component analysis. To reduce the inﬂuence of weather types on solar radiation, spectral clustering is adopted by fusing sparse subspace representation and k-nearest-neighbor to partition the data into three clusters. Next, for each cluster, four neural networks are established to predict the short-term solar radiation. The experimental results show that the proposed method can enhance the solar radiation accuracy.


Introduction
In recent years, the scale of renewable energy power generation has expanded rapidly. Many countries are considering incorporating renewable energy into the grid [1]. Solar energy has become one of the main sources of renewable energy [2]. The narrow definition of solar energy is solar radiation [3]. Broadly speaking, solar energy also includes other forms of energy converted from the solar radiation, such as coal, oil, natural gas, hydropower, wind energy, biological energy, etc. Solar radiation is affected by the seasons and geography, and has obvious discontinuities and uncertainties [4,5]. These characteristics are the reason that the focus of prediction must be on solar radiation prior to predicting the output of a solar system. Photovoltaic (PV) power generation is typically divided into two forms: off-grid and grid-connected. With the maturity and development of grid-connected PV technology, grid-connected PV power generation has become a mainstream trend [6]. The capacity of large-sale centralized grid-connected PV power generation systems is rapidly increasing. However, the output power of grid-connected PV power generation systems is inherently intermittent and uncontrollable. These intrinsic characteristics cause an adverse impact on the grid and seriously restrict grid-connected PV power generation [7].
At present, research on solar radiation prediction has become more and more extensive and in depth. Among various prediction methods, the simplest is the persistence method which assumes that the future solar radiation is equal to the current solar radiation. Other solar radiation prediction methods can be classified into four categories: physical methods, statistical methods, machine learning methods and hybrid methods [8][9][10][11]. Figure 1 briefly summarizes four types of prediction methods on solar radiation. Among the four categories in Figure 1, the physical methods establish the solar power generation forecast model according to the geographical environment and weather data (such as temperature, humidity, pressure, etc.) [8]. These methods can be further grouped into two subcategories: numerical weather prediction (NWP) methods [12] and spatial correlation methods [13]. NWP methods use numerical simulation to predict, that is, mathematical and physical models are applied on analyzing atmospheric conditions, and high-speed computers are utilized to forecast solar radiation [14]. Under normal conditions, NWP methods probably take a long time to predict [15]. Moreover, the meteorological and environmental factors in NWP methods are the most complicated and difficult to make accurate decisions [8,16]. In current research, it has always been difficult to improve forecast accuracy. The spatial correlation methods harness the spatial correlation of solar radiation to predict solar energy of several places. It should be noted that spatial correlation methods require rich historical data to simulate complex temporal and spatial changes. In summary, NWP methods and other physical models are not suitable for use in short-term cases and in small areas, owing to long runtimes. Meanwhile, they have high demands on computing resources.
The forecasting of solar radiation intensity and solar energy based on historical experimental data is more suitable for short-term prediction [17]. Statistical methods can be mainly classified into moving average (MA), autoregressive (AR) and autoregressive moving average (ARMA), autoregressive integrated moving average (ARIMA), autoregressive conditional heteroscedasticity (ARCH) and Kalman filtering [18,19]. The above models have fast calculation speeds, a strong interpretation ability and simple structures. However, statistical methods establish rigorous mathematical relationships between the inputs and outputs, which means that they cannot learn and change prediction strategies. In addition, a large amount of historical recording is required. As a result, it is almost impossible to capture nonlinear behavior in a time series, so the prediction accuracy may be decreased as time goes by.
With the booming development of artificial intelligence, the application of machine learning technology on predicting PV generation is becoming more popular. These advanced techniques include artificial neural networks (ANN), fuzzy logic (FL), support vector machines (SVM), random forest (RF) and the naive Bayesian algorithm [20][21][22][23][24][25]. The main principle of machine learning methods is as follows. Among them, artificial neural networks are a frequently used method, mainly containing black-propagation (BP) neural networks [26], radial basis function (RBF) neural networks [27], extreme learning machine Among the four categories in Figure 1, the physical methods establish the solar power generation forecast model according to the geographical environment and weather data (such as temperature, humidity, pressure, etc.) [8]. These methods can be further grouped into two subcategories: numerical weather prediction (NWP) methods [12] and spatial correlation methods [13]. NWP methods use numerical simulation to predict, that is, mathematical and physical models are applied on analyzing atmospheric conditions, and high-speed computers are utilized to forecast solar radiation [14]. Under normal conditions, NWP methods probably take a long time to predict [15]. Moreover, the meteorological and environmental factors in NWP methods are the most complicated and difficult to make accurate decisions [8,16]. In current research, it has always been difficult to improve forecast accuracy. The spatial correlation methods harness the spatial correlation of solar radiation to predict solar energy of several places. It should be noted that spatial correlation methods require rich historical data to simulate complex temporal and spatial changes. In summary, NWP methods and other physical models are not suitable for use in short-term cases and in small areas, owing to long runtimes. Meanwhile, they have high demands on computing resources.
The forecasting of solar radiation intensity and solar energy based on historical experimental data is more suitable for short-term prediction [17]. Statistical methods can be mainly classified into moving average (MA), autoregressive (AR) and autoregressive moving average (ARMA), autoregressive integrated moving average (ARIMA), autoregressive conditional heteroscedasticity (ARCH) and Kalman filtering [18,19]. The above models have fast calculation speeds, a strong interpretation ability and simple structures. However, statistical methods establish rigorous mathematical relationships between the inputs and outputs, which means that they cannot learn and change prediction strategies. In addition, a large amount of historical recording is required. As a result, it is almost impossible to capture nonlinear behavior in a time series, so the prediction accuracy may be decreased as time goes by.
With the booming development of artificial intelligence, the application of machine learning technology on predicting PV generation is becoming more popular. These advanced techniques include artificial neural networks (ANN), fuzzy logic (FL), support vector machines (SVM), random forest (RF) and the naive Bayesian algorithm [20][21][22][23][24][25]. The main principle of machine learning methods is as follows. Among them, artificial neural networks are a frequently used method, mainly containing black-propagation (BP) neural networks [26], radial basis function (RBF) neural networks [27], extreme learning machine (ELM)networks [28], and long short-term memory (LSTM) neural networks [29]. Several types of elements affecting solar radiation are determined firstly as the input features, then a nonlinear and highly complex mapping relationship is constructed. Finally, the model parameters are learned according to historical data. Traditional statistical methods cannot attain the above complex representation in most situations. In contrast, machine learning methods can overcome this deficiency.
Hybrid methods of solar radiation prediction mainly consist of weight-based methods and prediction-assisted methods. The former type is a combined model composed of multiple single models with the same structure. Each model gives a unique prediction, and the weighted average of the prediction results of all models is regarded as the final prediction result [30,31]. Unlike weight-based methods, prediction assistance methods usually include two models, one for power prediction and the other for auxiliary processes, such as data filtering, data decomposition, optimal parameter selection, and residual evaluation. According to the auxiliary technology, the forecast methods can be further divided into three groups: data preprocessing techniques, parameter optimization techniques, and residual calibration techniques. Among them, data preprocessing techniques are the commonly used methods, and they mainly include principal component analysis (PCA) and cluster analysis [31,32], the wavelet transform (WT) [33], empirical mode decomposition (EMD) [34] and variational mode decomposition (VMD) [35], etc. Reasonable selections of preprocessing methods can reduce the negative impact of the systematic error on prediction accuracy to a certain extent. In summary, each single model has its advantages and disadvantages, and the hybrid model combines advantages of different methods to obtain a better prediction performance.
This paper aims to predict short-term solar radiation through a comprehensive application of machine learning techniques. Firstly, the missing values are recovered via the means of matrix completion with low-rank structure. Robust principal component analysis, a method of strong robustness to large spare noise, is employed to denoise the recovered data. Next, solar radiation data after denoising is clustered by fusing sparse subspace representation and k-nearest-neighbor. Subsequently, four artificial neural network models are used to forecast, and thus a kind of hybrid model for short-term solar radiation prediction is proposed.
The main structure of the paper is organized as follows. Section 2 describes the experimental dataset and methods. Machine learning techniques for data preprocessing are introduced in Section 3. Section 4 presents several machine learning techniques for forecasting solar radiation. In Section 5, the experiments are carried out, and a comparison of experimental results is provided. Section 6 draws conclusions.

Dataset
The global horizontal irradiation data was collected at a PV power plant in Jiangsu in China. The installed capacity of the power plant was nearly 1.1 megawatts-peak (MWP). The acquired data were recorded every 5 min and the period was from 2018 to 2019, but the dates of 25 and 26 September in 2019 were not considered, due to an error in collection. There were in totally 288 recordings each day. A small amount of abnormal data was generated due to equipment or operation failures, and these unreasonable recordings were regarded as missing entries in this paper. Figure 2 illustrates the solar radiation in 2018 and 2019, respectively, where the blue trend represents lower solar radiation intensity and the yellow trend represents higher solar radiation intensity. It can be seen from two the subfigures that solar radiation is mainly concentrated from 7 a.m. to 6 p.m. each day, and the solar radiation intensity is generally stronger from 12 a.m. to 3 p.m. PV power grid. According to the above data characteristics, it is essential to choose the appropriate data processing method. For obtaining higher data quality, we need to recover the missing entries and denoise the completed data. Subsequently, cluster analysis is adopted to reduce the complexity of data. The data of each cluster are chosen to make predictions separately, which can improve the prediction efficiency and accuracy to some extent.   We separated all solar radiation data according to the four seasons: spring, summer, autumn and winter. In total, spring owns 180 days including January, February and March. In summer, there are 182 days in April, May and June. There are 182 days in July, August and September in autumn. Winter has 184 days in October, November and December. Figure 3 further illustrates the solar radiation intensity of the four seasons in 2018. In that year, the average daily maximum value of solar radiation intensity of spring is 592.64 Wh/m 2 . In summer and autumn, the average daily maximum solar radiation intensity is 879.88 Wh/m 2 and 949.67 Wh/m 2 , respectively. Compared with the other seasons, the average daily maximum solar radiation intensity in winter is only 549.33 Wh/m 2 .  By observing Figures 2 and 3, we can see that these data show no obvious trend or seasonality. What is more, the data are relatively complicated and there are many missing elements. If the original data were directly utilized to perform forecasting, the prediction results would probably have a large error, which would affect the normal operation of the PV power grid. According to the above data characteristics, it is essential to choose the appropriate data processing method. For obtaining higher data quality, we need to recover the missing entries and denoise the completed data. Subsequently, cluster analysis is adopted to reduce the complexity of data. The data of each cluster are chosen to make predictions separately, which can improve the prediction efficiency and accuracy to some extent.

Construction of the Hybrid Model
The hybrid model presented in this paper can be divided into the following two parts. The first part adopts machine learning methods including matrix completion, RPCA (robust principal component analysis) and cluster analysis to preprocess the original data. In the second part, the prediction is carried out by making use of the neural network in a machine learning method such as the BP neural network, radial basis function, extreme learning machine and long short-term memory. Figure 4 shows the flow chart of the proposed hybrid model. The main process of data preprocessing includes the recovery of missing data through matrix completion, denoising of the completed dataset, and spectral clustering based on the combination of sparse subspace representation and k-nearest-neighbor. By integrating the neural network models, a short-term solar radiation prediction model is accomplished.

Construction of the Hybrid Model
The hybrid model presented in this paper can be divided into the following two parts. The first part adopts machine learning methods including matrix completion, RPCA (robust principal component analysis) and cluster analysis to preprocess the original data. In the second part, the prediction is carried out by making use of the neural network in a machine learning method such as the BP neural network, radial basis function, extreme learning machine and long short-term memory. Figure 4 shows the flow chart of the proposed hybrid model. The main process of data preprocessing includes the recovery of missing data through matrix completion, denoising of the completed dataset, and spectral clustering based on the combination of sparse subspace representation and k-nearest-neighbor. By integrating the neural network models, a short-term solar radiation prediction model is accomplished.

Input the solar radiation data
The solar radiation data is recovered by the matrix completion method, and then the completion data is denoised through robust principal component analysis.

Data completion and denoising
Cluster 1 Cluster 2 Cluster 3 The solar radiation data of each season are clustered by the fusion algorithms of k-nearest neighbor representation and sparse subspace representation.

Cluster analysis
Solar radiation of the forecasting day forecasting The solar radiation data of each season is classified by the spectral clustering of fusing k-nearest-neighbor representation and sparse subspace representation.

Training set Test set
Neural network BP neural network RBF network Extreme learning machine Long short-term memory

Machine Learning Techniques for Data Preprocessing
This section will introduce three unsupervised machine learning methods to preprocess solar radiation data. Firstly, a matrix completion method is utilized to recover the missing data. Then, robust principal component analysis (RPCA) is employed to denoise the completed data. Ultimately, a spectral clustering method based on the fusion of k-nearest-neighbor and sparse subspace representation is proposed to perform cluster analysis on the denoised data.

Matrix Completion
Let z i ∈ m×1 be an irradiation measure vector in the i-th day, i = 1, 2, . . . , n. These n samples can be expressed as the matrix Z = (z 1 , z 2 , . . . , z n ). For any real matrix Z= (z ij m×n , its nuclear norm is defined as To indicate all missing entries of Z, an index set Ω ⊂ {1, 2, . . . , m} ×{1, 2, . . . , n} is firstly introduced. Subsequently, a projection operator of P Ω (·): m×n → m×n is defined as follows.
With regard to solar radiation, n samples can be roughly divided into several groups. Therefore, Z is approximately low-rank when n is relatively large. In the presence of missing elements, the recovery technique by the aid of the low-rank structure is called matrix completion [36,37]. Matrix completion is initially described as an affine rank minimization problem. However, due to the non-convexity and discontinuity of the rank function, it is intractable to address this problem. To this end, the aforementioned optimization model can be convexly relaxed into the matrix nuclear norm minimization [38,39]. Thus, the mathematical model of matrix completion is formulated as follows: where Z is the completed matrix. The optimal solution of the above minimization is also denoted as Z to avoid abuse of symbols.

Robust Principal Component Analysis
Observing Figures 1 and 2, we find that the solar radiation data is contaminated by some large sparse noise, which is detrimental for forecasting the future trend. For a low-rank data matrix corrupted by small dense noise, principal component analysis (PCA) can effectively perform dimension reduction, noise elimination, and feature extraction [39]. However, generally speaking, PCA does not work well when the studied dataset is superposed by large sparse noise or outliers. Hence, research on the robustness of PCA has always been the main focus of attention.
The emerging robust principal component analysis (RPCA) decomposes a matrix into the sum of a low-rank matrix and a sparse noise matrix, and the principal component pursuit is proposed to obtain the optimal decomposition [40]. This robust version of PCA can accurately recover the low rank component and the sparse noise under some conditions [41,42]. Formally, RPCA is modeled as follow: where A is the low-rank component, E is the sparse noise matrix, λ > 0 balances the low rankness and the sparsity, · 1 is the l 1 -norm of a matrix (i.e., the sum of absolute values of all elements). The alternating direction method of multipliers is frequently used to solve the nuclear norm minimization (1) and (2). The optimal solution of the above minimization is denoted as A.

k-Nearest-Neighbor
Cluster analysis refers to the process of dividing a given data set into several groups based on the similarity or the distance between two samples without prior information, and it is beneficial to further explore and mine the essence of the data [43]. Spectral clustering is a popular clustering method based on graph theory. The cluster task is achieved by clustering the eigenvectors of the Laplacian matrix for the sample set [44]. It can be explained that the core of spectral clustering is to map the points from highdimensional space to low-dimensional space, and some clustering algorithms are used in low-dimensional space. A similarity matrix is an important index in spectral clustering, and it is constructed according to the distance metric among all points [45]. In the spectral clustering algorithm, the similarity between two points with a short distance is relatively high, otherwise the similarity is relatively low. The similarity matrix can be built up through three ways: ε-neighborhood graph, k-nearest-neighbor graph and fully connected graph [46]. Among these three manners, ε-neighborhood graph and fully connected graph probably lose more information, which leads to less accurate results. In contrast, the k-nearest-neighbor graph generally has a precise calculation result. Meanwhile, it is simple and easy to realize [47].
The first step of spectral clustering is to establish a weighted graph G = (ν, ε), where ν is the set of n nodes and ε is a collection of edges among nodes. Let A = ( a 1 , a 2 , . . . a n ). The i-th node corresponds to the processed observation vector a i . Constructing the similarity graph is the most crucial task for spectral clustering. Among the existing approaches, the k-nearest-neighbor graph is generally recommended as the first choice. Next, we introduce the construction process of the similarity matrix based on the k-nearest-neighbor graph.
In a manifold space, there exists an approximate liner relationship among the adjacent points. Under this circumstance, the distance between a i and a j is calculated by d ij = a i − a j 2 , where · 2 is the l 2 -norm of a vector. Given a i , we first compute n − 1 distances {d i1 , d i2 , . . . , d i,i−1 , d i,i+1 , . . . , d in } and then sort them by the increasing order. Therefore, k nearest neighbors of a i can be found according to k smallest distances. On this basis, we build up a similarity graph G: if a i is one of k nearest neighbors for a i or a j is one of k-nearest-neighbors for a j , then an edge is added between the i-th and the j-th nodes and the weight s ij is set to 1; otherwise, no edge exists and s ij = 0. Eventually, the similarity matrix S = (s ij ) n×n is obtained and it is symmetrized by s ij ← max s ij , s ji .

Sparse Subspace Representation
As a sparse learning method, sparse subspace representation is another manner to construct the similarity matrix in spectral clustering. Generally speaking, a high-dimensional dataset is distributed in the union of several low-dimensional subspaces. Hence, the representation of high-dimensional data is characterized by the sparseness [48][49][50].
It is supposed that the dataset locates in the union of several disjoint linear subspaces. The purpose of cluster analysis is to recognize and separate these subspaces. If A is noisefree and there exist sufficient samples or points in each subspace, then the i-th sample can be expressed by the linear combination of the remainder, i = 1, 2, 3, . . . , n. Thus, it holds that where If v ji is large in the sense of absolute value, the i-th and the j-th samples probably have strong similarity. Equation (3) can be written as a i = Av i . In detailed implementation, the optimal coefficient vector v i can be obtained by solving the following l p -norm optimization problem: Appl. Sci. 2021, 11, 5808 8 of 21 The value of p is commonly set to 1 or 2. The samples matrix A is frequently corrupted by the superposition of small dense noise and large sparse noise. Under this circumstance, the vector a i is rewritten as: where e i is a large sparse noise vector and the noise vector o i is dense. We assume that e i follows a Gaussian distribution, both e i and v i obey two multivariate Laplacian distributions. By applying the maximum likelihood estimation, we formulate an optimization problem: where λ 1 ≥ 0 and λ 2 ≥ 0 are two regularization parameters. Hence, the minimization problem (6), also named as sparse subspace representation, is more robust than problem (4) in dealing with dense and large sparse noise.

Spectral Clustering via Fusing k-Nearest-Neighbor and Sparse Subspace Representation
The main disadvantage of k-nearest-neighbor is that it does not use global information, which possibly leads to the less robustness to data noise. As an exceedingly robust method, the sparse subspace representation not only transmits valuable information in the classification task, but also makes use of the overall context information to provide a data adaptive neighborhood [46]. However, a drawback of the sparse subspace representation is that two samples with a large Euclidean distance may be classified into the same cluster. For this propose, this paper provides a spectral clustering fusing the advantages of k-nearest-neighbor and sparse subspace representation, and applies it to the cluster analysis on solar radiation.
Let W k be the similarity matrix constructed by the k-nearest-neighbor and W s be the similarity matrix obtained by sparse subspace representation. In consideration of their construction principles, both W k and W s are sparse and symmetric. We propose a weighted similarity matrix defined by the convex combination between W k and W s : where γ ∈ [0,1] is the trade-off parameter. When γ = 0, the similarity matrix is constructed by sparse subspace representation. In addition, γ = 1 means that the similarity matrix is formed by the k-nearest-neighbor. Based on the resulting similarity matrix W, we divide the dataset matrix A into s clusters via the spectral clustering method. The following lists the implementation procedure of spectral clustering. Firstly, a diagonal matrix D is calculated, whose diagonal elements are constructed by the sum of each row in W. Then the Laplacian matrix is computed as L = D −1/2 WD −1/2 . Next, the eigen decomposition is performed on L and s mutually orthogonal unit eigenvectors a v i ∈ m×1 s i=1 corresponding to s largest eigenvalues are acquired. Denote A v = a v 1 , . . . , a v s ∈ m×s . Each row of A v is further transformed into unit vectors in the sense of l 2 -norm and the normalized matrix is indicated by A v . Finally, each row of A is regarded as a sample and m samples are partitioned into s clusters by k-means clustering. Compared with only using k-nearest-neighbor or sparse subspace representation, the proposed spectral clustering can maintain a stronger stability and robustness.

Machine Learning Techniques for Forecasting
Given input-output paired training samples {(x i , y i )} N i=1 , we consider the supervised learning task of seeking for an approximate function y = f (x), where x i ∈ D 1 ×1 is the input vector and y i ∈ D 2 ×1 is the output vector. To learn the function relationship between the input and the output, this section will introduce four supervised machine learning methods, namely, BP neural networks, radial basis function networks, extreme learning machines and long-short term memory models.

BP Neural Networks
Neural networks construct the functional form of y = f (x) from the viewpoint of a network model [26,51]. For an input vector x = x (1) , x (2) , . . . , x (D 1 ) T , a feedforward network with K-1 hidden layers can be expressed by: where W (k) ∈ d k ×d k−1 is the weights matrix in the k-th hidden layer, b (k) ∈ d k ×1 is the corresponding bias vector, g (k) (·) is the nonlinear activation function adopted in the k-th hidden layer, d 0 = D 1 and d K = D 2 .
Denote the model parameters set by θ = W (k) , b (k) K k=1 . By training the network according to all training samples, the optimal network parameters θ can be obtained. For this purpose, we minimize the following error function: The simplest and the most effective approach is the gradient descent, and the update formulation is where ∇E(θ) is the gradient of E(θ) with respect to θ, and the step size η is called the learning rate. Each parameter updating step consists of two stages. The first stage evaluates the derivatives of the error function with respect to the weight matrices and the bias vectors. The backpropagation technique propagates errors backwards through the network and it has become a computationally efficient method for evaluating the derivatives. The derivatives are employed to adjust all parameters in the second stage. Hence, the multilayer perceptron is also called a back-propagation (BP) neural network. Figure 5 depicts the topological structure of a BP neural network with one hidden layer. In detailed implementation, mini-batch gradient descent is usually utilized to update parameters to reduce the computation burden. x (1) x (2) g (1) g (1) g (2) g (2)

RBF Neural Networks
As a two-layer feedforward network, a radial basis function (RBF) neural network i composed of an input layer, a hidden layer and an output layer [27]. An RBF network is special case of BP network, and the major difference lies in that the former uses a radia basis function as activation function instead of other functions, such as a sigmoid activa

RBF Neural Networks
As a two-layer feedforward network, a radial basis function (RBF) neural network is composed of an input layer, a hidden layer and an output layer [27]. An RBF network is a special case of BP network, and the major difference lies in that the former uses a radial basis function as activation function instead of other functions, such as a sigmoid activation function. The sigmoid activation function forces the neurons to have a large input visible area [52]. In contrast, the activation function in an RBF network has a small input space region. Consequently, an RBF network needs more radial basis neurons. Moreover, an RBF network is mainly applied to the one-dimensional output case. Figure 6 plots the RBF neural network with the case D 2 = 1.

RBF Neural Networks
As a two-layer feedforward network, a radial basis function (RBF) neural network is composed of an input layer, a hidden layer and an output layer [27]. An RBF network is a special case of BP network, and the major difference lies in that the former uses a radial basis function as activation function instead of other functions, such as a sigmoid activation function. The sigmoid activation function forces the neurons to have a large input visible area [52]. In contrast, the activation function in an RBF network has a small input space region. Consequently, an RBF network needs more radial basis neurons. Moreover, an RBF network is mainly applied to the one-dimensional output case. Figure 6 plots the RBF neural network with the case x (1) Figure 6. Diagram of an RBF neural network. The commonly used radial basis function in an RBF neural network is the Gaussian function. Under this circumstance, the activation function for a given input feature x can be expressed as where u ∈ D 1 ×1 and σ are the center and the standard deviation of the Gaussian function, respectively. The mathematical model of the RBF network with L hidden units can be written as where w = (w 1 , w 2 , . . . , w L ) T is the weights vector connecting the hidden layer to the output layer, θ= {( u j , σ j )} L j=1 is a set composed by L center vectors and L standard deviations. Formally, the parameters of the RBF neural network can be obtained by minimizing the following errors: If θ is fixed, the optimal weights vector w is calculated as where Y= (y 1 , y 2 , . . . , y N ) T , the notation † is the generalized inverse of a matrix, Φ = ϕ ij N×L is the design matrix with ϕ ij = ϕ x i ; u j , σ j . The parameters set can be deter-mined by the gradient descent or cross-validation method. In practice, θ and w can be updated alternately.

ELM Neural Networks
ELM generalizes single hidden layer feedforward networks [53][54][55]. For an input sample x ∈ R D 1 ×1 , ELM constructs a hidden layer with L nodes and the output of the i-th node is denoted by h i (x), where h i (x) is a nonlinear feature mapping. We can choose the output of all hidden layer nodes as follows: where W ∈ L×D 1 and b ∈ L×1 are the weight matrix and the bias vector in the hidden layer, respectively, and g(·) is the mapping function. Subsequently, the linear combination of {h i (x)} L i=1 is used as the resulting output of the prediction where β ∈ L×D 2 is the output weight matrix. Figure 7 illustrates the diagram of an ELM neural network with one single hidden layer.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 12 of 22 where ⋅ F is the Frobenius norm of one matrix. g g input layer hidden layer output layer … … x (1) x (2) y (1)

LSTM Neural Networks
As a special recurrent neural network (RNN), long short-term memory (LSTM) is suitable for processing and predicting important events with relatively long intervals and delays in the time series [56,57]. LSTM can alleviate the phenomenon of gradient disappearance in the structure of RNN [58]. As the result of a powerful representation ability, LSTM utilizes a complex nonlinear unit to construct larger deep neural networks.
LSTM controls long and short-term memory through gates and cell states [10]. As shown in Figure 8, the neurons in LSTM include input gate i, forget gate f, cell state c, output gate y. Among them, three gates are calculated as follows: where y W are respectively the weight matrices of three gates, and σ is the sigmoid activation function. In Equation (18), the update formula of cell state is: When all parameters {W, b, β} are unknown, the above prediction function can be regarded as the combination of RBF networks and BP neural networks with only one hidden layer. To simplify the network model, extreme learning machines generate randomly the hidden node parameters {W, b} according to some probability distributions. In other words, W and b do not need to be trained explicitly, resulting in a remarkable efficiency.
Let H = (h (x 1 ; W, b), . . . , h(x N ; W, b)) T , Y = (y 1 , . . . , y N ) T . The weights matrix β connecting the hidden layer and the output layer can be solved by minimizing the squared error loss: where · F is the Frobenius norm of one matrix.

LSTM Neural Networks
As a special recurrent neural network (RNN), long short-term memory (LSTM) is suitable for processing and predicting important events with relatively long intervals and delays in the time series [56,57]. LSTM can alleviate the phenomenon of gradient disappearance in the structure of RNN [58]. As the result of a powerful representation ability, LSTM utilizes a complex nonlinear unit to construct larger deep neural networks.
LSTM controls long and short-term memory through gates and cell states [10]. As shown in Figure 8, the neurons in LSTM include input gate i, forget gate f , cell state c, output gate y. Among them, three gates are calculated as follows: where b i , b f , b y are bias terms, W i , W f , W y are respectively the weight matrices of three gates, and σ is the sigmoid activation function. In Equation (18), the update formula of cell state is: where is the Hadamard product and c t is the candidate cell state. Let b c and W be respectively the bias vector and the weight matrix of the candidate cell gate. Then c t is computed as: where the activation function ϕ is usually chosen as the hyperbolic tangent. At last, the hidden vector is updated: Appl. Sci. 2021, 11, x FOR PEER REVIEW 13 of 22 where the activation function ϕ is usually chosen as the hyperbolic tangent. At last, the hidden vector is updated: For the input information t x , Equation (22)

Model Implementation
The solar radiation observation dataset was collected from a certain part of Jiangsu Province in 2018 and 2019. It can be formed into the matrix , where m = 288 is the total number of recordings per day, and n = 728 is the number of days for that selected two years. First of all, in view of the incompleteness of these solar radiation data, the matrix completion method is utilized to infer the missing elements. This procedure can further refine and calibrate these data. In 2018, 363 days are considered and there are 92 pieces of For the input information x t , Equation (22) calculates the candidate cell state c t at time t by h t−1 and x t . Equation (21) combines the input gate and the forgetting gate to update the cell state c t at time t. Equation (23) calculates the hidden layer information at time t. Through the combination with gate control units, the LSTM network achieves the purpose of memorizing long-and short-term information of time series data by continuously updating the cell state at each moment.

Model Implementation
The solar radiation observation dataset was collected from a certain part of Jiangsu Province in 2018 and 2019. It can be formed into the matrix Z = (z ij ) m×n , where m = 288 is the total number of recordings per day, and n = 728 is the number of days for that selected two years.
First of all, in view of the incompleteness of these solar radiation data, the matrix completion method is utilized to infer the missing elements. This procedure can further refine and calibrate these data. In 2018, 363 days are considered and there are 92 pieces of missing data in total. There are 365 days in 2019, and 28 pieces of data are missing. Taking 11 January 2018 as an example, there are 279 recordings and 9 missing values, and Figure 9 illustrates the completed data in that day. In this figure, the blue stars represent the observed values and the red filled circles are the recovered values by matrix completion. As we can see, the matrix completion has a good recovery performance in that day. In summary, 104,544 and 105,120 recording values, respectively, are obtained after completion. The recovered solar radiation data are further divided into four parts according to the four seasons, and RPCA is employed to denoise the completed data of each season. Figure 10 shows the solar radiation waveforms of 20 non-repeated days before and after denoising for each season, and different colors indicate different days. It can be seen that the solar radiation is zero before 6 a.m. and after 7 p.m. in most cases. It is especially important that the denoised data are convenient for us to grasp the real trend of the variation in solar radiation data, which is conducive to a better prediction of solar radiation.
As can be seen from Figures 2 and 3, the differences of solar radiation intensity in the four seasons are particularly striking, and the sub-dataset of each season is disorganized without any seasonal characteristics and periodicity. For the solar radiation of each season, we utilize spectral clustering based on the fusion of k-nearest-neighbor and sparse subspace representation to divide all the days in each season into three clusters from the solar radiation intensity. In Figure 11, we cluster the radiation data of all days in each season into Cluster 1, Cluster 2 and Cluster 3 according to the solar radiation intensity from low to high. At the upper right of each subplot in Figure 11, the red asterisks, the blue hollow triangles and the green circles stand for Cluster 1, Cluster 2 and Cluster 3, respectively. Springs in Clusters 1, 2 and 3 have 60, 71 and 49 days, respectively, and summers have 55, 65 and 62 days. In autumn, there are 63, 68 and 51 days in the three clusters respectively, winters last 60, 71 and 49 days, respectively.
When neural networks are used for predictions, we choose the solar radiation be- The recovered solar radiation data are further divided into four parts according to the four seasons, and RPCA is employed to denoise the completed data of each season. Figure 10 shows the solar radiation waveforms of 20 non-repeated days before and after denoising for each season, and different colors indicate different days. It can be seen that the solar radiation is zero before 6 a.m. and after 7 p.m. in most cases. It is especially important that the denoised data are convenient for us to grasp the real trend of the variation in solar radiation data, which is conducive to a better prediction of solar radiation.  As can be seen from Figures 2 and 3, the differences of solar radiation intensity in the four seasons are particularly striking, and the sub-dataset of each season is disorganized without any seasonal characteristics and periodicity. For the solar radiation of each season, we utilize spectral clustering based on the fusion of k-nearest-neighbor and sparse subspace representation to divide all the days in each season into three clusters from the solar radiation intensity. In Figure 11, we cluster the radiation data of all days in each season into Cluster 1, Cluster 2 and Cluster 3 according to the solar radiation intensity from low to high. At the upper right of each subplot in Figure 11, the red asterisks, the blue hollow triangles and the green circles stand for Cluster 1, Cluster 2 and Cluster 3, respectively. Springs in Clusters 1, 2 and 3 have 60, 71 and 49 days, respectively, and summers have 55, 65 and 62 days. In autumn, there are 63, 68 and 51 days in the three clusters respectively, winters last 60, 71 and 49 days, respectively.

Performance Analysis
In order to verify the effectiveness of the proposed models, this subsection compares four commonly used neural networks, i.e., BP neural networks, RBF neural networks, ELM neural networks, and LSTM neural networks. Two commonly used statistical indices, the root mean square error (RMSE) and the mean absolute error (MAE), are adopted for model validation to quantitatively evaluate the prediction performance. Their formulations are as follows: where i y is the actual value of the output data, and ˆi y is the corresponding predicted result. Tables 1-8  When neural networks are used for predictions, we choose the solar radiation between 7 a.m. and 6 p.m. every day as the effective input data in order to improve the calculation speed and ensure the validity of the data. To enhance the short-term prediction ability of the proposed model, the solar radiation every two consecutive hours is selected as the training sample to predict that of the next moment. For each season, the last five days of each cluster are employed to construct the test set for final evaluation, and the remaining days are harnessed to train the neural networks. Attention should be paid as over-fitting may occur in the training process of the neural network, that is, the training error is small, but the generalization error is large. Therefore, in the experiment, we use the regularization technique for BP, RBF and ELM, and the dropout strategy for LSTM to prevent overfitting.

Performance Analysis
In order to verify the effectiveness of the proposed models, this subsection compares four commonly used neural networks, i.e., BP neural networks, RBF neural networks, ELM neural networks, and LSTM neural networks. Two commonly used statistical indices, the root mean square error (RMSE) and the mean absolute error (MAE), are adopted for model validation to quantitatively evaluate the prediction performance. Their formulations are as follows: where y i is the actual value of the output data, andŷ i is the corresponding predicted result. Tables 1-8 list the prediction results Tables 9-12 show the determination coefficient R 2 of solar radiation prediction by neural networks with and without clustering for four seasons. R 2 is a measure of how well the regression line represents the data, and the prediction models are more effective as R 2 approaches one. In contrast, in the case of without clustering, the R 2 of BP in the four seasons is respectively increased by 0.0946, 0.0206, −0.0305 and −0.0365 in the sense of average values. When applying RBF, R 2 is raised by 0.0642, −0.0210, −0.0240 and 0.0053. As for ELM, R 2 is improved by 0.0441, 0.0065, −0.0165 and 0.0031. With regard to LSTM, R 2 respectively went up by 0.0139, −0.0094, −0.0011 and 0.0098. The experimental results in these tables indicate that the proposed forecasting methods can significantly improve the prediction performance of short-term solar radiation in most cases. Due to the added cluster analysis, the four data sets of spring, summer, autumn and winter are divided into three clusters with different irradiation intensities. At the same time, the similarity of samples in each cluster is relatively high in general. It can be seen from the aforementioned experimental results that the clustering strategy does improve the prediction accuracy. This observation can be explained by the reasoning that data preprocessing and sample partitions have a favorable impact on short-term solar radiation prediction. Ultimately, through analyzing the prediction results of various artificial neural networks, the proposed methods have indeed improved the prediction accuracy on the whole. These experimental results mean that the hybrid models of machine learning have advantages to some extent.

Conclusions and Outlook
This paper proposes a comprehensive application of machine learning techniques for short-term solar radiation prediction. Firstly, aiming at the missing entries in solar radiation data, a matrix completion method is used to recover them. Then we denoise the completed data by robust principal component analysis. The denoised data is clustered into low, medium and high intensity types via fusing sparse subspace representation and k-nearest-neighbor. Subsequently, four commonly used neural networks (BP, RBF, ELM and LSTM) are adopted to predict the solar radiation. In order to quantitatively verify the performance of the prediction model, the RMSE and MAE indicators are applied for model evaluation. The experimental results show that the hybrid model can improve the solar radiation predication accuracy.
In future research work, we will try to improve the model in the following respects to enhance its prediction ability. A multi-step forward prediction is necessary in practice, and it is urgent to develop the corresponding forecasting models by an ensemble of machine learning techniques and signal decomposition methods. In the procedure of establishing the prediction model, the input meteorological element used in this paper is only global horizontal irradiance. In fact, there are many other elements that affect solar radiation, such as the variation of daily temperature and precipitation. The influence of multiple elements on solar radiation will be considered and analyzed so as to improve the prediction ability of solar radiation. Furthermore, this paper only merges a few machine learning techniques into the forecast of solar radiation. In particular, deep learning models have a powerful representative ability and their further application in forecasting solar radiation will be very prospective.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to their confidentiality.