A Review on Deep Learning Models for Forecasting Time Series Data of Solar Irradiance and Photovoltaic Power

: Presently, deep learning models are an alternative solution for predicting solar energy because of their accuracy. The present study reviews deep learning models for handling time-series data to predict solar irradiance and photovoltaic (PV) power. We selected three standalone models and one hybrid model for the discussion, namely, recurrent neural network (RNN), long short-term memory (LSTM), gated recurrent unit (GRU), and convolutional neural network-LSTM (CNN–LSTM). The selected models were compared based on the accuracy, input data, forecasting horizon, type of season and weather, and training time. The performance analysis shows that these models have their strengths and limitations in di ﬀ erent conditions. Generally, for standalone models, LSTM shows the best performance regarding the root-mean-square error evaluation metric (RMSE). On the other hand, the hybrid model (CNN–LSTM) outperforms the three standalone models, although it requires longer training data time. The most signiﬁcant ﬁnding is that the deep learning models of interest are more suitable for predicting solar irradiance and PV power than other conventional machine learning models. Additionally, we recommend using the relative RMSE as the representative evaluation metric to facilitate accuracy comparison between studies.


Introduction
Solar energy is a popular renewable energy source because it is abundant and environment-friendly. The amount of solar energy incident on the earth's surface is approximately 1.5 × 1018 kW h/year, which is approximately 10,000 times the current annual energy consumption of the entire world [1]. Therefore, in recent years, solar photovoltaic (PV) has a significant role in electricity generation. A challenging issue associated with the solar PV is that its power output strongly depends on uncertain and uncontrollable meteorological factors, such as atmospheric temperature, wind, pressure, and humidity [2]. As the solar PV capacity increases, risks caused by the uncontrollable nature of PV power increase. Energy storage could mitigate such risks, but the drawbacks are the installation and management costs. However, solar irradiance forecasting is an inexpensive and immediate solution and effective in microgrid operation optimization such as peak shaving, uncertainty impact reduction, and economic dispatch problem in the power system [3].
Generally, solar irradiance can be forecast from very short terms (several minutes ahead) to long terms (several days ahead)-the requirement for the time horizon changes with applications. For the very short-term and short-term forecast horizons, sudden variations of solar irradiance, namely the ramp events, are of interest. Abrupt and severe variations of solar irradiance have the potential to degrade the reliability and quality of PV power. Hence, forecasting results for the short-term local weather and atmospheric conditions has stochastic features and consequently is very difficult to predict.
The variability of solar irradiance depends on the time scale. A single datum represents a mean value of more frequent measurement records for a given time period. For instance, an hourly value of GHI may be acquired from 60 pyranometer records at the one-minute interval. Therefore, a long time interval results in the smoothing effect, and thus the variability lessens. Figure 1 shows the smoothing effect on the solar irradiance variability. The data were recorded every minute at Kookmin University, South Korea and presented by different time scales. The finest temporal resolution of one minute reveals a high degree of variability, but the variability smooths out with the time scale. It should be noticed that the smoothing effect also appears when a spatial averaging over a large area is applied. The variability of solar irradiance directly affects solar PV power systems' performance and more seriously disturbs grid stability. The grid typically absorbs power fluctuations at short time scales as fluctuations in frequency and voltage. Therefore, grid operators can control the power ramp rate by imposing a limit on PV plants; for instance, 10% of the nameplate capacity per minute. In addition, grid operators need to balance supply and demand while complying with power system regulations. If the power supply exceeds demand, curtailment can be applied to disconnect power delivery from PV plants with the grid. Solar irradiance or PV power forecasting can optimize the ramp rate control and schedule the load-following operation more effectively with an energy storage system.
Various solar forecasting techniques have their own applicable regimes in terms of time scale. Thus, depending on solar irradiance variability of interest, an appropriate technique must be selected. Figure 2 illustrates a guideline to select forecasting techniques. For time horizons less than one hour, sky-image-based techniques offer very good forecasting capability [10]. Satellite-image-based techniques are recommended for several hours ahead of forecasting with a spatial resolution around 1-5 km [11]. Numerical weather predictions (NWP) allows long-term forecasting over 1 day and up to 15 days ahead, but its time step and update interval are large. On the other hand, time series forecasting using statistical models covers many applications from short to long terms with fine time steps [12][13][14]. As an advanced version of statistical models, deep learning models for sequential data are anticipated to expand the regime. The variability of solar irradiance directly affects solar PV power systems' performance and more seriously disturbs grid stability. The grid typically absorbs power fluctuations at short time scales as fluctuations in frequency and voltage. Therefore, grid operators can control the power ramp rate by imposing a limit on PV plants; for instance, 10% of the nameplate capacity per minute. In addition, grid operators need to balance supply and demand while complying with power system regulations. If the power supply exceeds demand, curtailment can be applied to disconnect power delivery from PV plants with the grid. Solar irradiance or PV power forecasting can optimize the ramp rate control and schedule the load-following operation more effectively with an energy storage system.
Various solar forecasting techniques have their own applicable regimes in terms of time scale. Thus, depending on solar irradiance variability of interest, an appropriate technique must be selected. Figure 2 illustrates a guideline to select forecasting techniques. For time horizons less than one hour, sky-image-based techniques offer very good forecasting capability [10]. Satellite-image-based techniques are recommended for several hours ahead of forecasting with a spatial resolution around 1-5 km [11]. Numerical weather predictions (NWP) allows long-term forecasting over 1 day and up to 15 days ahead, but its time step and update interval are large. On the other hand, time series forecasting using statistical models covers many applications from short to long terms with fine time steps [12][13][14]. As an advanced version of statistical models, deep learning models for sequential data are anticipated to expand the regime. The basic ANN mathematical formula can be expressed as

Deep Learning Models
This section provides an overview of the theoretical background of the selected deep learning models. We describe a neural network and activation function because they are fundamental for all types of neural network algorithms, although ANN is not in the scope of this review.

Neural Network
The neural network is inspired by the structure of the human brain and significantly contributed to machine learning technology development. It is a simplified mathematical model to solve various nonlinear problems. In the past, researchers reviewed ANN models in predicting solar energy, such as Yadav et al. [15] for solar radiation prediction, Mellit et al. [16] for PV applications, and Cheon et al. [17] for solar energy forecasting. One conclusion from these studies is that the ANN model predicts solar radiation more accurately than other conventional models, such as the Angstrom, conventional, linear, nonlinear, and fuzzy logic models. The neural network model comprises the input, hidden, and output layers with auxiliary components, such as neurons, weight, bias, and activation functions. Figure 3 shows the basic neural network architecture with a multilayer perceptron. The input layer receives input values, and the hidden layer analyzes the input values. The output layer collects the data from the hidden layer and decides the output. In the learning process, the neural network modifies its structure to get the same reference or set point as the supervisor. The training process will be repeated until the difference between the neural network output and the supervisor lies within an acceptable range [18].

Deep Learning Models
This section provides an overview of the theoretical background of the selected deep learning models. We describe a neural network and activation function because they are fundamental for all types of neural network algorithms, although ANN is not in the scope of this review.

Neural Network
The neural network is inspired by the structure of the human brain and significantly contributed to machine learning technology development. It is a simplified mathematical model to solve various nonlinear problems. In the past, researchers reviewed ANN models in predicting solar energy, such as Yadav et al. [15] for solar radiation prediction, Mellit et al. [16] for PV applications, and Cheon et al. [17] for solar energy forecasting. One conclusion from these studies is that the ANN model predicts solar radiation more accurately than other conventional models, such as the Angstrom, conventional, linear, nonlinear, and fuzzy logic models. The neural network model comprises the input, hidden, and output layers with auxiliary components, such as neurons, weight, bias, and activation functions. Figure 3 shows the basic neural network architecture with a multilayer perceptron. The input layer receives input values, and the hidden layer analyzes the input values. The output layer collects the data from the hidden layer and decides the output. In the learning process, the neural network modifies its structure to get the same reference or set point as the supervisor. The training process will be repeated until the difference between the neural network output and the supervisor lies within an acceptable range [18]. The basic ANN mathematical formula can be expressed as  The basic ANN mathematical formula can be expressed as where A n is the output, n is the number of input, w j is the weight, I j is the input, and b is the bias. The output value varies with the activation function. The activation function, also known as the transfer function, is a mathematical equation determining neurons' output and can be divided into two types, namely, linear and nonlinear functions. A linear activation function generates the same linear result between the input and output layers. However, such a linear relationship is not enough for practical applications because the problems involve complex information and various parameters, such as image, video text, and sound. A neural network with a nonlinear activation function can tackle the limitations of the linear activation function. The commonly used activation functions are shown in Table 1. Note that the rectified linear unit (ReLU) and leaky ReLU are examples of nonlinear activation function because the slope is not constant for all values. Especially for ReLU, the slope is always either 0 for negative values or 1 for positive values. Table 1. Activation functions of the neural network.

Activation Function Equation Plot
Linear Energies 2020, 13, x FOR PEER REVIEW 5 of 23 where An is the output, n is the number of input, wj is the weight, Ij is the input, and b is the bias. The output value varies with the activation function. The activation function, also known as the transfer function, is a mathematical equation determining neurons' output and can be divided into two types, namely, linear and nonlinear functions. A linear activation function generates the same linear result between the input and output layers. However, such a linear relationship is not enough for practical applications because the problems involve complex information and various parameters, such as image, video text, and sound. A neural network with a nonlinear activation function can tackle the limitations of the linear activation function. The commonly used activation functions are shown in Table 1. Note that the rectified linear unit (ReLU) and leaky ReLU are examples of nonlinear activation function because the slope is not constant for all values. Especially for ReLU, the slope is always either 0 for negative values or 1 for positive values.

Activation Function Equation Plot
Linear

RNN
RNNs are specially designed for analyzing sequential data and have been successfully used in fields such as speech recognition, machine translation, and image captioning [19]. RNN processes sequence data by elements and preserves a state to represent the information at time steps [20]. A Energies 2020, 13, x FOR PEER REVIEW 5 of 23 where An is the output, n is the number of input, wj is the weight, Ij is the input, and b is the bias. The output value varies with the activation function. The activation function, also known as the transfer function, is a mathematical equation determining neurons' output and can be divided into two types, namely, linear and nonlinear functions. A linear activation function generates the same linear result between the input and output layers. However, such a linear relationship is not enough for practical applications because the problems involve complex information and various parameters, such as image, video text, and sound. A neural network with a nonlinear activation function can tackle the limitations of the linear activation function. The commonly used activation functions are shown in Table 1. Note that the rectified linear unit (ReLU) and leaky ReLU are examples of nonlinear activation function because the slope is not constant for all values. Especially for ReLU, the slope is always either 0 for negative values or 1 for positive values.

Activation Function Equation Plot
Linear

RNN
RNNs are specially designed for analyzing sequential data and have been successfully used in fields such as speech recognition, machine translation, and image captioning [19]. RNN processes sequence data by elements and preserves a state to represent the information at time steps [20]. A Energies 2020, 13, x FOR PEER REVIEW 5 of 23 where An is the output, n is the number of input, wj is the weight, Ij is the input, and b is the bias. The output value varies with the activation function. The activation function, also known as the transfer function, is a mathematical equation determining neurons' output and can be divided into two types, namely, linear and nonlinear functions. A linear activation function generates the same linear result between the input and output layers. However, such a linear relationship is not enough for practical applications because the problems involve complex information and various parameters, such as image, video text, and sound. A neural network with a nonlinear activation function can tackle the limitations of the linear activation function. The commonly used activation functions are shown in Table 1. Note that the rectified linear unit (ReLU) and leaky ReLU are examples of nonlinear activation function because the slope is not constant for all values. Especially for ReLU, the slope is always either 0 for negative values or 1 for positive values.

Activation Function Equation Plot
Linear

RNN
RNNs are specially designed for analyzing sequential data and have been successfully used in fields such as speech recognition, machine translation, and image captioning [19]. RNN processes sequence data by elements and preserves a state to represent the information at time steps [20]. A Energies 2020, 13, x FOR PEER REVIEW 5 of 23 where An is the output, n is the number of input, wj is the weight, Ij is the input, and b is the bias. The output value varies with the activation function. The activation function, also known as the transfer function, is a mathematical equation determining neurons' output and can be divided into two types, namely, linear and nonlinear functions. A linear activation function generates the same linear result between the input and output layers. However, such a linear relationship is not enough for practical applications because the problems involve complex information and various parameters, such as image, video text, and sound. A neural network with a nonlinear activation function can tackle the limitations of the linear activation function. The commonly used activation functions are shown in Table 1. Note that the rectified linear unit (ReLU) and leaky ReLU are examples of nonlinear activation function because the slope is not constant for all values. Especially for ReLU, the slope is always either 0 for negative values or 1 for positive values.

Activation Function Equation Plot
Linear

RNN
RNNs are specially designed for analyzing sequential data and have been successfully used in fields such as speech recognition, machine translation, and image captioning [19]. RNN processes sequence data by elements and preserves a state to represent the information at time steps [20].
Energies 2020, 13, x FOR PEER REVIEW 5 of 23 where An is the output, n is the number of input, wj is the weight, Ij is the input, and b is the bias. The output value varies with the activation function. The activation function, also known as the transfer function, is a mathematical equation determining neurons' output and can be divided into two types, namely, linear and nonlinear functions. A linear activation function generates the same linear result between the input and output layers. However, such a linear relationship is not enough for practical applications because the problems involve complex information and various parameters, such as image, video text, and sound. A neural network with a nonlinear activation function can tackle the limitations of the linear activation function. The commonly used activation functions are shown in Table 1. Note that the rectified linear unit (ReLU) and leaky ReLU are examples of nonlinear activation function because the slope is not constant for all values. Especially for ReLU, the slope is always either 0 for negative values or 1 for positive values. Table 1. Activation functions of the neural network.

Activation Function Equation Plot
Linear

RNN
RNNs are specially designed for analyzing sequential data and have been successfully used in fields such as speech recognition, machine translation, and image captioning [19]. RNN processes sequence data by elements and preserves a state to represent the information at time steps [20]. A Energies 2020, 13, 6623 6 of 23

RNN
RNNs are specially designed for analyzing sequential data and have been successfully used in fields such as speech recognition, machine translation, and image captioning [19]. RNN processes sequence data by elements and preserves a state to represent the information at time steps [20]. A traditional neural network assumes that all units of the input vectors are independent. Consequently, the traditional neural network is ineffective for predicting using sequential data.
The architecture of RNN with three main components (input, hidden neuron, and activation function) is shown in Figure 4.
Energies 2020, 13, x FOR PEER REVIEW 6 of 23 traditional neural network assumes that all units of the input vectors are independent. Consequently, the traditional neural network is ineffective for predicting using sequential data.
The architecture of RNN with three main components (input, hidden neuron, and activation function) is shown in Figure 4. Previous hidden state (ht) can be formulated as where xt is the input at time t, ht is the hidden neuron at time t, U is the weight of the hidden layer, and W is the transition weights of the hidden layer. The input and previous hidden states are combined to produce information as the current and previous input go through the tanh function. Then, the output is the new hidden state, performing as the neural network memory because it holds information from the previous network. Training regular RNNs can be challenging because of vanishing and exploding gradient problems. In the case of the exploding gradient, the problem can be solved after the backpropagation is closed at a certain point. However, the result is not optimal because all the weights are not updated. In the case of the vanishing gradient, it can be fixed by initializing the weights to reduce the possibility of vanishing gradient. However, an alternative treatment to solve the problem is to use LSTM, which we discuss later.

LSTM
LSTM is a time RNN proposed by Hochreiter et al. [21] to learn the long-term dependence of information. An LSTM has a similar flow as an RNN. The difference is the operation inside the cells. An LSTM unit comprises forget, input, and output gates ( Figure 5). The forget gate categorizes the information that should be thrown away or kept. The input gate updated the cells, and the output gate decides the next hidden state. Furthermore, LSTM has an internal memory unit and gate mechanism to overcome both the vanishing gradient and explosion gradient problems in the training process of RNN [22].  Previous hidden state (h t ) can be formulated as where x t is the input at time t, h t is the hidden neuron at time t, U is the weight of the hidden layer, and W is the transition weights of the hidden layer. The input and previous hidden states are combined to produce information as the current and previous input go through the tanh function. Then, the output is the new hidden state, performing as the neural network memory because it holds information from the previous network. Training regular RNNs can be challenging because of vanishing and exploding gradient problems. In the case of the exploding gradient, the problem can be solved after the backpropagation is closed at a certain point. However, the result is not optimal because all the weights are not updated. In the case of the vanishing gradient, it can be fixed by initializing the weights to reduce the possibility of vanishing gradient. However, an alternative treatment to solve the problem is to use LSTM, which we discuss later.

LSTM
LSTM is a time RNN proposed by Hochreiter et al. [21] to learn the long-term dependence of information. An LSTM has a similar flow as an RNN. The difference is the operation inside the cells. An LSTM unit comprises forget, input, and output gates ( Figure 5). The forget gate categorizes the information that should be thrown away or kept. The input gate updated the cells, and the output gate decides the next hidden state. Furthermore, LSTM has an internal memory unit and gate mechanism to overcome both the vanishing gradient and explosion gradient problems in the training process of RNN [22].
Energies 2020, 13, x FOR PEER REVIEW 6 of 23 traditional neural network assumes that all units of the input vectors are independent. Consequently, the traditional neural network is ineffective for predicting using sequential data.
The architecture of RNN with three main components (input, hidden neuron, and activation function) is shown in Figure 4. Previous hidden state (ht) can be formulated as where xt is the input at time t, ht is the hidden neuron at time t, U is the weight of the hidden layer, and W is the transition weights of the hidden layer. The input and previous hidden states are combined to produce information as the current and previous input go through the tanh function. Then, the output is the new hidden state, performing as the neural network memory because it holds information from the previous network. Training regular RNNs can be challenging because of vanishing and exploding gradient problems. In the case of the exploding gradient, the problem can be solved after the backpropagation is closed at a certain point. However, the result is not optimal because all the weights are not updated. In the case of the vanishing gradient, it can be fixed by initializing the weights to reduce the possibility of vanishing gradient. However, an alternative treatment to solve the problem is to use LSTM, which we discuss later.

LSTM
LSTM is a time RNN proposed by Hochreiter et al. [21] to learn the long-term dependence of information. An LSTM has a similar flow as an RNN. The difference is the operation inside the cells. An LSTM unit comprises forget, input, and output gates ( Figure 5). The forget gate categorizes the information that should be thrown away or kept. The input gate updated the cells, and the output gate decides the next hidden state. Furthermore, LSTM has an internal memory unit and gate mechanism to overcome both the vanishing gradient and explosion gradient problems in the training process of RNN [22].   The calculation formulas related to the LSTM structure in Figure 4 are as follows: The mathematical symbols in the above equations are as follows: 1. X t is the input vector to the memory cell at time t.

2.
W h t is the value of the memory cell at time t.

5.
S t and C t are the values of the candidate state of the memory cell and the state of the memory cell at time t, respectively. 6.
σ and tanh are the activation functions.

7.
i t , f t , and o t are values of the input gate, the forget gate, and the output gate at time t.
The forget gate (f t ), input gate (i t ), and output gate (o t ) in Equations (3), (4) and (7) have values from 0 to 1 through the sigmoid function (σ). A value of one means that all input information passes through the gate, but a value of 0 shows that no input information passes [23]. The values of the candidate state of the memory cells in Equation (6) calculate the new information at time t, and its output through the tanh function has a value between −1 and 1. The state of the memory at the cell, controlled by the forget and input gates, is calculated as variable C t of time t (Equation (7)). The selected values are converted into output by multiplying them by o t and output becomes h t (Equation (8)).

GRU
Cho et al. [24] first proposed GRU as a simpler RNN architecture than LSTM, resulting in easier computation and implementation. GRU is similar to LSTM in terms of remembering valuable information and capturing long-term dependencies. The strength of GRU is that the computational time is more efficient with less complexity because of fewer parameters than LSTM [25]. GRU also only has two gates, namely, a reset and an update gate. The update gate is the same as the forget and input gate in LSTM because it selects what information should be stored or erased. Meanwhile, the reset gate decides the amount of information that must be forgotten. Therefore, the training time of GRU is faster than LSTM.
The structure of GRU is shown in Figure 6, and the relationship between the input and output for GRU can be written as: where r t is the reset gate, Z t is the update gate, A t is the memory content, σ and tanh are the activation functions, and h t is the final memory at the current time step. The reset (r t ) and update gate (Z t ) have values from 0 to 1 through the sigmoid function (σ) in Equations (9) and (10). Meanwhile, the memory content (A t ), using the rest gate to store the relevant information from the past, has a value between −1 and 1 through tanh.
where rt is the reset gate, Zt is the update gate, At is the memory content, σ and tanh are the activation functions, and ht is the final memory at the current time step. The reset (rt) and update gate (Zt) have values from 0 to 1 through the sigmoid function (σ) in Equations (9) and (10). Meanwhile, the memory content (At), using the rest gate to store the relevant information from the past, has a value between −1 and 1 through tanh. Figure 6. The structure of the gated recurrent unit network.

The Hybrid Model (CNN-LSTM)
CNN is a deep learning algorithm by considering spatial inputs. Identical to other neural networks, CNN neurons have learnable weights and biases. However, CNN is mainly used for processing data with a grid topology, giving it a specific characteristic of its architecture [26].
CNN is a feedforward network because information flow occurs in one direction only, that is, from their inputs to their outputs [27]. The CNN model uses three main layers, namely, the convolutional, pooling, and fully connected layers (Figure 7). The convolutional and pooling layers are used to reduce the computational complexity. Meanwhile, the fully connected layer is the flattened layer connected to the output. Various pooling techniques are available in the architecture of CNN. However, max pooling is mostly used in CNN layers, where the pooling window contains the maximum value from each element [28]. CNN-LSTM was developed for visual time series prediction problems and generating textual descriptions from the sequences of images. The CNN-LSTM architecture uses CNN layers for feature extraction on input data and combines with LSTM to support sequence prediction. Specifically, CNN extracts the features from spatial inputs and uses them in the LSTM architecture to output the caption. The architecture of the CNN-LSTM model is illustrated in Figure 8.

The Hybrid Model (CNN-LSTM)
CNN is a deep learning algorithm by considering spatial inputs. Identical to other neural networks, CNN neurons have learnable weights and biases. However, CNN is mainly used for processing data with a grid topology, giving it a specific characteristic of its architecture [26].
CNN is a feedforward network because information flow occurs in one direction only, that is, from their inputs to their outputs [27]. The CNN model uses three main layers, namely, the convolutional, pooling, and fully connected layers (Figure 7). The convolutional and pooling layers are used to reduce the computational complexity. Meanwhile, the fully connected layer is the flattened layer connected to the output. Various pooling techniques are available in the architecture of CNN. However, max pooling is mostly used in CNN layers, where the pooling window contains the maximum value from each element [28].
where rt is the reset gate, Zt is the update gate, At is the memory content, σ and tanh are the activation functions, and ht is the final memory at the current time step. The reset (rt) and update gate (Zt) have values from 0 to 1 through the sigmoid function (σ) in Equations (9) and (10). Meanwhile, the memory content (At), using the rest gate to store the relevant information from the past, has a value between −1 and 1 through tanh. Figure 6. The structure of the gated recurrent unit network.

The Hybrid Model (CNN-LSTM)
CNN is a deep learning algorithm by considering spatial inputs. Identical to other neural networks, CNN neurons have learnable weights and biases. However, CNN is mainly used for processing data with a grid topology, giving it a specific characteristic of its architecture [26].
CNN is a feedforward network because information flow occurs in one direction only, that is, from their inputs to their outputs [27]. The CNN model uses three main layers, namely, the convolutional, pooling, and fully connected layers (Figure 7). The convolutional and pooling layers are used to reduce the computational complexity. Meanwhile, the fully connected layer is the flattened layer connected to the output. Various pooling techniques are available in the architecture of CNN. However, max pooling is mostly used in CNN layers, where the pooling window contains the maximum value from each element [28]. CNN-LSTM was developed for visual time series prediction problems and generating textual descriptions from the sequences of images. The CNN-LSTM architecture uses CNN layers for feature extraction on input data and combines with LSTM to support sequence prediction. Specifically, CNN extracts the features from spatial inputs and uses them in the LSTM architecture to output the caption. The architecture of the CNN-LSTM model is illustrated in Figure 8. CNN-LSTM was developed for visual time series prediction problems and generating textual descriptions from the sequences of images. The CNN-LSTM architecture uses CNN layers for feature extraction on input data and combines with LSTM to support sequence prediction. Specifically, CNN extracts the features from spatial inputs and uses them in the LSTM architecture to output the caption. The architecture of the CNN-LSTM model is illustrated in Figure 8.
The applications of this hybrid model have been used to solve many problems, such as rod pumping [29], particulate matter [30], waterworks [31], and heart rate signals [32]. Studies have demonstrated promising results; for example, Xingjian et al. [33] predicted the future rainfall intensity in a local region over a relatively short period. The experiments show that the CNN-LSTM network captures spatiotemporal correlations better and consistently outperforms the fully connected LSTM (FC-LSTM) model for precipitation forecasting. The applications of this hybrid model have been used to solve many problems, such as rod pumping [29], particulate matter [30], waterworks [31], and heart rate signals [32]. Studies have demonstrated promising results; for example, Xingjian et al. [33] predicted the future rainfall intensity in a local region over a relatively short period. The experiments show that the CNN-LSTM network captures spatiotemporal correlations better and consistently outperforms the fully connected LSTM (FC-LSTM) model for precipitation forecasting.

Evaluation Metrics
Evaluation metrics are critical for explaining the forecast performance of deep learning models [34]. The metrics provide feedback regarding the accuracy of the forecasting to improve the models until a desirable accuracy is achieved. Various evaluation metrics are available for calculating the accuracy of prediction. The typical evaluation metrics for solar irradiance and PV power forecasting are summarized in Table 2. Here, Ppred, Pmeas, and n represent the forecasted values at each time, the measured values at each time, and the number of sample data for the period, respectively.

Evaluation Metric Equation
Error

Evaluation Metrics
Evaluation metrics are critical for explaining the forecast performance of deep learning models [34]. The metrics provide feedback regarding the accuracy of the forecasting to improve the models until a desirable accuracy is achieved. Various evaluation metrics are available for calculating the accuracy of prediction. The typical evaluation metrics for solar irradiance and PV power forecasting are summarized in Table 2. Here, P pred , P meas , and n represent the forecasted values at each time, the measured values at each time, and the number of sample data for the period, respectively. Mean absolute error (MAE) measures the average magnitude of error in a set of predictions using the absolute value. If the absolute sign is removed, the evaluation metric becomes MBE, capturing the average bias in the prediction, such that positive and negative values represent overprediction and underprediction, respectively. However, root-mean-square error evaluation metric (RMSE) measures the deviation from the measurement, and thus, the smaller, the better. When mean values vary with location or system, a direct comparison of evaluation metrics could lead to a misunderstanding. In such cases, the percentage or relative metrics such as Mean absolute percentage error (MAPE) and relative root-mean-square error (rRMSE) are more useful. Forecasting skill measures the forecasting model over the persistence model regarding RMSE, where the persistence model assumes that the atmospheric conditions are stationary. A positive forecasting skill value means that the model is outperforming the persistence model. Note that measurement data include uncertainty by nature, and thus, the true values are unknown. Hence, researchers prefer to use differences, such as root-mean-square difference (rRMSD) and relative mean bias difference (rMBD) [35].

Analysis of Past Studies
In this section, we present notable findings in forecasting solar irradiance and PV power after analyzing the published studies based on RNN, LSTM, GRU, and CNN-LSTM hybrid models. In total, 35 papers from 2005 to 2020 were collected and plotted by the publication year ( Figure 9). the absolute value. If the absolute sign is removed, the evaluation metric becomes MBE, capturing the average bias in the prediction, such that positive and negative values represent overprediction and underprediction, respectively. However, root-mean-square error evaluation metric (RMSE) measures the deviation from the measurement, and thus, the smaller, the better. When mean values vary with location or system, a direct comparison of evaluation metrics could lead to a misunderstanding. In such cases, the percentage or relative metrics such as Mean absolute percentage error (MAPE) and relative root-mean-square error (rRMSE) are more useful. Forecasting skill measures the forecasting model over the persistence model regarding RMSE, where the persistence model assumes that the atmospheric conditions are stationary. A positive forecasting skill value means that the model is outperforming the persistence model. Note that measurement data include uncertainty by nature, and thus, the true values are unknown. Hence, researchers prefer to use differences, such as root-mean-square difference (rRMSD) and relative mean bias difference (rMBD) [35].

Analysis of Past Studies
In this section, we present notable findings in forecasting solar irradiance and PV power after analyzing the published studies based on RNN, LSTM, GRU, and CNN-LSTM hybrid models. In total, 35 papers from 2005 to 2020 were collected and plotted by the publication year ( Figure 9).  The proportion of publications by deep learning models for predicting solar irradiance and PV power individually is shown in Figure 10. In both cases, the LSTM accounts for most publications, with RNN and GRU following. Meanwhile, the CNN-LSTM hybrid model shows the lowest contribution. The popularity of the LSTM model is higher than that of the other standalone models because it provides promising accuracy in the case of solar energy forecasting. The CNN-LSTM also The proportion of publications by deep learning models for predicting solar irradiance and PV power individually is shown in Figure 10. In both cases, the LSTM accounts for most publications, with RNN and GRU following. Meanwhile, the CNN-LSTM hybrid model shows the lowest contribution. The popularity of the LSTM model is higher than that of the other standalone models because it provides promising accuracy in the case of solar energy forecasting. The CNN-LSTM also has a better performance than other models. However, the percentage of publications using this model is smaller than using other models because the CNN-LSTM is a new solar energy forecasting model.
It should be noted that we did an independent review for solar irradiance and PV power because the units and range values are different. Solar irradiance refers to the amount of solar radiation per unit area; meanwhile, PV power refers to the use of solar radiation as thermal energy through PV cells in the solar panel. The total solar radiation on the earth's atmosphere is approximately 1360 W/m 2 , called the solar constant. This value is attenuated to the earth's surface through a complex series of reflections, absorptions, and remissions. Solar irradiance fluctuates because it is affected by several factors such as atmosphere condition, geographic location, season, and time of day. Although the amount of power generated by PV at a particular location depends on how much of the solar irradiance reaches it, the PV power output also relies on the solar panel's size and efficiency. Therefore, it is essential to describe the specification of the solar panel for getting accurate PV output information.
Energies 2020, 13, x FOR PEER REVIEW 11 of 23 has a better performance than other models. However, the percentage of publications using this model is smaller than using other models because the CNN-LSTM is a new solar energy forecasting model. It should be noted that we did an independent review for solar irradiance and PV power because the units and range values are different. Solar irradiance refers to the amount of solar radiation per unit area; meanwhile, PV power refers to the use of solar radiation as thermal energy through PV cells in the solar panel. The total solar radiation on the earth's atmosphere is approximately 1360 W/m2, called the solar constant. This value is attenuated to the earth's surface through a complex series of reflections, absorptions, and remissions. Solar irradiance fluctuates because it is affected by several factors such as atmosphere condition, geographic location, season, and time of day. Although the amount of power generated by PV at a particular location depends on how much of the solar irradiance reaches it, the PV power output also relies on the solar panel's size and efficiency. Therefore, it is essential to describe the specification of the solar panel for getting accurate PV output information.

Accuracy
The prediction accuracy is the most critical factor in selecting a forecast model. We chose RMSE as the basic evaluation metric because it is most popular in solar energy forecasting. Because the mean value of solar irradiance or PV power differs by factors such as location and system size, rRMSE is a better measure to compare the accuracy between studies. Unfortunately, only a few studies presented rRMSE. We selected the best accuracy out of all the studied cases. Tables 3 and 4 list the accuracy of the papers, including forecast horizon, time interval, input parameters, and the size of PV systems in the case of PV forecasting.
Regarding solar irradiance forecasting in Table 3, the forecast horizon ranges from 5 min to 1 day ahead. Meanwhile, the time interval data vary from every 5 min to 1 h. Generally, the CNN-LSTM as the hybrid model performs better to predict solar irradiance for one day-ahead forecasting. Ghimire et al. [36] have the best performances of all the studies with RMSE values of 8.189 W/m 2 . For LSTM and GRU, the best performance was the prediction of solar irradiance for 10 min ahead at 5 min intervals. However, RNN tends to result in lower accuracy than other models for day-ahead hourly forecasting.
In this case, the main factor causing changes in solar irradiation is the presence of clouds. Unfortunately, only some studies explained the condition of the sky. For example, Wang et al. [25] demonstrated that the CNN-LSTM hybrid outperformed the single LSTM to predict one day-ahead

Accuracy
The prediction accuracy is the most critical factor in selecting a forecast model. We chose RMSE as the basic evaluation metric because it is most popular in solar energy forecasting. Because the mean value of solar irradiance or PV power differs by factors such as location and system size, rRMSE is a better measure to compare the accuracy between studies. Unfortunately, only a few studies presented rRMSE. We selected the best accuracy out of all the studied cases. Tables 3 and 4 list the accuracy of the papers, including forecast horizon, time interval, input parameters, and the size of PV systems in the case of PV forecasting.  Regarding solar irradiance forecasting in Table 3, the forecast horizon ranges from 5 min to 1 day ahead. Meanwhile, the time interval data vary from every 5 min to 1 h. Generally, the CNN-LSTM as the hybrid model performs better to predict solar irradiance for one day-ahead forecasting. Ghimire et al. [36] have the best performances of all the studies with RMSE values of 8.189 W/m 2 . For LSTM and GRU, the best performance was the prediction of solar irradiance for 10 min ahead at 5 min intervals. However, RNN tends to result in lower accuracy than other models for day-ahead hourly forecasting.
In this case, the main factor causing changes in solar irradiation is the presence of clouds. Unfortunately, only some studies explained the condition of the sky. For example, Wang et al. [25] demonstrated that the CNN-LSTM hybrid outperformed the single LSTM to predict one day-ahead solar irradiation at 15 min intervals on sunny days, showing the small error where the RMSE value is less than 34 W/m 2 . Meanwhile, Niu et al. [37] have the largest error in forecasting solar irradiance using the RNN model where the RMSE value is 195 W/m 2 ; however, the sky conditions are not mentioned in the present study. Hence, further study regarding deep learning models to predict solar irradiance is required to create a concrete solution.
The performances for PV power forecasting are listed in Table 4. Most publications focused on predicting PV power in intra-hour forecast horizons. The RMSE reveals a wide variation from 0.044 to 15,290 kW because PV power generation is proportional to system size. Accuracy comparison is more difficult in PV power forecasting than in solar irradiance forecasting, suggesting an increasing need for evaluation based on rRMSE. For the LSTM models in references (Zhang et al. [46]; Wang et al. [47]; Li et al. [48]), the RMSE value increases as the PV size increases. The values are 0.139, 0.398, and 0.885 kW for the PV size of 60.00, 153.48, and 199.16 m 2 , respectively.

Types of Input Data
The type of input data, endogenous and exogenous, can classify forecasting models. In the endogenous model, the type of input and output data is identical. That is, historical PV power data are used for forecasting future PV power. However, the exogenous model uses other data types, such as ambient temperature, humidity, wind speed, wind direction, and sun position, besides the type of output data. It should be noted that the cited references used historical data prediction as input rather than numerical weather prediction data. The time period of input data is also summarized in Tables 3  and 4. The number of publications by exogenous and endogenous inputs, and the forecast horizon is illustrated in Figure 11. For the intra-hour and day-ahead forecast horizons, the endogenous model outnumbers the exogenous model.
Li et al. [55] presented RNN and LSTM models using endogenous inputs for predicting PV power output in the very short term. They used both PV power data from the previous day and previous forecasting data as the input. The results demonstrated that the average RMSE, MAPE, and MAE outperform the other models such as SVM, radial basis function (RBF), back propagation neural network (BPNN), and persistence in the 15 and 30 min forecasting horizons.
Husein et al. [3] proposed an LSTM model using exogenous inputs to forecast day-ahead solar irradiance. The models used dry bulb temperature, dewpoint temperature, and relative humidity as input from six locations. The results showed that LSTM outperforms the feedforward neural network (FFNN) for data from all locations. They also simulated a one-year operation of a commercial building microgrid using the actual and forecasted solar irradiance, and the results showed that using the forecasting approach increases the annual energy savings by 2% compared with FFNN. Li et al. [55] presented RNN and LSTM models using endogenous inputs for predicting PV power output in the very short term. They used both PV power data from the previous day and previous forecasting data as the input. The results demonstrated that the average RMSE, MAPE, and MAE outperform the other models such as SVM, radial basis function (RBF), back propagation neural network (BPNN), and persistence in the 15 and 30 min forecasting horizons.
Husein et al. [3] proposed an LSTM model using exogenous inputs to forecast day-ahead solar irradiance. The models used dry bulb temperature, dewpoint temperature, and relative humidity as input from six locations. The results showed that LSTM outperforms the feedforward neural network (FFNN) for data from all locations. They also simulated a one-year operation of a commercial building microgrid using the actual and forecasted solar irradiance, and the results showed that using the forecasting approach increases the annual energy savings by 2% compared with FFNN.
We investigated the exogenous input parameters from 25 publications ( Figure 12). The less frequently used parameters include visibility, sky images, column ice water, column liquid water, and pressure. However, it shows that temperature, humidity, and wind speed are used more frequently because they are easier to collect than other parameters. We investigated the exogenous input parameters from 25 publications ( Figure 12). The less frequently used parameters include visibility, sky images, column ice water, column liquid water, and pressure. However, it shows that temperature, humidity, and wind speed are used more frequently because they are easier to collect than other parameters. Li et al. [55] presented RNN and LSTM models using endogenous inputs for predicting PV power output in the very short term. They used both PV power data from the previous day and previous forecasting data as the input. The results demonstrated that the average RMSE, MAPE, and MAE outperform the other models such as SVM, radial basis function (RBF), back propagation neural network (BPNN), and persistence in the 15 and 30 min forecasting horizons.
Husein et al. [3] proposed an LSTM model using exogenous inputs to forecast day-ahead solar irradiance. The models used dry bulb temperature, dewpoint temperature, and relative humidity as input from six locations. The results showed that LSTM outperforms the feedforward neural network (FFNN) for data from all locations. They also simulated a one-year operation of a commercial building microgrid using the actual and forecasted solar irradiance, and the results showed that using the forecasting approach increases the annual energy savings by 2% compared with FFNN.
We investigated the exogenous input parameters from 25 publications ( Figure 12). The less frequently used parameters include visibility, sky images, column ice water, column liquid water, and pressure. However, it shows that temperature, humidity, and wind speed are used more frequently because they are easier to collect than other parameters. The period of historical data also affects the performance of the prediction. If the time series is too short, it will lead to a lack of information for learning, and if the time series is too long, it will increase the complexity of the algorithm. However, the increase in time series data does not guarantee better performance. Wang et al. [47] compared the errors of the LSTM and CNN-LSTM according to input sequences ( Table 5). The results show that errors increase from 0.5 to 2 years of input time series data. The best accuracy was observed when three years of input sequences were used. Historical data of more than three years degraded performances, implying that the period of input data must be optimized.

Forecast Horizon
The forecast horizon is the length of time into the future for which a model can predict, and it strongly influences the performances and characteristics of the forecast. Forecasting horizons can be divided into four types [59]:

2.
Short-term forecasting (1 h or several hours ahead to 1 day or 1 week ahead).
Yan et al. [45] studied LSTM and GRU's performance to predict solar irradiance for very short-term forecasting. The experiments were conducted in time ranges of 5, 10, 20, and 30 min and the four seasons. The RMSE in Table 6 shows that LSTM and GRU's smallest error occurred in winter for the 10 min forecast horizon. Generally, the two models showed a gradual increase of errors for the 10 to 30 min solar irradiance forecasting. Ghimire et al. [36] used the CNN-LSTM hybrid model and three standalone models for solar irradiance forecasting. The forecasting errors for those models are listed in Table 7. CNN-LSTM was used using 30 min interval data to predict solar irradiance for a 1 day up to 1-month forecast horizon, as measured by RMSE and MAE. Compared with RNN, LSTM, and GRU, the hybrid model outperforms other models to predict solar irradiance in all forecasting horizons.

Type of Season and Weather
The accuracy of deep learning models has been proven for the different seasons and types of weather. For example, Li et al. [48] used RNN, LSTM, and GRU models for short-term PV power forecasting. The performance evaluations of the different models for each season and types of weather are presented in Table 8. Except for winter, generally, the LSTM model outperforms the two other models. In winter, the GRU model is better than LSTM and RNN for all types of weather.

Training Time
Deep learning with many parameters requires distributed training, where training time is critical [60]. Training time is the product of the deep learning models that must be performed to reach the desired level of accuracy. Each deep learning model has a different training time to reach the best performance. Hence, in this section, we compared the training time for each model to know which model is more efficient in forecasting solar irradiation and PV power.
Wang et al. [25] conducted forecasting experiments for LSTM and GRU to experience information training time in the best, worst, and average cases. LSTM and GRU's training times in the three cases are presented in Table 9, proving that GRU is better than LSTM because the longest training time of GRU is shorter than the best training time of the LSTM.

Comparison with Other Models
In this section, RNN, LSTM, GRU, and CNN-LSTM are compared in Tables 11-15 with other machine learning models and deep learning models. Pang et al. [61] proposed a hierarchical approach to predict solar irradiance using ANN and RNN in Tuscaloosa, Alabama, USA. The data from 22 to 28 May 2016 were used to predict solar irradiance into three interval times (10, 30, and 60 min) (Table 12). Consequently, RNN outperformed ANN in forecasting horizon data sampling, and the best result for the forecast horizon is when the shortest horizon was considered. In the case of LSTM, Alzahrani et al. [62] studied a machine learning approach in Chicago, USA, where the study compared various prediction models on a test dataset using FFNN and support vector regression (SVR). The data were collected over four days (24 March,8 February,8 October, and 12 August) and split into three parts, namely, data training, data testing, and data validation in portions of 70%, 15%, and 15%, respectively. The results in Table 13 show that LSTM has a better performance in achieving an RMSE of 0.086 W/m 2 , whereas FFNN and SVR obtained 0.160 and 0.110 W/m 2 , respectively.
Lee et al. [63] compared CNN-LSTM, random forest regression (RFR), and SVR. They considered two sets of inputs: a PV measurement incorporated with weather values from a nearby meteorological station and only past PV measurements. They collected 18,620 hourly data time series and divided the data training and data testing as 75% and 25%, respectively. The results in the first case show that CNN-LSTM has the lowest error metrics with an RMSE of 0.098 kW. However, the excluded weather data case shows that SVR is better than RFR and CNN-LSTM with 0.126 kW (Table 14).
Li et al. [48] presented their results for the comparison between GRU and multilayer perceptron (MLP) in a different season (Table 15). They collected the data in DKASC, Alice Springs, Australia. The generation of the PV system is 26.500 kW. In this case, they used the data from 1 June 2014 to 31 May 2015, as the training dataset, and the data for 12 June 2016, were used as the testing dataset. The results show that MLP has a better performance than GRU only in the autumn, with a value of 1061 kW. However, the average value of GRU has the most representative attributes of all the seasons compared with MLP.

Conclusions
This paper introduced deep learning models as techniques to predict solar irradiance and PV power generation. To represent a complete review of the models, the evaluation for the PV power and solar irradiance forecasting is made apart. The main reason because the output is different in terms of function, unit, and range value. Although the value fluctuates because it is affected by several factors such as atmosphere condition, geographic location, season, and time of day, the evaluation of solar irradiance forecasting is uncomplicated to understand. Radiation from the sun that reaches the earth's surface is measured in power per unit area; hence it is possible to compare the solar irradiance at different locations. However, apart from the main factors, the PV power output relies on the solar panel's size and efficiency. Therefore, in part of the review of PV power forecasting, the solar panel size is described to obtain accurate PV output information.
RNN, LSTM, GRU, and CNN-LSTM have become the topics of research interest because of their popularity in predicting solar energy. They also offer many advantages over other machine learning models, especially regarding time series data forecasting. Each model has its strengths and limitations to predict solar irradiance and PV power; therefore, it is challenging to decide which is the best among all the models. However, from the studies reviewed in this paper, we propose the following conclusions:

•
In the case of the single model, most studies explain that LSTM and GRU show better performance than RNN in all conditions because LSTM and GRU have internal memory to overcome the vanishing gradient problems occurring in the RNN. • The hybrid model (CNN-LSTM) outperforms the three standalone models in predicting solar irradiance. More specifically, the evaluation metrics for this hybrid model are substantially smaller Energies 2020, 13, 6623 20 of 23 than those of the standalone models. However, the CNN-LSTM model requires complex input data, such as images, because it has a CNN layer inside.

•
The training time should be considered to recognize the performance of the models. This work reveals that the statistics of GRU are more efficient than that of LSTM in the case of computational time because the average time for LSTM to train the data is relatively longer than that for GRU. Therefore, considering training time and forecasting accuracy, the GRU model can generate a satisfactory result for forecasting PV power and solar irradiance.

•
Comparisons between the deep learning models and other machine learning models conclude that these models were better used in predicting solar irradiance and PV power (Section 5.6).
Most studies show that the accuracy of the proposed models is better than other models, such as ANN, FFNN, SVR, RFR, and MLP.