3D-CNN-Based Sky Image Feature Extraction for Short-Term Global Horizontal Irradiance Forecasting

The instability and variability of solar irradiance induces great challenges for the management of photovoltaic water pumping systems. Accurate global horizontal irradiance (GHI) forecasting is a promising technique to solve this problem. To improve short-term GHI forecasting accuracy, ground-based sky image is valuable due to its correlation with solar generation. In previous studies, great efforts have been made to extract numerical features from sky image for data-driven solar irradiance forecasting methods, e.g., based on pixel-value color information, and based on the cloud motion detection method. In this work, we propose a novel feature extracting method for GHI forecasting that a three-dimensional (3D) convolutional neural network (CNN) is developed to extract features from sky images with efficient training strategies. Popular machine learning algorithms are introduced as GHI forecasting models and corresponding forecasting accuracy is fully explored with different input features on a large dataset. The numerical experiment illustrates that the minimum average root mean square error (RMSE) of 62 W/m2 is achieved by the proposed method with 15.2% improvement in Skill score against baseline forecasting method.


Introduction
The variability of photovoltaic (PV) power generation induces great challenges for the management of energy systems including PV plants [1], e.g., a PV water pumping system. The PV generation is essential for the optimal scheduling of a PV water pumping system, in which the performance of PV pumps is affected by the fluctuations of solar irradiance [2]. The PV power generation highly depends on incident solar irradiance [3]. Hence, timely and accurate solar irradiance forecasting is a promising technique to solve the problem of uncertainty of PV generation. In recent years, the data-driven-based solar forecasting method has become mainstream due to the rapid development of computation technique and the access of comprehensive quality-controlled solar data [4]. Based on forecasting horizons, solar irradiance forecasts can be classified into intra-hour (5 min-30 min) forecast, intra-day (30 min-180 min) forecast, and day-ahead (26-39 h) forecast [4]. The history records of solar irradiance are valuable inputs for data-driven-based solar irradiance forecasting methods [5,6].
For different solar irradiance forecasting horizons, exogenous inputs can improve forecasting accuracy and robustness. For example, satellite imagery is helpful for intra-day forecasting [7]. Day-ahead forecasting is also improved by incorporating information from numerical weather prediction (NWP) models [8]. Short-term solar irradiance is highly

1.
A 3D CNN model was proposed to extract features from ground-based sky images for short-term GHI forecasting with machine learning algorithms.

2.
To illustrate the effectiveness of the proposed 3D CNN in feature extraction, a comprehensive comparison study was conducted against existing feature extraction method. 3.
The proposed method for short-term GHI forecasting with ground-based sky images was verified on a large dataset The remainders of the paper are organized as follows. The methodology for shortterm GHI forecasting is carefully introduced in Section 2, including the framework of the forecasting method, the machine learning algorithms, and the 3D-CNN-based feature extraction model. The utilized dataset is presented in Section 3, which is followed by the presentation of experiment results in Section 4. The conclusion is given in Section 5.

Methodology
To improve GHI forecasting accuracy, sky image, which provides high temporal and spatial information on clouds, is introduced. The feature engineering is a key element for sky-image-based short-term solar irradiance forecasting. Different from previous studies, a 3D CNN model is developed as a universal feature engineering tool. Machine learning algorithms including artificial neural network (ANN), support vector machine (SVM), and k-nearest neighbor (KNN) are used as forecasting models which build the relationship between input features and future solar irradiance. To illustrate the effectiveness of the proposed 3D CNN model in feature extraction of sky images for solar irradiance forecasting, forecasting accuracy with different input features are fully explored, specifically, only endogenous features from history irradiance records, endogenous features integrating RGB color information of sky images, and endogenous features together with sky image features derived by the proposed 3D CNN model. In this Section, the machine learning algorithmbased forecasting models are firstly introduced. This is followed by the development of the 3D CNN model.

GHI Forecasting Method
To remove deterministic daily and seasonal variations in irradiance data, clear-sky index is introduced following [12]. The relationship between clear-sky index and solar irradiance is defined as follows: k t = I/I cs (1) where I denotes actual GHI, I cs is the clear-sky irradiance of clear-sky model in [27] and k t is the clear-sky index at specific time t. The GHI forecasting methods forecast clear-sky index instead of solar irradiance. Once k t is forecasted, the corresponding solar irradiance can be obtained according to Equation (1). For our short-term GHI forecasting with ground-based sky image, the general framework is illustrated in Figure 1, which could be mathematically represented by: where ∧ I represents the forecasted solar irradiance, t indicates forecasting issuing time, δ denotes the forecasting horizon, and p and q denote the number of sky-image and clear-sky index for feature extraction at issuing time t. The functions g exo and g end represent the methods used to extract features from lagged images i t:t-p and lagged clear-sky indexes k t:t-q , respectively. The function P denotes the machine learning algorithm which maps the input features into future clear-sky index. As illustrated in Figure 1, the inputs of GHI forecasting models are composed of endogenous features and exogenous features. Endogenous features extracted from clear-sky index are discussed in Section 3. Exogenous features extracted from sky image vary according to the applied feature engineering method gexo, including the method based on pixel RGB color information in [4] and the proposed 3D CNN method. The exogenous features are thereafter referred to as color features and CNN features, respectively. Popular machine learning algorithms including SVM, ANN, and KNN are introduced as GHI forecasting model P. These algorithms are introduced from the scikit-learn package in Python. Brief introductions of these algorithms are presented as follows: 1 The SVM is originally proposed for classification task and has been successfully applied to regression analysis. The key idea for SVM is to map input data into a highdimension feature space in which the input data can be linearly separated [28]. In this work, Epsilon-support vector regression is introduced as a case study for SVM.
2 ANN is a strong and robust nonlinear method and can model the complex relationship between inputs and outputs. The architecture of ANN consists of several layers and the whole architecture is optimized by backpropagation. The combination of ANN and backpropagation is used to forecast day-ahead solar energy in [29].
3 KNN is based on the similarity of predictors to forecast target value of input data. The similarity is defined by Euclidean distance between train data and input data. The performance of KNN is sensitive to hyper parameters, e.g., the number of nearest neighbors, which are fully explored by an optimization algorithm in [13].
In addition, the smart persistence forecasting model is considered as a baseline due to its excellent performance in short-term forecasting.

3D CNN Model for Feature Extraction
In this work, a 3D CNN model is trained to extract numerical features from sky images with the proposed weak supervision model (WSM). The training process is facilitated by transfer learning and weak supervision strategies. As illustrated in Figure 1, the inputs of GHI forecasting models are composed of endogenous features and exogenous features. Endogenous features extracted from clearsky index are discussed in Section 3. Exogenous features extracted from sky image vary according to the applied feature engineering method g exo , including the method based on pixel RGB color information in [4] and the proposed 3D CNN method. The exogenous features are thereafter referred to as color features and CNN features, respectively. Popular machine learning algorithms including SVM, ANN, and KNN are introduced as GHI forecasting model P. These algorithms are introduced from the scikit-learn package in Python. Brief introductions of these algorithms are presented as follows: 1 The SVM is originally proposed for classification task and has been successfully applied to regression analysis. The key idea for SVM is to map input data into a high-dimension feature space in which the input data can be linearly separated [28]. In this work, Epsilon-support vector regression is introduced as a case study for SVM. 2 ANN is a strong and robust nonlinear method and can model the complex relationship between inputs and outputs. The architecture of ANN consists of several layers and the whole architecture is optimized by backpropagation. The combination of ANN and backpropagation is used to forecast day-ahead solar energy in [29]. 3 KNN is based on the similarity of predictors to forecast target value of input data. The similarity is defined by Euclidean distance between train data and input data. The performance of KNN is sensitive to hyper parameters, e.g., the number of nearest neighbors, which are fully explored by an optimization algorithm in [13].
In addition, the smart persistence forecasting model is considered as a baseline due to its excellent performance in short-term forecasting.

3D CNN Model for Feature Extraction
In this work, a 3D CNN model is trained to extract numerical features from sky images with the proposed weak supervision model (WSM). The training process is facilitated by transfer learning and weak supervision strategies.
The specific 3D CNN in this work is the most promising 3D ResNet in [30]. The architecture of introduced 3D ResNets is illustrated in Table 1, where F denotes channels of output features and N denotes number of blocks in every layer, and FC indicates the last fully connected layer with C dimensions. Two different structures with differences in network depth of 34 and 50 are explored. Both models consist of a 7 × 7 × 7 convolution layer, four concatenated layers, and a fully connected layer. The main difference for two Water 2021, 13, 1773 5 of 14 models resulting from consisted block, i.e., basic block and bottleneck block in [30]. For more details on the ResNet architecture, refer to [31]. To ensure that extracted features are comparable with endogenous features in dimension, two intuitive strategies are compared in experiments. The first strategy is to replace the last fully connected layer (indicated by FCR, fully connected layer replacement). The second strategy is to maintain complete architecture and an extra fully connected layer is added at the end of ResNet. The extraction progress could be defined by a mathematic formulation as follows: where the 3D ResNet is indicated by g, ξ means ReLU activation and X vid means 3D input consisting of sky images. Transfer learning helps to achieve promising results in CNN-based research, e.g., Ima-geNet pre-training is a common strategy in CNN-based tasks such as image classification and object detection [32]. In GHI forecasting, the 3D CNN is initialized with model pretrained on mega-scale video datasets [33] to extract motion information from sky images. In [30], the authors compared video classification accuracy among models pre-trained on 4 video datasets, and in this work, the most proper pre-trained weights are introduced according to reported video classification accuracy.
For 3D ResNet training, it is time-consuming to annotate a large dataset which is necessary for deep model training. To solve this problem, the 3D ResNet learns to extract irradiance related features with weak supervision of irradiance. The weak supervision of irradiance strategy is intuitive. It guides 3D ResNet to extract features which benefit irradiance forecasting by backpropagation. To make full use of irradiance supervision, the 3D ResNet is integrated with a 1D CNN to fuse endogenous features and predict target clear-sky index. The integrated architecture is named as weak supervision model (WSM). Figure 2 illustrates the WSM architecture, and the whole model is optimized in an end-to-end way.
The 1D CNN consists of series basic convolutional layer and a convolutional operation, as illustrated in Figure 2. The basic convolutional layer consists of convolutional operation, batch normalization, and ReLU activation. The last convolutional operation outputs forecasted value directly. In addition, to learn mutual dependence between two input features, a self-attention mechanism is introduced. The mechanism can be defined as follows: where X l is the concatenated features, f is a fully connected layer and σ is the softmax activation function.
The whole WSM is optimized by minimizing the mean square error loss function: where y and ∧ y are target clear-sky index and forecasted value, respectively, and n indicates the number of sampling data.
WSM is introduced to train 3D ResNet to extract irradiance related features, while it is also capable of forecasting GHI. Forecasting accuracy of WSM is reported in Section 4 and compared with GHI forecasting model in Section 4.1. After WSM training, the 3D ResNet is retained as a feature engineering tool and no more optimization is needed. The 1D CNN consists of series basic convolutional layer and a convolutional operation, as illustrated in Figure 2. The basic convolutional layer consists of convolutional operation, batch normalization, and ReLU activation. The last convolutional operation outputs forecasted value directly. In addition, to learn mutual dependence between two input features, a self-attention mechanism is introduced. The mechanism can be defined as follows: where Xl is the concatenated features, f is a fully connected layer and σ is the softmax activation function.
The whole WSM is optimized by minimizing the mean square error loss function: Where y and y ∧ are target clear-sky index and forecasted value, respectively, and n indicates the number of sampling data.
WSM is introduced to train 3D ResNet to extract irradiance related features, while it is also capable of forecasting GHI. Forecasting accuracy of WSM is reported in Section 4 and compared with GHI forecasting model in Section 4.1. After WSM training, the 3D ResNet is retained as a feature engineering tool and no more optimization is needed.

Forecasting Performance Metric
Commonly used metrics including mean absolute error (MAE), mean bias error (MBE), root mean square error (RMSE), mean absolute percentage error (MAPE), and an improvement evaluation metric with respect to smart persistence method, Skill score are considered in this work. These metrics are defined as follows: Water 2021, 13, 1773 7 of 14 where y and ∧ y are target GHI value and forecasted GHI value, respectively, N is the number of data samples, and RMSE p and RMSE m are the RMSEs of smart persistence model and specific forecasting model, respectively.

Datasets
Solar irradiance data and sky images with temporal resolution of 1 min for three years (2014-2016) in [4] are used with the first two years' data for training and the last year data for testing.
This work considers the forecasting of average GHI over a temporal window of 5 min. To predict GHI for different horizons (5 min to 30 min with a 5 min step), the sky images and endogenous features are organized as follows at forecast issuing time: 1 The endogenous features are composed of backward average value, lagged average value, and variability for clear-sky index time series defined in [4].

Experiment Results
In this section, numerical experiment results are presented. First, the forecasting performance based only on endogenous features is introduced, then the results of using the endogenous features and the color features combined are described. Then, the training of 3D ResNet and the forecasting performance of WSM are reported. Lastly, the forecasting

Experiment Results
In this section, numerical experiment results are presented. First, the forecasting performance based only on endogenous features is introduced, then the results of using the endogenous features and the color features combined are described. Then, the training of 3D ResNet and the forecasting performance of WSM are reported. Lastly, the forecasting performance based on endogenous features together with CNN features is presented.

Short-Term GHI Forecasting with Color Features
In previous work [13,17], color features from pixel value are extracted as exogenous input. In this part, machine learning algorithms are trained for GHI forecasting with color features. These algorithms including: SVM, ANN, and KNN as described in Section 2.1. Implementation details for these algorithms are listed as follows: 1.
For epsilon-SVM, the regularization parameter is 1.0 and epsilon is 0.1. Radial basis function (RBF) kernel is used and tolerance for stopping criterion is 10 -3 .

2.
A four-hidden-layers ANN with neurons {64,64,32,16} is implemented. Activation function is ReLU and optimizer is Adam. Learning rate is 10 -4 with adaptive learning decay. For training phase, early stopping strategy is used and 20% training dataset is split for validation. 3.
The number of nearest neighbors is 40 in KNN. Euclidean distance is used as weight function in prediction, which means closer neighbors have greater influence than neighbors further away.
These popular machine learning algorithms have been explored in previous studies [13,14,17]. In this work, these methods were tested on a larger dataset. Figure 4 illustrates color features of testing data using t-Distributed stochastic neighbor embedding (t-SNE) [34]. Average forecasting performance over all horizons are presented in Table 2 in mean ± standard deviation format. Persistence indicates smart persistence method and its RMSE performance is used as baseline. Endo means only clear-sky index as input and Exo means that sky-image features are introduced as additional features into clear-sky index. According to Table 2, machine learning algorithms are better than persistence in terms of RMSE and Skill. However, the introduction of color features results in small improvement compared with methods without image, e.g., the improvement on RMSE and Skill of ANN and KNN, which also was observed in [13,16]. The smart persistence achieves the lowest MAPE of all the methods. Extra input does not improve performance of SVM. The experiment results reveal that easily available color features are not reliable, which promotes the search of a stronger feature engineering tool.  According to Table 2, machine learning algorithms are better than persistence in terms of RMSE and Skill. However, the introduction of color features results in small improvement compared with methods without image, e.g., the improvement on RMSE and Skill of ANN and KNN, which also was observed in [13,16]. The smart persistence achieves the lowest MAPE of all the methods. Extra input does not improve performance of SVM. The experiment results reveal that easily available color features are not reliable, which promotes the search of a stronger feature engineering tool.

3D ResNet Training
The CNN is widely applied in image-related fields due to its superiority and robustness in feature extraction. To extract effective features from successive sky images, the 3D ResNet is deployed with WSM. The implementation details of WSM are listed as follows: the 3D ResNet and 1D CNN are built on Pytorch under the Python 3.6 environment; the Adam optimizer with learning rate 10 -4 is used and learning rate decays by 0.9 every epoch; due to memory limitation, the batch size is set to 100 for the model with depth of 34, while batch size is set to 80 for the model with depth of 50; the models are trained for 30 epochs on one GTX 1080ti GPU; 20% of training data is split for validation. In our work, the dimension of features extracted from sky image is 10.
The 3D ResNet is trained with supervision of irradiance and the accuracy of GHI forecasting is used as metric to measure performance of feature extraction. An ablation study is firstly conducted to explore training strategies and WSM architecture. For convenience, some denotations are introduced: Res34 and Res50 denote 34 and 50 depth model; Att means attention mechanism; Tra means initialization with pre-trained weights; FCR means to replace the last fully connected layer. The ablation study results on the testing dataset are listed in Table 3.

3D ResNet Training
The CNN is widely applied in image-related fields due to its superiority and robustness in feature extraction. To extract effective features from successive sky images, the 3D ResNet is deployed with WSM. The implementation details of WSM are listed as follows: the 3D ResNet and 1D CNN are built on Pytorch under the Python 3.6 environment; the Adam optimizer with learning rate 10 -4 is used and learning rate decays by 0.9 every epoch; due to memory limitation, the batch size is set to 100 for the model with depth of 34, while batch size is set to 80 for the model with depth of 50; the models are trained for 30 epochs on one GTX 1080ti GPU; 20% of training data is split for validation. In our work, the dimension of features extracted from sky image is 10.
The 3D ResNet is trained with supervision of irradiance and the accuracy of GHI forecasting is used as metric to measure performance of feature extraction. An ablation study is firstly conducted to explore training strategies and WSM architecture. For convenience, some denotations are introduced: Res34 and Res50 denote 34 and 50 depth model; Att means attention mechanism; Tra means initialization with pre-trained weights; FCR means to replace the last fully connected layer. The ablation study results on the testing dataset are listed in Table 3. Table 3. Ablation study of proposed WSM. Res34 and Res50 denote 34 and 50 depth model; Att means attention mechanism; Tra means initialization with pre-trained weights; FCR means the last fully connected layer replacement.

Res34
Res50 From Table 3, the Res34 initialized with pre-trained weights by transfer learning outperforms its counterpart with random initial weights. The comparison between Res34 and Res50 reveals that a deeper model achieves small improvement. The attention mechanism is simple but improves forecasting accuracy especially for Skill score. Unlikely the commonly used strategy for transfer learning, where the last fully connected layer of 3D ResNet is always replaced, it has been observed that retaining the total ResNet architecture yields better results.
It is necessary to mention that the above models are trained with a fixed number of backward sky image, i.e., 5 sky images are organized in raw in experiment. This parameter is arbitrary, therefore further exploration has been conducted. To illustrate the influence of the number of backward sky image on forecasting accuracy, further experiments have been carried out by increasing the number of backward sky image to 10 for 10 min forecasting. Image horizon comparison of 10 min GHI forecasting is illustrated in Table 4. It has been observed that the increasing number of backward sky image does not improve forecasting accuracy but greatly increases the computational burden. According to the ablation study, the optimal architecture of WSM is as follows: the last fully connected layer of 3D ResNet is retained, attention mechanism is added, and ResNet50 is selected. The training of WSM facilitates 3D ResNet learning to extract features. Embedding feature extracted by 3D ResNet50 on the test dataset is shown in Figure 5. The extracted high dimension features are processed by t-SNE as well. Compared with color features illustrated in Figure 4, the 3D ResNet features are more semantically separable, indicating that the features extracted by 3D ResNet are better than color features in sky image pattern recognition. Also, the features from 3D ResNet illustrate superiority against color features in stability with increasing forecasting horizon. It has been mentioned that WSM is able to forecast GHI as well. Performance of WSM on GHI forecasting is reported in Table 5. Performances of ANN and KNN with color features in Section 4.1 are listed for comparison. Table 5 illustrates that WSM slightly outperforms ANN and KNN on GHI forecasting, e.g., the improvement on RMSE, MAPE, It has been mentioned that WSM is able to forecast GHI as well. Performance of WSM on GHI forecasting is reported in Table 5. Performances of ANN and KNN with color features in Section 4.1 are listed for comparison. Table 5 illustrates that WSM slightly outperforms ANN and KNN on GHI forecasting, e.g., the improvement on RMSE, MAPE, Skill score and MAE. Although the WSM does not bring a great improvement, features from 3D ResNet present superiority against color features. This promotes the exploration of integrating 3D ResNet and machine learning algorithms for GHI forecasting.

Short-Term GHI Forecasting with CNN Features
In this subsection, two machine learning algorithms, KNN and ANN are trained to predict GHI. All settings are the same as Section 4.1 except that exogenous input is replaced by features from fixed 3D ResNet50. At forecasting issuing time, backward 5 sky images are processed by trained 3D ResNet50 to obtain 10 dimensions features, referred to as CNN features. The process of clear-sky index is the same as in Section 4.1.
The forecasting performances of KNN and ANN with CNN features are reported in Table 6. For better comparison, the KNN and ANN models with color features in Section 4.1 are also listed. For convenience, the abbreviations Color-KNN and Color-ANN denote forecasting models with color features; CNN-KNN and CNN-ANN denote models with CNN features. From Table 6, features from ResNet50 improve GHI forecasting accuracy against color features for most of the forecasting horizons. For better visualization, the comparison on RMSE, Skill score, and MAPE is presented in Figure 6. The improvement of ANN and KNN in forecasting accuracy proves the universality of trained 3D ResNet and we assume it could be flexibly integrated with other forecasting methods.  Figure 7 illustrates forecasted values of KNN-CNN and persistence method against target GHI for different horizons. It is observed that both methods perform well during steady period, such as GHI before 19:00; while it is difficult to capture GHI abrupt change for both methods. In addition, the smart persistence curve presents temporal delay against target value, and the difference is enlarged with increasing forecasting horizon. The numerical experiment proves that extracting feature from 3D CNN is feasible and is a promising way to improve GHI forecasting accuracy. Water 2021, 13, x FOR PEER REVIEW 13 of 15

Conclusions
In this work, popular machine learning algorithms are introduced for short-term GHI forecasting based on sky images. In addition to generally used color features of sky im-

Conclusions
In this work, popular machine learning algorithms are introduced for short-term GHI forecasting based on sky images. In addition to generally used color features of sky im-

Conclusions
In this work, popular machine learning algorithms are introduced for short-term GHI forecasting based on sky images. In addition to generally used color features of sky images, a 3D CNN model is developed for feature extraction. Numerical experiment suggests that machine learning algorithms benefit from CNN features with improvements in RMSE, MAPE, and Skill score. Moreover, feature visualization illustrated that features extracted by 3D ResNet are better than color features in sky image pattern recognition. In the future, more work will be conducted on stronger sky image feature extraction, the integration methods to improve short-term GHI forecasting, and PV water pumping system optimization with accurate irradiance forecasting.