Solar Thermal Collector Output Temperature Prediction by Hybrid Intelligent Model for Smartgrid and Smartbuildings Applications and Optimization

Currently, there is great interest in reducing the consumption of fossil fuels (and other non-renewable energy sources) in order to preserve the environment; smart buildings are commonly proposed for this purpose as they are capable of producing their own energy and using it optimally. However, at times, solar energy is not able to supply the energy demand fully; it is mandatory to know the quantity of energy needed to optimize the system. This research focuses on the prediction of output temperature from a solar thermal collector. The aim is to measure solar thermal energy and optimize the energy system of a house (or building). The dataset used in this research has been taken from a real installation in a bio-climate house located on the Sotavento Experimental Wind Farm, in north-west Spain. A hybrid intelligent model has been developed by combining clustering and regression methods such as neural networks, polynomial regression, and support vector machines. The main findings show that, by dividing the dataset into small clusters on the basis of similarity in behavior, it is possible to create more accurate models. Moreover, combining different regression methods for each cluster provides better results than when a global model of the whole dataset is used. In temperature prediction, mean absolute error was lower than 4 ◦C.


Introduction
In recent years, preserving the environment has become a great concern. One of the reasons for this trend is environmental deterioration caused by human action. The governments of most countries regression techniques [43][44][45][46] have been applied over each of the obtained groups. The performance of the developed model is very good in terms of all the operational aspects.
The next section describes a case study where the facility under study is detailed. The model approach is described in the next section, and the different stages of the process of model development are described; i.e., data processing, clustering, and regression. All the results are presented, and finally the conclusions are drawn and future lines of research are outlined.

Case Study
This research is based on the thermal installations from a bioclimatic house. The main aim of the Sotavento Galicia Foundation is to study new ways of using renewable energies. For this reason, Sotavento Experimental Wind Farm built this bioclimatic house and implemented different energy systems in it.

Sotavento Bioclimatic House
This bioclimatic house ( Figure 1) is located in Xermade council (Lugo), in Galicia (Spain). There are different renewable energy extractors installed around/on such a house, such as solar, wind or geothermal ones, and the main aim is to demonstrate that a house could be more environmentally-friendly. Only the thermal installation is taken into account in this research. In Figure 2 all the thermal systems of the house are shown. They are divided into three different parts to isolate the generation, the accumulation, and the consumption. The generation includes solar (1), biomass (2), and geothermal (3) energies; the geothermal system is divided in the horizontal ground collector and the heat pump. The accumulation has two isolate tanks, one for the storage of solar energy (4), and another for domestic hot water (DHW) (5) and the heater system. This part of the installation also includes the preheating of domestic hot water (8) that is made using the solar tank to heat the water before going inside the DHW tank. Finally, the consumption element is divided into the heater system (6) and DHW (7).
The bioclimatic house also distributes the electrical energy generated from renewable energy sources (wind and photovoltaic), but this feature is not considered in this paper as it is not part of the research. Figure 3 shows the layout of the described thermal solar installation. This energy is generated in the solar panel collectors (2 strings of 4 panels each one), shown on the left side of the figure; 2.5 m 2 each panel, with a total of 20 m 2 , model SchücoSol S.2. Solar energy accumulation is represent by the tank on the right, with a capacity of 1000 L. The schematic includes all the valves and pumps that the system needs to work.   This research focuses on thermal solar generation; thus, only the temperature sensors S1, S2, S3, and S4 are used; they are RTD (PT1000) temperature sensors. However, the input and output temperatures are not the only variables taken into account; others include the flow-meter (marked by the arrow in Figure 3), a Multical R 403, and the solar radiation sensor (a PYR-P sensor), deployed outside of the house

Model Approach
The basic model of the proposal is shown in Figure 4, where the output is set as the output temperature of the lower panel (S4). For this research, the output of the upper panel is not used. The inputs of the model are the input temperatures (lower and upper panels, S1 and S2), the flow of the etilenglicol used in the collectors, and the solar radiation. In this research, a hybrid intelligent model has been chosen to increase the accuracy of the output prediction. This type of model divides the dataset into different subsets (or clusters), and then, a regression technique is used to predict the output. Figure 5 represents the internal layout of the hybrid model; each cluster model used an intelligent technique selected from artificial neural network, polynomial regression support vector regression. The diagram in Figure 6 represents the process followed to train the hybrid intelligent model. Firstly, different clusters were created, and then, a regression phase is made for each cluster. This regression was trained using k-fold cross validation that divides each data cluster into k groups and trains k models with k-1 groups and uses the other groups for testing. After all the k models were tested, all the cluster data were used for testing, and the error of the specific regression technique is calculated with all the data. The k-fold cross validation is shown in Figure 7. In the third step of Figure 6, the best regression algorithm is selected for each cluster. As different regression algorithms were used, it is necessary to choose the best one based on the error achieved in the training phase (described below). Moreover, some of the regression techniques tested have several tune parameters, and all of these configurations were considered as different algorithms. The last step to creating the hybrid model was the selection of the best hybrid configuration. For this purpose, a different dataset was used, isolated from the beginning of the regression phase in order to test all the hybrid configurations. The best one was chosen on the basis of the error achieved with this validation dataset.

Data Processing
The dataset was collected from a monitoring system that takes samples every 2 s; however, the mean value of all collected samples is stored every 10 min.
The entire dataset has been preprocessed before starting the training phase. Firstly, the incorrect samples were removed, which correspond to error in the communication process between the data acquisition system and sensors. Then, only the samples recorded when the panels were working have been considered (the radiation sensor value and the flow meter used in this step to avoid the samples when the system was not operating). The original dataset had 52,689 samples, but the number reduced to 26,665 samples after preprocessing.
The data were normalized to obtain the new sample values in the range 0 to 1. The Max-Min Scaler [47,48], Equation (1), is used to change all the samples (Data j ) to new values (Data j new ).
To select the best hybrid topology, 5% of the samples in each cluster have been isolated from the dataset. The isolated data have been used at a later stage to verify the performance of all the hybrid configurations and to select the best hybrid topology. Moreover, to validate the obtained model, 367 samples that represent 5 operation days were isolated. These samples were not used in the hybrid model creation process. Instead, they were used only in the final step to have a realistic prediction error measurement for the hybrid intelligent model.

K-Means Algorithm
K-means algorithm was chosen to perform clustering and create different groups in the dataset. Different centroids were created to defined the clusters in the hyperspace; the samples are assigned to each cluster depending on the distance to these centroids [40][41][42]; the most common distance used is the Euclidean distance. The algorithm divides the data in the number of clusters (K) defined by the user.
Once the final centroids have been defined, the computational cost needed by the k-means algorithm to assign new samples to each cluster is very small. However, the training time depends on the desired number of clusters and the number of training samples. The aim of the training is achieved when the final centroids [49] do not change. The training procedure can be explained as follows: • Randomly choose the first set of centroids from the whole dataset. Since the centroids are defined as the center of the clusters, at the beginning there are no clusters.

•
The cluster samples are defined with the samples that are nearest to each centroid.

•
Once the clusters are defined, the centroids are swapped to the center of each cluster.
The procedure is repeated (the last two steps) until the centroids are in the same position at least to times during the training procedure. It is necessary to store the centroids to use the k-means algorithm with new samples.

Artificial Neural Networks
The artificial neural networks (ANN) algorithm is an artificial intelligent technique used for regression or classification. This algorithm was developed using the biological neuron model to create the basic unit component, the artificial neuron. The algorithm is called ANN because it has some artificial neurons inside, connected similarly to the biological ones.
Each neuron's input has a weight factor that allows for a different reaction to each input. Moreover, the neuron has an activation function that calculates its output using the inputs. An ANN model is able to generalize from the learning cases during the training phase [43,44]. The ANN can be used to perform complex functions thanks to its different activation functions.
The output of the activation function is called the excitation level [45], and it is normally in the range 0 to 1, or −1 to 1. The configuration of the ANN includes the number of neurons, its activation functions, and its organization. The neurons are organized in layers; all the neurons that have the same inputs and outputs are in the same layer.
The multilayer perceptorn is a basic feed-forward topology; the signal goes in the same way from the inputs to the outputs. The input and the output layers are directly connected to the inputs and outputs of the model; the hidden layers are the other layers that are only connected internally. In regression, the linear activation function is commonly used for the output neuron, while in the other neurons the tan-sigmoid function could be used.

Polynomial Regression
The polynomial regression algorithm is defined as the summation of several linear functions. Different degrees for the inputs defined the basis functions, and the maximum degree is called the degree of the polynomial.
Equations (2) and (3) show two different degree polynomials for a two inputs model. Each basis function has its own coefficient (c * ) that is adjusted in the training phase. (2)

Support Vector Machines for Regression
The support vector machine is a machine learning algorithm used for classification problems. When this algorithm is used regression purposes, it is called support vector regression (SVR). This technique uses a nonlinear transformation to create a high-dimensional representation of the data; then, in the case of SVR, the algorithm performs a linear regression with the new mapping data.
This paper uses a modification of the SVR algorithm that is called least square SVR (LS-SVR) [46]. The LS-SVR's performance is similar to that the original SVR algorithm [50]; it is only necessary to adjust two internal parameters: weight vector (γ) and kernel width (σ). Moreover, the LS-SVR includes an optimization function that automatically tunes these parameters.

Results
This section is divided into three different parts with the aim of presenting all the results of this research. Firstly, the clustering results show the selected hybrid topologies, with the clusters and the samples in each one. Then, different regression results were represented. Since three differently configured algorithms were used, only some of the results are shown. This part includes the best regression technique, with its error measurement for each cluster. Finally, the validation of the model is described, along with the best hybrid topology and the final model error values.

Clustering Results
As the best number of clusters was not known beforehand, the k-means clustering technique was used to divide the dataset several times. In Table 1 it can be seen that nine different hybrid topologies were created, dividing the dataset into 2, 3, 4, 5, 6, 7, 8, 9, and 10 clusters. It is also shown that the first column corresponds to the global model (no clusters). Cl-10 5520 The training of the k-means algorithm was made with random initial centroids, and for all the configurations the training was repeated 20 times to avoid local minimum. Moreover, the training phase includes a final condition to avoid clusters with less than 15 samples; however, in this research, the smallest cluster has 611 samples.

Modeling Results
As there are three different regression techniques, the modeling results are divided into three parts, each showing the results of a different algorithm.

Artificial Neural Networks
All the tested ANNs have the same configuration; the input layer has four neurons (as many as the model's inputs), the internal layer has a varying number of neurons inside (this parameter is configurable), and the output layer has one neuron (as the model has only one output). The output layer neuron has a linear activation function, and the rest of the neurons in the ANN use a tan-sigmoid as their activation function. As it was said, the internal neuron number was varied to achieve the optimal one; 15 different models have been tested, each one with different neurons in the hidden layer. Table 2 shows the error distribution through the clusters. In this case, it presents the mean absolute error (MAE) calculated for ANN with seven neurons in the hidden layer. There are a total of 15 MAE tables for artificial neural networks, as 15 different configurations have been tested. The error values have been calculated using 10 k-fold cross validation; this implies 10 different models must be trained before the error is calculated. Moreover, four different error measurements have been calculated: MAE, MSE (mean squared error), MAPE (mean absolute percentage error), and NMSE (normalized mean squared error).

Polynomial Regression
Two different configurations have been trained with the Polynomial Regression algorithm; the first and the second degree polynomials have been used. As an example of this training, Table 3 shows the MAE obtained using second-order polynomial degree for each cluster. As explained, 10 k-fold cross validation is used to achieve the error measurement.

Support Vector Machines for Regression
An error measurement for least square support vector regression is shown in Table 4. In this case, the algorithm has only one configuration because the least square modification uses an auto-tune function to adjust the internal parameters. Following the same training process as in the other algorithms, 10 k-fold cross validation was used.

Selection of Best Local Regression Models
The best regression model for each cluster has been selected considering the MSE obtained by all the created models. There are 18 error values for each cluster (15 ANN, two polynomial regression, and LS-SVR). Tables 5 and 6 show the lower MAE and MSE obtained for each cluster; it must be remarked that the MSE is the usual error used to compare the predicted error for regression algorithms. Table 7 shows the algorithm selected for each cluster. Once the regression technique was chosen, new models were created for each cluster with the selected algorithm using all the available training data; as k-fold was used, not all the data had been used previously. The validation data was applied to the new models to select the best hybrid topology.

Validation Results
With the aim of selecting the best hybrid configuration (the optimal clusters number), a test has been performed using the testing dataset. This data were created with the 5% of the data of each cluster. This data were used as new input data for the nine different hybrid models, and also for the global model. Inside the hybrid model, each new sample was assigned to its local model using the euclidean distance to each cluster centroid, and the output is predicted to calculate the model error. Table 8 shows different error values to compare the hybrid configurations. The best hybrid configuration is the one that divides the model internally into nine local models. In Table 7, it is possible to see the different algorithms and their configurations used in the final hybrid model, including artificial neural networks with 6,8,9,11,12, and 14 neurons, and least squared support vector regression.
Moreover, to test the final hybrid configuration, five different subsets were tested. Each subset represents the data collected over the whole day, chosen randomly from the initial dataset, and isolated from the whole process described before. Figure 8 shows the variation of the real lower solar panel output temperature (blue continuous line) and the variation predicted by the model (green dashed line). The following error values have been calculated for these validation days; these values are not normalized to test the real operation of the model, but the normalized ones are included in italics. In order to validate the innovative feature of the hybrid model, several ANN, polynomial, and SVR models have best experimented on using the whole dataset (global model). The results of this combination of models are presented in Table 9.

Conclusions and Future Works
The hybrid intelligent model described in this research predicts the output temperature of a solar panel, taking into account the input temperature, the flow through the panel, and the solar radiation. This type of model could be used to measure, for example, the thermal energy absorbed by a solar collector without using thermal energy measurement equipment.
The model has been created with a real dataset recorder over two years to ensure that all climatology conditions are included in the dataset. Moreover, different subsets were separated from the beginning of the modeling process to validate and test the final model. The testing was performed with the 5% of the samples that had been isolated from each cluster; this test has made it possible to select the best hybrid configuration that has nine local models with artificial neural networks and support vector machines for regression.
The validation dataset has been isolated from the rest of the dataset, at the very begging of data processing, and it therefore does not consider the clusters. The validation test has been performed with new data and the model has been used in real time; each sample is used as input and internally assigned to a local model to calculate the output. The performance values obtained in this test represent a prediction with less than 4 • C in MAE or less than 14.5% in MAPE (0.0255 and 11.3191 with normalized values). The obtained MSE was 30.5010, and the NMSE was 0.1144. These results demonstrate that the approach predicts more accurate values in comparison to global models.
Regarding future lines of research, it would be interesting to consider increasing the predicted horizon in order to predict the signal values in a future time. Moreover, it may be possible to create new models for the rest of the systems; this research only focused on solar thermal energy, but the bioclimatic house has many systems that could be studied.