Predicting High or Low Transfer Efficiency of Photovoltaic Systems Using a Novel Hybrid Methodology Combining Rough Set Theory, Data Envelopment Analysis and Genetic Programming

Solar energy has become an important energy source in recent years as it generates less pollution than other energies. A photovoltaic (PV) system, which typically has many components, converts solar energy into electrical energy. With the development of advanced engineering technologies, the transfer efficiency of a PV system has been increased from low to high. The combination of components in a PV system influences its transfer efficiency. Therefore, when predicting the transfer efficiency of a PV system, one must consider the relationship among system components. This work accurately predicts whether transfer efficiency of a PV system is high or low using a novel hybrid model that combines rough set theory (RST), data envelopment analysis (DEA), and genetic programming (GP). Finally, real data-set are utilized to demonstrate the accuracy of the proposed method.


Introduction
Although traditional energy resources, such as oil and coal, account for the largest proportion of energy worldwide, they also produce more pollution than solar energy.As environmental awareness and the need to reduce pollution have increased, solar energy has become an important energy source in industrialized countries.Photovoltaic systems convert solar energy into electrical energy.However PV systems are not yet popular and their transfer efficiency must be improved.Hence, engineers have used various combinations of system components to increase the transfer efficiency of PV systems.
Generally, the transfer efficiency of a PV system is only 6-20% [1].According to the options of experts in PV energy of Taiwan, a transfer efficiency exceeding 9% is considered high and that ≤9% is considered low [2].Generally, engineers or energy managers must judge if a PV system belongs to one category or the other, thus, a reliable prediction model is needed to determine whether the transfer efficiency of a PV system is high or low.Managers or decision-makers in the PV field will then be able to identify the critical components using the prediction model and improve to transfer efficiencies.Thus, this work develops a novel and efficient prediction model to determine whether transfer efficiency of a PV system is high or low.
In applications of discriminating models, most studies utilized different approaches to construct an effective prediction model [3][4][5][6][7].These models were constructed using conventional statistical methods, such as discriminant analysis and logistic regression, or artificial intelligence (AI) methods, such as artificial neural networks (ANNs) and support vector machines (SVMs).Ong et al. [8] demonstrated that a discriminating model constructed using an ANN-based method is more accurate than a model constructed using traditional statistical methods, especially when data-sets are non-linear.However, ANN-based discriminating models have poor prediction accuracy when applied to small samples and input variables are irrelevant [9].Additionally, hidden layers in an ANN are difficult to explain and the relationship between input variables and output variables in an ANN or SVM cannot be expressed by a mathematical equation.Genetic programming (GP) has recently been applied in many fields to construct classification or prediction models.Since GP does not require any assumptions about the relationships between dependent and independent variables to construct a prediction model [10], GP can be applied to both small and large samples [8].In some applications, GP has better prediction accuracy than ANN-based methods.For examples, Ong et al. [8] utilized GP to construct a more satisfactory credit scoring model than ANN model; Muttil and Lee [11] utilized GP to predict coastal algal blooms and claimed GP can obtain more effective prediction model than ANN in their analytical case.In prediction or classification applications, GP can be used to construct a mathematical equation [10][11][12].Moreover, a comparison of the performance of classification models indicated that GP outperforms conventional statistical methods and ANNs [13].
Measuring and monitoring energy efficiency have become important issues in many fields [14].Some studies have utilized data envelopment analysis (DEA) to assess energy efficiency.For instance, Boyd and Pang [15] examined the relationship between productivity and energy intensity utilizing DEA to assess productivity.Hu and Kao [16] developed an energy efficiency index utilizing DEA.This index is used to determine the energy-saving target ratio (ESTR) for seventeen APEC countries.
Based on the importance of energy efficiency and the ability of DEA to determine the ratio between input and output variables, this work adopts DEA to evaluate the input/output efficiency of PV systems using multiple inputs, such as texture type, selection of a PV module, and PV module capacity, and one output (transfer efficiency of PV systems).
Moreover, identifying significant input variables is important when constructing an effective prediction model.Many conventional methods, such as correlation analysis, have been utilized to identify the significant input variables for predicting the output variable.However, such methods are restricted by some assumptions, such as a linear relationship among variables and normality, and large data-sets.Thus, a technique that provides a knowledge system contained in a data-set and clear attribute selection under different classes is desirable.Rough set theory (RST) can be utilized as a soft computing tool to deal with data-sets with poor information and remove irrelevant attributes from a data-set [17].Notably, RST has been applied in many real-world classification problems [18][19][20].
To construct an efficient prediction model that determines whether the transfer efficiency of a PV system is high or low, this work uses input/output efficiency of a PV system as the predictive variable and enhances prediction accuracy using a novel hybrid model combining RST with GP; this model is called the RST-GP model.Because of its robust reliability in knowledge systems, RST is utilized during the first stage to identify significant input variables.During the second stage, significant independent variables obtained from RST are utilized as input variables for GP to construct a prediction model that can determine whether the transfer efficiency of a PV system is high or low.This remainder of this paper is organized as follows.Section 2 reviews the PV system literature.Section 3 briefly reviews the DEA model used to evaluate the input/output efficiency of a PV system.Section 4 describes RST and GP.Section 5 elucidates the proposed hybrid model.Section 6 analyzes and compares the outcomes of the proposed and existing hybrid models.Section 7 gives conclusions.

An Overview of PV System
This section introduces the structure of PV system and factors influencing PV system transfer efficiency.Energies 2012, 5 548

PV System
A PV system primarily consists of a solar cell, an electrical conditioner, an inverter, and a system controller.A PV system uses an inverter to transform light energy into electrical energy.Figure 1 shows the PV system process.

Factors Influencing PV Systems
Gregg et al. [22] noted that numerous complex factors influence the efficiency of PV systems.
These factors can be classified as internal or external factors.Internal factors include PV system texture, the azimuthal angle, transformation of the PV inverter, and selection of direct current (DC) voltage and an inverter.Among the internal factors, PV system texture, the most important factor, influences PV system transfer efficiency.Single crystal and polycrystals are common in PV systems.
The azimuthal angle of a PV system is that at which most light is received given the absence of obstacles; thus, azimuthal angle varies with PV system location.The PV inverter transforms light energy into electrical energy.Selection of DC voltage and the inverter both influence PV system transfer efficiency.However, in the real world, the transformation of light energy into electrical energy is affected by dynamic changes in sunshine.Accordingly, the optimal transfer efficiency of a PV inverter cannot be attained in practice.
The two major external factors are described as follows: first, the amount of solar radiation strongly influences PV system transfer efficiency.Thus, the degree of solar radiation must also be considered when determining PV system transfer efficiency.Second, the temperature of a PV system affects the amount of electrical energy converted from light energy.Thus, determining the optimal temperature in a real environment is a major goal for energy experts.

Evaluating PV System Transfer Efficiency
Transfer efficiency of a PV system is the percentage of energy converted from light energy.The transfer efficiency formula is: where mou P is maximum output electrical energy, and in P is input light energy.As transfer efficiency increases, the amount of energy a PV system generates increases.

Using DEA to Determine Efficiencies
Notably, DEA is a linear programming (LP)-based technique for evaluating decision-making units (DMUs) and deals with many decision-making problems by converting multiple output and input variables into a single comprehensive performance measure [23].DEA is an extensively utilized non-parametric data analysis technique.For instance, Hu and Kao [16] utilized DEA to construct an energy efficiency index.This index is used to determine the energy-saving target ratio (ESTR) for seventeen Asia-Pacific Economic Cooperation (APEC) countries.Tsai et al. [23] applied DEA with other measures to assess the magnitude of performance differences between leading telecom carriers.Guo and Tanaka [24] utilized a fuzzy DEA model to solve an efficiency evaluation problem with given fuzzy input and output data.Wu et al. [25] used the DEA-neural network approach to evaluate branch efficiency for a large Canadian bank.Additional detailed descriptions of DEA can be found elsewhere [26][27][28].
DEA, developed by Charnes, Cooper, and Rhodes (CCR) [28], was based on Farrell's (1957) pioneering study of efficiency measures (relative efficiency or productivity of a specific DMU) [29].Suppose data for each DMU, 1, 2,..., j n = , comprise q positive outputs, rj y , 1, 2,..., r q = , and ) be the DMU whose relative efficiency is to be maximized.The DEA model is displayed as LP as follows: where , ro u io v are the variable weights of given to the rth output and ith input of the oth DMU, respectively.Furthermore, ro u and io v are decision variables of LP modeling used to determine the relative efficiency of DMU o .Obviously, the maximum value (efficiency score), o h , cannot exceed 1.If 1 o h = , the DMU o is called the constant returns to scale (CRS) frontier [30].There are two CCR models in practice.One minimizes input variables, and the other maximizes output variables.In this work, in order to obtain maximum energy efficiency, the maximized output variables of the CCR model are utilized to obtain the optimal value for the objective function, o h .

Rough Set Theory and Genetic Programming
This section reviews the basic concepts of RST and GP.

Basic Concepts of Rough Set Theory
Pawlak [31] developed RST as a data-mining approach in 1982.RST has proved effective for data-sets with poor information or ambiguity and it can be applied in many fields [32][33][34].Walczak and Massart [35] provided a detailed description of RST.
An information system can be represented as S=(U, R, V, f), where U is the universe (a finite set of objects, U = {x 1 ,x 2 ,…,x n }), R is a finite set of attributes (features and variables), , where V r is the domain of attribute r, and : is an information function such that ( ) x U ∈ and r R ∈ .In RST, highly accurate good-quality approximations are very important when extracting decision rules.Let P R ⊆ and X U ⊆ , the lower approximation of X in S by P is denoted as PX , and the upper approximation of X in S by P is denoted as PX and are derived as follows: where: From Equations ( 3),( 4), the boundary can be represented as follows: Hence, reducts can be obtained utilizing approximation spaces.Given an information system ( ) , and then the reduct RED(P), the minimal set of attributes is P R ⊆ , such that where ( ) p r U is the ratio of all P-correctly classified objects to all objects (U) in the system.
Furthermore, core is common to all reducts.For instance, COR(P) is the core of P when ( ) ( ) . Reduction is a feature subset selection process.The selected feature subset retains its explanation ability and has minimal redundancy [36].Core analysis results can be represented as a reference of important attributes in a knowledge system.Several RST-based reduction and feature-selection algorithms have been developed.For instance, Wen et al. [37] applied RST and a grey model to analyze the factors influencing gas breakdown.Li et al. [38] developed a grey-based rough set approach to solve a supplier-selection problem.Thangavel and Pethalakshmi [39] reviewed studies using RST-based feature selection.

Genetic Programming
Koza [40] developed GP as a novel algorithm for computer programs that exploits evolution in solving model structure identification problems and performs symbolic regression [41].The basic concepts of GP resemble those of genetic algorithms (GAs), and include mutation, crossover, and reproduction [10].Unlike GAs, GP uses a generic parse-tree representation to replace the logic number of the genetic state (0 and 1).Additionally, GP can construct an optimal forecasting equation through symbolic regression.The main advantage of symbolic regression is that it is not limited to any functional form or normality assumption for data-sets.For instance, GP is more flexible in symbolic setting than conventional regression method or data-mining approach (e.g., ANN).Notably, GP is also widely utilized in practical applications such as in forecasting [10][11][12]42] and classification [8,43].
Functions or statements in GP have operators ({+, −, ×, ÷, log, and exp}) and a trigonometric function ({sin, cos, and tan}).Hence, a GP parse tree (Figure 2) can be applied to a simple example: When selecting input variables, GP automatically finds variables that contribute most to the model [11,42] and does not have any restriction for data size, as compared to an ANN or large data-set [8,43].

The Proposed Hybrid Prediction Model
This work develops a four-step procedure for predicting whether the transfer efficiency of a PV system is high or low.The proposed prediction model is as follows: Step 1: Collect transfer efficiencies of PV systems with various component combinations.These components are independent variables and transfer efficiency is a binary output variable (i.e., high or low) in the proposed prediction model.
Step 2: RST selects the significant independent variables of a PV system based on its robust reliability in knowledge system [36][37][38][39].The importance of feature selection based on RST (i.e., core analysis) can be explained as follows [44]: Step 3: The DEA evaluates energy efficiency (i.e., the input/output ratio) of a PV system.The input variables in DEA are obtained in Step 2 and the output variable in DEA is transfer efficiency of a PV system.The DMU values obtained from DEA represent energy efficiencies of PV systems.
Step 4: GP constructs a classification model for predicting whether transfer efficiency of a PV system is high or low.For the GP model, this work utilizes the significant independent variables obtained in Step 2 and the input/output ratio obtained in Step 3 as input variables of GP and binary transfer efficiency (i.e., high or low) of a PV system is the output variable of GP.Table 1 presents parameter settings of the GP model.The parameters of GP are obtained by trial-and-error approach.In Step 2, RST is utilized to select the significant independent variables of PV systems because adopting significant independent variables can yield good accuracy for constructing a prediction model [36].Moreover, RST can not only deal with small data-sets but also requires no statistical assumptions (such as a linear relationship between input variables with output variable).In Step 3, DEA is utilized to evaluate the energy efficiency of PV systems because the index (energy efficiency of PV systems) efficiently provides sufficient information for evaluating the economic-value of PV systems.In Step 4, GP is utilized to construct a prediction model because of its high performance in forecasting and classification.Furthermore, GP yields good forecasts using only small data-sets [42].Hence, RST, DEA, and GP are integrated herein to predict the high or low transfer efficiency of PV systems, and the model thus developed is called the RST-DEA-GP model.

Empirical Analysis
A real data-set of transfer efficiency of PV systems collected from a Taiwanese research organization is utilized to demonstrate the effectiveness of the proposed model.The data used in Step 1 concern 38 PV systems.Each PV system contains 18 variables (e.g., texture type, capacity for PV-transfer, and number of inverters) and binary transfer efficiency (e.g., low or high).The low and high transfer efficiencies of the PV systems are coded as 0 and 1, respectively.The data-set comprises 38 PV systems-15 with low and 23 with high transfer efficiencies.The output power of inverter 0.5715 X 3 The selection of PV module 0.4817 X 4 The number of inverter 0.3914 X 5 The weights of PV module 0.3367 X 6 The selection of inverter 0.2893 X 7 PV module capacity 0.2567 X 8 The selection of DC voltage 0.2638 X 9 The location of PV setting 0.2476 X 10 DMU (obtained from DEA) - In Step 2 of the proposed hybrid model, RST is utilized to identify significant independent variables of PV systems.The RST algorithm can be constructed using MATLAB software.The RST results indicate that nine independent variables (X 1 -X 9 ) are significant (Table 2) because that the importance value of nine independent variables are greater than 0.2.It has not a clear criterion to determine the threshold value (importance value).Moreover, the nine independent variables (X 1 -X 9 ) have high correlation to output variable (the low or high transfer efficiencies of PV systems).The correlation coefficient are greater than 0.6.Also, based on the opinion of experts in PV energy in Taiwan, these nine variables importantly influence for the transfer efficiency of PV systems.
In Step 3, DEA is utilized to evaluate the DMU value of each PV system.Table 2 shows the DMU value (X 10 ).In applying DEA, input variables of DEA are the nine significant variables obtained in Step 2 and the output variable of DEA is PV system transfer efficiency.The DEA algorithm can be executed by LINGO software.Table 3 lists the DMU values of the PV systems.In Step 4, the significant independent variables obtained in Step 2 and DMU obtained in Step 3 are utilized as input variables for GP to predict the high or low level of PV system transfer efficiency.To demonstrate the effectiveness of the proposed hybrid model, some basic classification models such as K Nearest Neighbor (KNN), Naive Bayes (NB), SVM, ANN, and GP are utilized as benchmark models.The basic classification models belong to data-mining techniques and can obtain better prediction performance than traditional linear statistical method (e.g., linear regression) [8,10].Although some studies [36] have also adopted hybrid classification models that combine RST, DEA, and SVM to predict business failures, the RST of their proposed methodology did not identify how to obtain the important variables based on a clear equation.This study [36] only adopted the RSES software tool [45] to select important variables.Furthermore, the SVM model performs well only with large data-sets, and collecting large data-sets for PV systems is difficult.Hence, the use of a suitable classification model for small data-sets is important for constructing a high-precision prediction model.
In order to compare the accuracy of hybrid prediction model when adding DEA or nor, this work does some design of experiments for prediction models.The proposed model, named RST-DEA-GP model, which adopts the significant variables obtained by RST and the DMU variable obtained in DEA as input variables for GP (model I).The RST-GP model adopts only the significant variables, X 1 -X 9 , as the input variables for GP (model II).In both models I and II, this work adopts leave-one-out cross validation to test the accuracy of the prediction model.
Tables 4 and 5 show the analytical results for hybrid models I and II, respectively.Model I has an average correct classification rate of 92.10%, and that of model II is 84.21%.Hence, adding DEA provides more information than adopting significant input variables only and enhances prediction model accuracy.The RST-SVM-based models are also utilized to predict whether PV systems have high or low transfer efficiency.The RST-DEA-SVM model uses both significant variables obtained from RST and DMU as input variables of SVM (model III).The RST-SVM model, which utilizes only significant attributes, is model IV.In constructing the SVM model, this work utilizes STATISTICA software to generate a classification model.Some studies [46,47]  , which can generate an appropriate prediction model.Tables 6 and 7 summarize prediction results for the confusion matrix utilizing models III and IV, respectively.Based on RST-SVM-based model results, adding DEA improves the correct classification rate from 78.94% to 81.57%.Furthermore, two RST-ANN-based prediction models are applied.One uses the significant variables obtained from RST and the DMU variable obtained from DEA as input variables for an ANN (model V, named RST-DEA-ANN model).The RST-ANN model utilizes only significant variables as input variables for the ANN (model VI).This work uses Qnet2000 software to construct the ANN classification model.Cybenko [48] demonstrated that utilizing one hidden layer is sufficient when modeling any complex system.Hence, the appropriate network models are 10-5-1 and 9-7-1 for nodes of the input layer, hidden layer, and output layer for models V and VI, respectively.Tables 8 and 9 summarize prediction results for the confusion matrix utilizing models V and VI, respectively.Similarly, from the results of RST-SVM-based model, RST-ANN is also obvious that adding DEA can improve the correct classification rate from 76.31% to 81.57%.With the same analysis of the above classification models (model I to VI), two RST-KNN-based and RST-NB-based prediction models are applied to predict whether PV systems have high or low transfer efficiency.This work also adopts STATISTICA to construct the KNN and NB classification models, respectively.For KNN classification, one uses the significant variables obtained from RST and the DMU variable obtained from DEA as input variables for a KNN (model VII, named RST-DEA-KNN model).The RST-KNN model utilizes only significant variables as input variables for the KNN (model VIII).Tables 10 and 11 summarize prediction results for the confusion matrix utilizing models VII and VIII, respectively.Based on RST-KNN-based model results, adding DEA improves the correct classification rate from 73.68 % to 76.31%.For NB classification, one uses the significant variables obtained from RST and the DMU variable obtained from DEA as input variables for a NB (model IX, named RST-DEA-NB model).The RST-NB model utilizes only significant variables as input variables for the NB (model X).Tables 12  and 13 summarize prediction results for the confusion matrix utilizing models IX and X, respectively.Based on RST-NB-based model results, adding DEA improves the correct classification rate from 73.68 % to 76.31%.important contributions to the existing literature.First, adding DEA provides additional information for constructing a model that can predict high or low transfer efficiency of PV systems.Second, the results of RST can allow managers or decision-makers in the PV field to identify the critical components.Third, the proposed hybrid model has better classification results than existing hybrid models, regardless of whether only significant variables are adopted or significant variables and the DMU variable are adopted.Fourth, the proposed model also has the lowest misclassification rate among all models tested.Therefore, the proposed RST-DEA-GP model can accurately predict whether a PV system has high or low transfer efficiency.Future work can apply grey theory to determine whether a PV system has high or low transfer efficiency based on uncertain information.Second, in order to demonstrate the effectiveness of the proposed hybrid prediction model, it will be utilized to predict more different country PV systems.

Figure 2 .
Figure 2. Example of GP parse tree representation.
of dependence between conditional features C (the variables of PV systems) and decision feature D (i.e., the high or low PV transfer efficiency), between removing a conditional feature (such as a condition feature) from C and decision feature D. ( ) ( ) , C D a σ denotes the variation of degree of dependence between removing a from C with all condition features C. When ( ) ( ) , C D a σ is large, feature a importantly affects the decision attribute D.
utilized the Gaussian kernel function to enhance prediction performance.For the SVM model, parameters settings are the Gaussian kernel function, C = 3, and 0.129 r =

Table 1 .
The settings of GP model.

Table 2 .
Selected significant variables from RST and DMU variable from DEA.

Table 3 .
The results of DMU value of each PV system by utilizing DEA.

Table 4 .
RST-DEA-GP model (model I) results with both significant variables and DMU.

Table 5 .
RST-GP model (model II) results with only significant variables.

Table 6 .
RST-DEA-SVM model (model III) results with significant variables and DMU.

Table 7 .
RST-SVM model (model IV) results with only significant variables.

Table 8 .
RST-DEA-ANN model (model V) results with both significant variables and DMU.

Table 9 .
RST-ANN model (model VI) results with only significant variables.

Table 10 .
RST-DEA-KNN model (model VII) results with both significant variables and DMU.

Table 11 .
RST-KNN model (model VIII) results with only significant variables.

Table 12 .
RST-DEA-NB model (model IX) results with both significant variables and DMU.

Table 13 .
RST-NB model (model X) results with only significant variables.