GMDH-Based Semi-Supervised Feature Selection for Electricity Load Classiﬁcation Forecasting

: With the development of smart power grids, communication network technology and sensor technology, there has been an exponential growth in complex electricity load data. Irregular electricity load ﬂuctuations caused by the weather and holiday factors disrupt the daily operation of the power companies. To deal with these challenges, this paper investigates a day-ahead electricity peak load interval forecasting problem. It transforms the conventional continuous forecasting problem into a novel interval forecasting problem, and then further converts the interval forecasting problem into the classiﬁcation forecasting problem. In addition, an indicator system inﬂuencing the electricity load is established from three dimensions, namely the load series, calendar data, and weather data. A semi-supervised feature selection algorithm is proposed to address an electricity load classiﬁcation forecasting issue based on the group method of data handling (GMDH) technology. The proposed algorithm consists of three main stages: (1) training the basic classiﬁer; (2) selectively marking the most suitable samples from the unclassiﬁed label data, and adding them to an initial training set; and (3) training the classiﬁcation models on the ﬁnal training set and classifying the test samples. An empirical analysis of electricity load dataset from four Chinese cities is conducted. Results show that the proposed model can address the electricity load classiﬁcation forecasting problem more efﬁciently and effectively than the FW-Semi FS (forward semi-supervised feature selection) and GMDH-U (GMDH-based semi-supervised feature selection for customer classiﬁcation) models.


Introduction
Electricity load forecasting is a major issue in the planning and operation of modern electricity networks and electricity markets [1,2].Electricity load forecasting can be classified into long-term [3], medium-term [4], short-term [5][6][7] and ultra-short term [8], and the cut-off points for these four categories are three years, two weeks, and one day, respectively [9].The short-term load forecasting (STLF), which is applied to horizons no more than one day ahead, can result in significant environmental and economic benefits for energy systems.For reliable and efficient operations, STLF is used when decision-making has significant impacts on the operations, such as scheduling generating capacity dispatches, demand side management, security assessments, and generator maintenance scheduling [5,[10][11][12][13][14][15][16].Unsatisfactory STLF can cause the increase in the operational cost, equipment failures, or systems blackouts, thus resulting in a waste of resources [17][18][19].As the implementation of accurate and timely forecasting methods is important for environmental-friendly, economically sound operations, STLF research is essential to ensure efficient and reliable power system operations.
STLF involves the electricity load forecasting of total demand and peak demand within one day.For example, Bessec and Fouquau [19] developed an one day-ahead forecast for half-hourly electricity loads using a combination of stationary wavelet transformations that yielded 502 daily observations for each half-hour in France.Based on the corresponding weather forecasts, Feng and Ryan [20] provided accurate day-ahead hourly load forecasting for multiple zones within a region using a temporal and weather conditional epi-splines-based load models.Tong et al. [21] developed a deep learning based model and established a support vector regression model to forecast the total day-ahead electricity load, and then refined the features by stacking the denoising auto-encoders with historical electricity load data and related temperature parameters.
In addition to total electricity load forecasting, the peak load forecasting has also been found to be related to power network dispatch centers.For instance, dispatching center operators require daily peak loads for scheduled maintenance or adequate assessments.Therefore, the forecasting of daily peak loads should be considered in the STLF.However, only a few researchers considered electricity peak load forecasting in the past.Amjady [22] presented a new time series models that could precisely forecast the daily peak loads of a power system, and obtained results from extensive tests to confirm the validity of the developed approach.In reality, because of the hysteresis in generator units, even a large number of spare generator units fail to meet immediate electricity needs when loads reach a peak and cause power restrictions.Therefore, it is essential to accurately forecast peak loads in power grids.
There is a little research focusing on electricity peak load forecasting because numerous studies only seek to predict specific electricity loads [23].The electricity peak load interval forecasting has not been investigated so far.On the other hand, the peak load interval forecasting has greater practical value than that of specific electricity loads, since the power generation from generator sets has an interval value, which means that operators need to open spare units in advance.When the peak load lies in different intervals, the power dispatcher needs to configure the corresponding generators in advance.Therefore, this paper seeks to convert the peak power load into an interval load and then further translates it into a peak power load so as to forecast peak load classifications.
Previous research has paid close attention to the accurate forecasting of electricity loads, and multiple methods.For instance, the classical statistical methods [24] and machine learning methods [25][26][27] have been proposed for the electricity load forecasting.The classical statistical methods often assume that the load is a function of several explanatory variables and estimate the specified functional parameters [28,29].One of the well-known methods is the seasonal autoregressive integrated moving average (SARIMA) proposed by Box and Jenkins [30].To improve forecasting accuracy, there have been numerous attempts to enhance models.For example, Soares and Medeiros [31] proposed a SARIMA model for hourly electricity loads in southeast Brazil.Although SARIMA models are easy to use and are capable of forecasting accurately, they have some limitations.The machine learning methods such as artificial neural networks (ANNs) and support vector regression (SVRs) are restricted to specified functions [20].The SVR-based electricity load forecasting methods are proposed and show good performance mainly due to the strong non-linear learning capability of SVR.The comparison between the machine learning methods (ANNs and SVRs) and the discrete-time univariate econometric models can be found in [32].Both theoretical and empirical findings have indicated that a combination of different models could overcome the limitations of single models and improve forecasting accuracy by harnessing each mode's merits.Consequently, there have been several hybrid models developed that incorporate different energy field models for electricity load forecasting.Some researchers proposed the hybrid method which consists of a neural network and the evolutionary algorithms [33].For instance, Mori and Takahashi [34] proposed a hybrid intelligent method for probabilistic STLF, and Xiong et al. [35] converted hourly load series into a 24 monthly interval time series and proposed a hybrid approach for forecasting the electricity demand intervals.Fan et al. [2] proposed a SVR model combining the auto regression with the differential empirical mode decomposition method for a kind of electricity load forecasting.Although the above methods may be used to resolve the continuous the electricity load forecasting problem, they are not suitable to tackle the electricity peak loads classification forecasting issue.Hence, it is necessary for academics to propose novel methods to solve the classification forecasting problem of electricity peak loads.
The group method of data handling, which is a family of inductive algorithms for the computer-based mathematical modeling of multi-parametric datasets, has been found to be an effective tool for solving the classification problem in machine learning field.It can also be used for short-term load forecasting [36,37] and traffic flow prediction [38].The GMDH-type neural network that is the combination of GMDH and neural networks, can improve forecasting accuracy [39], and solve the classification problems efficiently [40].Nevertheless, to the best of our knowledge, there is no research that has utilized GMDH and neural network for electricity peak load classification forecasting.
This paper investigates the one day-ahead electricity peak load classification forecasting problem.One major contribution is that it transforms the conventional continuous forecasting into a novel interval forecasting, and then further converts the interval forecasting into the classification forecasting.In addition, an indicator system of influencing the electricity load is established from three dimensions, namely the load series, calendar data and weather data.Another contribution is that a novel semi-supervised feature selection algorithm is proposed to address the electricity load classification forecasting problem based on the group method of data handling technology.
The rest of the paper is organized as follows.The related theory and the GMDH-based semi-supervised feature selection for an electricity load classification model are introduced in Section 2. Section 3 presents the experimental design and analyzing results in detail.Section 4 draws conclusions and provides suggestions for the future research.

GMDH Network
The group method of data handling (GMDH) is a basic technique for self-organized learning.It enables the researchers to control the process of the complex model from the input set to the output data and to determine the model parameters [41][42][43].
The GMDH network establishes a relationship between input and output, which is referred to as the Volterra function series or the Kolmogorov-Gabor polynomial function: Suppose that the linear function is set.All items are then taken as the m + 1 initial input variables.The specific modeling process is as follows.From the transfer function, a new neuron is obtained to construct the first layer (see Figure 1).The specific expression is as follows: First, the parameters are calculated by the using least squares estimation and the external criterion value of every intermediate candidate model according to the model selection set.In general, the accuracy of the intermediate candidate model increases when the external criterion value decreases.When the confidence level is selected, the external criteria values are measured using the threshold value measurement.Finally, every two models are paired, which then becomes the input for the second layer: Similarly, the intermediate candidate model t 2 = C 2 F 1 is obtained in the second layer.Repeating the above steps, the model continues working until an optimal complexity model is determined.Therefore, the termination principle obeys the optimal complexity principle [44].To identify the initial model contained in the optimal complexity model y * , the GMDH network structure can be examined from the last layer to the initial input layer.As shown in Figure 1, v 1 , v 2 , v 3 , v 4 , v 5 are chosen as the initial input model.Then, each variable is paired with another in a group to compete with each other.Nonetheless, y 1 , y 2 , y 3 , y 4 are preserved by the algorithm.Note that v 1 , v 3 , v 4 , v 5 remain in the model to participate in the subsequent competition, however, v 2 is eliminated.In other words, x 2 , x 3 , x 4 are selected and x 1 is deleted.During the modeling process, the building of the external criteria is also crucial.A detailed description of the SSFS-GMDH model and external criteria are shown in Sections 2.2 and 2.3.The interpretation for the symbols can be found in Table 1.

Detailed Modeling Steps
The basic flowchart of the SSFS-GMDH model is illustrated in Figure 2, and the detailed modeling steps are as follows: Input: L, U, T, K, θ, p.
Output: Classification results from the final training of the test set.
Step 1: Divide the original dataset into training set L with a category label and dataset U without a category label, and test dataset T with a category label.Further divide L into the simulated training set L train and the simulated validation set .
Step 2: Find N training by mapping the L train subsets based on the stochastic subspace, and then train the N basic classification models.
Step 3: Use the training classification model to classify L veri f y , and then choose the classifier with highest classification accuracy.
Step 4: Use the selected classification model to mark the catalog tag on the unclassified dataset U, and find sample U l with a catalog tag.
Step 5: Calculate and sort the confidence level of each sample; δ is defined as the confidence level of each marked sample U l i in set U l , the calculation formula is: where K is the number of neighboring samples chosen from the initial labeled training set L. k reflects the number of neighboring samples that have the same class labels as samples among K neighbors.In this paper, the Euclidean distance is used to calculate the distance between samples.It is obvious that the higher the value of δ ∈ [0, 1] is the higher the confidence level will be.Then, sort the marked samples based on the confidence level of each sample.
Step 6: Choose a certain proportion of the marked samples with a higher confidence level from U l i and put them into L train .
Step 7: Repeat Steps 2 to 6.The iteration stops when the proportion p of the sample added to L train in U exceeds θ.
Step 8: Train the final classification model, select the final character subset F s and classify the samples in the testing set T.

Establishing the GMDH External Criteria
There are two fundamental types of GMDH (group method of data handling) external criteria: the accuracy criteria and the compatibility criteria.Accuracy criteria focus on the random errors in different established model parts, and are also referred to as fitting precision, while compatibility criteria highlight the consistency of the models built for the same system in datasets from different samples [45].Ivakhnenko et al. [40] established regularization criteria and a theoretical basis for symmetric regularization criteria, and proved that the regularization criteria and symmetric regularization criteria could be used as the external criteria in GMDH theory.Because of the different application scopes, different external criteria for different GMDH have significant impacts on the model classification performance [46].Details of the 13 types of external criteria are as follows: SSFS-GMDH1: Symmetric mean square error. where SSFS-GMDH2: Symmetric regularization criteria. where SSFS-GMDH3: Average regularization criteria.
SSFS-GMDH10: Combination criteria (symmetric minimum deviation criteria + minimum square error criteria).SSFS-GMDH11: Asymmetric regularization criteria training model on A and calculating the external criteria on B. SSFS-GMDH12: Asymmetric stability criteria training on A and calculating the external criteria on W. SSFS-GMDH13: Asymmetric minimum error criteria.

Data Description
The electricity load series were provided by the Electric Power Company in the Sichuan Province, China and the sample spanned from January 2013 to June 2017, yielding 1270 daily data.Four representative cities, namely Mianyang, Nanchong, Yibin and Panzhihua, were selected from this province.The indicator system, consisting of the weather variables, calendar variables, and load series, is used to forecast the day-ahead electricity load.There are 18 related variables-one calendar variable, six weather variables, and eleven kinds of load series (Table 2).
-Calendar variables: There is one calendar variable that varies across weekdays, weekends, and holidays.Calendar variables are crucial, as electricity loads show daily and weekly periodic variations [47] as well as weekday, weekend, and holiday variations [48].

-
Weather variables: There are six weather variables: the maximum temperature, minimum temperature, maximum temperature variable rate, minimum temperature variable rate, wind speed, and weather type.As the electricity load is susceptible to changes in weather variables, it is necessary to understand electricity load volatility under various weather conditions within different timescales [49].Weather variables have been seen as the main parameters controlling energy demand [50,51].

-
Load series: There are eleven kinds of load series, namely the peak load, off-peak load, daily consumption, cumulative consumption, off-peak consumption, load rate, actual peak load, previous day's electricity consumption, daily consumption in the same period of the previous week, daily consumption in the same period in the previous month, and daily consumption in the same period of the previous year.-Y: y is defined as the peak load and n as the number of categories, such that The specification for n is as follows: where y max and y min denote the maximal peak load and the minimal peak load, respectively, and S is defined as the step length that is set based on the power of generators.

Empirical Analysis
To analyze the performance of the proposed SSFS-GMDH model and different criteria, the practical datasets from Mianyang, Nanchong, Yibin and Panzhihua in China are empirically analyzed.The SSFS-GMDH model is compared with the FW-SemiFS (Forward semi-supervised feature selection) [52] and GMDH-U (GMDH-based semi-supervised feature selection for customer classification) [46] models.

Experimental Setting
Table 3 shows the parameters used in the experiment.Each particular dataset is divided into three subsets: 30% of samples in the dataset is used as a training set L with a class label, and the another 30% of samples in the dataset is used as a dataset U without a class label, and the remaining 40% of samples as a testing set T with a class label.The range of K and θ is K ∈ [2,15] and θ ∈ [0.1, 1], respectively.In the SSFS-GMDH model, L is utilized to mark the labels of the samples in U.As this procedure has a significant impact on the performance of the SSFS-GMDH model, it is crucial to choose a proper basic classification model.Therefor three basic and effective classification models include the Support Vector Machines [7], Bayesian Networks [53], and Decision Trees [54,55] are employed in this paper.Each experiment is conducted 30 times via MATLAB2016b.

Symbols
Parameters Setting

Model Evaluation Criteria
The most common evaluation criterion for evaluating classification forecasting models is the accuracy on the testing set.Since this is the appropriate way to evaluate the performance of models dealing with unbalanced class distributions, the ROC (Receiver Operating Characteristic) curve can be used to evaluate the model' classification performance.However, since it is inconvenient to directly compare each model's ROC curve, the AUC (the area under the ROC curve ) is usually taken as the model evaluation criterion.The ROC curve and AUC value are both capable of evenly handling the minorities and the majorities.Nevertheless, the AUC value can better weigh the minority recognition rate against the majority recognition rate, and the larger the AUC value, the better the model performance [56].
The classification evaluation matrix is then introduced.As shown in Table 4, TP denotes the number of correctly predicted positive classes, FN denotes the number of wrongly predicted negative classes, FP denotes the number of wrongly predicted positive classes, and FP denotes the number of correctly predicted negative classes.To deal with the dichotomy, the ROC curve is a true positive rate-false positive rate figure, in which the horizontal axis of the figure shows the fake positive rate (=FP/(FP + TN) × 100%) and the vertical axis shows the true positive rate (=TP/(TP + FN) × 100%).

Analysis of the Impacts of the GMDH External Criteria on Classification Performance of the SSFS-GMDH Model
This paper constructs 13 external criteria and then conducts tests on four datasets to determine the best external criteria by exploring the relationships between the external criteria classification performances and the model.Figure 3 shows the impacts of GMDH external criteria on classification performance of the SSFS-GMDH model in the four datasets, namely the Mianyang, Nanchong, Yibin, and Panzhihua datasets.As shown in Figure 3, the SSFS-GMDH3 model on Mianyang dataset has the highest classification accuracy with a MAUC (mean AUC) value of 0.91748, followed by the SSFS-GMDH4 model with a MAUC value of 0.91339, and the SSFS-GMDH13 model with a MAUC value of 0.91265.The SSFS-GMDH6 model has the lowest classification accuracy, with a MAUC value of 0.87516.The SSFS-GMDH13 and SSFS-GMDH4 models belong to the accuracy criteria model, whereas the SSFS-GMDH13 and SSFS-GMDH6 models are the compatibility criteria model.
The SSFS-GMDH3 model on the Nanchong dataset has the highest classification accuracy with a MAUC value of 0.88320, followed by the SSFS-GMDH4 model with a MAUC value of 0.86295, and the SSFS-GMDH11 model with a MAUC value of 0.85785.The SSFS-GMDH13 model has the lowest classification accuracy with a MAUC value of 0.61927.The SSFS-GMDH13, SSFS-GMDH4, and SSFS-GMDH11 models are the accuracy criteria models, while the SSFS-GMDH13 model is the compatibility criteria model.
On the Yibin dataset, the SSFS-GMDH9 model has the highest classification accuracy with a MAUC value of 0.90309, followed by the SSFS-GMDH12 model with a MAUC value of 0.90071, and the SSFS-GMDH4 model with a MAUC value of 0.89814.The SSFS-GMDH3 model has the lowest classification accuracy, with a MAUC value of 0.82647.The SSFS-GMDH11, SSFS-GMDH12 and SSFS-GMDH4 models are the accuracy criteria model.
On the Panzhihua dataset, the SSFS-GMDH9 model has the highest classification accuracy with a MAUC value of 0.70092, followed by the SSFS-GMDH8 model with a MAUC value of 0.69865, and the SSFS-GMDH2 model with a MAUC value of 0.64622.The SSFS-GMDH5 model has the lowest classification accuracy with a MAUC value of 0.64622.The SSFS-GMDH9 and SSFS-GMDH8 models are the compatibility criteria model, and the SSFS-GMDH2 and SSFS-GMDH5 models are the accuracy criteria model.
To further examine the impacts of external criteria on classification performance of the SSFS-GMDH model, an analysis of the variance is conducted, and the results are shown in Table 5.
The SSFS-GMDH3 and SSFS-GMDH4 models perform better compared to other models on the Mianyang dataset.In particular, the MAUC value for the SSFS-GMDH3 model is the largest on the Mianyang dataset.The p-value for the SSFS-GMH3 and SSFS-GMDH4 significance tests is 0.073, which exceeds the significance level of 0.05.The p-values for the SSFS-GMDH3 and the significance tests for the other eleven external criteria are less than 0.05, and are therefore statistically different.Similarly, the SSFS-GMDH3 model is also superior to other models on the Nanchong dataset.

Analysis of the Parameter Sensitivity
θ and K are two essential parameters in the SSFS-GMGH model proposed in this paper.The two parameters need to be determined to achieve better performance.In the following section, the impact of θ and K on model performance is analyzed.
(1) Impacts of θ on model performance Suppose that θ = 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, and 100%.We set randomly K = 5, and the experimental results for the SSFS-GMDH3 model in the four datasets are shown in Figure 4.As can be seen in Figure 4, the SSFS-GMDH3 model's performance in the four datasets gradually reaches a peak and then declines.On the Mianyang dataset, when θ reaches 0.9, the model performance is the best with a MAUC value of 0.9309.The corresponding MAUC value is 0.9308 when θ equals 0.8, so the small discrepancy can be overlooked.On the Panzhihua and Yibin datasets, when θ = 0.8, both model performances are optimal.On the Nanchong dataset, parameter θ has little impact on the model performance.When θ reaches 0.6 and 0.8, the MAUC values are 0.9219 and 0.9215, respectively.Therefore, the paper suggests setting the θ = 0.8.  Figure 5 indicates that, with an increase in K, the MAUC value first has a fluctuating increasing tendency, after which it slowly declines.When K = 13, the best model performances are achieved in Mianyang, Panzhihua and Yibin datasets.When K = 10, the model has an optimal performance on Nanchong dataset, with an MAUC value of 0.9209.When K = 13, the MAUC value is 0.9114.Therefore, the SSFS-GMDH3 model has the best performance only when θ = 0.8 and K = 10.

Comparisons with Other Models
Table 6 shows the MAUC value of the SSFS-GMDH, FW-SemiFS, and GMDH-U models on the four datasets.Symbols ↓, ↑, indicate that the result is significantly worse, better, and similar to that obtained by the SSFS-GMDH3 model, respectively.Symbols ∼, +, ≈ denote that the result is significantly worse, better, and similar to that obtained by the SSFS-GMDH11 model, respectively.On the Mianyang dataset, the MAUC values for the SSFS-GMDH3, SSFS-GMDH11, FW-SemiFS and GMDH-U are, respectively, 0.9452, 0.9381, 0.9308 and 0.9218.Therefore, the SSFS-GMDH model performs much better than the FW-SemiFS.On the Nanchong and Panzhihua datasets, the performance of the SSFS-GMDH model is superior to that of both the FW-SemiFS and GMDH-U models.Overall, the performance of the SSFS-GMDH model is the best compared to the FW-Semi FS and the GMDH-U models.

Conclusions
This paper investigates a day-ahead electricity peak load classification forecasting problem.It transforms the conventional continuous forecasting into a novel interval forecasting, and then further converts the interval forecasting into the classification forecasting.In addition, an indicator system influencing the electricity load is established from three dimensions, namely the load series, calendar data, and weather data.A novel semi-supervised feature selection algorithm based on the group method of data handling technology is proposed to address the electricity load classification forecasting problem.Furthermore, the parameters of the proposed model and the external criteria are analyzed systematically, which aims to improve the robustness of proposed model.An empirical test in real-world peak load forecasting cases shows that the proposed method has better classification forecasting performance compared to the two other state-of-the-art methods in four typical datasets, and that the peak classification forecasting problem is solved effectively.It is evident that the time interval in this paper is one day, but investigating different time intervals according to practical scheduling tasks, ranging from one hour to one week, and comparing the subtle difference are of great importance in the future.It is also urgent for researchers to develop more methods solving the short-term load classification forecasting issues.

Figure 1 .
Figure 1.An illustration of modeling process for the GMDH.

2. 2 .
Basic Modeling Idea This paper proposes a GMDH-based semi-supervised feature selection (SSFS-GMDH) model to deal with the electricity load classification forecasting problem.In this model, the labeled and unlabeled samples are used for the feature selection.Suppose L is the original labeled training set for the electricity peak load forecasting problem, T is the labeled testing set, and U is the number of an unlabeled dataset.L is firstly divided into a simulated training set L train and a simulated validation set L veri f y .The flowchart of the proposed method is shown in Figure 2. The proposed model contains three major stages.(1) The classified dataset L is used to train N basic classification models.(2) Label the labeled samples in the dataset U by using basic classification models.A certain proportion of marked samples U α are chosen from U l and the samples merged in L. These two stages are repeated until the proportion of selected samples exceeds θ. (3) Train the basic classification model until the final training set L and the feature set F s are obtained.

Figure 3 .
Figure 3. Impacts of GMDH external criteria on classification performance of the SSFS-GMDH model in the four datasets.

( 2 )
Impacts of K on model performanceThe experimental results for the SSFS-GMDH3 model in the four datasets are shown in Figure5with θ = 0.8, and K ∈[2,15].

Figure 5 .
Figure 5.The performance of SSFS-GMDH3 model with different K values.

Table 1 .
Interpretation for the symbols.

Table 5 .
Analysis of variance.