Forecasting Monthly Electricity Demands by Wavelet Neuro-Fuzzy System Optimized by Heuristic Algorithms

.


Introduction
Electric energy plays a fundamental role in business operations all over the world. Industries, homes, and services worldwide depend on efficient, reliable, and accessible electricity. Therefore, as electricity has a deep impact on economic activities, the management of electricity and electricity sources must be carefully implemented to guarantee the efficient use of electricity. The key to this is high-quality capacity planning, scheduling, and operations of the electric power systems. Unlike other commodities, electricity cannot be stored. The production, transmission, distribution, and consumption of the electricity are performed simultaneously. As a result, the available capacity generated within the system should meet the system requirements. An accurate knowledge of future electricity demands is necessary for solid capacity planning and scheduling. This leads to a need for reliable electricity demand forecasting to guarantee that electricity generation is sufficient for demand. However, it is difficult to accurately forecast electricity demands because the demand series often contains unpredictable trends, high noise levels, and exogenous variables. Although demand forecasting is Information 2018, 9, 51 5 of 26 to make an accurate forecast is by utilizing a combination of different techniques [33]. However, the published literature only provides forecasting models based on the neuro-fuzzy system or the wavelet neuro-fuzzy system without the combination of heuristic algorithms. Considering recent literature, there is still room for improving the neuro-fuzzy systems for the problem of electricity demand forecasting. In this research we will propose an approach to improve the neuro-fuzzy system by the incorporation of the recent heuristic algorithm and wavelet transform for forecasting electricity demand. This approach will make use of the ability of all these techniques and this development has not been previously reported. We believe this proposed novel approach not only can be used for short-term, mid-term, and long-term electricity demand forecasting, regionally or nationally, but also has the potential to be applied in other applications.

Fuzzy Inference System
The fuzzy inference system (FIS) is a process of mapping from a given input to an output using the theory of fuzzy sets. Figure 1 shows an example of the fuzzy inference system. In general, the main steps performed in the FIS are as follows: • The fuzzification interface transforms each crisp input variable into a membership grade based on the membership functions defined. • The inference engine then conducts the fuzzy reasoning process by applying the appropriate fuzzy operators in order to obtain the fuzzy set to be accumulated in the output variable. • The defuzzification interface transforms the fuzzy output into a crisp output by applying a specific defuzzification method Information 2018, 9, x 5 of 26 room for improving the neuro-fuzzy systems for the problem of electricity demand forecasting. In this research we will propose an approach to improve the neuro-fuzzy system by the incorporation of the recent heuristic algorithm and wavelet transform for forecasting electricity demand. This approach will make use of the ability of all these techniques and this development has not been previously reported. We believe this proposed novel approach not only can be used for short-term, mid-term, and long-term electricity demand forecasting, regionally or nationally, but also has the potential to be applied in other applications.

Fuzzy Inference System
The fuzzy inference system (FIS) is a process of mapping from a given input to an output using the theory of fuzzy sets. Figure 1 shows an example of the fuzzy inference system. In general, the main steps performed in the FIS are as follows: • The fuzzification interface transforms each crisp input variable into a membership grade based on the membership functions defined.

•
The inference engine then conducts the fuzzy reasoning process by applying the appropriate fuzzy operators in order to obtain the fuzzy set to be accumulated in the output variable.

•
The defuzzification interface transforms the fuzzy output into a crisp output by applying a specific defuzzification method The two types of fuzzy inferences most commonly used are the Mamdani method and the Sugeno method [35]. The difference between these two methods is the specification of the consequent part. In the Mamdani method [36], consequents are fuzzy sets, and the final crisp output of the Mamdani method is based on defuzzification of the overall fuzzy output using various types of defuzzification methods. Whereas, in the Sugeno method [37], consequents are real numbers, which can be either linear or constant. The final output (known as a singleton output membership function) is the weighted average of each rule's output.
The key success of and interest in FIS are that it offers the capability to deliver the process of turning data into knowledge that can be understood by people. The representation of rules to express the behavior of the system is constructed from human expert knowledge [38]. FISs have been widely used to solve different classification problems. However, the reliance on experts to form the fuzzy The two types of fuzzy inferences most commonly used are the Mamdani method and the Sugeno method [35]. The difference between these two methods is the specification of the consequent part. In the Mamdani method [36], consequents are fuzzy sets, and the final crisp output of the Mamdani method is based on defuzzification of the overall fuzzy output using various types of defuzzification methods. Whereas, in the Sugeno method [37], consequents are real numbers, which can be either linear or constant. The final output (known as a singleton output membership function) is the weighted average of each rule's output.
The key success of and interest in FIS are that it offers the capability to deliver the process of turning data into knowledge that can be understood by people. The representation of rules to express the behavior of the system is constructed from human expert knowledge [38]. FISs have been widely used to solve different classification problems. However, the reliance on experts to form the fuzzy rules (to define the membership functions, fuzzy operators, and the knowledge base) and lack of learning capabilities, are some of the limitations of FISs [39].
There are three types of Fuzzy inference systems. They are: The Mamdani fuzzy inference system was first proposed by Mamdani [40] as an attempt to control a steam engine and boiler combination from a set of linguistic control rules obtained from experienced human operators. In this model both input and output membership functions are in the form of linguistic variables. To implement this type of fuzzy model one must go through the following six steps.
Step 1: Setting the fuzzy rules: Fuzzy rules are the conditional statements that define the relationship between the input membership functions and output membership functions. For example, if input 1 is low and input 2 is high then output is high. Here the values of low, medium and high to the inputs are called linguistic variables or the membership functions. Expert Knowledge is used for this purpose Step 2: Fuzzification: It is the process of converting crisp data into fuzzy data. The input data is classified into input membership functions which can be linguistic variables such as low, medium, etc. This is usually done on the basis of expert human knowledge.
Step 3: Combining the fuzzified inputs according to the fuzzy rules to establish rule strength.
Step 4: Finding the consequence of the rules by combining the rule strength and the output membership function.
Step 5: The outputs of all the fuzzy rules are calculated and combined to get an output distribution.
Step 6: Defuzzification: Usually a crisp output is required in most of the applications. Defuzzification is the process of converting fuzzy data (Output distribution) to crisp data (single value). There are many methods which can be used for this purpose. Some of the commonly used methods are Center of Mass, Mean of the Maximum etc.
Sugeno fuzzy model also known as TSK (Takagi Sugeno Kang) fuzzy model was proposed by Takagi, Sugeno and Kang in an effort to develop a systematic approach to generate fuzzy rules from a given input and output data set. In this model the input membership functions can be linguistic variables, but the output must be linear or constant. In this model the fuzzy rules are of the form.
If x is A and y is B then z = f (x,y), where A and B are the fuzzy input membership functions and z = f (x,y) is the crisp output. Usually the output is a polynomial expression of x and y. The order of the polynomial defines the order of the model. Since the output of each rule is a crisp value, the overall output is calculated as the weighted average of all the rules. Furthermore, because the output is a crisp value there is no need of defuzzification in this model, and thus reducing the complexity when compared to Mamdani model.
If z = k (constant), then it is known as zero order Sugeno model. If z = px + qy + r, then it is a first order polynomial of inputs and it is called as first order Sugeno model.
Tsukamoto Fuzzy Model is proposed by Tsukamoto [41]. The consequent of each fuzzy if-then rule is represented by a fuzzy set with a monotonical membership function. As a result, the inferred output of each rule is defined as a crisp value induced by the rule's firing strength. The overall output is taken as the weighted average of each rule's output. Since each rule infers a crisp output, the Tsukomoto fuzzy model aggregates each rule's output by the method of weighted average and Information 2018, 9, 51 7 of 26 thus avoids the time-consuming process of defuzzification. The Tsukamoto fuzzy model is not often used since it is not as transparent as either the Mamdani or Sugeno fuzzy model [42].

Adaptive Neuro-Fuzzy Inference System (ANFIS)
Neural networks and fuzzy set theory, which are both computational intelligence techniques, are tools for establishing intelligent systems. A FIS employs fuzzy if-then rules when acquiring knowledge from human experts to deal with imprecise and vague problems. As mentioned, fuzzy systems cannot learn from or adjust themselves. On the other hand, a neural network has the capacity to learn from its environment, self-organize, and adapt in an interactive way. For these reasons, a neuro-fuzzy system, which is the combination of a fuzzy inference system and a neuron network, has been introduced to produce a complete fuzzy rule-based system [43,44]. The advantages of neural networks and fuzzy systems can be integrated into a neuro-fuzzy approach. Fundamentally, a neuro-fuzzy system is a fuzzy network that has its function as a fuzzy inference system. The system can overcome some limitations of neural networks, as well as the limits of fuzzy systems [45,46], when it has the capacity to represent knowledge in an interpretable manner and the ability to learn. Most neuro-fuzzy system research follows a structure similar to that proposed by Takagi and Hayashi [47]. There are several ways to combine neural networks and fuzzy systems. In general, the neuro-fuzzy system can be classified into three types, dependent on how the combinations between the neural network system and fuzzy system are performed. These three types are as follows [48]: • Cooperative Neuro-Fuzzy System: The neural network is used at the initial phase to determine the fuzzy set and/or fuzzy rules, and then the fuzzy system is fully utilized for execution. • Concurrent Neuro-Fuzzy System: Neural networks are used to provide input for a fuzzy system, or to change the output of the fuzzy system. In this case, the parameters of the fuzzy system are not changed by the learning process. • Hybrid Neuro-Fuzzy System: A fuzzy system uses a learning algorithm inspired by the neural networks to determine its parameters through pattern processing.
In this study, the third type, the hybrid neuro-fuzzy system will be applied to develop prediction models. Among the third type of neuro-fuzzy systems, the ANFIS, introduced by Jang [43], has been one of the most common tools. In the FIS, the fuzzy if-then rules are determined by experts, whereas in the ANFIS, the model itself automatically produces adequate rules with respect to input and output data and facilitates the learning capabilities of neural networks.
ANFIS is a multilayer feed-forward neural network, which employs neural network learning algorithms and fuzzy reasoning to map from input space to output space. The architecture of ANFIS includes five layers, namely the fuzzification layer, the rule layer, the normalization layer, the defuzzification layer, and a single summation node. To present the ANFIS architecture and simplify the explanations, we assume that the FIS has two inputs (x 1 and x 2 ), two rules, and one output (y) as shown in Figure 2. Each node within the same layer performs the same function. The circles are used to indicate fixed nodes, while the squares are used to denote adaptive nodes. A FIS that has two inputs and two fuzzy if-then rules [49] can be expressed as: Rule 1: If x 1 is A 1 and x 2 is B 1 then y 1 = p 1 x 1 + q 1 x 2 + r 1 , Rule 2: If x 1 is A 2 and x 2 is B 2 then y 2 = p 2 x 1 + q 2 x 2 + r 2 , where x 1 and x 2 are the inputs; A 1 , A 2 , B 1 , B 2 are the linguistic labels; p i , q i , and r i , (i = 1 or 2) are the consequent parameters [43] that are identified in the training process; and y 1 and y 2 are the outputs within the fuzzy region. Equation (1) represents the first type of fuzzy if-then rules, in which the output part is linear. The output part can also be a constant [37], and is represented as Rule 1: If x 1 is A 1 and x 2 is B 1 then y 1 = C 1 , Rule 2: If x 1 is A 2 and x 2 is B 2 then y 2 = C 2 , where C i (i = 1 or 2) are constant values. Equation (2) represents the second type of fuzzy if-then rules. For complicated problems, the first type of if-then rules is widely utilized to model the relationships of inputs and outputs [50]. In this research, a linear function will be used for the output. A brief description of each layer's function is as follows: Layer 1-fuzzification layer: Every node in this layer is a square node. The nodes produce the membership values. Outputs obtained from these nodes are calculated as follows: where O1,i denotes the output of node i in layer 1, and µAi(x1) and µBi−2(x2) are the fuzzy membership functions of Ai and Bi−2. The fuzzy membership functions can be in any form, such as triangular, trapezoidal, or Gaussian functions. Layer 2-rule layer: Every node in this layer is a circular node. The output is the product of all incoming inputs.
where O2,i denotes the output of node i in layer 2, and wi represents the firing strength of a rule. Layer 3-normalization: Every node in this layer is a circular node. Outputs of this layer are called normalized firing strengths. The i-th node is calculated by the i-th node firing strength to the sum of all rules' firing strengths.
where O3,i denotes the output of node i in layer 3, and i w is the normalized firing strength.
Layer 4-defuzzification layer: Every node in this layer is an adaptive node with a node function.
where O4,i denotes the output of node i in layer 4, i w is the output of layer 3, and {pi, qi, ri} (in Equation (6)) is the parameter set. Parameters in this layer are consequent parameters of the Sugeno fuzzy model. Layer 5-a single summation node: The node is a fixed node. This node computes the overall output by incorporating all the incoming signals from the previous layer: A brief description of each layer's function is as follows: Layer 1-fuzzification layer: Every node in this layer is a square node. The nodes produce the membership values. Outputs obtained from these nodes are calculated as follows: where O 1,i denotes the output of node i in layer 1, and µ Ai (x 1 ) and µ Bi−2 (x 2 ) are the fuzzy membership functions of A i and B i−2 . The fuzzy membership functions can be in any form, such as triangular, trapezoidal, or Gaussian functions. Layer 2-rule layer: Every node in this layer is a circular node. The output is the product of all incoming inputs.
where O 2,i denotes the output of node i in layer 2, and w i represents the firing strength of a rule. Layer 3-normalization: Every node in this layer is a circular node. Outputs of this layer are called normalized firing strengths. The i-th node is calculated by the i-th node firing strength to the sum of all rules' firing strengths.
where O 3,i denotes the output of node i in layer 3, and w i is the normalized firing strength. Layer 4-defuzzification layer: Every node in this layer is an adaptive node with a node function.
where O 4,i denotes the output of node i in layer 4, w i is the output of layer 3, and {p i , q i , r i } (in Equation (6)) is the parameter set. Parameters in this layer are consequent parameters of the Sugeno fuzzy model.
Layer 5-a single summation node: The node is a fixed node. This node computes the overall output by incorporating all the incoming signals from the previous layer: where O 5,i denotes the output of node i in layer 5. The results are then defuzzified using a weighted average procedure.
The ANFIS architecture has two adaptive layers: layer 1 and layer 4. Layer 1 has parameters related to the fuzzy membership functions and layer 4 has parameters {p i , q i , r i } related to the polynomial. The aim of the hybrid learning algorithm in the ANFIS architecture is to adjust all of these parameters in order to make the output match the training data. Adjusting the parameters includes two steps. In the forward pass of the learning algorithm, the premise parameters are fixed, functional signals go forward till layer 4, and the consequent parameters are identified by the least squares method to minimize the measured error. In the backward pass, the consequent parameters are fixed, the error signals go backward, and the premise parameters are updated by the gradient descent method [42]. This hybrid learning algorithm is able to decrease the complexity of the algorithm and increase learning efficiency [51]. The hybrid learning algorithm will be utilized in this study due to this advantage.
There are different types of membership functions that can be used in layer 1. Some of the commonly used membership functions are triangular membership function, trapezoidal membership function, bell shaped membership function, and Gaussian membership function.
According to Güneri [52], an excessive number of inputs in the ANFIS structure makes the system complicated and limits its applicability (also known as the curse-of-dimensionality problem). In addition, many studies have pointed out that ANFIS gives better solutions with a simple structure. To deal with this issue, several low-dimensional rule bases should be arranged in a hierarchical structure [53]. For modeling a hierarchical ANFIS (HANFIS), it is necessary to identify a suitable hierarchical structure, the number of inputs for each sub-ANFIS model, and a rule base for each sub-ANFIS model. Suppose that there are five inputs. The two-layer HANFIS model will be employed is shown as in Figure 3.
where O5,i denotes the output of node i in layer 5. The results are then defuzzified using a weighted average procedure.
The ANFIS architecture has two adaptive layers: layer 1 and layer 4. Layer 1 has parameters related to the fuzzy membership functions and layer 4 has parameters {pi, qi, ri} related to the polynomial. The aim of the hybrid learning algorithm in the ANFIS architecture is to adjust all of these parameters in order to make the output match the training data. Adjusting the parameters includes two steps. In the forward pass of the learning algorithm, the premise parameters are fixed, functional signals go forward till layer 4, and the consequent parameters are identified by the least squares method to minimize the measured error. In the backward pass, the consequent parameters are fixed, the error signals go backward, and the premise parameters are updated by the gradient descent method [42]. This hybrid learning algorithm is able to decrease the complexity of the algorithm and increase learning efficiency [51]. The hybrid learning algorithm will be utilized in this study due to this advantage.
There are different types of membership functions that can be used in layer 1. Some of the commonly used membership functions are triangular membership function, trapezoidal membership function, bell shaped membership function, and Gaussian membership function.
According to Güneri [52], an excessive number of inputs in the ANFIS structure makes the system complicated and limits its applicability (also known as the curse-of-dimensionality problem). In addition, many studies have pointed out that ANFIS gives better solutions with a simple structure. To deal with this issue, several low-dimensional rule bases should be arranged in a hierarchical structure [53]. For modeling a hierarchical ANFIS (HANFIS), it is necessary to identify a suitable hierarchical structure, the number of inputs for each sub-ANFIS model, and a rule base for each sub-ANFIS model. Suppose that there are five inputs. The two-layer HANFIS model will be employed is shown as in Figure 3.

Rule Selection for Adaptive Neuro-Fuzzy Inference System (ANFIS)
In a conventional FIS, the number of rules is commonly decided by experts who are familiar with the target system to be modeled. In ANFIS simulation, however, no expert is available, and the number of membership functions (MFs) assigned to each input variable is chosen empirically by plotting the data sets and examining them visually, or simply by trial and error. For datasets with more than two inputs, visualization techniques are not very effective, and therefore we often have to rely on trial and error. This situation is similar to that of neural networks; there is just no simple way to determine the minimal number of hidden units needed in advance to achieve a desired performance level.
When identifying the rule base for ANFIS, we are under the circumstance of: (1) there are no standard methods for transforming human knowledge or experience into the rule base; and (2) it is

Rule Selection for Adaptive Neuro-Fuzzy Inference System (ANFIS)
In a conventional FIS, the number of rules is commonly decided by experts who are familiar with the target system to be modeled. In ANFIS simulation, however, no expert is available, and the number of membership functions (MFs) assigned to each input variable is chosen empirically by plotting the data sets and examining them visually, or simply by trial and error. For datasets with more than two inputs, visualization techniques are not very effective, and therefore we often have to rely on trial and error. This situation is similar to that of neural networks; there is just no simple way to determine the minimal number of hidden units needed in advance to achieve a desired performance level.
When identifying the rule base for ANFIS, we are under the circumstance of: (1) there are no standard methods for transforming human knowledge or experience into the rule base; and (2) it is necessary to tune the membership functions to maximize the performance and minimize the errors [43]. There are several techniques for determining the numbers of MFs and rules. The most common methods are grid partitioning and scatter partitioning.
The grid partition method divides the data space into rectangular sub-spaces using an axis-paralleled partition based on a pre-defined number of membership functions and the type of each dimension. Premise fuzzy sets and parameters are calculated using the least square estimate method based on the partition and membership function. When constructing the fuzzy rules, consequent parameters in the linear output membership function are set at zero. Therefore, it is necessary to identify and refine the parameters by the use of ANFIS. The combination of grid partition and ANFIS can be found in the literature [54]. The application of grid partition in FIS is limited by the curse-of-dimensionality, since the number of fuzzy rules increases exponentially when the number of input variables increases. If there are m membership functions for every input variable and a total of n input variables, the number of fuzzy rules is mn. Hence, this method is only suitable for simple problems with a small number of input variables.
In order to eliminate the problems associated with grid-partitioning, other ways of dividing the input space into rule patches have been proposed. This approach allows the IF-parts of the fuzzy rules to be positioned at arbitrary locations in input space. If the rules are represented by n-dimensional Gaussians or normalized Gaussians, the centers of the Gaussians are not anymore confined to corners of a rectangular grid. Rather, they can be chosen freely.
The scatter partitioning method includes the fuzzy-C means (FCM) clustering method and the subtractive clustering method. FCM, also known as fuzzy ISODATA (Iterative Self-Organizing Data Analysis Technique Algorithm) [55], partitions a collection of n vectors into C fuzzy groups and finds a cluster center in each group so that a cost function of dissimilarity measure is minimized. FCM also employs fuzzy partitioning so that a given data point can belong to several groups, with a degree of belongingness specified by membership grades between 0 and 1. Here, the number of cluster centers represents the number of rules and the researchers can determine the number of rules.
When there is no clear idea about how many clusters there should be for a given set of data, subtractive clustering is a fast, one-pass algorithm for estimating the number of clusters and the cluster centers for a set of data [50,56]. Subtractive clustering method was proposed by Chiu [57] by extending the mountain clustering method [58]. It clusters data points by measuring their potential in the feature space. Subtractive clustering method assumes that each data point is a potential cluster center and calculates the potential for each data point based on the density of surrounding data points. The data point with the highest potential is chosen as the first cluster center, and the potential of data points near the first cluster center (within the neighboring radius) are removed.
In a set of n data points, (x 1 , x 2 , ..., x n ), the potential of x i to be a cluster center is calculated as follows: where D i is the potential value of the i-th data point and r a denotes the neighboring radius. If a point has more points within its neighboring radius, it will have a higher potential value.
Assume that x c1 has been selected as the first cluster center and D c1 is its potential value. The potential value of each data point x i is then obtained by the following equation: where D i is the reduced potential value, and γ is the squash factor. This step is repeated till a sufficient number of cluster centers are produced. The influential radius is important for determining the number of clusters. A smaller radius results in many smaller clusters in the data space, which leads to more rules, and vice versa. After clustering the data space, the number of fuzzy rules and fuzzy membership functions can be determined. The linear squares estimation method is then utilized to calculate the consequence parts in the output membership functions, which leads to an initial FIS. In this research, all three types of the FIS methods (mentioned in Section 3.2) will be used and compared for data prediction, and FIS will be generated by using the subtractive clustering method.

Wavelet Transform
The electricity demand can be assumed to be a linear combination of some components. Therefore, wavelet transform will be used to analyze, decompose, and synthesize the electricity demand. Wavelets are mathematical functions that break data into different frequency components, and then each frequency component is inspected with a specific resolution suitable for its scale. Wavelet transform is a recently developed mathematical tool for signal analysis. It has been applied successfully in astronomy, data compression, signal and image processing, earthquake prediction, and so on [59,60]. The fundamental idea in wavelet analysis is to select a suitable wavelet (mother wavelet), and then perform an analysis using its translated and dilated versions. There are several kinds of wavelets that can be used as a mother wavelet, such as the Harr wavelet, Meyer wavelet, Coiflet wavelet, Daubechies wavelet, and Morlet wavelet. Each wavelet has specific characteristics. Similar to the Fourier transform, there are different wavelet transforms, called continuous wavelet transform (CWT), also known as an integral wavelet transform (IWT), discrete wavelet transform (DWT), and fast discrete wavelet transform (FWT). The fast wavelet transform is known as multi-resolution analysis (MRA) or a Mallat pyramidal algorithm. For a given square integrable (or finite electricity) function (or signal) f (t), its continuous wavelet transform is as follows: where ψ(t) is the mother wavelet. The value of wavelet transform (Wf )(a,b) is called the wavelet coefficient, which stands for the similar degree between the signal and the wavelet at the translation b and the dilation a. Translation means the time shift, and dilation means the time scale. In other words, it means how many components of the wavelet at dilation are included in the original signal at translation b. Through a simple mathematical map, Equation (5)  , can also be reconstructed by these approximations and details. Figure 4 presents the decomposing and reconstructing processes. Both the Fourier transforms and wavelet transforms are domain transform function. One of the disadvantages of Fourier transforms is that frequency information can only be extracted for the complete duration of a signal f (t). If, at some point in the lifetime of f (t), there is a local oscillation representing a particular feature (this will also contribute to the calculated Fourier transform f (w)), its location on the time axis will be lost. There is no way of judging whether the value of f (w) at a particular w is derived from a frequency throughout the life of f (t), or during just one or a few selected periods. Another disadvantage is the phase shift. These disadvantages can be overcome by wavelet transform. For the windowed Fourier transform, the window width is always fixed, and the window is a square shape. An important advantage of wavelet transform is that the window widths of wavelet transform can be adjusted automatically. At low frequency, the window widths are longer, while at high frequency, the window widths are shorter. The shorter the window width the better is the resolution. This means that wavelet transform can provide better time resolution for high frequency components of a signal and better frequency resolution for low frequency components of the signal. These are the reasons why wavelet transform will be applied in this research.
Because of low computation cost and simple algorithm, the Haar wavelet is utilized in this study. A time series can be considered as multiple resolutions when using wavelets. Each resolution denotes a specific frequency. In the Haar wavelet algorithm, the time series have a size of a power of two values (e.g., 32, 64, ...). Each step of wavelet transform produces two sets of values: a set of averages and a set of differences (the wavelet coefficients). This process is completed when there are only one average and one coefficient.

Heuristic Algorithms
In this section, the recent heuristic algorithms including GSA, COA, and CS used in the training phase are described.

Gravitational Search Algorithm
The GSA, proposed by Rashedi [16], is based on the physical law of gravity and the law of motion. In the universe, every particle attracts every other particle with a gravitational force that is directly proportional to the product of their masses, and is inversely proportional to the square of the distance between them. The GSA can be considered as a system of agents, called masses, that obey the Newtonian laws of gravitation and masses. All masses attract each other by the gravity forces between them. A heavier mass has a bigger force.
Consider a system with N masses in which the position of the ith mass is defined as follows: where x d i is the position of the i-th agent in the d-th dimension and n presents the dimension of search space. At a specific time, t, the force acting on mass i from mass j is defined as follows: These disadvantages can be overcome by wavelet transform. For the windowed Fourier transform, the window width is always fixed, and the window is a square shape. An important advantage of wavelet transform is that the window widths of wavelet transform can be adjusted automatically. At low frequency, the window widths are longer, while at high frequency, the window widths are shorter. The shorter the window width the better is the resolution. This means that wavelet transform can provide better time resolution for high frequency components of a signal and better frequency resolution for low frequency components of the signal. These are the reasons why wavelet transform will be applied in this research.
Because of low computation cost and simple algorithm, the Haar wavelet is utilized in this study. A time series can be considered as multiple resolutions when using wavelets. Each resolution denotes a specific frequency. In the Haar wavelet algorithm, the time series have a size of a power of two values (e.g., 32, 64, ...). Each step of wavelet transform produces two sets of values: a set of averages and a set of differences (the wavelet coefficients). This process is completed when there are only one average and one coefficient.

Heuristic Algorithms
In this section, the recent heuristic algorithms including GSA, COA, and CS used in the training phase are described.

Gravitational Search Algorithm
The GSA, proposed by Rashedi [16], is based on the physical law of gravity and the law of motion. In the universe, every particle attracts every other particle with a gravitational force that is directly proportional to the product of their masses, and is inversely proportional to the square of the distance between them. The GSA can be considered as a system of agents, called masses, that obey the Newtonian laws of gravitation and masses. All masses attract each other by the gravity forces between them. A heavier mass has a bigger force.
Consider a system with N masses in which the position of the ith mass is defined as follows: where x d i is the position of the i-th agent in the d-th dimension and n presents the dimension of search space. At a specific time, t, the force acting on mass i from mass j is defined as follows: where M aj denotes the active gravitational mass of agent j; M pi is the passive gravitational mass of agent i; G(t) represents the gravitational constant at time t; ε is a small constant; and R ij (t) is the Euclidian distance between agents i and j.
The total force acting on agent i in dimension d is as follows: where rand j is a random number in [0,1]. According to the law of motion, the acceleration of agent i at time t in the d-th dimension, a t i (t), is calculated as follows: where M ii (t) is the mass of object i. The next velocity of an agent is a fraction of its current velocity added to its acceleration. Therefore, the next position and the next velocity can be calculated as: The gravitational constant, G, is generated at the beginning, and is reduced with time to control the search accuracy. It is a function of the initial value (G o ) and time (t): Gravitational and inertia masses are calculated by the fitness evaluation. A heavier mass means a more efficient agent. This means that better agents have higher attractions and move more slowly. The gravitational and inertial masses are updated by the following equations: where fit i (t) denotes the fitness value of agent i at time t, and worst(t) and best(t) represents the weakest and strongest agents in the population, respectively. For a minimization problem, worst(t) and best(t) are as follows: worst(t) = max j∈{1,...,N} f it j (t).
For a maximization problem, The pseudo code of the GSA is given in Figure 5.

Cuckoo Optimization Algorithm
Rajabioun [22] developed another algorithm based on the cuckoo's lifestyle, named the Cuckoo Optimization Algorithm. The lifestyle of the cuckoo species and their characteristics were also the basic motivations for the development of this evolutionary optimization algorithm. The cuckoo groups are formed in different areas that are called societies. Each society has its habitat region to live in. The cuckoo population in each society consists of two types: mature cuckoos and eggs. The effort to survive among cuckoos constitutes the basis of COA. During the survival competition, some of the cuckoos or their eggs are detected and killed. Then, the survived cuckoo societies try to immigrate to a better environment and start reproducing and laying eggs. Cuckoos' survival effort hopefully may converge to a place in which there is only one cuckoo society, all having the same survival rates. Therefore, the place in which more eggs survive is the term that COA is going to optimize. The fast convergence and global optima achievement of this algorithm have been proven through some benchmark problems. The pseudo code of the COA is presented in Figure 6. In COA, cuckoos lay eggs within a maximum distance from their habitats. This range is called the Egg Laying Radius (ELR). In the algorithm, ELR is defined as:

Cuckoo Optimization Algorithm
Rajabioun [22] developed another algorithm based on the cuckoo's lifestyle, named the Cuckoo Optimization Algorithm. The lifestyle of the cuckoo species and their characteristics were also the basic motivations for the development of this evolutionary optimization algorithm. The cuckoo groups are formed in different areas that are called societies. Each society has its habitat region to live in. The cuckoo population in each society consists of two types: mature cuckoos and eggs. The effort to survive among cuckoos constitutes the basis of COA. During the survival competition, some of the cuckoos or their eggs are detected and killed. Then, the survived cuckoo societies try to immigrate to a better environment and start reproducing and laying eggs. Cuckoos' survival effort hopefully may converge to a place in which there is only one cuckoo society, all having the same survival rates. Therefore, the place in which more eggs survive is the term that COA is going to optimize. The fast convergence and global optima achievement of this algorithm have been proven through some benchmark problems. The pseudo code of the COA is presented in Figure 6.

Cuckoo Optimization Algorithm
Rajabioun [22] developed another algorithm based on the cuckoo's lifestyle, named the Cuckoo Optimization Algorithm. The lifestyle of the cuckoo species and their characteristics were also the basic motivations for the development of this evolutionary optimization algorithm. The cuckoo groups are formed in different areas that are called societies. Each society has its habitat region to live in. The cuckoo population in each society consists of two types: mature cuckoos and eggs. The effort to survive among cuckoos constitutes the basis of COA. During the survival competition, some of the cuckoos or their eggs are detected and killed. Then, the survived cuckoo societies try to immigrate to a better environment and start reproducing and laying eggs. Cuckoos' survival effort hopefully may converge to a place in which there is only one cuckoo society, all having the same survival rates. Therefore, the place in which more eggs survive is the term that COA is going to optimize. The fast convergence and global optima achievement of this algorithm have been proven through some benchmark problems. The pseudo code of the COA is presented in Figure 6. In COA, cuckoos lay eggs within a maximum distance from their habitats. This range is called the Egg Laying Radius (ELR). In the algorithm, ELR is defined as: In COA, cuckoos lay eggs within a maximum distance from their habitats. This range is called the Egg Laying Radius (ELR). In the algorithm, ELR is defined as: Number o f current cuckoo s eggs Total number o f eggs × (var hi − var low ) (24) where α is an integer used to handle the maximum value of ELR, and var hi and var low are the upper limit and lower limit. The society with the best profit value (the highest number of survival eggs) is then selected as the goal point (best habitat) for other cuckoos to immigrate to. In order to recognize which cuckoo belongs to which group, cuckoos are grouped by the K-means clustering method. When moving toward the goal point, each cuckoo only flies λ% of the maximum distance and has a deviation of φ radians. The parameters for each cuckoo are defined as follows: where ω is a parameter to constrain the deviation from goal habitat. A ω of π/6 is supposed to be enough for good convergence [22].

Cuckoo Search Algorithm
Cuckoo search (CS) is an optimization algorithm introduced by Yang and Deb [18]. This algorithm was inspired by the special lifestyle of the cuckoo species. The cuckoo birds lay their eggs in the nests of host birds, and they may remove the eggs of the host birds. In the process, some of these eggs, which look similar to the host bird's eggs, have the opportunity to grow up and become adult cuckoos. In other cases, the eggs are discovered by host birds, and the host birds will throw them away or leave their nests and find other places to build new ones. Each egg in a nest stands for a solution, and a cuckoo egg stands for a new solution. The CS uses new and potentially better solutions to replace not-so-good solutions in the nests. The CS is based on the following rules: each cuckoo lays one egg at a time, and dumps this egg in a randomly chosen nest; the best nests with high-quality eggs (solutions) will carry over to the next generation; the number of available host nests is fixed, and a host bird can detect an alien egg with a probability of p a ∈ [0,1]. In this case, the host bird may either throw the egg away, or abandon the nest to build a new one in a new place. The last assumption can be estimated by the fraction p a of the n nests being replaced by new nests (with new random solutions). For a maximization problem, the quality or fitness of a solution can be proportional to the objective function. Other forms of fitness can be defined in a similar way to the fitness function in genetic algorithms [18]. Based on the above-mentioned rules, the steps of the CS can be described as the pseudo code in Figure 7. The algorithm can be extended when each nest has multiple eggs representing a set of solutions.
When generating new solutions, x (t+1) , for the ith cuckoo at iteration (t + 1), the following Lévy flight is performed: j and x (t) k are two solutions selected by random permutation. ε is a random number derived from a uniform distribution. H(u) is a Heaviside function and s is the step size. Additionally, the global random walk is achieved by the use of Lévy flight.
where L(s, λ) = λΓ(λ) sin (πλ/2) π 1 s 1+λ (s >> s 0 > 0) (27) where α > 0 is the step size, which depends on the scale of the problem. Lévy flights provide a random walk. x where α > 0 is the step size, which depends on the scale of the problem. The product ⊕ denotes entry-wise multiplications. Lévy flights provide a random walk. The Lévy flight is a probability distribution which has an infinite variance with an infinite mean. It is represented by: There are several ways to achieve random numbers in Lévy flights; however, a Mantegna algorithm is the most efficient [18]. In this work, the Mantegna algorithm was utilized to calculate this step length. There are several ways to achieve random numbers in Lévy flights; however, a Mantegna algorithm is the most efficient [18]. In this work, the Mantegna algorithm was utilized to calculate this step length.

Research Design
The following subjects have been considered in developing forecasting models.

Methodology
This study presents a novel approach that combines the Haar wavelet transform and the heuristic algorithms into the neuro-fuzzy system. Figure 8 presents the methodology proposed in the study.

Research Design
The following subjects have been considered in developing forecasting models.

Methodology
This study presents a novel approach that combines the Haar wavelet transform and the heuristic algorithms into the neuro-fuzzy system. Figure 8 presents the methodology proposed in the study.

Data Collection
Due to divergent climate characteristics in the northern Vietnam, demand for electricity in Hanoi varies between the summer period and the winter period. The demand increases to its full extent during summer and decreases significantly during the rest of the year. The significant increase in electricity demand and therefore in energy consumed during the summer period is influenced by the need for operating air conditioners to overcome the high temperatures.
The historical data was collected from 2003 to 2013. This data was used to determine a forecasting model for future electricity load. The data used in this study were obtained from the Bureau of Statistics, the National Hydro-meteorological Service, and the Hanoi Power Company.

Methodology
This study presents a novel approach that combines the Haar wavelet transform and the heuristic algorithms into the neuro-fuzzy system. Figure 8 presents the methodology proposed in the study.

Noise Filtering Using Wavelet Transform
The Haar wavelet transform is then used to decompose time series and remove noise. The strength of two coefficient spectra produced by wavelet transform presents the change in the time series at different resolutions. The first coefficient band presents the highest frequency changes. This is the noisiest part of time series. This noise can be eliminated by threshold methods.

Collecting Data
Electricity consumption (MWh) is influenced by several related factors (as shown in Table 1), including month index, average air pressure, average temperature, average wind velocity, rainfall, rainy time, average relative humidity, and daylight time. The historical data regarding these factors was collected from 2003/01 to 2013/12; in other words, there are 132 monthly data samples. This data was used to determine a forecasting model for future electricity demand. The data used in this study were obtained from the Bureau of Statistics, the National Hydro-meteorological Service, and the Hanoi Power Company. The dataset can be requested by contacting the corresponding author by email. Month index X 2 Average air pressure X 3 Average temperature X 4 Average wind velocity X 5 Rainfall X 6 Rainy time X 7 Average relative humidity X 8 Daylight time There are 8 factors used for electricity forecasting. Therefore, an 8-to-1 model is proposed. That means 8 factors are used to forecast the electricity demand. To train the neural networks, the authors use two kinds of vectors. The input vector consists of 8 factors and the output vector includes the electricity demand of the month.

Data Normalization
In order to facilitate the training process, the data needs to be normalized because the electricity load data are in the different ranges. Normalization is a process converting the data points into a small range generally from 1 to −1 or 0 [61]. In this study, the vector normalization is utilized.
where N i is the normalized data and T i is the time-series data, k denotes the number of value in series, and i = 1, ..., k.

Splitting Data
One problem needs to be considered in ANN-based model, namely overfitting. Overfitting problem is that the error in the training set is small but when a new data is presented to the network, the error is high. This means the model performs well on the training data but poorly on the new data. In order to solve the overfitting problem, the available data were divided into two groups. The available data were divided into two groups. The first group is called the training dataset and includes the data over years 2003/01-2009/05. The second group is called the testing dataset and includes the data over years 2009/06-2013/12. The training dataset served in model building, while the testing dataset was used for the validation of the developed models. The training dataset is used to adjust the weights and biases of the network, the testing dataset is utilized to minimize overfitting. If the accuracy on the training dataset increases, but the accuracy on the testing dataset decreases, the neural network is overfitting and the training should be stopped.

Training Adaptive Neuro Fuzzy Inference System (ANFIS)
A common learning algorithm in training ANFIS-based models is a hybrid algorithm. Based on that method, the gradient descent (GD) algorithm with back propagation is utilized to obtain premise parameters and the least squares method is used to identify the consequent parameters. However, due to the gradient calculation, the convergence may take a long time. The other problem of the gradient-based algorithm is the need to find an appropriate learning rate to ensure that, during the training process, parameters do not diverge. Another problem of the gradient-based algorithms is that they are easily being trapped in local minimums. By contrast, efficient and effective heuristic algorithms can tackle these problems. These heuristic algorithms usually mimic physical or biological processes and have the ability to produce optimal or near optimal parameters [62].
In this research, we will use the GSA, COA, CS, and some other effective algorithm for training ANFIS. Several models for electricity demand forecasting will be developed and tested to provide predictions.

ANFIS-based Forecasting Models
In our proposed approach, the subtractive clustering parameters will be optimized by using efficient and effective heuristic algorithms, and the ANFIS model will be used to evaluate the objective function value of any candidate solution generated by these heuristic algorithms. Therefore, the parameters of a FIS designed for mapping input values to actual output values can be optimized in order to minimize the total prediction errors of the resulted final FIS. The objective function of the optimizer is to minimize the root mean square error (RMSE) of predictions made by the ANFIS model, whose number of rules is controlled by the parameters of the subtractive clustering method. The stopping criterion will be the number of consecutive iterations in which the RMSE value cannot be improved. In other words, after some successive iterations, if the heuristic algorithms cannot find any better clustering parameters to lower the RMSE value, the search will stop. The flowchart of using the heuristic algorithms to optimize the clustering parameters is presented in Figure 9. The fuzzy membership function will to be a Gaussian function with a maximum equal to 1, and a minimum equal to 0. After the subtractive clustering parameters are obtained, the ANFIS model will run again to adjust the parameters in the antecedent and consequent parts. It should be noted that each input variable (xi) has a certain level of influence to the output variable (Y), and therefore, the output variable (Y) can be used as the output of all of the sub-ANFIS models. The selection of the input variables for each sub-ANFIS model is based on the significance of these variables affecting the output variable. Specifically, the input variables with the highest degree of correlation will be chosen as inputs for the sub-ANFIS model in the first layer, while the remaining input variables can be considered as inputs for the sub-ANFIS models in the successive layers.
In brief, the training procedure of the model includes two stages. In Stage I, the subtractive clustering parameters are adjusted by heuristic algorithms. In Stage II, with the optimal rule parameters identified in Stage I, each sub-ANFIS model is run again to adjust the parameters in the antecedent and consequent parts. In the implementation of the model, each sub-ANFIS model is trained individually and successively. In other words, the sub-ANFIS model at the first layer is trained first, and the results are used in order to train the sub-ANFIS at the second layer, and so on, till the last layer. The output of the sub-ANFIS model at the last layer will be the output of the model. In the model, the hierarchical structure and clustering method help to decrease the rule base dimension resulted from the increasing number of inputs, and the heuristic algorithms are used to optimize the clustering parameters. Therefore, the model can enhance the quality of the forecasting outcomes.
As discussed earlier, the data set included eight input variables and one output variable; therefore, a four-layer ANFIS structure was introduced to decrease the dimension of the rule base. Layers 1-3 have three input variables and layer 4 has two input variables; each layer has one output. Figure 10 represents the hierarchical ANFIS model, where X1-X8 are input variables and Y is the one output variable. It should be noted that each input variable (x i ) has a certain level of influence to the output variable (Y), and therefore, the output variable (Y) can be used as the output of all of the sub-ANFIS models. The selection of the input variables for each sub-ANFIS model is based on the significance of these variables affecting the output variable. Specifically, the input variables with the highest degree of correlation will be chosen as inputs for the sub-ANFIS model in the first layer, while the remaining input variables can be considered as inputs for the sub-ANFIS models in the successive layers.
In brief, the training procedure of the model includes two stages. In Stage I, the subtractive clustering parameters are adjusted by heuristic algorithms. In Stage II, with the optimal rule parameters identified in Stage I, each sub-ANFIS model is run again to adjust the parameters in the antecedent and consequent parts. In the implementation of the model, each sub-ANFIS model is trained individually and successively. In other words, the sub-ANFIS model at the first layer is trained first, and the results are used in order to train the sub-ANFIS at the second layer, and so on, till the last layer. The output of the sub-ANFIS model at the last layer will be the output of the model. In the model, the hierarchical structure and clustering method help to decrease the rule base dimension resulted from the increasing number of inputs, and the heuristic algorithms are used to optimize the clustering parameters. Therefore, the model can enhance the quality of the forecasting outcomes.
As discussed earlier, the data set included eight input variables and one output variable; therefore, a four-layer ANFIS structure was introduced to decrease the dimension of the rule base. Layers 1-3 have three input variables and layer 4 has two input variables; each layer has one output. Figure 10 represents the hierarchical ANFIS model, where X 1 -X 8 are input variables and Y is the one output variable. In the study, six ANFIS-based forecasting models are developed. There are three hierarchical ANFIS models in which the noise is eliminated by wavelet transform and parameters are optimized by GSA, COA, and CS; we refer to them hereafter as wavelet GSA-HANFIS, wavelet COA-HANFIS, and wavelet CS-HANFIS. There are three hierarchical ANFIS models in which parameter are optimized by GSA, COA, and CS; and they are named as GSA-HANFIS, COA-HANFIS, and CS-HANFIS, respectively. Moreover, a model based on artificial neural network (ANN) is also developed.

Evaluation Criteria
To compare the performances of different forecasting models, several criteria are used. These criteria are applied to the trained neural network to determine how well the network works. These criteria are used to compare forecasting values and actual values. They are as follows: Mean absolute percentage error (MAPE): this index indicates an average of the absolute percentage errors; the lower the MAPE the better.
where tk is the actual (desired) value, yk is the forecasting value produced by the model, and m is the total number of observations. Root mean squared error (RMSE): this index estimates the residual between the actual value and desired value. A model has better performance if it has smaller a RMSE. An RMSE equal to zero represents a perfect fit.
Correlation coefficient (R): this criterion reveals the strength of relationships between actual values and forecasting values. The correlation coefficient has a range from −1 to 1, and a model with a higher R means it has better performance. In the study, six ANFIS-based forecasting models are developed. There are three hierarchical ANFIS models in which the noise is eliminated by wavelet transform and parameters are optimized by GSA, COA, and CS; we refer to them hereafter as wavelet GSA-HANFIS, wavelet COA-HANFIS, and wavelet CS-HANFIS. There are three hierarchical ANFIS models in which parameter are optimized by GSA, COA, and CS; and they are named as GSA-HANFIS, COA-HANFIS, and CS-HANFIS, respectively. Moreover, a model based on artificial neural network (ANN) is also developed.

Evaluation Criteria
To compare the performances of different forecasting models, several criteria are used. These criteria are applied to the trained neural network to determine how well the network works. These criteria are used to compare forecasting values and actual values. They are as follows: Mean absolute percentage error (MAPE): this index indicates an average of the absolute percentage errors; the lower the MAPE the better.
where t k is the actual (desired) value, y k is the forecasting value produced by the model, and m is the total number of observations. Root mean squared error (RMSE): this index estimates the residual between the actual value and desired value. A model has better performance if it has smaller a RMSE. An RMSE equal to zero represents a perfect fit.
Mean absolute error (MAE): this index indicates how close predicted values are to the actual values.
Correlation coefficient (R): this criterion reveals the strength of relationships between actual values and forecasting values. The correlation coefficient has a range from −1 to 1, and a model with a higher R means it has better performance.

Experimental Results and Discussion
The models were coded and implemented in the Matlab environment (Matlab R2014a) and simulation results were then obtained. As discussed earlier, Haar wavelet transform is able to eliminate noise. Therefore, it is suitable to deal with the irregular data series. A five-fold cross validation method was used to avoid an over-fitting problem. For wavelet GSA-HANFIS and GSA HANFIS, the parameters for the GSA algorithm were as follows: the number of initial population was 20 and the gravitational constant in Equation (12) was determined by the function G(t) = G 0 exp(−αt/T), where G 0 = 100, α = 20, and T was the total number of iterations. For wavelet COA-HANFIS and COA-HANFIS, the parameters for the COA algorithm were set as follows: the number of initial population was 20 and p% was 10%. For wavelet CS-HANFIS and CS-HANFIS, the parameters for the CS algorithm were set as follows: the number of nests was 20; the step size was 0.01; and the net discovery rate (p a ) was 0.1.
For ANN model, the optimum number of neurons in the hidden layer was determined by varying their number, starting with a minimum of one, and then increasing in steps by adding one neuron each time. Hence, various network architectures were tested to achieve the optimum number of hidden neurons. The best performing architectures for ANN were found to be 8-5-1. The activation function from input to hidden is sigmoid. With no loss of generality, a commonly used form, f (n) = 2/(1 + e −2n ) − 1, is utilized; while a linear function is used from the hidden layer to the output layer. The parameters for back-propagation were set as follows: the learning and momentum rates were 0.4 and 0.3, respectively.
Plot the actual data and sets of forecasts on a single graph. A plot of data can reveal the existence and nature of a trend. Figures 11-13 plots actual data and forecasting values by different forecasting models. It can be seen that wavelet CS-HANFIS performed best when all the forecasting values have the tendency to close to actual data. Table 2 presents performance statistics of these models. Obviously, the wavelet CS-HANFIS has the smallest MAPE, RMSE, and MAE values as well as the biggest R value. This means that the wavelet CS-HANFIS has a better overall performance in all criteria.      In order to evaluate the performance of the developed forecasting models, the ARIMA and MLR methods were also applied to the problem. The details of these methods can be found in the relevant literature, which is beyond the scope of this work. After a few testing attempts, the ARIMA model was selected as ARIMA (2,1,1). These models were also implemented in Matlab R2014a. The results obtained by these models were recorded and are shown in Table 3. As can be seen from Table 3, the ARIMA had a better performance than the MLR. However, when compared with the results from Table 2, the wavelet CS-HANFIS model surpassed the ARIMA and MLR.  In general, from the results presented in this section, it can be inferred that the wavelet HANFISbased models perform better than traditional forecasting methods (ARIMA and MLR) and the wavelet CS-HANFIS model is clearly superior to its counterparts. Another implication of the results is that if the noise is eliminated (by wavelet transform) the forecasting models will perform better.

Conclusions
Understanding electricity demand is a potentially critical factor that is required for ensuring future stability and security. Executives and government authorities need this information for decision making in energy markets. In this study, a new approach, named wavelet HANFIS-based model, which is based on wavelet, neuro-fuzzy system and heuristic algorithms for monthly electricity demand forecasting, is proposed. Firstly, in order to eliminate noise, wavelet transform is utilized to decompose the original data series. Secondly, the three powerful heuristic algorithms including GSA, COA, and CS, are employed to optimize the clustering parameters of HANFIS. The proposed approach and other well-known forecasting methods, ARIMA and MLR, were used to forecast the monthly electricity load in Hanoi, Vietnam based on historical data from June 2009 to December 2013. The results indicate that the wavelet CS-HANFIS is the best model to fit the historical data. This study of the use of ANFIS as a modelling tool for forecasting electricity demand has shown the benefits of the application of neuro-fuzzy systems. Therefore, this work has contributed to the development of forecasting methods. The results of the present study show the fact that a comparative analysis of different approaches has always been supportive in developing a forecasting model. The findings also show the application of wavelet transform in time series forecasting is very encouraging. Further studies may include different segments of electricity consumption, including residential, industrial, agricultural, governmental and commerce, and city services. Province-based forecasting is also essential for distribution companies. Technical loss should be considered when analyzing electricity demand because this parameter may have a tremendous impact.  In general, from the results presented in this section, it can be inferred that the wavelet HANFIS-based models perform better than traditional forecasting methods (ARIMA and MLR) and the wavelet CS-HANFIS model is clearly superior to its counterparts. Another implication of the results is that if the noise is eliminated (by wavelet transform) the forecasting models will perform better.

Conclusions
Understanding electricity demand is a potentially critical factor that is required for ensuring future stability and security. Executives and government authorities need this information for decision making in energy markets. In this study, a new approach, named wavelet HANFIS-based model, which is based on wavelet, neuro-fuzzy system and heuristic algorithms for monthly electricity demand forecasting, is proposed. Firstly, in order to eliminate noise, wavelet transform is utilized to decompose the original data series. Secondly, the three powerful heuristic algorithms including GSA, COA, and CS, are employed to optimize the clustering parameters of HANFIS. The proposed approach and other well-known forecasting methods, ARIMA and MLR, were used to forecast the monthly electricity load in Hanoi, Vietnam based on historical data from June 2009 to December 2013. The results indicate that the wavelet CS-HANFIS is the best model to fit the historical data. This study of the use of ANFIS as a modelling tool for forecasting electricity demand has shown the benefits of the application of neuro-fuzzy systems. Therefore, this work has contributed to the development of forecasting methods. The results of the present study show the fact that a comparative analysis of different approaches has always been supportive in developing a forecasting model. The findings also show the application of wavelet transform in time series forecasting is very encouraging. Further studies may include different segments of electricity consumption, including residential, industrial, agricultural, governmental and commerce, and city services. Province-based forecasting is also essential for distribution companies.
Technical loss should be considered when analyzing electricity demand because this parameter may have a tremendous impact.