Representative Days for Expansion Decisions in Power Systems

Short-term uncertainty should be properly modeled when the expansion planning problem in a power system is analyzed. Since the use of all available historical data may lead to intractability, clustering algorithms should be applied in order to reduce computer workload without renouncing accuracy representation of historical data. In this paper, we propose a modified version of the traditional K-means method that seeks to attain the representation of maximum and minimum values of input data, namely, the electric load and the renewable production in several locations of an electric energy system. The crucial role of depicting extreme values of these parameters lies in the fact that they can have a great impact on the expansion and operation decisions taken. The proposed method is based on the traditional K-means algorithm that represents the correlation between electric load and wind-power production. Chronology of historical data, which influences the performance of some technologies, is characterized though representative days, each one composed of 24 operating conditions. A realistic case study based on the generation and transmission expansion planning of the IEEE 24-bus Reliability Test System is analyzed applying representative days and comparing the results obtained using the traditional K-means technique and the proposed method.

x L Binary variable that is equal to 1 if candidate transmission line is built, and 0 otherwise. θ nrh Voltage angle at node n [rad].

Introduction
The Generation and Transmission Expansion Planning (G&TEP) problem is solve to determine the new facilities that should be built in a power system in order to ensure the supply of the electric load in the future, since the time frame of this problem can comprise several decades. It is motivated by the growth in peak loads, the penetration of renewable generating units and the aging of transmission facilities.
In most electricity markets, a central entity is in charge of taking expansion decisions of the transmission network, i.e., which transmission lines should be built. The aim of this system operator is to minimize investment and operation costs preventing load-shedding. In addition, expansion decisions of generating units are taken by private investors, whose purpose is to maximize their economic profits along with minimizing their financial risk. Nevertheless, optimal solution is not guaranteed accounting the G&TEP problem as two independent problems. This is the reason why the perspective of the system operator is generally considered in technical literature when dealing with a G&TEP problem.
It means that the central entity attains the optimal solution minimizing operation and investment cost, of both transmission facilities and generating units.
Once this is done, the system operator must provide indications about optimal expansion of generating units, with whom the government should design policy plans to promote the investment in certain technologies or locations.
Historical data are generally used to model the performance of power systems, since the more realistic the input data of the G&TEP problem are, the more accurate the solution of the problem will be in comparison with the future situation.
Regarding short-term uncertainty, electric load and renewable production are the historical data whose variability is more important. On the one hand, electric load is characterized by a daily evolution pattern. Since its variability depends on human habits, its progression can be accurately predicted using historical data. On the other hand, the generation of electric energy through renewable sources depends on meteorological conditions. For instance, the performance of wind turbines depends on wind speed as well as electric energy produced by solar panels and hydroelectric power stations relies on sunlight and rainfall, respectively.
Note that short-term uncertainty associated with renewable generating units increases the complexity of G&TEP problems due to weather forecast can be poorly predicted in advance as opposed to the daily evolution of electric load.  net load duration curves [1,2] or load-and wind-duration curves [3,4] The K-means technique applies algorithms of arranging data into groups, whose centroids are used with the purpose of representing the input data as well as reducing computer workload. The weight of each centroid is associated with the number of input data inside its group. This method has the advantage that, in contrast to load-and wind-duration curves technique, it can consider different correlations of electric load and renewable production in several locations of the electric energy system under study. The K-means method is used, for example, in [6,7].
Duration curves and traditional K-means methods are compared in [8]. Their main issue of these two methods is that it is not possible to include units with inter temporal constraints such as storage units in the expansion problems. To deal with this issue, [9] proposes using a representative day of each season, while [10] and [11] consider a modified K-means method. The main drawback of these methods is that they may not represent accurately extreme values of input data.
In case of using electric load and renewable production as input data, maximum and minimum values can have a great effect on the solution of the optimization problem.
Within this context, the contributions of this paper are threefold: 1. To propose a modified version of the traditional K-means method to achieve that system operating conditions obtained as output data of this technique properly represent maximum and minimum values of input data. The remaining of this paper is organized as follows. Section 2 explains the methodology of the traditional K-means method and the proposed modified version of this technique. Section 3 provides the formulation of the G&TEP problem. Section 4 displays the results of a case study, where a comparison among the outcomes obtained applying the clustering methods mentioned above is analysed. Finally, Section 5 concludes the paper with some relevant remarks.

Methodology
The K-means method is a clustering algorithm which aim is to arrange data into groups called clusters according to similarities. On the one hand, the inputs of this algorithm are historical data of two physical processes, namely, the electric load and the wind-power production in several locations of an electric energy system. On the other hand, the outputs of this technique are the cluster centroids along with the number of observations located at each cluster.
Note that cluster centroids, defined by the values of the two physical processes involved, represent the system operating conditions, which can be used as input data in the resolution of optimization problems (e.g., a long-term planning problem).
The K-means technique is useful when dealing with a significant amount of data in optimization problems due to the reduction of computer workload. In order to ensure this, the users of this method are able to choose the K number of operating conditions which is obtained. However, it must be taken into account that a low number of operating conditions can mean that the representation of the input data may not be very accurate. In contrast, a high number of clusters can lead to intractability.

Input data
It is important to normalize the input data before applying the algorithm in case of working with electric load and wind-power production data, because it is common that the order of magnitude of the first one is greater than in the case of the second one. If the input data are not normalized and the orders of contrast, Fig. 2 shows that the daily evolution of wind-power production does not follow any pattern.

Traditional K-means algorithm
The algorithm of the K-means method that has been used in technical literature, known from now on as traditional K-means method (TKM), is based on the following steps [8]: • Step 1: Select the number of required clusters according to the needs of the problem.
• Step 2: Define the initial centroid of each cluster, e.g., randomly assigning a historical observation to each cluster.
• Step 3: Compute the quadratic distances between each original observation and each cluster centroid.   • Step 5: Recalculate the cluster centroids using the historical observations allocated to each cluster.
Steps 3-5 are repeated iteratively until there are no changes in the cluster compositions between two consecutive iterations. Fig. 3 illustrates the TKM algorithm.

Modified K-means algorithm
To overcome these issues, we propose a new clustering method called modified K-means method (MKM), which tries to properly characterize the extreme values of the parameters considered, whose steps are presented below: • Step 1: Arrange the historical data into a K 1 number of clusters following the TKM. Note that the MKM can only be applied if the number of observations located at each cluster after Step 1 is greater than or equal to the parameter K 2 . In addition, the parameter K 1 must be less than or equal to the number of input data considered in Step 1. This last condition can be extrapolated to K in the TKM.
Equation (1) defines the relation that must exist among the parameter K, associated with the traditional K-means method, and the parameters K 1 and K 2 , linked to the modified K-means method, to make the results of both methods comparable. Figure 4: Flowchart of the modified K-means method algorithm.

Output data
Since we use representative days in the case study described in Section 4, we consider that the parameters K, K 1 and K 2 are associated with the number of representative days in their respective K-means methods, instead of the previous definitions that they have received in this paper.
The representative days of electric load and wind-power production obtained applying the traditional K-means method using K = 10 are illustrated in Figs.   Wind-power production (% of installed) Figure 8: Representative days of wind-power production: modified K-means method.
The objective function (2a) represents the aim of the G&TEP problem, which is minimizing the expansion (generation, storage, and transmission fa-cilities) and operation (power produced by conventional generating units and load-shedding) costs. The terms associated with operation costs are multiplied by the weight of the corresponding representative day, σ r , to make them comparable with expansion costs. Note that the sum of σ r for all the representative days is equal to 365, i.e., the total number of days in a year. constraints (2w)-(2x) that impose bounds on the power produced by existing and candidate conventional generating units, respectively; constraints (2y) that limit the load shed of demands; constraints (2z)-(2aa) that impose bounds on the charging power of existing and candidate storage units, respectively; constraints (2ab)-(2ac) that impose bounds on the discharging power of existing and candidate storage units, respectively; constraints (2ad)-(2ae) that impose bounds on the power produced by existing and candidate wind-power units, respectively, where wind-power capacity factors α wrh , ∀w, ∀r, ∀h, are associated with the output of the K-means method described in Section 2; and constraints (2af) which define the voltage angle at the reference node.
It is important to mention that the network constraints are modeled in the G&TEP problem using a DC model without losses for the sake of simplicity. In addition, fixed costs are not considered and the capacity to be installed of each generating unit, i.e., variables p G g , ∀g ∈ Ω G+ , are considered continuous. The G&TEP problem (2) is a mixed-integer nonlinear programming (MINLP) model. Nonlinear terms are x L θ nrh in constraints (2m), i.e., products of binary and continuous variables. These nonlinear terms can be replaced by exact equivalent mixed-integer linear expressions as explained, e.g., in [13]. Thus, the G&TEP problem (2) can be finally formulated as a mixed-integer programming (MILP) model that can be solved using available branch-and-cut solvers, e.g., CPLEX [14].

Data
We apply the expansion model described in Section 3 to the modified version of the IEEE Reliability Test System (RTS) [17] that is depicted in Fig. 9. This electric energy system comprises 11 conventional generating units, 17 demands, 24 nodes, two storage units, 38 transmission lines and two wind-power units. Table 1 provides the conventional generating unit data; Table 2 supplies the demand data; storage unit data is presented in Table 3; the transmission line data can be consulted in Table 4; and Table 5 provides the wind-power unit data.
It is necessary to mention that the annualized investment costs of candidate storage units, which are showed in Table 3, are based on the data collected in [18]. We consider a set of values taking the average value of the costs provided in the two scenarios considered in [18], as it is displayed in equation (3).
We consider that wind-power production and electric load can change their values depending on the zone of the electric energy system where wind-power units and demands are located. On the one hand, demands are allocated to the west and east zones of the system, as illustrated in Figs. 10 and 11. On the other hand, wind-power units are distributed between the north and south zones of the system, as depicted in Figs. 12 and 13. As in Section ??, the historical data of electric load and wind-power production have been acquired from [16]. It is remarkable to mention that the peak values of electric load in the west zone are greater than in the east zone. In addition, the maximum values of wind-power production are associated with the north zone. It is expected that the need to supply the high demands in the west zone will condition the investment decision making of the expansion problem.
It is supposed that we work with hourly data, thus the duration of time steps, ∆τ , is equal to one hour. We consider that the charging and discharging efficiency of storage units is equal to 90 %. The energy initially stored in storage units is assumed to be zero for all the representative days. Node 1 is the reference node of the optimization problem. The parameter F receives a value of 500,000.
Due to the presence of transformers in the electric energy system considered as it can be noticed in Fig. 9, we select a base power of 100 MW. It is supposed that the values of the parameters α wrh and β drh for each representative day and hour are the same for all the wind-power units and demands, respectively. Both parameters are obtained from the K-means methods.
Instead of considering a different investment budget for the building of each candidate generating/storage unit or transmission line, we consider a total investment budget, I T , which is distributed among the different types of facilities.
Thus, it is supposed that constraints (2g)-(2j) of problem (2) are replaced by constraint (4) from now on. Therefore, we consider a total investment budget of $2,000 million. The annualized investment costs are 10 % of the total costs.

Results
First of all, we solve the G&TEP problem using all the historical data to find the exact solution in order to compare it with the results obtained using representative days provided by both K-mean methods. However, it is necessary to make some changes in the formulation of problem (2) to properly characterize the continuity in time of the historical data. Thus, constraints (2q)-(2r) are replaced by constraints (5), which allude to the energy stored in each storage unit during the first hour of all the days except the first one relating it to the energy stored in the same storage unit during the last hour of the previous day.
In addition, constraints (2s)-(2t) are replaced by constraints (6)-(7), which refer to the energy stored in each existing and candidate storage unit, respectively, during the first hour of the first day linking it to the energy initially stored in the same storage unit in the first day, E S sr0 .
Having made these changes, the G&TEP problem is solved using the 366 days of historical data, due to the fact that the year considered is a leap year.
The total annual cost obtained, CT , amounts to $3,124 million. The results show that the 0.14 % of the total demand is not supplied. The computation time required to obtain the exact solution is 55 h 28 min.
The steps that should be followed in order to make the results obtained using representative days comparable with the exact solution are presented below: • Step 1: Solve the G&TEP problem using representative days obtained applying the clustering methods.
• Step 2: Fix the values of the decision variables (m S s , ∀s ∈ Ω S+ ; p G g , ∀g ∈ Ω G+ ; p W w , ∀w ∈ Ω W+ ; x L , ∀ ∈ Ω L+ ) obtained in Step 1 and solve the G&TEP problem using all the historical data.
• Step 3: Calculate the percent error, ε CT , associated with the total annual cost obtained in Step 2, CT K , with regard to the total annual cost provided by the exact solution, CT E , applying the equation (8).
These steps are followed in the case study using a set of values of the parameter K ranging from 10 to 80, being 366 the maximum value which could be selected. It means that we work with an equivalent amount of data ranging from 3 to 22 % of all the historical data considered.  with less error than those obtained using the TKM for all the cases analyzed, especially in those where the parameter K presents a low value.
Although it is fundamental to determine which clustering method provides the closest results to the exact solution, we should also analyze the computation times, obtained in Step 1 of the process described above, in the cases under study. It is relevant in Fig. 16 that the TKM generally provides shorter computation times, especially in those cases where the parameter K presents a high value. However, it should be taken into account that the possible saturation of the server used to solve the G&TEP problem, caused by its concurrent use, may have influenced in the values of the computation times obtained. In addition, note that there is a rising trend of the computation times as well as it is increased the value of K. The result of Figs. 14, 15 and 16 are collected in Table 6.     Taking into account the results commented before, we consider that the MKM provides better results than the TKM, especially regarding the error of the total annual cost. Although the computation times obtained using the MKM are generally greater than those acquired using the TKM, in several of the cases evaluated the error provided by the MKM in a given time is less than the error obtained using the TKM and the same amount of time. For instance, the MKM presents a 2.04 % of error using 40 representative days in 22 min, while the TKM spends 30 min to obtain a 2.70 % of error using 60 representative days.
Due to this and the possible saturation problems in the server mentioned before, we consider that the results associated with the error are more relevant than those linked to the computation times.

Computation Times
The results of this case study are obtained using CPLEX [14] under GAMS [15] on an Intel Xeon E7-4820 computer with 4 processors at 2 GHz and 128 GB of RAM.
The computation time required to obtain the exact solution is 55 h 28 min.
Regarding the resolution of the G&TEP problem using representative days, the corresponding computation times are collected in Table 6.

Conclusions
This paper proposes a new clustering method to adequately characterize the maximum and minimum values of the input data. In addition, we arrange the operating conditions obtained using the K-means method into representative days in order to depict the chronology of the historical data. This allows us to include storage units in the expansion model considered to solve the G&TEP problem.
The conclusion of this paper is that the results obtained in the case study using the modified K-means method and different numbers of representative days provide a total annual cost closer to the exact solution than in the case of using the traditional K-means method. In fact, although the computation times may have been influenced by the saturation of the server used, the results display that in some cases the MKM is able to solve the G&TEP problem in less time than the TKM using less representative days and achieving a minor error.