Building a Better Baseline for Residential Demand Response Programs: Mitigating the Effects of Customer Heterogeneity and Random Variations

Peak-time rebates offer an opportunity to introduce demand response in electricity markets. To implement peak-time rebates, utilities must accurately determine the consumption level if the program were not in effect. Reliable calculations of customer baseline load elude utilities and independent system operators, due to factors that include heterogeneous demands and random variations. Prevailing research is limited for residential markets, which are growing rapidly with the presence of load aggregators and the availability of smart grid systems. Our research pioneers a novel method that clusters customers according to the size and predictability of their demands, substantially improving existing customer baseline calculations and other clustering methods.


Introduction
Smart grid systems provide real-time capability for managing and monitoring electric usage. With smart grid technology, demand response (DR) offers a cost-effective alternative to increasing electricity supply, when dealing with a small number of capacity-constrained hours. For purposes of this article, DR refers to programs where customers are paid "to reduce their consumption relative to an administratively set baseline level of consumption." [1]. While economists have long endorsed dynamic prices such as real-time pricing (RTP) that vary continuously to improve electric system efficiency, regulators have hesitated to implement these programs, particularly for residential customers. Borenstein [2] has noted that customers have resisted these rates, due to their complexity and the possibility of much higher bills from price surges. To gain greater customer participation, electricity load-serving entities (LSEs) have introduced DR offerings, such as peak-time rebate (PTR), that credit customers for reducing their use during event hours rather than charging a premium during high demand hours.
Some observers, such as in [3], suggest that subsidies and premiums are equivalent in their effects on load. At the margin, the cost to reducing load is the same whether there is a penalty or a subsidy. However, economists find evidence that customers respond differently to rewards than to penalties (for example, see Gneezy et al. [4]). There is also a difference in the mechanisms. Unlike RTP, programs such as PTR require providers to determine load reduction. The predominant method is to pay customers the difference between their actual usage and a calculated customer baseline load (CBL), a measure of energy customers would have consumed at their former rate.
The use of a CBL introduces considerable complications. First, the CBL is counterfactual, as the LSE cannot observe what the customer on DR would have used at the previous rate. To calculate a CBL, However, when clustering is formulated on the basis of similar size and demand predictability rather than random clustering, the formulation will increase user understanding and appearance of fairness. The justification is not unlike insurance, where customer rates depend on group characteristics rather than each individual's characteristics, with transparency and fairness dependent upon the customer's perception that the group has similar characteristics to the individual. We focus on residential customers, who have received less attention than their commercial and industrial counterparts. With the increase in advanced metering infrastructure (AMI) and load aggregators who offer load reductions to electricity providers from residential customer groups, residential DR potential is growing faster than the commercial and industrial sectors Navigant [8] suggests that it will be increasingly difficult to find more commercial and industrial customers who are willing to offer demand response, while the residential market is relatively untapped.
As with commercial and industrial customers, residential baselines are subject to "payments for nothing", due to random variations. Residential demands also have less predictable patterns than commercial and industrial loads as shown in Section 3.3 of Mohajeryami [9]. Baselines developed for commercial and industrial customers are not readily transferable to residential customers, as residential activities are more random and less predictable, begging for innovation in residential CBL determinations.
We employ a dataset collected by Australian Energy Market Operator (AEMO) for 200 residential customers for 2012 (366 days). In the dataset, each electricity distributor in an AEMO market supplies raw data from a sample of 200 customers for each of their supply areas to the market operator to construct load profiles. The data used in this article is a sample of the raw data for one of the distributors. Data were incomplete for 11 customers and excluded from this study. The analysis includes the remaining 189 customers.
The customers in the sample do not participate in a DR program. Our objective is to compare the accuracy of CBL calculations in predicting demand on a high use day. Such a calculation serves as a needed counterfactual for an actual DR program, so that if the customer's demand in response to an incentive such as PTR is below the baseline, the customer is receiving compensation based on an accurate baseline, as opposed to a baseline that may have resulted from random fluctuations in individual customer usage during the baseline measurement period.
Using a sample of one year's worth of hourly data for 200 customers drawn from Australian Energy Market Operator (AEMO) [10], of whom we use 189 customers with complete hourly data, we compare our method to widely used existing methods that do not use clustering and to methods that use clustering, but do not consider the predictability of customer demands. We have daily data for the year 2012, and develop customer baselines using the 366 days. These hourly data are for customers who were not part of a demand response program, so we simulate 12 event days, typically using the highest monthly demand days, based on the historical highest-use hours of 3-9 pm. CBL inaccuracy is due to customer heterogeneous usage patterns and random variations, as well as the CBL mechanism, among which we consider the most common ones used by utilities, as well as other proposed techniques. We find that our clustering method performs considerably better than non-clustering methods currently used by LSEs, and improves upon previous methods of clustering, based on standard metrics of accuracy and bias commonly used to measure CBL performance.
This article examines the methods of practitioners and the proposals of theorists, then offers a new approach to customer baseline (CBL) methodology. Section 2 covers baseline methods and metrics in the field, examining utility company and ISO practices for calculating CBLs. Section 3 reviews the literature and innovations in CBL calculations. The cross-disciplinary nature of this topic has left some of the rigorous engineering studies outside the purview of economists before this article. Section 4 proposes a new method to improve the error performance of CBL calculations based on k-means clustering, using average loads and a measure of load predictability. Section 5 presents a case study to evaluate the error performance of the CBL developed with the proposed methodology, using data from AEMO. Section 6 provides conclusions, policy implications, and suggestions for future work.

Conventional Utility and ISO Methods
Electricity providers developed baseline methods for their DR offerings for commercial and industrial customers, and then, in a relatively small number of DR programs, transferred them to residential customers. We provide a brief overview of the most common approaches, including High XofY, exponential smoothing, and regression.
The predominant approach used to determine baselines is the High XofY method. This method selects Y days of non-event non-holiday weekdays before a DR-event day. Of these Y days, the baseline uses X days with maximum average consumption. The baseline for each hour of the event day is the average hourly load for these X days. The New York ISO uses this method with X = 5 and Y = 10, and we include it as one of the methods that we compare to our clustering method. PJM uses High4of5 and the California ISO (CAISO) uses Last10Days.
Exponential smoothing uses a weighted average of customers' historical data from the beginning of their subscription, which we include as a second method to compare to our clustering approach. In this method, the weight of each day decreases exponentially with time. New England ISO (ISO-NE) employed this methodology until June 1, 2017. The ISO-NE algorithm used a weighting factor of 90% for the previous day's CBL and 10% for the current day's consumption.
The third conventional method we compare to our clustering method is regression. Using regression to calculate the baseline, the primary explanatory variable is historical consumption data for different days of the week, incorporating additional parameters, such as temperature, humidity, and time of sunrise and sunset. Electric Reliability Council of Texas (ERCOT) uses this method.
Event day is different from the selected prior days. Therefore, some mechanisms adjust CBL for the event day. Morning or "same day" adjustment normally uses 2-4 h before the start of the event. The difference between actual load and the estimated baseline in the adjustment time frame can be adjusted in two ways, multiplicative or additive. Multiplicative adjustment uses a percentage change and applies it to the estimated baseline. Additive adjustment utilizes absolute change. Goldberg and Agnew [11] reported that the choice of multiplicative or additive adjustment did not change the outcome appreciably. In our article, we show the additive adjustment, although we find that it is not a clear improvement over the unadjusted results.

Metrics
Most studies of CBL use metrics of hourly accuracy (α) and bias (β). We added a third measure, overall performance index (OPI), that is a weighted average of the other two.

Accuracy
Mean absolute error (MAE), as shown in Equation (1), is a measure of hourly accuracy. A lower MAE indicates higher accuracy, and thus a better prediction of capacity savings, and in turn, a larger reduction in event day wholesale price to customers not on the DR program: where C is a set of customers, D is day, T is time, T = {t 1 , . . . , t T } is timeslot division within a day, l i (d,t) is actual load of customer i ∈ C on day d at timeslot t ∈ T, and b i (d,t) is predicted baseline of customer i ∈ C on day d at timeslot t ∈ T Electronics 2020, 9, 570

of 19
Bias Equation (2) defines bias. Positive bias is indicative of the overestimation of the customers' actual consumption, resulting in the electricity provider overpaying for capacity reductions. Negative bias leads to underestimation of capacity savings, which reduces electricity provider payments to the advantage of the provider but to the disadvantage of customers on the program: Overall Performance Index (OPI) We create an overall error performance that is the weighted sum of the absolute value of accuracy and bias, as shown in Equation (3). A lower OPI means that the CBL is more capable of measuring the customer's response to the price incentives in the pertinent DR program. As we have no a priori reason to weight one measure more than the other, we arbitrarily select the weight of λ = 0.5, to give equal weight to the absolute value of accuracy and bias.

CBLs: Residential Studies and Clustering Approaches
Two Maryland-vicinity utilities offer residential programs using CBLs. Pepco [12] investigated a critical peak rebate program for residential customers. Following the methodology of PJM, the company used high XofY, with X = 3 and Y = 30. The company examined the accuracy of the average CBL for a variety of scenarios of between 4 and 15 event days for the hours of noon to 8 pm. They found CBL to have only a small error, using a similar measure to MAE. Remarkably, there was no bias, meaning that the CBL accurately predicted the average hourly demand. However, the study did not report CBLs for individual customers. With a payout of $1.25/kWh for each kWh reduction below the baseline, there is a lot of revenue at stake if individual customer baselines are not accurate measures of demand.
Baltimore Gas and Electric (BG&E) conducted a pilot program in 2008 through 2011. The company recruited participants, as well as a control group. BG&E opened the program to one million customers in 2015, offering $1.25 per kWh for reductions below baseline. Due to the fact that customers are automatically enrolled-the default is opt-out rather than opt-in-some customers who make no changes in their behavior will collect payments. The company estimates load savings equivalent to one coal plant, but there has been criticism that they do not appear to have accounted for free riders who randomly decreased load, instead of decreasing load due to the PTR program. (AEE Institute (2019) [13]).
Wijaya et al. [14] examined DR baselines for residential customers from the Irish Commission for Energy Regulation (CER) smart metering trial dataset. They found low XofY had the highest accuracy, but the largest negative bias, meaning that the baseline was below actual use. Ordinarily, a negative bias reduces customer revenue, but increases utility revenue. However, in the long run, utility revenues may eventually decrease, as customers are more likely to discontinue their participation if they receive lower payments.
Using the actual smart meter data of 66 residential customers of Australian energy company AGL, Jazaeri et al. [15] analyzed five CBL estimation methods of High XofY, Last Y days, regression, neural network, and polynomial extrapolation. They found that in terms of bias, machine learning techniques of neural network produce the best results and polynomial extrapolation provides the smallest estimation error.
Hatton et al. [16] analyzed an experimental DR program of about 700 customers from Electricité de France, using a randomized controlled trial (RCT) to create a suitable control group to measure the DR effect. Unlike CBL, the method is not subject to gaming. RCT aims to minimize the distance between the selected control group's load curve and the load curve of DR participants. The results showed Electronics 2020, 9, 570 6 of 19 a significant improvement compared to conventional methods. They also found that an increase in the size of the control group improved accuracy. However, while Mohajeryami and Cecchi [9] find that RCT shows acceptable performance for individual loads, traditional methods perform better for aggregate loads. RCT also requires many characteristics that utilities do not typically collect to create the best control group. Furthermore, in its current form, RCT employs a static view, unable to portray the dynamics of evolving behavior.
Zhang et al. [7] clustered urban customers in the southern U.S. to examine event days from June to September. The research proposed a cohort-based CBL approach using k-means clustering, according to the customer power consumption profiles of DR and control group customers. DR baselines were set by cluster, using electricity consumption and demographic data. The calculation compared uses for the two groups for three similar days with the same temperature. Among methods considered in their articles, including conventional CBL methods, the proposed cohort-average method according to k-means clusters consistently showed the best results.
The primary hurdle surrounding CBL calculation with clustering involves defining the key attributes to use for clustering, which is particularly difficult for residential customers, due to high stochasticity and irregular patterns of consumption. In the sections to follow, we present a novel clustering method used to model residential customers and CBL.

A New Method of Clustering Residential Consumers and Determining CBL
Mohajeryami et al. [17] incorporated the effect of the inherent inaccuracy of the CBL, due to its counterfactual nature when applied to heterogeneous residential customers. We build upon those results to examine the impact of the proposed clustering method.

Stochasticity
We develop a measure for predictability of residential demands, with the purpose of clustering customers of similar size and habit. To carry out the task, we deploy a discrete Fourier transform (DFT) to decompose customer loads into their underlying components. The DFT is a tool that transforms a time-domain signal (demand over time) into a frequency-domain signal. By using DFT, it is possible to sort loads based on the frequency of the repetition of its underlying components. We then compare the accuracy of our method to other methods by calculating MAE, bias, and OPI, with and without same day adjustments.

Frequency Response Analysis
Frequency domain response requires decomposing the response into underlying predictable and random components (white noise), accomplished using the discrete Fourier transform (DFT). Transforming a finite sequence of data (N samples separated by the sampling time) into coefficients of sinusoids, we ordered the set with a complex-valued function of frequencies. After decomposing a time-domain signal into its underlying components, it is then possible to search for any recurring pattern or periodicity in the initial time-domain data. DFT determines the magnitude of each periodic component, showing the components' relative strengths. In this section, we treat the hourly electricity consumption of each customer for one year as a signal. Combining DFT and two filters, the method sorts components into high-and low-frequency signals. The equations for discrete Fourier transform and inverse DFT are shown in Equations (4) and (5): where N is the number of samples in the signal. The DFT outcome has N components such that each component has a different frequency, listed in monotonically increasing order. The real parts in the DFT outcome signal mirror half of the data points. Therefore, only half of the information is necessary, as expressed in Equation (6): where the operator (*) refers to the conjugation operator (the complex conjugate of a complex number contains the same real part and an imaginary part equal in magnitude but opposite in sign. For example, the complex conjugate of a + bi is a − bi, where a and b are real numbers and i is the imaginary number representing the square root of −1). One can calculate the frequency of the DFT outcome signal with Equation (7): where f r refers to the frequency resolution, and T s refers to the time resolution in seconds. We use a time resolution of 3600 s (1 h) and the frequency resolution, given the 8784 h in a leap year, is 31.6 nHz (Hertz (Hz) is a measure of frequency equal to one cycle per second. A nanohertz (nHz) is a frequency of one cycle per 10 −9 s).

Filters
To decompose the consumption signals into their underlying components, we created two filters of high and low pass frequency around a cut-off frequency (fc). These two filters separate a frequency-domain signal into high and low-frequency components. The cut-off frequency separates the predictable (low frequency) and unpredictable (high frequency) components of the consumption signal. Figure 1 illustrates the filters; fc is the cut-off frequency, and fs is the sampling frequency, which is 277.7 µHz (10 −6 Hz). Due to the fact that the frequency domain signal symmetrically mirrors itself, low and high-frequency filters must also imitate this characteristic. where the operator (*) refers to the conjugation operator (the complex conjugate of a complex number contains the same real part and an imaginary part equal in magnitude but opposite in sign. For example, the complex conjugate of a + bi is a − bi, where a and b are real numbers and i is the imaginary number representing the square root of −1). One can calculate the frequency of the DFT outcome signal with Equation (7): where fr refers to the frequency resolution, and Ts refers to the time resolution in seconds.
We use a time resolution of 3600 s (1 hour) and the frequency resolution, given the 8784 hours in a leap year, is 31.6 nHz (Hertz (Hz) is a measure of frequency equal to one cycle per second. A nanohertz (nHz) is a frequency of one cycle per 10 −9 seconds).

Filters
To decompose the consumption signals into their underlying components, we created two filters of high and low pass frequency around a cut-off frequency (fc). These two filters separate a frequencydomain signal into high and low-frequency components. The cut-off frequency separates the predictable (low frequency) and unpredictable (high frequency) components of the consumption signal. Figure 1 illustrates the filters; fc is the cut-off frequency, and fs is the sampling frequency, which is 277.7 µHz (10 −6 Hz). Due to the fact that the frequency domain signal symmetrically mirrors itself, low and high-frequency filters must also imitate this characteristic.
The cut-off frequency is 23.1 µHz (equivalent to 12 hours in the time domain). The frequency of almost all the spontaneous daytime activities for the residential sector is under 12 hours In contrast, some industrial activities have 24-hour cycles, in which case 11.5 µHz would capture a 24-hour time domain. By applying the filters to the frequency-domain consumption signal, we obtain two high and low frequency signals. Then, an inverse DFT can be applied to these frequency-domain signals to reconstruct their high and low frequency counterparts in time-domain.  Figure 2 shows the time-domain signals of the aggregated residential customers. For aggregated residential customers, the high frequency components become almost double for a cut-off hour of 24, compared to a cut-off hour of 12. This pattern means that the amount of electricity consumed by activities with a frequency of 1-12 hours is almost equal to the amount of electricity consumed by activities with a frequency of 13-24 hours. The cut-off frequency is 23.1 µHz (equivalent to 12 h in the time domain). The frequency of almost all the spontaneous daytime activities for the residential sector is under 12 h In contrast, some industrial activities have 24-h cycles, in which case 11.5 µHz would capture a 24-h time domain. By applying the filters to the frequency-domain consumption signal, we obtain two high and low frequency signals. Then, an inverse DFT can be applied to these frequency-domain signals to reconstruct their high and low frequency counterparts in time-domain. Figure 2 shows the time-domain signals of the aggregated residential customers. For aggregated residential customers, the high frequency components become almost double for a cut-off hour of 24, compared to a cut-off hour of 12. This pattern means that the amount of electricity consumed by activities with a frequency of 1-12 h is almost equal to the amount of electricity consumed by activities with a frequency of 13-24 h. As shown in Figure 3, the share of high-frequency components is much higher than lowfrequency components for individual residential customers. Residential customers, due to the diversity and spontaneity of their household activities, have highly fluctuating consumption signals. Unlike industrial and commercial customers, the weekends do not have a major impact on the customers' routines. This observation is very important, because typically, many load reduction calculation methods exclude weekends from their process. The exclusion of weekends is not necessary for residential customers.
The high-frequency components for a cut-off hour of 24 have a small increase compared to a cutoff hour of 12, which shows that most of the high-frequency activities of residential customers lie in the range of 1-12 hours. Comparing the low and high frequency of the customer for a cut-off of hour 24, the low frequency is minimal compared to the high frequency. Therefore, we conclude that most As shown in Figure 3, the share of high-frequency components is much higher than low-frequency components for individual residential customers. Residential customers, due to the diversity and spontaneity of their household activities, have highly fluctuating consumption signals. As shown in Figure 3, the share of high-frequency components is much higher than lowfrequency components for individual residential customers. Residential customers, due to the diversity and spontaneity of their household activities, have highly fluctuating consumption signals. Unlike industrial and commercial customers, the weekends do not have a major impact on the customers' routines. This observation is very important, because typically, many load reduction calculation methods exclude weekends from their process. The exclusion of weekends is not necessary for residential customers.
The high-frequency components for a cut-off hour of 24 have a small increase compared to a cutoff hour of 12, which shows that most of the high-frequency activities of residential customers lie in the range of 1-12 hours. Comparing the low and high frequency of the customer for a cut-off of hour 24, the low frequency is minimal compared to the high frequency. Therefore, we conclude that most Unlike industrial and commercial customers, the weekends do not have a major impact on the customers' routines. This observation is very important, because typically, many load reduction calculation methods exclude weekends from their process. The exclusion of weekends is not necessary for residential customers. The high-frequency components for a cut-off hour of 24 have a small increase compared to a cut-off hour of 12, which shows that most of the high-frequency activities of residential customers lie in the range of 1-12 h. Comparing the low and high frequency of the customer for a cut-off of hour 24, the low frequency is minimal compared to the high frequency. Therefore, we conclude that most residential activities have frequency ranges of less than 24 h. Spontaneity governs residential customers who typically do not follow schedules for their daily activities.

Predictability Analysis
An index is proposed to be used as a means for finding and quantifying similarity among the customers' consumption signals. This index is called the predictability index and reflects the ratio of the low frequency components to the whole of the original consumption signal. The index is calculated by summing the share of high-frequency components and then subtracting the value from one, as shown in Equation (8): Table 1 shows a comparison between individual and aggregated residential demands. The randomness of individual residential customer demands makes it very difficult to estimate CBL accurately from the historical data of the customers. Mohajeryami [18], p. 54 shows that industrial and commercial predictability indexes are much higher than that for individual residential customers, but comparable to aggregated residential customers. However, aggregated residential customers show much higher predictability.

Data
We divide the data into four seasons to see the impact of seasonality. Seasons in Australia are as follows: • Spring: September, October, and November. • Summer: December, January, and February (note that since we have data for one year, January and February of that year precede December of that year, so the "summer" months are not consecutive). The total consumption for all four seasons is depicted in Figure 4.   Residential customer household activities are diverse compared to large industrial and commercial customers, because many random parameters influence their electricity consumption. Due to the fact that randomness in electricity consumption is a main source of error in CBL calculation, irregular residential activities could adversely affect the DR performance. To improve the DR performance, we and others have proposed load aggregation, i.e., grouping customers into different clusters, to mitigate the randomness.
One proposed approach to improve performance is to use k-means clustering for assigning customers into different groups. Our novel k-means method uses the predictability index to aid the clustering. We will demonstrate that the customers' P index highly correlates with the CBL's error performance. This correlation is significant for two reasons: first, we can use the P index to demonstrate the upper limits of the error performance of the CBL calculation for each customer. Second, we can employ the P index to cluster the customers into different groups, based on their error performance.
A few initial points must be randomly assigned to run a k-means algorithm. These points are called cluster centroids. The number of these points is determined based on the preferred number of clusters. K-means is an iterative algorithm. Its two primary functions are:
Moving the centroid to minimize an objective function.
Given a set of observations (x 1 , x 2 , ..., x n ), where each observation is a d-dimensional real vector, k-means clustering partitions the n observations into k (≤ n) sets S = {s 1 , s 2 , .., s k }, so as to minimize the within-cluster sum of squares (WSS)-the sum of distance functions of each point in the cluster to the K center-shown in Equation (9): We cluster customers based on their P index values and simulated event-day average-hourly consumption. We select 12 simulated event days (one for each month) for the error analysis, shown in Figure 4 and Table 2. The simulated event hours start at 3:00 p.m. and end at 9:00 p.m. Most of these days are high-use days that would be likely candidates for events for event days in an actual DR program. Apr. 28th (119th day) Fall 9 May 25th (147th day) 10 Jun. 23rd (175th day) 11 Jul. 28th (210th day) Winter 12 Aug. 11th (224th day) K-means clustering is one of the unsupervised learning algorithms that partitions a given data set into a specified number of clusters (assume k clusters), fixed a priori. The "elbow criterion" is one way to select the number of clusters, but is not always unambiguous (the reference at the link shows two patterns, the first with a clear elbow, and the second without https://bl.ocks.org/ rpgove/0060ff3b656618e9136b. Our pattern resembles the second). Figure 5 shows no obvious elbow. We decided to choose five cluster bins (k = 5). There are many other ways to cluster the customers (Asadinejad, et al [19]); however, k-means clustering has proven to be very efficient. Figure 6 shows the results of the clustering. We assigned customers into their respective bins with k-means clustering. For each bin, we compute the CBL for all customers and the average accuracy MAE. In columns 1 through 5 in Tables  3 and 4, we provide each bin's number of customers, average Pindex, CBL calculation's accuracy MAE average, and the average hourly consumption. To show the relationship between the predictability index and the accuracy, we normalize each bin's accuracy MAE by the event-day average hourly consumption for the 12 simulated event days. As shown in column 5, the average hourly consumption is different for the customers in each bin. Comparing the average accuracy MAE values of each bin (column 4) misleads without considering the difference in the average hourly consumption. Hence, we divide the value of accuracy MAE in column 4 by the value of the eventday average hourly consumption in column 5 for each bin. We calculate the normalized value and list it in column 6. As shown in Table 3, the normalized accuracy MAE (column 6) decreases (greater accuracy), as the average predictability index increases. The higher predictability index indicates that There are many other ways to cluster the customers (Asadinejad et al. [19]); however, k-means clustering has proven to be very efficient. Figure 6 shows the results of the clustering. There are many other ways to cluster the customers (Asadinejad, et al [19]); however, k-means clustering has proven to be very efficient. Figure 6 shows the results of the clustering. We assigned customers into their respective bins with k-means clustering. For each bin, we compute the CBL for all customers and the average accuracy MAE. In columns 1 through 5 in Tables  3 and 4, we provide each bin's number of customers, average Pindex, CBL calculation's accuracy MAE average, and the average hourly consumption. To show the relationship between the predictability index and the accuracy, we normalize each bin's accuracy MAE by the event-day average hourly consumption for the 12 simulated event days. As shown in column 5, the average hourly consumption is different for the customers in each bin. Comparing the average accuracy MAE values of each bin (column 4) misleads without considering the difference in the average hourly consumption. Hence, we divide the value of accuracy MAE in column 4 by the value of the eventday average hourly consumption in column 5 for each bin. We calculate the normalized value and list it in column 6. As shown in Table 3, the normalized accuracy MAE (column 6) decreases (greater accuracy), as the average predictability index increases. The higher predictability index indicates that We assigned customers into their respective bins with k-means clustering. For each bin, we compute the CBL for all customers and the average accuracy MAE. In columns 1 through 5 in Tables 3 and 4, we provide each bin's number of customers, average P index , CBL calculation's accuracy MAE average, and the average hourly consumption. To show the relationship between the predictability index and the accuracy, we normalize each bin's accuracy MAE by the event-day average hourly consumption for the 12 simulated event days. As shown in column 5, the average hourly consumption is different for the customers in each bin. Comparing the average accuracy MAE values of each bin (column 4) misleads without considering the difference in the average hourly consumption. Hence, we divide the value of accuracy MAE in column 4 by the value of the event-day average hourly consumption in column 5 for each bin. We calculate the normalized value and list it in column 6. As shown in Table 3, the normalized accuracy MAE (column 6) decreases (greater accuracy), as the average predictability index increases. The higher predictability index indicates that the CBL predicts more accurately. The correlation between the P index and normalized MAE (MAE divided by the event-day average hourly consumption, using the 12 event days) is −0.98. The result reveals a robust relationship between the P index value and the accuracy MAE of the CAISO CBL (Last10Days) calculation method.  Note: RCT aims to minimize the distance between the selected control group's load curve and the load curve of DR participants. We perform the same analysis with the RCT method. The correlation between the P index and the normalized accuracy MAE is −0.88; still a strong correlation, but not as high as clustering using the predictability index.
Almost all research working on the evaluation, measurement, and verification (EM&V) of CBL calculation methods rely upon two metrics, accuracy and bias, to compare different CBL calculation methods. The strong correlation between the P index and the normalized accuracy MAE suggests the possibility of using the P index as an alternative or complement. Furthermore, we can use the P index as a feature to demonstrate the limitation of the CBL calculation methods. If a customer's index is low, no CBL calculation method is likely to work well. If the P index is high without a satisfactory performance of the CBL method, then the method is likely to be inaccurate. A change in the CBL calculation method may be an effective way to improve error performance.

Error Analysis before Using the Predictability Index
To examine how the load aggregation improves the error performance of CBL estimation, we initially applied a simple random load aggregation to the dataset. This part of the analysis analyzed the consumption data for 189 customers over 91 consecutive days, from October 1st to December 31st, 2012. We selected December 24th (Day 359) as a proxy for an event day, because that day constituted the maximum consumption in this partial dataset. The event hours start from 3:00 p.m. and end at 9:00 p.m. Table 5 applies the three metrics to the CBL calculations and their (additively) adjusted forms before load aggregation. The calculated values are for event hours of 3 p.m. to 9 p.m. According to the results, the morning adjustment reduces the accuracy of each CBL method. However, for OPI values, the adjustment improved the results substantially for ISO-NE and regression methods. The adjustment did not improve the OPI of NYISO.

Error Analysis after Random Load Aggregation
We next apply a random load aggregation to the sample data, to examine whether load aggregation and random clustering improve the accuracy, bias, and OPI values of CBL calculation methods provided in Table 6.  Figure 7 shows that simple load aggregation and clustering clearly improves accuracy, while leaving bias unaffected. The additive same-day adjustment generally has a positive effect on accuracy. NYISO is the exception. adjustment did not improve the OPI of NYISO.

Error Analysis after Random Load Aggregation
We next apply a random load aggregation to the sample data, to examine whether load aggregation and random clustering improve the accuracy, bias, and OPI values of CBL calculation methods provided in Table 6.  Figure 7 shows that simple load aggregation and clustering clearly improves accuracy, while leaving bias unaffected. The additive same-day adjustment generally has a positive effect on accuracy. NYISO is the exception.   Table 7 shows the sensitivity of OPI to the number of clusters. As the number of customers in each cluster increases, the OPI generally improves accordingly for all three methods. According to the results, even grouping two customers in a cluster can improve OPI by 13.99%, 12.73%, and 11.80% for the NYISO, ISONE and regression method, respectively. These values generally increase as the number of customers in each cluster increases, except for the case of two clusters with 100 customers per cluster.

Error Analysis for the Proposed CBL Estimation Method
We now apply the proposed CBL estimation method to the dataset. After obtaining all predictability indices and applying k-means clustering, the data is organized into five bins (k = 5), as was shown in Figure 6. Visual inspection suggests that customers in bins 4 and 5 have the highest variability. Bin 1 has the lowest variability. Actual variability is based upon box plots shown in Appendix A, Figure A1, obtained for each season from the aggregated data for each bin. Although different seasons have slight variations, the overall patterns are similar. And while customers might vary behavior over time, individual residential customers follow similar patterns, most of the time. The loads show the highest variability during noontime.
Next, we estimate the CBL for each group and calculate the error performance. We utilize two CBL methods, CAISO and PJM. Table 8 lists the error performance results for the CAISO method. Table 9 lists the error results for the PJM method. These results show an average of MAE for all 12 events. We ran five random combinations of customers in each bin and then averaged the MAE, so as to remove the possibility that the results were specific to a single random draw. For example, Bin #1 has 35 customers. For the entry k = 2, 17 (35/2), groups of two customers are created five times randomly. Results of the five random combinations are averaged and shown in Tables 8 and 9. In this case, because 17 × 2 = 34, information from one customer (35 -34 = 1) is not used in each of the five combinations (not necessarily the same customer).
For both CAISO and PJM, MAE tends to decrease as the number of customers in a cluster increases. In CAISO, we observed the largest decrease in MAE in bin 1 of 54.122% and the smallest decrease in bin 4 of 48.409%. We found similar effects in PJM, with the largest MAE decrease in bin 1 of 54.103% and the smallest decrease in bin 5 of 45.751%. The decrease is larger for the less variable bins and smaller for the more variable bins.  The bias is not sensitive to the number of customers in a cluster. There seems to be no relationship between the bin's variability and the bias value. Appendix B, Tables A1 and A2, contain the bias values for CAISO and PJM.

Comparing the Predictability Method with Random Load Aggregation
The predictability approach improves upon random load aggregation. Although random load aggregation considerably improved the error performance of CBL estimation methods, our method further improves demand response calculations. Table 10 lists the results of both random load aggregation and the predictability method for CAISO and PJM. According to Table 10 for CAISO, the average accuracy MAE for all k's using random aggregation and the predictability method are 0.601 and 0.58. This results in a 3.43 percent improvement in MAE accuracy with our approach. For PJM, the average MAE accuracy for all k's for random aggregation and the predictability method are 0.616 and 0.527, which is equivalent to a 14.42 percent improvement. The improvement is larger for PJM; the combined effect of the load aggregation and the clustering, based on the predictability index, causes this improvement. If the proposed clustering method, based on the predictability index, incurs a large administrative cost, random aggregation still achieves a significant improvement. In large DR programs, where even a small improvement in accuracy can translate into a significant financial gain, clustering will be beneficial.

Conclusions
Demand response programs need reliable evaluation, measurement, and verification (EM&V) methods. One of the major EM&V challenges in DR programs is an accurate estimation of the load reduction, particularly for residential customers who show large stochasticity. To ascertain the load reduction for residential DR programs, it is essential to estimate the customer baseline load (CBL) accurately. The necessity of CBL calculation introduces several challenges to DR programs, including random variations in demand and heterogeneous patterns. To address these challenges, we propose k-means clustering based on demand size and predictability. We first show the theoretical merits of this proposed method. We then provide an empirical analysis of residential customers from an Australian electricity provider, showing that clustering customers according to size and predictability improves the error performance. The improvement over random clustering ranges from approximately 3% to 14%, with larger increases compared to no clustering.
The new predictability clustering method provides a meaningful improvement to random clustering. The method produces a lower mean average error (MAE), indicating higher accuracy and better capacity predictions. Our method found a strong correlation between MAE and predictability.
We also consider demand variability. Variability increases MAE. Larger group sizes produce lower MAE, but the decrease is largest for the cluster with the least variability and smallest for the one with greatest variability.
Overestimation and underestimation, and positive and negative bias, respectively, either over-or under-compensate customers. Neither the number of customers in the bin, nor their variability, had a noticeable impact on bias.
Future work can explore alternative methods. Wavelet transform (WT) offers the more robust handling of time-domain data with variable frequency. WT decomposes signals into a set of functions, containing expansions, contractions, and transitions of a single primary function, referred to as a wavelet. The greater division of frequencies can create a higher resolution for analysis, potentially increasing accuracy. Combining predictability methods with randomized controlled trial (RCT) may increase the accuracy of RCT, requiring a less administratively burdensome data set for predictability methods.
Larger data sets should achieve higher levels of predictability and accuracy. If more detailed customer data are available, characteristics such as income, education, and household size may affect predictability. To further increase accuracy, further research can align customer characteristics with predictability assumptions, to decrease error and detect correlations with the size-predictability index.
Actual DR programs likely maintain the information necessary to test the theory. These data will allow one to test CBL accuracy, as well as the revenue of the utility. The successful implementation and development of predictability methods warrants the theory be tested in a variety of conditions, such as differing climates and differing electricity market structures. In particular, actual DR programs can provide the opportunity to minimize the effects of customer stochasticity, so as to avoid "paying for nothing."

Appendix A. Box Plot for Bins 1 through 5 of All Customers for Each of 24 h
Oregon, and participants at the 2017 Hellenic Association of Energy Economics in Athens, Greece. We are especially grateful for the insightful comments of Ken Gillingham at the 2017 USAEE/IAEE conference in Houston, Texas. We also thank the workshop participants at Davidson College, Davidson, North Carolina. Michael-Paul "Jack" James, Janine Rodrigues Rangel de Assis, and Roozbeh Karandeh provided excellent research assistance. Peter Schwarz acknowledges funding from The Belk College of Business Summer Grant program.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix B. Bias Values for CAISO and PJM
These results are the average of bias for 12 events. In order to remove the claim that the presence of customers in each group is random, five random combinations of customers in each group are selected, and the values of MAE are averaged for each event.

Appendix B. Bias Values for CAISO and PJM
These results are the average of bias for 12 events. In order to remove the claim that the presence of customers in each group is random, five random combinations of customers in each group are selected, and the values of MAE are averaged for each event.