A Distributed PV System Capacity Estimation Approach Based on Support Vector Machine with Customer Net Load Curve Features

Most distributed photovoltaic systems (DPVSs) are normally located behind the meter and are thus invisible to utilities and retailers. The accurate information of the DPVS capacity is very helpful in many aspects. Unfortunately, the capacity information obtained by the existing methods is usually inaccurate due to various reasons, e.g., the existence of unauthorized installations. A two-stage DPVS capacity estimation approach based on support vector machine with customer net load curve features is proposed in this paper. First, several features describing the discrepancy of net load curves between customers with DPVSs and those without are extracted based on the weather status driven characteristic of DPVS output power. A one-class support vector classification (SVC) based DPVS detection (DPVSD) model with the input features extracted above is then established to determine whether a customer has a DPVS or not. Second, a bootstrap-support vector regression (SVR) based DPVS capacity estimation (DPVSCE) model with the input features describing the difference of daily total PV power generation between DPVSs with different capacities is proposed to further estimate the specific capacity of the detected DPVS. A case study using a realistic dataset consisting of 183 residential customers in Austin (TX, U.S.A.) verifies the effectiveness of the proposed approach.


Background and Motivation
Grid-connected distributed photovoltaic system (DPVS) is playing an increasingly significant role as an electric supply resource.The worldwide rooftop photovoltaic (PV) installations reached nearly 40 GW by the end of 2017, which is twice as much as in 2016 [1].It is estimated by the International Energy Agency (IEA) that the total distributed PV capacity installed in 2016 in the U.S., Germany, Sweden, Spain and China is 4169 MW, 1225 MW, 70.7 MW, 36.1 MW and 4230 MW, respectively [2].Particularly, the growth of DPVS installations in China is predicted to be 20 GW per year from 2017 to 2020.
Introducing a high penetration of DPVSs to the grid could have significant impacts on many aspects, including demand response (DR) capacity estimation [3][4][5][6], customer baseline load estimation [7,8], load forecasting [9,10] and distribution network planning [11,12].Specifically, in terms of DR capacity estimation, the DR aggregators need to estimate how much DR capacity they have in order to formulate the reasonable bidding strategy in the electricity market.The DR capacity is highly related to the load profiles.The DPVS installations will significantly affect the load profiles and thereby affect the available DR capacity.Large errors will occur in DR capacity estimation if the latest DPVS capacity information is unknown.Similarly, a lack of knowledge about the installed DPVSs will also lead to biased baseline load estimation results, high errors in load forecasting and improper network planning.
Generally, most DPVSs are normally located behind meters, and their real online operating capacity cannot be easily obtained from existing measurements.Hence, there is a growing need among utilities, retailers and third party companies (e.g., DR aggregators) to obtain such high quality capacity information.However, the existing methods for collecting the capacity information such as original installation records or surveys are usually incomplete, costly and time consuming [13].What's more, the information obtained by the existing methods is usually inaccurate due to the following reasons: (1) the existence of unauthorized DPVS.For example, a large number of unauthorized DPVSs were recognized by Hawaii Electric Company in 2014 [14].These unauthorized DPVSs have no installation records; (2) DPVS capacity expansion without permission applications; (3) DPVSs that have totally or partly stopped running due to faults.These unauthorized, expanded and faulty DPVSs are invisible from the measurements and thus it is difficult to obtain the accurate capacity information by the existing methods.Therefore, it is essential to explore new methods to realize reliable and accurate DPVS capacity estimation (DPVSCE).

Literature Review
The research on DPVSCE has recently started to receive attention in the past few years.To date, very few papers have attempted to estimate the capacity of each individual DPVS.Malof et al. presented an object detection algorithm for automatically identifying the location and size of small-scale solar PV arrays using high-resolution color satellite imagery [15].This work demonstrates the feasibility of collecting distributed PV information over large areas using aerial or satellite imagery, however, some limitations can be observed.First, manual annotations of true rooftop PV locations are needed in this method for training and validating, which is time-consuming.Second, it is not easy to obtain such latest high-resolution (≤0.3 km) color satellite imagery in reality.More importantly, this method only can roughly estimate the physical size of PV arrays but is unable to estimate the specific capacity because the capacity of PV arrays of the same size of could be different.
Zhang et al. proposed a data-driven approach for the detection, verification, and estimation of residential PV systems based on smart meter data [16].The authors first adopted a change-point detection algorithm to identify the potential unauthorized PV installations and then used a statistical inference to further verify them.Finally, local cloud cover index was integrated with smart meter data to estimate the capacity of PV system.This method is effective under an impractical assumption that only when the characteristics of all other loads before and after unauthorized PV installations are almost unchanged.This method needs the load data before and after PV installations, which make it unable to work if the load data is unavailable before installations.

Contributions
In order to address the above problems, a two-stage DPVS capacity estimation approach based on support vector machine (SVM) with customer net load curve features is proposed in this paper.The main contributions of this paper are summarized as follows: 1.
A one-class support vector classification (SVC)-based DPVS detection (DPVSD) model with the input features extracted according to the unique weather status driven characteristic of DPVS output power is proposed to distinguish customers with DPVSs from those without.This model can not only accurately detect the existence of DPVS, but also reliably distinguish load components showing similar output characteristic to DPVS, such as electric vehicles and energy storage devices, which means it is robust to the interference from those load components that are most likely misrecognized by other methods.

2.
A bootstrap-support vector regression (SVR) based DPVSCE model with the input features describing the difference of daily total PV power generation between DPVSs with different capacities is proposed to further estimate the specific capacity of the detected DPVS.This model can keep stable promising performance under the scenario of limited training samples and imbalance dataset.

3.
The effectiveness of the proposed approach is verified on a realistic dataset.Furthermore, the robustness of the model under several scenarios (e.g., the existence of storage devices, different lengths of available historical data) is analyzed and discussed.

Structure of This Paper
The rest of the paper is organized as follows: Section 2 illustrates the problem formulation and presents the framework of the proposed method.In Section 3, a one-class SVC-based DPVSD model is presented.In Section 4, a bootstrap-SVR-based DPVSCE model is proposed to further estimate the capacities for those detected DPVSs.In Section 5, a case study is presented to verify the effectiveness of the proposed approach.Section 6 tests the robustness of the proposed approach under different scenarios.Section 7 highlights the concluding remarks and further works in future.

Problem Statement
Assume the smart meter collects the historical electricity load consumption data for several days, denoted by a set D, D = {d|d = 1, 2, . . ., D} .For a certain day, it can be divided into several timeslots, denoted by T = {t|t = 1, 2, . . ., T}.The net load power on day d at timeslot t can be expressed using Equation (1): where P use (d, t) and P pv (d, t) are the actual load power and PV output power on day d at timeslot t.
For those customers without DPVSs, the PV output power is equal to 0, i.e., P pv (d, t) = 0 ∀d ∈ D, t ∈ T.
For those authorized customers, utilities usually know the detailed information about their DPVSs, including the capacity, location and installation date, etc.The aim of this paper is to detect whether other customers have installed DPVSs or not and further estimate their capacities, which can be formulated as a supervised learning problem.

Framework of the Proposed Approach
Due to the possible interference from other load components that exhibit similar output characteristics like DPVS, such as electric vehicles and energy storage systems, it is difficult to directly estimate the capacity of DPVS.Hence, the proposed DPVSCE approach has been split into two stages named detection and estimation.The framework of the proposed approach is shown in Figure 1 and the details are illustrated in the following sections.named detection and estimation.The framework of the proposed approach is shown in Figure 1 and the details are illustrated in the following sections.

One-Class SVC Based DPVSD Model
The significant distinction between DPVS and other load components is that the output power DPVS is weather status driven while other load components are not.In this section, several features utilizing this distinction are extracted to distinguish the DPVS from other load components.Then a one-class SVC model with the extracted features will be used to determine whether a customer has a DPVS or not.

Generalized Weather Classes
To depict the weather status driven characteristic of DPVS output power, the first step is to perform weather status classification.Four generalized weather classes (GWCs), denoted by GWC-A, B, C, D, covering all weather types were described in our previous work [17].Different GWCs correspond to different PV output power levels.Historical weather type data is usually needed for generating the GWC labels for each day to show which GWC it belongs to.In order to make the proposed approach be independent of any external weather type data resource, a voting-based GWC label generating method is designed using the actual PV output power data of some customers whose output power is known.The flow chart of the GWC label generating method is shown in Figure 2. Joining K-means clustering and voting method generate a GWC label for each day.It should be noted that multiple rather than single customer's PV output data has been used to determine the specific GWC label for each day so as to make the final results more reliable.As such, the GWC label of each

One-Class SVC Based DPVSD Model
The significant distinction between DPVS and other load components is that the output power DPVS is weather status driven while other load components are not.In this section, several features utilizing this distinction are extracted to distinguish the DPVS from other load components.Then a one-class SVC model with the extracted features will be used to determine whether a customer has a DPVS or not.

Generalized Weather Classes
To depict the weather status driven characteristic of DPVS output power, the first step is to perform weather status classification.Four generalized weather classes (GWCs), denoted by GWC-A, B, C, D, covering all weather types were described in our previous work [17].Different GWCs correspond to different PV output power levels.Historical weather type data is usually needed for generating the GWC labels for each day to show which GWC it belongs to.In order to make the proposed approach be independent of any external weather type data resource, a voting-based GWC label generating method is designed using the actual PV output power data of some customers whose output power is known.The flow chart of the GWC label generating method is shown in Figure 2. Joining K-means clustering and voting method generate a GWC label for each day.It should be noted that multiple rather than single customer's PV output data has been used to determine the specific GWC label for each day so as to make the final results more reliable.As such, the GWC label of each day can be obtained.Then the set D can be divided into four subsets named D A , D B , D C , D D according to the GWC label.Each subset contains all days belonging to the corresponding GWC.

Typical Net Load Pattern
The installation of DPVS will reshape customers' net load profiles.As shown in Figure 3, the net load profiles in GWC-A and GWC-D show different shapes.This difference becomes more and more obvious with the increase of PV capacity, which means that the net load profile contains abundant information about whether the customer has an operating DPVS or not and the specific PV capacity.Hence, it can be utilized to perform the DPVS detection and estimation.

Typical Net Load Pattern
The installation of DPVS will reshape customers' net load profiles.As shown in Figure 3, the net load profiles in GWC-A and GWC-D show different shapes.This difference becomes more and more obvious with the increase of PV capacity, which means that the net load profile contains abundant information about whether the customer has an operating DPVS or not and the specific PV capacity.Hence, it can be utilized to perform the DPVS detection and estimation.

Typical Net Load Pattern
The installation of DPVS will reshape customers' net load profiles.As shown in Figure 3, the net load profiles in GWC-A and GWC-D show different shapes.This difference becomes more and more obvious with the increase of PV capacity, which means that the net load profile contains abundant information about whether the customer has an operating DPVS or not and the specific PV capacity.Hence, it can be utilized to perform the DPVS detection and estimation.The main difficulty of using net loads to detect DPVSs is that the other load components will also affect a customer's net load shape.Compared with industrial and commercial customers, residential customers show more variable load patterns due to their random electricity consumption behaviors.To mitigate the negative impact of the variability of daily load profiles on the detection and reveal the typical consumption behaviors of customers, four typical net load patterns (TNLPs) of each customer in four GWCs are extracted by averaging the daily load profiles in the same GWC, given by Equation (2).
where |D w | represents the number of elements in set D w .An example of TNLPs in different GWCs is shown in Figure 4. Then several features will be extracted from these TNLPs in the next section.
Energies 2018, 11, x FOR PEER REVIEW 6 of 19 The main difficulty of using net loads to detect DPVSs is that the other load components will also affect a customer's net load shape.Compared with industrial and commercial customers, residential customers show more variable load patterns due to their random electricity consumption behaviors.To mitigate the negative impact of the variability of daily load profiles on the detection and reveal the typical consumption behaviors of customers, four typical net load patterns (TNLPs) of each customer in four GWCs are extracted by averaging the daily load profiles in the same GWC, given by Equation (2).
where w D represents the number of elements in set w D .An example of TNLPs in different GWCs is shown in Figure 4. Then several features will be extracted from these TNLPs in the next section.

Feature Extraction
As one of the important steps, feature extraction has a large impact on the performance of machine learning methods [18,19].Suitable features should be able to distinguish between the customers with DPVSs from those who don't have.It is noted that the TNLPs of customers with DPVSs are considerably different in different GWCs, while these TNLPS are similar for those customers without DPVSs.This is because the output power of DPVSs is weather status driven, while other load components are not.This unique characteristic makes it possible to detect the DPVSs using net load data.Therefore, analysis of the difference between TNLPs in different GWCs is the key point for extracting suitable features.
To further mitigate the negative influence of load pattern variation on the detection, attention is paid to the time period in which the solar power generation is obvious.Hence, we select a time window [ , ] s e t t = δ and use the TNLP segment ( TNLP ) δ to extract features instead of using the whole TNLP (the time window is set to be 9:00 to 16:00 in this paper).Four weather status-driven features describing the discrepancy between customers with DPVSs and those who don't have are extracted as follows.

Ratio of Total Electricity Consumption in GWC-A to GWC-D
The amplitudes of TNLPs in different GWCs are different, this difference is particularly obvious between GWC-A and GWC-D.The first feature, denoted by 1 F , is the ratio of the absolute value of total consumption during time window in GWC-A to it in GWC-D, which is defined as Equation (3).

Feature Extraction
As one of the important steps, feature extraction has a large impact on the performance of machine learning methods [18,19].Suitable features should be able to distinguish between the customers with DPVSs from those who don't have.It is noted that the TNLPs of customers with DPVSs are considerably different in different GWCs, while these TNLPS are similar for those customers without DPVSs.This is because the output power of DPVSs is weather status driven, while other load components are not.This unique characteristic makes it possible to detect the DPVSs using net load data.Therefore, analysis of the difference between TNLPs in different GWCs is the key point for extracting suitable features.
To further mitigate the negative influence of load pattern variation on the detection, attention is paid to the time period in which the solar power generation is obvious.Hence, we select a time window δ = [t s , t e ] and use the TNLP segment TNLP(δ) to extract features instead of using the whole TNLP (the time window is set to be 9:00 to 16:00 in this paper).Four weather status-driven features describing the discrepancy between customers with DPVSs and those who don't have are extracted as follows.

Ratio of Total Electricity Consumption in GWC-A to GWC-D
The amplitudes of TNLPs in different GWCs are different, this difference is particularly obvious between GWC-A and GWC-D.The first feature, denoted by F 1 , is the ratio of the absolute value of total consumption during time window in GWC-A to it in GWC-D, which is defined as Equation (3).
Energies 2018, 11, 1750 For customers with DPVSs, the value of this feature should be greater or much greater than 1, while for those customers without DPVSs, this index should be close to 1.

Concave Shape Index
The integration of DPVSs will affect the net load shape.Specifically, the net load curve will be concave during time window δ.This phenomenon is more evident in GWC-A, because the solar PV generation is much larger in this GWC compared with other GWCs.Thus, the second feature named concave shape index denoted by F 2 is extracted based on the TNLP in GWC-A, namely TNLP A .
Once the time window δ is determined, there exists a line l connecting the start point and the end point.And the linear equation of this line l can be described by Equation (4): where y(t) represents the sampling point on this line l at timeslot t.All of these sampling points form a set, named S and S = {y(t)|t ∈ [t s , t e ] }.The TNLP sampling points below the line l can be expressed as Equation ( 5): Then the concavo shape index can be calculated by Equation ( 6): where card(•) represents the counts of elements in a set.This feature describes the proportion of the sampling points below the line l, which can be used to characterize the concavity of the TNLP curve during time window δ.

Concavity Degree
The second feature F 2 can only identify whether the TNLP A is concave or not.This feature will be ineffective in some cases, for example, some households without DPVSs but consume lower electricity in the midday.Hence, to further confirm whether the customer has a DPVS or not, other features should be extracted.The PV output power in different GWCs is different, thus the concavity degree of TNLP is also different in different GWCs.A new feature named concavity degree denoted by F 3 is extracted based on the above analysis.The specific process is illustrated as follows.
As shown in Figure 5, find a minimum point of the TNLP segment, denoted by (t m , TNLP(t m )).There are two lines L 1 and L 2 connecting this minimum point to the start point and the end point.Calculate the slope of these two lines, respectively.The sum of these two slopes can be used to describe the concavity degree: For those customers with DPVSs, these two slopes should be different.That is, the value of k A is much larger than k D .By contrast, the value of this feature will be close to 1 for customers without DPVSs, since there are no other load components are weather status driven.
For those customers with DPVSs, these two slopes should be different.That is, the value of A k is much larger than D k .By contrast, the value of this feature will be close to 1 for customers without DPVSs, since there are no other load components are weather status driven.

Load Ramping Rate
The output power of DPVSs will rapidly decrease around the time of sunset, thus there will be a rapid increase in the net load curve.This phenomenon is defined as load ramping in this paper.To describe the ramping speed, an index named load ramping rate is defined, which is illustrated in Figure 6.Find a point ( , ( )) t TNLP t (it is set to be 19:00 in this paper) when the solar PV generation is close to 0, there is a line connecting this point and end point of time window δ , the slope of this line is defined as the load ramping rate.Apparently, the load ramping rate in different GWCs is different.Another feature denoted by 4  F is extracted according to the above analysis, expressed by Equations ( 9) and ( 10):

Load Ramping Rate
The output power of DPVSs will rapidly decrease around the time of sunset, thus there will be a rapid increase in the net load curve.This phenomenon is defined as load ramping in this paper.To describe the ramping speed, an index named load ramping rate is defined, which is illustrated in Figure 6.Find a point (t f , TNLP(t f )) (it is set to be 19:00 in this paper) when the solar PV generation is close to 0, there is a line connecting this point and end point of time window δ, the slope of this line is defined as the load ramping rate.Apparently, the load ramping rate in different GWCs is different.Another feature denoted by F 4 is extracted according to the above analysis, expressed by Equations ( 9) and (10): is much larger than D k .By contrast, the value of this feature will be close to 1 for customers without DPVSs, since there are no other load components are weather status driven.

Load Ramping Rate
The output power of DPVSs will rapidly decrease around the time of sunset, thus there will be a rapid increase in the net load curve.This phenomenon is defined as load ramping in this paper.To describe the ramping speed, an index named load ramping rate is defined, which is illustrated in Figure 6.Find a point ( , ( )) (it is set to be 19:00 in this paper) when the solar PV generation is close to 0, there is a line connecting this point and end point of time window δ , the slope of this line is defined as the load ramping rate.Apparently, the load ramping rate in different GWCs is different.Another feature denoted by 4  F is extracted according to the above analysis, expressed by Equations ( 9) and ( 10):

One-Class SVC Based DPVSD Model
A variety of supervised machine learning methods have been proposed to address classification and regression problems in the past years, such as SVM, artificial neural network (ANN) [19], K-Nearest Neighbor (KNN), Random Forests (RF), etc.Among these techniques, we choose SVM in this research due to its excellent performance in many applications (e.g., solar PV power forecasting [20,21]).SVM can be divided into two categories: SVC and SVR.The SVC model is used here because DPVSD is a classification problem.The input of SVC model is the four features extracted in Section 3.3.The output is the label showing whether a customer has installed a DPVS or not.Specifically, the customers with DPVSs are marked as 1, while those customers without DPVSs are marked as 0. Radial basis function (RBF) is chosen as the kernel function in this paper.

Bootstrap-SVR Based DPVSCE Model
Once a DPVS is detected, the next step is to estimate its capacity.Similar to DPVSD, the DPVSCE also can be formulated as a supervised machine learning problem.The key to accurate estimation is still the feature extraction.Those features extracted for DPVSD focus on the difference between customers with DPVSs and those without are not effective for distinguishing DPVSs with different capacities.Therefore, new features should be extracted to describe the distinction between DPVSs with different capacities.

Extraction of Envelope Lines
One of the main differences between DPVSs with different capacities is the total amount of daily PV power generation.To estimate the daily total PV power generation only using the net load curves, two "special lines" need to be extracted from the net load curves for each customer.The first line defined as lower envelop line (LEL) should be able to reflect the net load level when the DPVS operates near to its maximum power generation capability.In contrast, the other line defined as upper envelope line (UEL) should be able to reflect the net load level when the DPVS output power is near to zero.These two envelope lines can be used to create features to describe the distinction between DPVSs with different capacities.The specific extraction process is illustrated as follows.

Extraction of LEL
Since GWC-A corresponds to the highest PV output power level among the four GWCs, all daily net load curves in GWC-A are used to extract the LEL.For each t ∈ T, find the minimum value L − i (t) of all daily net load curves, expressed by Equation (11): As such, the LEL L − can be obtained by Equation (12): Apparently, this LEL reflects the net load level when the actual load level is minimum and the output power level of DPVS is maximum.

Extraction of UEL
A suitable UEL should be able to represent the net load level when the PV output power is minimum and the actual load level is minimum.Since GWC-D corresponds to the lowest PV output power level, all daily net load curves in GWC-D are used to extract the UEL.Similarly, for each t ∈ T, we calculate the minimum value of all net load curves to obtain the UEL L + , given by Equations ( 13) and (14).
An example of envelope lines extraction is shown in Figure 7.

Extremum Difference Based Feature Extraction
Three features reflecting the size of DPVS capacity can be extracted based on the obtained LEL and UEL.To obtain the most obvious features for estimation, the influence from the part of load curve within which DPVS is unable to produce electricity should be avoided.So we still use the within the time period δ of the envelop line to extract features.

Minimum Net Load Power
The first feature denoted by 1 DPVE F is equal to the minimum value of the LEL, which is given by Equation ( 15): This feature can be used to describe the maximum DPVS generation capacity.Apparently, when the electricity consumption levels of two customers are same, smaller 1

F
indicates a larger DPVS capacity.

Maximum Difference of PV Output Power
The second feature denoted by 2 DPVE F is defined as the maximum value of the difference between UEL and LEL, given by Equation ( 16):

Total PV Power Generation during Time Window
The area between UEL and LEL can be used to characterize the solar PV generation and distinguish different DPVSs with different capacities.Hence, this index is selected as the third feature denoted by 3

F
, which can be calculated by Equation ( 17):

Bootstrap-SVR Based DPVSCE Model
Since the DPVS capacity is numerically continuous, hence regression model can be used to estimate it.To achieve higher estimation accuracy, sufficient data is usually needed for training.

Extremum Difference Based Feature Extraction
Three features reflecting the size of DPVS capacity can be extracted based on the obtained LEL and UEL.To obtain the most obvious features for estimation, the influence from the part of load curve within which DPVS is unable to produce electricity should be avoided.So we still use the segment within the time period δ of the envelop line to extract features.

Minimum Net Load Power
The first feature denoted by F DPVE 1 is equal to the minimum value of the LEL, which is given by Equation ( 15): This feature can be used to describe the maximum DPVS generation capacity.Apparently, when the electricity consumption levels of two customers are same, smaller F DPVE 1 indicates a larger DPVS capacity.

Maximum Difference of PV Output Power
The second feature denoted by F DPVE 2 is defined as the maximum value of the difference between UEL and LEL, given by Equation ( 16):

Total PV Power Generation during Time Window
The area between UEL and LEL can be used to characterize the solar PV generation and distinguish different DPVSs with different capacities.Hence, this index is selected as the third feature denoted by F DPVE 3 , which can be calculated by Equation ( 17):

Bootstrap-SVR Based DPVSCE Model
Since the DPVS capacity is numerically continuous, hence regression model can be used to estimate it.To achieve higher estimation accuracy, sufficient data is usually needed for training.However, it is common in practice that the dataset is imbalanced, i.e., not uniformly distributed.The imbalanced dataset will not only significantly affect the accuracy of classification model, but also has a negative impact on the performance of the regression model.To address this issue, a bootstrap-SVR based DPVSCE model is proposed in this paper.
First, the typical output power (TOP) curve of a local DPVS with capacity of 1 kWp is defined as P TPV (d, t) , d ∈ D, t ∈ T. It can be approximated by those authorized customers whose DPVS capacity and the output power are known.The output power of DPVSs with various capacities can be simulated by TOP, as follows: where C s is the simulated capacity and P s pv (d, t) is the simulated power of the DPVS with capacity of C s .
Second, bootstrap method is used to simulate more electricity consumption data from the known samples.The core of bootstrap is resampling with replacement, which is particularly useful when there is a small amount of empirical data.In this way, various samples can be obtained.
Finally, a large number of net load profiles covering all distribution of DPVS capacities can be obtained through the random combination of the various simulated output power and the electricity consumption data.
Similar to the SVC-based DPVSD model, the input of the SVR is the three extracted features.The output is the corresponding actual DPVS capacity.

Dataset
The dataset used in this research is obtained from the Pecanstreet Database which collects consumer energy recorded from residential houses in Austin (Texas, U.S.A.) [22].In this paper, the smart meter readings containing electricity consumption data, PV output power data, and net load data with the sampling interval of 1 h from 183 households with rooftop solar systems in 2015 are used to test and verify the proposed approach.

Performance Metric
To evaluate the proposed DPVSD and DPVSCE models, several performance metrics are adopted.

Performance Metric for DPVSD Model
Confusion matrix is usually used to evaluate a classification model, which contains all the information about actual and predicted classes: where m ij is the number of objects which actually belong to the class i but be classified to the class j, N is the number of total classes.Performance of classification model is commonly evaluated by the following three indexes: product's accuracy (PA), user's accuracy (UA) and overall accuracy (OA) using the data in confusion matrix: Energies 2018, 11, 1750 PA is used to describe the classification accuracy of one specific actual class.UA is the indicator to describe the correct identified ratio of one specific output class.OA is used to describe the classification accuracy of all actual classes or the correct identified ratio of all output classes.

Performance Metric for DPVSCE Model
Two indicators defined in Equations ( 23) and ( 24) are adopted to evaluate the performance of the proposed DPVSCE model: where C a n and C p n represent the actual and predict capacity of the nth test sample, respectively.Smaller MAPE values indicate better performance of the DPVSCE model.The larger the R 2 is, the better the model is.

Experiment Design
In this section, two experiments are designed to test and verify the effectiveness of the proposed approach.Experiment 1 is designed for testing the performance of the proposed DPVSD model and the other one is for the DPVSCE model.

Experiment 1: Performance Test of the DPVSD Model
To test the performance of the proposed DPVSD model, some electricity consumption data of those customers without DPVSs is needed.Since the dataset used contains the total electricity use data of each customer.Hence, it is feasible to simulate some virtual customers without DPVSs by using the total electricity use data.As such, another 183 customers without DPVSs can be obtained.In order to make the results reliable, we use the K cross-validation technique.These 366 customers are randomly divided into 4-folds (about 92 samples for each fold).The SVM classifier is then trained (using three folds) and evaluated (using one fold) four times.The four results from the 4-folds are averaged to produce a single result.The above procedure is run 100 rounds and the performance metrics are calculated and recorded for each round.The simulation of the SVC-based DPVSD model is implemented using MATLAB R2014a (MathWorks, Natick, MA, USA) and LIBSVM tools (Version 3.17).The grid search method is adopted to optimize the penalty parameter c and kernel parameter of RBF.
In order to illustrate the advantage of the proposed voting method for GWC label generation, a comparison is conducted between the voting method and the simple clustering method (i.e., just use the output power data from a single DPVS to generate the GWC label).The comparison results of these two methods in different cases are shown in Table 1.Several findings can be summarized as follows.First, the proposed voting method achieves more accurate classification results than the simple clustering method in terms of the best and average cases.This finding verifies our hypothesis that the labels generated by multi-sources are more reliable.Second, the DPVSD model presents very good overall performance in all cases.The OA of the proposed method can reach 100% in the best case.Even in the worst case, the proposed DPVSD also can achieve a high accuracy of 90.22%, which verifies the effectiveness of the proposed features.The distribution of DPVS capacities for the 183 customers is shown in Figure 8.These customers can be divided into three groups according to their DPVSs capacities: "small" (<4 kWp), "middle" (4-6 kWp) and "large" (>6 kWp).Apparently, the customers in the "middle" group account for the largest proportion (about 70%).To test the performance of the proposed DPVSCE model in such an imbalance dataset, 10 customers are randomly selected from the "small" group and "large" group for training, the other 173 customers are used for testing.1000 virtual customers are generated by the bootstrap method.The proposed bootstrap-SVR model is compared with the direct SVR model i.e., directly using the training samples to train the model.The comparison results are shown in Figure 9.It can be seen that the proposed bootstrap-SVR model presents high accuracy estimation results in both cases.The value of MAPE is only between 5~7% and the value of R 2 is between 0.86~0.92.However, in terms of the direct model, it shows biased estimation results in both cases.Specifically, large errors occur when only the samples in "small" or "large" group are used for training, which reveals the low adaptability of direct model to imbalance dataset.
that the labels generated by multi-sources are more reliable.Second, the DPVSD model presents very good overall performance in all cases.The OA of the proposed method can reach 100% in the best case.Even in the worst case, the proposed DPVSD also can achieve a high accuracy of 90.22%, which verifies the effectiveness of the proposed features.The distribution of DPVS capacities for the 183 customers is shown in Figure 8.These customers can be divided into three groups according to their DPVSs capacities: "small" (<4 kWp), "middle" (4-6 kWp) and "large" (>6 kWp).Apparently, the customers in the "middle" group account for the largest proportion (about 70%).To test the performance of the proposed DPVSCE model in such an imbalance dataset, 10 customers are randomly selected from the "small" group and "large" group for training, the other 173 customers are used for testing.1000 virtual customers are generated by the bootstrap method.The proposed bootstrap-SVR model is compared with the direct SVR model i.e., directly using the training samples to train the model.The comparison results are shown in Figure 9.It can be seen that the proposed bootstrap-SVR model presents high accuracy estimation results in both cases.The value of MAPE is only between 5~7% and the value of 2  R is between 0.86~0.92.However, in terms of the direct model, it shows biased estimation results in both cases.Specifically, large errors occur when only the samples in "small" or "large" group are used for training, which reveals the low adaptability of direct model to imbalance dataset.Furthermore, we also compare our DPVSCE method with the method proposed in the literature [16] on this dataset.The estimation results of the method proposed in that literature is shown in Figure 10.It produced large estimation errors and the value of MAPE is up to 34%.It is pointed in that literature that it is almost impossible to get an accurate DPVS capacity estimation without the local cloud cover index information.For our method, high estimation accuracy can be achieved without any weather information and very few training samples are required in our method.

Discussion
In this section, we analyze the relative importance of the proposed features for both classification and capacity estimation and further test the robustness of the proposed method under several scenarios.Furthermore, we also compare our DPVSCE method with the method proposed in the literature [16] on this dataset.The estimation results of the method proposed in that literature is shown in Figure 10.It produced large estimation errors and the value of MAPE is up to 34%.It is pointed in that literature that it is almost impossible to get an accurate DPVS capacity estimation without the local cloud cover index information.For our method, high estimation accuracy can be achieved without any weather information and very few training samples are required in our method.Furthermore, we also compare our DPVSCE method with the method proposed in the literature [16] on this dataset.The estimation results of the method proposed in that literature is shown in Figure 10.It produced large estimation errors and the value of MAPE is up to 34%.It is pointed in that literature that it is almost impossible to get an accurate DPVS capacity estimation without the local cloud cover index information.For our method, high estimation accuracy can be achieved without any weather information and very few training samples are required in our method.

Discussion
In this section, we analyze the relative importance of the proposed features for both classification and capacity estimation and further test the robustness of the proposed method under several scenarios.

Discussion
In this section, we analyze the relative importance of the proposed features for both classification and capacity estimation and further test the robustness of the proposed method under several scenarios.

Correlation Analysis
The input feature showing more strong correlation with the output is usually considered as a more important feature [23].Maximal information coefficient (MIC) is used to analyze the relative importance of the proposed input features for both classification and capacity estimation, which can not only quantify the linear relation but also quantify the non-linear relation between two variables [24].The analysis results are shown in Tables 2 and 3, respectively.It is shown in Table 3 that the order of relative importance of the proposed three feature for capacity estimation is

The Impact of Energy Storage on DPVSD Model
Some loads such as energy storage devices (ESDs) can act as generators in some cases, which will also reshape the load profiles and probably affect the detection results of the DPVSD model.To test the robustness of the proposed DPVSD model, a scenario in which some customers have ESDs but without DPVSs is designed.
The Powerwall, a household ESD produced by Tesla is selected for simulation [25].The useable storage capacity of the Powerwall is 10 kWh.In our simulation, the EBS is set to charge in the low electricity price period (0:00-6:00) and discharge in the high electricity price period (12:00-20:00) for each day.The charging/discharging power is set to be 5 kW and the duration of charging/discharging is set to be 2 h.Assume that all customers without DPVS have installed an ESD.The simulated normalized TNLPs of a customer after introducing a household ESD are shown in Figure 11.The classification accuracy of the proposed DPVSD model after introducing ESDs is shown in Table 4.

Correlation Analysis
The input feature showing more strong correlation with the output is usually considered as a more important feature [23].Maximal information coefficient (MIC) is used to analyze the relative importance of the proposed input features for both classification and capacity estimation, which can not only quantify the linear relation but also quantify the non-linear relation between two variables [24].The analysis results are shown in Tables 2 and 3, respectively.From Table 2, we can see that correlation between classification label and (i.e., Ratio of total electricity consumption in GWC-A to GWC-D) is stronger than others, which indicates it is the most important feature for classification model.(i.e., load ramping rate) shows the weakest correlation, which indicates it is relatively unimportant compared with other features.

The Impact of Energy Storage on DPVSD Model
Some loads such as energy storage devices (ESDs) can act as generators in some cases, which will also reshape the load profiles and probably affect the detection results of the DPVSD model.To test the robustness of the proposed DPVSD model, a scenario in which some customers have ESDs but without DPVSs is designed.
The Powerwall, a household ESD produced by Tesla is selected for simulation [25].The useable storage capacity of the Powerwall is 10 kWh.In our simulation, the EBS is set to charge in the low electricity price period (0:00-6:00) and discharge in the high electricity price period (12:00-20:00) for each day.The charging/discharging power is set to be 5 kW and the duration of charging/discharging is set to be 2 h.Assume that all customers without DPVS have installed an ESD.The simulated normalized TNLPs of a customer after introducing a household ESD are shown in Figure 11.The classification accuracy of the proposed DPVSD model after introducing ESDs is shown in Table 4.It can be overserved from Table 4 that the proposed DPVSD model can still achieve high detection accuracy after introducing the ESDs.This is because that the charging/discharging behaviors are not weather status driven.Hence, as shown in Figure 11, the TNLPs of a customer without DPVS in different GWC are similar.The extracted features can accurately reflect the unique WSDDFs of DPVSs so that the DPVSs can be distinguished with the ESDs. Figure 12 presents the simulated net load profiles for 10 consecutive days.Assume the ESD is introduced in the 6th day, there is a high possibility for the change-point detection algorithm to identify this customer as a DPVS owner in this case.The daily net load profiles after introducing the ESDs are totally different with before.Hence, a change-point will be detected by the algorithm.Moreover, the concave-down net load profiles caused by the discharging of ESD make them difficult to be distinguished from those DPVS owners.It can be overserved from Table 4 that the proposed DPVSD model can still achieve high detection accuracy after introducing the ESDs.This is because that the charging/discharging behaviors are not weather status driven.Hence, as shown in Figure 11, the TNLPs of a customer without DPVS in different GWC are similar.The extracted features can accurately reflect the unique WSDDFs of DPVSs so that the DPVSs can be distinguished with the ESDs. Figure 12 presents the simulated net load profiles for 10 consecutive days.Assume the ESD is introduced in the 6th day, there is a high possibility for the change-point detection algorithm to identify this customer as a DPVS owner in this case.The daily net load profiles after introducing the ESDs are totally different with before.Hence, a change-point will be detected by the algorithm.Moreover, the concave-down net load profiles caused by the discharging of ESD make them difficult to be distinguished from those DPVS owners.

The Impact of Length of Historical Data on the DPVSCE Model
Various lengths of historical data from 1 month up to 12 months are set to explore its impact on the performance of the proposed DPVSCE method.The model is run 100 rounds for each length and MAPE and 2 R are calculated in each round.The distribution for the estimation results of 100 rounds is presented in Figure 13.
We find that both of MAPE and 2 R become worse with the decrease of the available historical data.The value of MAPE and 2 R are 15% and 0.6, respectively, in the worst case, i.e., only one month of data can be used.The above findings indicate that the proposed approach relies on the historical data.Hence, it's better to obtain sufficient data to achieve higher accuracy in practice.However, the proposed approach still outperforms the model proposed in the literature [16] even with limited historical data.

The Impact of Length of Historical Data on the DPVSCE Model
Various lengths of historical data from 1 month up to 12 months are set to explore its impact on the performance of the proposed DPVSCE method.The model is run 100 rounds for each length and MAPE and R 2 are calculated in each round.The distribution for the estimation results of 100 rounds is presented in Figure 13.
We find that both of MAPE and R 2 become worse with the decrease of the available historical data.The value of MAPE and R 2 are 15% and 0.6, respectively, in the worst case, i.e., only one month of data can be used.The above findings indicate that the proposed approach relies on the historical data.Hence, it's better to obtain sufficient data to achieve higher accuracy in practice.However, the proposed approach still outperforms the model proposed in the literature [16] even with limited historical data.

Feature extraction based onFigure 1 .
Figure 1.Framework of the proposed two-stage DPVSCE approach.

Figure 1 .
Figure 1.Framework of the proposed two-stage DPVSCE approach.

Figure 2 .
Figure 2. The flow chart of the GWC label generating method.

Energies 2018 ,Figure 2 .
Figure 2. The flow chart of the GWC label generating method.

Figure 4 .
Figure 4.An example of TNLPs in four GWCs.

Figure 4 .
Figure 4.An example of TNLPs in four GWCs.

Figure 5 .
Figure 5. Illustration of the third feature.

Figure 5 .
Figure 5. Illustration of the third feature.

Figure 6 .
Figure 6.Illustration of the load ramping.

Figure 6 .
Figure 6.Illustration of the load ramping.

Figure 7 .
Figure 7.An example of envelope lines extraction.

Figure 7 .
Figure 7.An example of envelope lines extraction.

Figure 8 .
Figure 8.The distribution of DPVS capacities for the 183 customers.

Figure 8 .Figure 9 .
Figure 8.The distribution of DPVS capacities for the 183 customers.

Figure 10 .
Figure10.The estimation results of the method proposed in the literature[16].

9 .
The comparison results for the proposed bootstrap-SVR model and direct model (a,b) Using the samples from the "small" group for training (c,d) Using the samples from the "large" group for training.

Energies 2018 ,Figure 9 .
Figure 9.The comparison results for the proposed bootstrap-SVR model and direct model (a,b) Using the samples from the "small" group for training (c,d) Using the samples from the "large" group for training.

Figure 10 .
Figure10.The estimation results of the method proposed in the literature[16].

Figure 10 .
Figure10.The estimation results of the method proposed in the literature[16].

F
MIC value 0.937 0.505 0.602 It is shown in Table3that the order of relative importance of the proposed three feature for capacity estimation is

Figure 11 .
Figure 11.The simulated TNLPs of a customer with a household ESD.

Figure 11 .
Figure 11.The simulated TNLPs of a customer with a household ESD.

Figure 12 .
Figure 12.The simulated net load profiles of a customer without DPVSs before and after introducing a household ESD.

Figure 12 .
Figure 12.The simulated net load profiles of a customer without DPVSs before and after introducing a household ESD.

Table 1 .
The classification evaluation results of the proposed DPVSD model in different cases: a comparison between two GWC label generation methods.

Table 1 .
The classification evaluation results of the proposed DPVSD model in different cases: a comparison between two GWC label generation methods.

Table 2 .
The MIC values between the four features used for classification and classification label.

Table 2 ,
we can see that correlation between classification label and (i.e., Ratio of total electricity consumption in GWC-A to GWC-D) is stronger than others, which indicates it is the most important feature for classification model.(i.e., load ramping rate) shows the weakest correlation, which indicates it is relatively unimportant compared with other features.

Table 3 .
The MIC values between the three features used for estimation and capacity values.

Table 2 .
The MIC values between the four features used for classification and classification label.

Table 3 .
The MIC values between the three features used for estimation and capacity values.

Table 4 .
The classification accuracy after introducing ESDs.

Table 4
The classification accuracy after introducing ESDs.