A Regional Day-Ahead Rooftop Photovoltaic Generation Forecasting Model Considering Unauthorized Photovoltaic Installation

Rooftop photovoltaic (PV) systems are usually behind the meter and invisible to utilities and retailers and, thus, their power generation is not monitored. If a number of rooftop PV systems are installed, it transforms the net load pattern in power systems. Moreover, not only generation but also PV capacity information is invisible due to unauthorized PV installations, causing inaccuracies in regional PV generation forecasting. This study proposes a regional rooftop PV generation forecasting methodology by adding unauthorized PV capacity estimation. PV capacity estimation consists of two steps: detection of unauthorized PV generation and estimation capacity of detected PV. Finally, regional rooftop PV generation is predicted by considering unauthorized PV capacity through the support vector regression (SVR) and upscaling method. The results from a case study show that compared with estimation without unauthorized PV capacity, the proposed methodology reduces the normalized root mean square error (nRMSE) by 5.41% and the normalized mean absolute error (nMAE) by 2.95%, It can be concluded that regional rooftop PV generation forecasting accuracy is improved.


Background and Motivation
In the past, fossil fuels were a key driving force for growth in the fields of technology, society and economy, and were used as the main energy source through the industrial revolution [1]. However, fossil fuels generate 65% of the annual carbon dioxide, which is known to cause global warming and causes air pollution [2,3]. In order to solve the problem caused by the use of fossil fuels, electricity is produced by renewable energy sources. According to the IRENA survey, the capacity of renewable energy utilities increased from 1329 GW to 2799 GW over the past 10 years, of which the supply of photovoltaic (PV) utilities increased from 73 GW to 713 GW [4].
Another reason for the increased supply of solar power facilities is the decline in the levelized cost of electricity (LCOE) [5], and renewable energy policies such as the feed-intariff (FiT) and renewable portfolio standard (RPS) [6]. In particular, rooftop PV increased rapidly due to factors such as a decrease in rooftop PV generation costs [7], incentive for roof PV installations, and reduction of house electricity bills [8,9]. However, the solar power output has the characteristic that the output is determined according to the amount of irradiance and the PV module temperature, which have intermittent characteristics. In the case of self-consumption, the power demand changes, and when the solar power facility is connected to the power grid system, the uncertainty in power supply increases.
When the uncertainty in the output of renewable energy increases, a supply-demand imbalance occurs, and when the difference between supply and demand becomes extremely large, frequency fluctuation occurs. Second, reverse currents flow within the distribution system. Lastly, system operation costs increase due to frequent two shifting for ancillary services [10]. In order to solve the problem of uncertainty of PV output, it is necessary to predict the amount of solar power generation. For about 10 years, research on predicting PV output has been conducted [11].
One of the important features of PV is that PV is distributed and installed in several areas because it can generate power wherever solar irradiance is provided. From the power system operator perspective, although the total sum of distributed PV power generation is an important value to balance supply and demand, there are practical difficulties in collecting accurate meteorological and PV output data for all regions.
In the upscaling method, the entire region is divided into sub-regions, a sample within the sub-region is determined from there to predict the amount of power generation, and then upscaling is performed for each sub-region. Here, the upscaling is to multiply by the upscaling factor after adding the predicted value of the sample power generation in the sub-region. Through the upscaling process, the amount of power generation in the sub-region is predicted. The solar power generation amount of the entire region is predicted by adding the forecast value of the generation amount of the sub-region.
Since 2014, unauthorized PV installations have occurred [27][28][29]. The first reason why unauthorized PV installation is occurring is because residents avoid the roof-type solar power installation fee, the second is not wanting to carry out obligations for solar power installation, and the last reason is the lack of awareness of the impact of unauthorized solar installations on the power system [27]. As unauthorized PV installations occur, deviations occur between the actual photovoltaic facility capacity and the photovoltaic facility capacity information known to the system operator. As unauthorized PV installation occurs, the actual PV capacity and the capacity known to the system operator are different.
This difference in PV capacity information leads to a prediction error of the solar power generation amount by region (reduces the prediction accuracy), and it becomes difficult to calculate the appropriate hosting capacity in the power system. In addition, overvoltage occurs in the power system, which not only threatens the safety of employees of the electric power utility, but also damages the facilities in the power system [30].
Unauthorized PV installation can cause various problems in terms of safety. It causes overvoltage and back-feeding which, if sustained, can damage transformers, voltage regulators, and customers' appliances [30,31]. In addition to Cape Town, unregistered solar installations occur in California and Hawaii [32]. Arizona is charging new solar customers to prevent unregistered solar installations [33].
In order to compensate for the problems caused by the unauthorized PV installation, the process of detecting unauthorized PV installation and estimating the PV capacity should precede predicting the amount of solar power generation.

Literature Review
In this section, literature reviews are classified into three groups as shown in Table 1. Many single PV forecasting studies have been conducted in the past. However, single PV generation forecasting is less robust than regional PV generation forecasting. Single PV output has large variability due to meteorological factors. If the location where PV is installed is different, solar irradiance is also different; therefore, the PV generation pattern varies significantly depending on the PV location. However, regional PV generation is combined with several PV power generation sources; thus, the volatility is smaller than in single PV power generation and easier to forecast. In addition, missing and abnormal data occur because of malfunctions. This reduces the accuracy of single PV generation forecasting. Like PV capacity in a power system, regional PV generation has similar characteristics and trends; hence, PV generation forecasting in the region is valuable.
Refs. [27,[38][39][40] belong to group 3 in Table 1. Ref. [27] proposes three processes, including PV detection, PV identification, and PV capacity estimation. It has a limit that there must be data before and after the rooftop PV is installed. Ref. [38] uses random matrix theory to detect and estimate unauthorized PV. Ref. [39] proposes machine learning based unauthorized PV detection and estimation model trained net load data. It is available to detect and estimate accurately by utilizing difference between sunny days and rainy days. However, [40] estimates PV capacity without a detection process. Ref. [40] proposes an ensemble model PV capacity estimation with optimal net load pair.

Contributions
To handle the uncertainty of unauthorized PV installation, a PV detection and capacity estimation model is applied to the regional rooftop PV forecasting model in this study. The main contributions of this paper are summarized as follows.
Detection performance was improved by adding two detection features. The correlation between the featured and the presence or absence of unregistered solar installation was confirmed through the MIC, and it was confirmed that the new feature had a higher correlation than the existing feature. Refs. [35,36] did not investigate the effect of unregistered solar installation on the prediction accuracy of solar power generation, and this paper verified it through a case study.

Structure of This Study
The rest of the study is organized as follows: Section 2 describes the problem formulation and the overall framework of the proposed approach. In Section 3, details of unauthorized PV detection, unauthorized PV capacity estimation, and upscaling method for regional PV forecasting are presented. In Section 4, a case study is presented to verify the effectiveness of the proposed approach. Section 5 deals with model features analysis of considering the proposed approach. Section 6 contains the conclusion and highlights future work.

Problem Statement
Assume that a home smart meter collects net load data hourly for several days. Here, D, D = {d|d = 1, 2,..., D} is the defined set of day and T, T = {t|t = 1, 2,..., T} is defined set of time slots. The net load on day d and time slot t is shown in Equation (1): where NL(d, t), GL(d, t) and P PV (d, t) are the net load (NL) power, gross load (GL) power and PV generation power on day d at time slot t. If a home has not installed rooftop PV, the PV generation power value is 0, i.e., P PV (d, t) = 0 ∀ d ∈ D, ∀ t ∈ T. For rooftop PV, the PV output power is BTM except at representative solar sites, implying that most rooftop PV power is not measured and collected. The home classification according to rooftop PV installation, rooftop PV authorization, and sub meter of PV power installation is shown in Figure 1 and Table 2.  In Figure 1, , , , and denote home group 1, home group 2, home group 3, and home group 4. Also, H1, H2, H3, and H4 in Table 2 are same meanings. Representative solar sites are installed at homes in H1, and utility has information for rooftop PV systems installed at authorized homes (H1 and H2 in Table 2), including location, capacity, and installation date of the rooftop PV systems. However, the utility does not know which homes are without rooftop PV or have unauthorized rooftop PV. Finally, the PV output forecasting discussed in this study is hourly day-ahead forecasting. Therefore, the time interval measured is one hour.
The parameters required in the problem situation are defined as follows. N home denotes the number of home in entire region. It is equals to the sum of the number of home in H1, H2, H3 and H4 in Table 2. N PV denotes the number of home with installed PV. N PV is the sum of the number of homes in H1, H2 and H3 in Table 2. r Au represents the ratio of the number of authorized PV homes among the home PVs installed. r Sam represents the ratio of the number of homes where PV generation data is measured and among the homes with installed authorized PV. In Figure 1, when the number of homes in H1, H2, H3 and H4 groups are N H1 , N H2 , N H3 , N H4 , respectively, r Au , r Sam can be expressed as the following equations.

Framework of the Proposed Approach
The framework of the proposed approach is shown in Figure 2. The proposed approach consists of three steps: unauthorized PV detection, unauthorized PV capacity estimation, and regional rooftop PV output forecasting. Unauthorized PV detection is a process judging whether an unauthorized home belongs to group H3 or H4 in Figure 1. Unauthorized PV capacity estimation is a process to determine rooftop PV capacity of group H3 using data from groups H1 and H2 in Figure 1. Regional rooftop PV forecasting is a process to predict regional aggregated PV generation by scaling up PV generation of representative solar sites.

Unauthorized PV Detection Model
In this section, the details of processes consisting of the unauthorized PV detection model are handled. This model investigates whether PV is installed or not through net load data for unauthorized PV homes.

Four Weather Groups Clustering
In this process, days are grouped by four weather groups (WG), denoted as A, B, C and D [35], representing sunny, cloudy, shower, and rainy days. In [35], WG is grouped by rooftop PV generation data. However, rooftop PV data cannot be obtained except for representative solar sites according to the assumptions in Section 2.1. Instead, solar irradiance data for a day is used to classify into WG. Using K-means clustering, WG is grouped by A-D. The average solar irradiance of each WG is shown in Figure 3. In Figure 3, A-D have same meaning in [35].

Generation Real and Virtual Typical Net Load Pattern and Minimum Net Load Pattern
To determine out PV or energy storage, the net load variation by meteorological factors must be confirmed. The typical net load pattern (TNLP) is created by averaging the net load of one home in the same WG. Formulation of TNLP is shown in [35]. Next, a virtual TNLP must be generated. In [35], all actual home load data are estimated through rooftop PV generation and NL data. They are used to train load data of homes without rooftop PV. In reality, however, most rooftop PV systems cannot be accounted for by PV generation data due to the BTM features. Additionally, the number of authorized homes is much lower than the number of unauthorized homes. To solve these problems, the method creates a virtual net load based on the bootstrap of actual load of representative solar sites where PV generation data can be obtained and then applied in the unauthorized PV detection process. The difference of A and D is used to detect unauthorized PV in this study. Thus, a virtual TNLP of A and D is created through real TNLP of A and D. Typical PV power (TPP) is defined as average PV power in a certain WG. Sam is the index of representative solar sites whose real-time PV generation is measured. N and ωPV indicate the number of virtual TNLP and PV included or not. The first step in Algorithm 1 is to extract the typical load pattern (TLP) from TNLP and TPP of representative solar sites. Next, the TLP is normalized to its maximum value. In the third step, a normalized TLP is chosen randomly from 0 to 23 h.
Subsequently, the normalized selected TLP is scaled up to the original size by the maximum TLP. If TNLP with rooftop PV is required, additional processes are needed. Similar to creating a virtual TLP, TPP is initially normalized by representative solar site capacity. Then, a random PV capacity from 1 kW to 10 kW, which covers the rooftop PV capacity range is applied. The virtual TPP is created by multiplying random PV capacity and normalized TPP. The virtual TNLP with rooftop PV is generated by subtracting virtual TLP and virtual TPP. The generation process of virtual net load is summarized as Algorithm 1. The minimum net load pattern (MNLP) means the smallest value for each time period among clustered net load values. Algorithm 1. Generation virtual TNLP.

Feature Extraction Based on TNLP and MNLP
In this section, features are extracted from TNLP and MNLP to train the unauthorized PV detection model. In this study, six features are used to train the detection model. Six detection features are expressed following Equations (5)- (10).
In Equation (4), t s and t e is the start time and end time of the PV output. Additionally, t m is the time when TNLP is at the minimum. The specific values of these times are provided in Section 4.1. c A and c D indicate each concavity of TNLP of A and D. In Equation (5), F D 1 shows the ratio of the summation of TNLP of A and D. If unauthorized PV is installed at home, F D 1 is greater than one. In the opposite case, the value of F D 1 is close to one. F D 2 describes the concaveness of TNLP of A. Using the mathematical definition in (6), F D 2 is calculated as the number of hours of solar power generation that satisfy the representative concave TNLP of A. The range of F D 2 is from 0 to 1, and a value of F D 2 close to 1 means the concaveness of TNLP of A. F D 3 in (7) means relative concavity of TNLP of A to D [35]. It is fundamental to detect unauthorized PV by utilizing the fact that the TNLP of A is more concave than that of D. However, it is limited in its ability to detect unauthorized PV. This is because F D 3 has a large value when the concavity of the TNLP of A is large, but F D 3 has a large value even when the value of TNLP of D is small. Thus, F D 5 means that concavity of TNLP of A is required as a feature in Equation (9). F D 4 indicates the ratio of increase of TNLP of A and D between t e and t f . t f , final time, is the time when the PV generation becomes zero after the sun has completely set. If unauthorized PV is installed, F D 4 becomes greater than 1 as the net load decreased due to solar output increases. In the opposite case, F D 4 has a value close to 1. Finally, F D 6 in Equation (10) is the minimum value of TNLP when WG is A. Without a rooftop PV, F D 6 is positive; however, F D 6 is changed to zero or a negative value. This is because the peak of PV generation is during the daytime, while the peak of residential electricity load occurs later in the evening. Thus, F D 6 is available to be used as a feature of unauthorized PV detection.

Training and Test of Unauthorized PV Detection Model
Processes of training and test for unauthorized PV detection are shown in Figure 4. Figure 4a shows the process that extracts features from TNLP of home and train the detection model with features. Both real and virtual TNLP are used to train the detection model. A multi-layer perceptron (MLP) is used as the detection model. Figure 4b shows the process to test the detection model with test features. Additionally, PV columns in the table in Figure 4 indicate PV installation status.

Unauthorized PV Capacity Estimation Model
In this section, the process to estimate the detected rooftop PV capacity of the model described in Section 3.1 is shown. This model is designed to determine how much unauthorized PV capacity is using net load data of homes.

Generation Virtual Net Load
The algorithm for generation of virtual NLP is similar to the method described in Section 3.1.2. However, the difference in the method described in Section 3.1.2 is the considered distribution of PV capacity, which is not uniform. In [41], the distribution of PV capacity is shown in Figure 5. There are many PVs with capacities of 1 kW to 2 kW, while very few PVs with a capacity of 3 kW or more are shown in Figure 5. This indicates that the training case with large PV capacity is difficult. Therefore, generating virtual data for the case with large PV capacity is needed. Generation details of the virtual MNLP algorithm are expressed in Algorithm 2. In Algorithm 2, GL of homes with installed representative solar sites is made by summing NL of homes with installed representative solar sites and PV power of representative solar sites. Next, PV capacity sections are divided by authorized PV capacity. Then, the distribution of the number of PVs in each PV capacity section is investigated. Here, additional NL data are required for each PV capacity section to make PV capacity distribution uniform by subtracting the number of PVs per PV capacity section from the maximum value. The average amount of representative solar site generation is determined by WG and then normalized to each representative solar site capacity. Subsequently, the PV capacity and one GL among home installed representative solar sites is randomly selected by the first PV capacity section. The virtual NL is generated by subtracting the PV capacity multiplied by normalized PV power generation from the GL of randomly selected homes. This process is repeated for each weather group. Figure 5. PV capacity distribution histogram in [41].

Extracting Minimum Net Load Pattern (MNLP) for Four Weather Classes
After generating virtual NL, MNLP in WG A and D are extracted to create features of capacity estimation. MNLP A (t) and MNLP D (t) each denote MNLP in A and D of WG. They are shown in Equations (11) and (12). D A and D D in Equations (11) and (12) denote the set of days when the WG is A and the set of days when the WG is D, where d is the day index.

Extracting Features from MNLP
In unauthorized PV capacity estimation models, three features, F E 1 , F E 2 , and F E 3 are used [35]. These features are expressed in Equations (13)- (15).
In [35], F E 1 denotes minimum of MNLP A (t). It varies by PV output and GL values. If PV output is maximum or GL is minimum, it has negative values of significantly larger magnitude. It is available to estimate PV capacity using F E 1 . The second feature F E 2 denotes difference of MNLP A (t) and MNLP D (t) during a day. So, F E 3 is originally the sum of the difference of MNLP A (t) and MNLP D (t) for intermediate start time and end time of PV generation. However, it is difficult to recognize PV capacity when the PV capacity is small. Therefore, the sum of difference between MNLP A (t) and MNLP D (t) for intermediate start time and end time of PV generation is chosen as the third feature. In the testing process, the MNLP of a test home is used to extract three features and unauthorized PV capacity is estimated through three features.

Training and Test PV Capacity Estimation Model
The overall process of capacity estimation is shown in Figure 6. MNLP is extracted from real and virtual from NL. Support vector regression (SVR) is used as a machine learning method of unauthorized PV capacity estimation model. Unauthorized PV capacity estimation model is trained with three features and authorized PV capacity. Hyper parameter optimization based on grid search is performed. Three features by MNLP are used to train the capacity estimation model.

Regional PV Forecasting Model
After unauthorized PV capacity is estimated, regional rooftop PV generation is predicted through upscaling method. In Section 3.3.1, representative solar sites (sample of rooftop PV) are determined by cluster (sub region). Then, the PV generation forecasting model of representative solar sites is trained and PV generation for the next day at representative solar sites is predicted. In Section 3.3.3, the predicted PV power is scaled up by the upscaling factor and aggregated for clustering. Finally, PV generation for the entire region is predicted by aggregating PV generation for clusters.

Clustering and Sampling of Rooftop PV
In this section, home-installed rooftop PVs are grouped by their location. To use the upscaling method, PV generation in regions must be similar. PV generation is affected by meteorological factors. The closer the distance between the two points, the more likely the weather conditions at the two points are similar. Thus, upscaling must be carried out between geographically close PVs. In this study, K-means clustering [42] is used as the clustering method for rooftop PV. Clustering of rooftop PV is expressed in Algorithm 3. The first step is to initialize the k-cluster center randomly. Next, location information of the rooftop is assigned to a k-cluster by the distance between the center and data. Rooftop PV is assigned to the cluster closest to the distance. Subsequently, the average data value is assigned as the new cluster center. If the cluster center is not changed, rooftop PV clustering is finished. Otherwise, calculation of the cluster center is iterated until it converges to a certain value. Then, representative solar sites are chosen for each cluster. For utility scale PV, there is no limit to the choice of representative solar sites because PV generation is measured. While rooftop PV generation is not measured due to the BTM feature, rooftop PV with sub meters installed to measured PV generation must be chosen. If there is no rooftop PV with a sub meter installed, some PV system must be installed with a sub meter. As rooftop PV systems with sub meters are selected as representative solar sites, the process of Section 3.3.1 is completed.

Individual Rooftop PV Generation Forecasting
In this section, individual rooftop PV generation of representative solar sites is predicted. First, feature normalization must occur. Normalization of features is shown in Equation (16).
In Equation (16), F actual , F Max , F Min and F Norm denote original feature, maximum of feature, minimum of feature, and normalized feature. Next, the individual rooftop PV generation model is constructed in Equation (17). denote normalized one day ahead PV generation, normalized solar irradiance, normalized cloud cover, normalized precipitation, and normalized temperature. Finally, the test of individual PV generation forecasting is performed through the trained PV generation forecasting model.

Upscaling Sample Rooftop PV Generation by Cluster
In this section, rooftop PV generation in sub regions is predicted by the upscaling method. Sub region PV generation is calculated by scaling up predicted individual PV generation according to Equations (18) and (19).
In Equation (18), P c,ind (t) denotes predicted individual generation ind th for the representative solar site at time t in cluster c. N rep denotes the number of representative solar sites. uf c denotes upscaling factor of cluster c that corresponds to the scaled up coefficient of individual power generation. The PV generation of a cluster (or sub region) is made by multiplying each individual PV generation by the upscaling factor and aggregating them. Equation (19) shows how the upscaling factor is calculated. It is defined as the ratio of total rooftop PV capacity in the cluster to the sum of representative solar sites capacity.

Aggregating PV Generation of a Cluster
The PV generation for entire region is finally predicted by aggregating the predicted PV generation amount of a cluster (or sub region). It is shown in Equation (20).
In Equation (20), c denotes index of cluster. And N c denotes the number of clusters. The others, P c,t and P reg,t denote each predicted PV generation of cluster and entire region.

Experimental Data Description
In order to verify performance of the proposed approach, generation data are collected from 300 rooftop PV of Ausgrid that is a power utility in Sydney, Australia. Historical data from 1 July 2010 to 30 June 2013 with the sampling interval of 1 h is chosen in this paper. PV generation and GL data from 1 July 2010 for one year were used for unauthorized PV detection and capacity estimation. PV generation data for two years since 1 July 2011 was used to predict regional rooftop PV generation. The installed capacity of rooftop PV is 504 kW. The geographic distribution of the home with rooftop PV is shown Figure 7.
In Figure 7, color indicates the range of capacity. In other words, green color indicates capacity range of 1 kW to less than 2 kW. Likewise, blue color indicates capacity range of 2 kW to less than 3 kW. Finally, red color indicates capacity range of 3 kW or more. Additionally, the larger the radius of the circle within same color, the larger the rooftop PV capacity. These data can be downloaded in website [41]. In addition to rooftop PV generation data, the weather forecast data are required. In the unauthorized rooftop PV detection and capacity estimation model, solar irradiance hourly data are needed. Solar irradiance forecast hourly data provided by [43] are used. In order to predict PV generation, weather forecast data such as temperature, cloud cover and precipitation are required. Ref. [44] provides hourly various weather forecast data like temperature, humidity, wind speed and cloud cover. Therefore, weather forecast data of [43,44] are used in this paper. In addition to data, parameters mentioned in Sections 2.1, 3.1 and 3.2 are shown in Table 3.  According to [45], the rooftop PV installation penetration rate was approximately 20% in August 2019. Because N PV , the number of given home with rooftop PV was 300, all home in the region were 1500. r Au is 0.5 (i.e., 50%), which means the ratio of home installed unauthorized rooftop PV of all rooftop PV. In other words, 150 homes are authorized and the other 150 homes are unauthorized in the case study. Finally, an important assumption in our work is that the number of systems is constant over 2 years. Because it is difficult to find this by complete enumeration, there are few papers on this. According to [29], identified unauthorized rooftop PV installation rate is about 50% in Cape Town, South Africa. Therefore, unauthorized rooftop PV installation rate is assumed 0.5 (i.e., 50%) based on [29]. r Sam , the ratio of the number of home PV generation data is measured and among the home installed authorized PV, is assumed 0.08 (8%). ts and te are assumed to be 9 and 16. This is because the period that PV generation mainly occurs is from 9 to 16.

Performance Metric
To evaluate the proposed detection, capacity estimation, and regional PV generation forecasting models, several performance metrics are used in this paper.

Unauthorized PV Detection Performance Metric
In this section, three accuracy metrics, PV accuracy (PA), non-PV accuracy (NPA), and overall accuracy (OA) are defined to evaluate unauthorized PV detection model. PA denotes the ratio of accurately classified homes of actual homes with rooftop PV. On the other hand, NPA denotes the ratio of accurately classified homes of actual homes without rooftop PV. OA denotes the ratio of accurately classified homes of all homes in the region.
By using a confusion matrix, three metrics can be calculated. Confusion matrix, CM, is defined as Equation (21).
In Equation (16), N st denotes the number of states can be classified. In unauthorized PV detection, whether rooftop PV installed or not is two cases. In other words, N st − 1 is 1 because N st is 2. In terms of index in confusion matrix, zero indicates state rooftop PV is not installed. In contrast, one indicates state rooftop PV installed. The element of cm ij denotes the number of objects that actually comprise the state i but can be classified to the state j. The confusion matrix can be used to formulate the three accuracy metrics defined above. These are expressed in Equations (22)- (24).

Unauthorized PV Capacity Estimation Performance Metric
In this section, two accuracy metrics, mean absolute percentage error (MAPE) is defined in Equation (25) in order to evaluate the performance of the unauthorized PV capacity estimation.
In Equation (25), C act and C pred represent the nth actual and predicted capacity of unauthorized rooftop PV.
In this paper, above two indicators are used to evaluate regional rooftop PV generation forecasting performance.

Unauthorized Rooftop PV Detection Results
As mentioned in Section 4.1, the unauthorized PV detection model is tested for 1350 home. It corresponds to the total number of homes excluding 150 homes with autho-rized PV. They consist of 150 homes with rooftop PV and 1200 homes without rooftop PV. If the predicted result and the actual configuration are the same, it can be said that the performance of the PV detection model is good. Detection simulation procedure is run for 100 rounds. Authorized homes are different each round. The unit of accuracy metric is percentage. The larger the value of the accuracy metric, the better the performance of the detection model. As mentioned in Section 3.1.4, MLP is used as unauthorized PV detection. MLP is a representative machine learning method utilized for classification or prediction, which is chosen to show a good performance in classification problems. The parameters of the MLP used the default parameters. The accuracy of detection is shown in Tables 4 and 5. In Table 4, the detection result by the method in [39] is shown. Best in column of Table 4 means best accuracy among 100 rounds. Worst in column means worst accuracy among 100 rounds. Average in column means average value of 100 rounds' accuracy. The above description also applies to the columns in Table 5. In Table 5, detection result by proposed method in this paper is shown. By adding features used to train detection model, improved performance is identified in Table 5.

Unauthorized Rooftop PV Capacity Estimation Results
In this section, capacity for detected PV is estimated. Two cases of capacity estimation in [35] and in this paper are presented. Capacity estimation simulation is run for 100 rounds. Table 6 shows capacity estimation of [39] and the proposed method. Like Section 4.3.1, The best, worst, and average results in Table 6. By modifying the unauthorized PV detection model, capacity estimation performance is improved.

Regional Rooftop PV Generation Forecasting Results
The result of clustering rooftop PV is shown Figure 8. Two is optimal value of the number of clusters in K-means clustering. It was decided by silhouette coefficient [38]. Next, six rooftop PVs are chosen as representative solar sites for each cluster. Then, individual rooftop hourly PV generation for the next day is predicted by the SVR model [16]. For detection and capacity estimation, PV generation forecasting runs 100 rounds by selecting the authorized PV and representative solar sites differently each time by random sampling. Two prediction accuracy metrics, nRMSE and nMAE, are used to evaluate the individual PV generation forecasting model. Distribution of individual PV generation forecasting is shown in Figure 9. PV power and weather forecast data for 2 years are used. Data from 1 July 2011 to 3 February 2013 (i.e., 584 days) are used to train PV generation forecasting model. Data after 4 February 2013 (i.e., 147 days) are used to test the trained PV generation forecasting model. Average of nRMSE and nMAE are 9.18% and 4.60%. Once individual PV generation forecasting is completed, regional PV generation is predicted through the upscaling method as mentioned in Sections 3.3.3 and 3.3.4. The regional PV generation forecasting error is shown in Table 7. These errors are obtained as the average of the prediction errors of PV generation in each region calculated through 100 rounds simulations. The prediction result is shown in Table 7. The difference in estimation results in Tables 6 and 7 is due to the distribution of solar energy facilities. In most areas, such as California, the solar capacity distribution is 1-3 kW [33]. However, in the case of [39], the detection was conducted on houses with 4-6 kW capacity. In the case of houses with large installation capacity, the net load pattern is clear, but detection is difficult because the characteristics of the net load pattern are not clear when solar energy is installed in houses with small installation capacity. Taking these features into account, we detect them, and this difference improves the estimation of the proposed method and the performance of predicting local-unit solar power generation.  Regional PV generation forecasting performance considering unauthorized PV detection and estimation much better than otherwise. This is because the uncertainty of the unauthorized PV capacity is reduced by detection and capacity estimation model.
These errors are obtained as the average of the prediction errors of PV generation in each region calculated through 100 simulations. The prediction result without detection and capacity estimation is shown in Table 7. In Table 7, Case 1 is a situation that does not consider unauthorized PV detection. Case 2 is a situation considering unauthorized PV detection in [39]. Case 3 is a situation considering unauthorized PV detection in this study.
It is also possible to identify the effect through Figure 11. Case 1 is the regional PV output prediction situation without unauthorized PV detection and capacity estimation. On the other hand, Case 2 is the regional PV output prediction situation with unauthorized PV detection and capacity estimation in [35]. Case 3 is similar to Case 2. However, Case 3 is used to proposed the unauthorized detection method instead. Here, it is assumed that r Au is 0.5 based on the Cape Town case [29]. In Case 1, where unauthorized PV detection is not considered, the prediction values differ from real generation values. Because the upscaling factor was obtained incorrectly only considering authorized PV installation. In Case 2, the upscaling factor error between real PV capacity and estimated could be reduced through unauthorized PV detection. Furthermore, it is possible to reduce the error in predicting the amount of solar power generation by adding a feature capable of discriminating unauthorized PV detection in the model in Case 3.

Upscaling Factor Analysis
The upscaling factor is key to predicting accurate regional PV generation. In this section, upscaling factors with several situations are compared to highlight the importance of unauthorized PV detection and capacity estimation.
In this section, the upscaling factor distributions are covered in Figure 12. In Case 1, upscaling factor range between Case 1 and real values is differ significantly. It results in a large error in predicting regional PV output. On the other hand, the upscaling factor can be estimated similarly to the real value through unauthorized PV detection in Case 2. In addition, by improving the detection accuracy of unauthorized PV, Case 3 has a smaller upscaling factor error than Case 2. Through the improved unauthorized PV detection model in Figure 12, it is possible to improve the prediction of regional PV output.

Feature Correlation Analysis
In this section, analysis between features and results is discussed for detection. The maximal information coefficient (MIC) is used to analyze correlation between features and result for detection and capacity estimation [39]. MIC values of features are shown in Table  8. Through MIC values, F D 5 and F D 6 in this study have a stronger correlation in this study. Overall detection accuracy is improved due to the use of these features.

Conclusions
This study presents a new forecasting method for regional rooftop PV power generation. It aims to accurately forecast aggregated PV power of rooftop PV in the entire region under unauthorized PV installation. In the first step, an unauthorized PV detection model based on MLP by trained virtual TNLP and MNLP is proposed to detect whether rooftop PV is installed or not. In the second step, an unauthorized PV capacity estimation model dealing with the imbalance of PV capacity distribution through virtual NL generation based on a bootstrap approach is proposed. In the final step, regional rooftop PV generation forecasting based on an upscaling method considering unauthorized PV installation is proposed. A realistic dataset from Sydney (NSW, Australia) consisting of 300 residential customers with rooftop PV system was used to evaluate the performance of the proposed methodology. The results show that the proposed methodology has good overall performance compared with previous regional rooftop PV generation forecasting approaches. Furthermore, the impact of unauthorized PV detection and capacity estimation on the upscaling factor value is investigated. The results indicate that PV detection and capacity estimation reduce the upscaling factor error under unauthorized PV installation. Additionally, by analyzing proposed features for the detection model, the proposed methodology shows its effectiveness. In conclusion, the proposed methodology can contribute to accurate regional PV output forecasting. Future work possibilities are as follows: 1.
Investigating the impact of NL home-owned energy storage and electric vehicles on unauthorized PV detection performance.

2.
Exploring rooftop PV capacity uncertainty in addition to unauthorized PV installation. For example, there are rooftop PV faults and real-time rooftop PV penetration.
Author Contributions: T.K. conceived of the idea for the research and preformed the simulations as the first author. J.K. led and supervised the research and is the corresponding author. Both authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.