Generation Data of Synthetic High Frequency Solar Irradiance for Data-Driven Decision-Making in Electrical Distribution Grids

: In this paper, we introduce a model representing the key characteristics of high frequency variations of solar irradiance and photovoltaic (PV) power production based on Clear Sky Index (CSI) data. The model is suitable for data-driven decision-making in electrical distribution grids, e.g., descriptive/predictive analyses, optimization, and numerical simulation. We concentrate on solar irradiance data since the power production of a PV system strongly correlates with solar irradiance at the site location. The solar irradiance is not constant due to the Earth’s orbit and irradiance absorption/scattering from the clouds. To simulate the operation of a PV system with one-minute resolution for a speciﬁc coordinate, we have to use a model based on the CSI of the solar irradiance data, capturing the uncertainties caused by cloud movements. The proposed model is based on clustering the days of each year into groups of days, e.g., (i) cloudy, (ii) intermittent cloudy, and (iii) clear sky. The CSI data of each group are divided into bins of magnitudes and the transition probabilities among the bins are identiﬁed to deliver a Markov Chain (MC) model to track the intraday weather condition variations. The proposed model is tested on the measurements of two PV systems located at two different climatic regions: (a) Yverdon-les-Bains, Switzerland; and (b) Oahu, Hawaii, USA. The model is compared with a previously published N -state MC model and the performance of the proposed model is elaborated.


Introduction
The energy transition pathway has been proposed in global and regional scales for decarbonization of the energy sector and its transformation to a sustainable system, keeping global warming below the permissible limit. The 21st Conference of the Parties (COP21) Paris Agreement of 2015 has individuated the global scale pathway [1]; as an example of a regional pathway, the Swiss Federal Office of Energy (SFOE) has developed the "Energy Strategy Act 2050" to prepare Switzerland for the energy transition [2]. It is supposed to enable Switzerland's grid to withstand a high capacity of renewable energy through the promotion of small photovoltaic (PV) systems in electrical distribution grids by exploiting the benefits of power industry digitalization [3]. However, the diffusion of PV systems and the high frequency variations of power flow in electrical distribution grids bring operational challenges such as voltage deviations due to the excess/deficit of generation, power quality decrement, and reliability issues [4]. To address these challenges, the power industry digitalization enables us to use data-driven decision-making [5].
For analyzing and solving the adverse impacts of the high penetration of PV systems in electrical distribution grids, data-driven decision-making based on numerical simulation is practical. For this purpose, we have to simulate the operation of an electrical distribution grid over time under the high penetration of PV systems. A numerical simulation requires a model representing the key characteristics and behaviors of the physical systems and processes [6]. It is worth mentioning that using a model to generate synthetic data instead of using the recorded real measurement data for numerical simulation has the following benefits: 1.
In contrast to recorded measurement data and to approaches used in commercial software [7], a model is more general and non-biased against the occurrence of a specific scenario.

2.
For stochastic numerical simulation, an infinite amount of data can be generated. 3.
To investigate the sensitivity of a numerical simulation to specific model parameters, we can change the parameters of the model.
In this paper, we address the problem of finding a suitable model for PV systems' power production that may be used for the numerical simulation of an electrical distribution grid. Since the power production of a PV system in an electrical distribution grid strongly correlates with the solar irradiance data, we have to model the high frequency variations of solar irradiance at the site coordinate. The solar irradiance variations depend on the Earth's orbit and irradiance absorption/scattering from the clouds.
An exemplary application of the numerical simulation of an electrical distribution grid has been presented in [8], which is based on bucket testing. The robustness of a treatment strategy in electrical distribution grids for solving the issues of high PV penetration has been evaluated in [8], using a numerical simulation based on a stochastic model of PV systems power production. Numerical simulation has been used in [9] to perform a predictive analysis of the grid imbalance risks in the presence of a Markov Chain (MC) model of PV systems power production. Finally, the applications of numerical simulation have been reviewed in [10] with the objective of optimizing PV and battery storage systems integration.
The performance of data-driven decision-making based on numerical simulation for an electrical distribution grid highly depends on the accuracy of the PV system model. The uncertainties of PV systems power production, autocorrelation, and correlation with other variables directly affect the quality of data-driven decision-making in electrical distribution grids. To model and calculate the PV systems power production variations, the uncertainty related to the solar irradiance is normally modeled through the Clear Sky Index (CSI). The CSI is defined as the ratio between the Global Horizontal Irradiance (GHI) and corresponding clear sky GHI at the same time and at the same coordinate [11]. The CSI considers the condition of the sky, particularly cloud movements in a short-term scale, depending on the location and seasonal time.
In what follows, a review of previous works dealing with models proposed to reproduce PV power production variations is presented.

Literature Review
A recent literature survey has been given in [4] for describing the models of PV systems power production variations, which are used in the numerical simulation and data analysis studies. The most important features of PV systems power production variations are Probability Density Functions (PDFs) and Temporal Autocorrelation Functions (TAFs). The Beta PDF as presented in [12][13][14] and Gaussian PDF as given in [15,16] have been used for modeling the PV systems power production variations. The TAF of CSI time-series has been analyzed in [17][18][19], where the following four frequencies of TAFs have been discovered in the measurement data: (i) very short-term variations with a period of one minute; (ii) short-term variations with a period of one day; (iii) mid-term variations with a period of one month; and (iv) long-term variations with a period of one year.
The PV systems power production may be simulated with limited accuracy when we have both PDF and TAF data of solar irradiance variations. In order to improve the accuracy of numerical simulation, more sophisticated models of PV systems power production variations have been developed in recent years with the objective of accurate modeling. In this regard, there are three directions for improving the models of the PV systems power production variations: (i) using white-box models that return numerical weather predictions; (ii) adding other sources of data such as recorded meteorological input data; and (iii) using a grey-or black-box modeling strategy.
In the direction of the first strategy (i.e., using white-box models), a prediction model is presented in [20] that takes into account the thermal effects that occur in a PV system as well as the temperature-dependent nature of the PV system efficiency. In the direction of the second strategy (i.e., adding other sources of meteorological input data), Bright et al. [21] used the MC model to produce stochastic time-series of cloud cover depending on the seasons and prevailing pressure system; then, the transition probability matrices were trained using 10 years of coarse meteorological data (excluding the solar irradiance). Bright et al. [22] demonstrated that the synthetic one-minute resolution of solar irradiance time-series, varying on a spatial dimension, may be generated with the following inputs: (i) hourly average meteorological observations of okta (in meteorology, an okta is a unit of measurement used to describe the amount of cloud cover at any given location such as a weather station); (ii) wind speed; (iii) cloud height; and (iv) atmospheric pressure. A synthetic downscaling simulation of real CSI data based on clustering and an MC model has been developed in [23], in which the hidden patterns and trends in very short-term resolution have been discovered. The output of [23] is a simulated one-minute GHI dataset downscaled from 10 to 20 min real recorded data. (The standard of available meteorological data of next-generation satellites would be in 10 to 15 min.) Finally, the prerequisite on long historical data for downscaling and data discovery of solar irradiance variation has been relaxed in [24], which has used the datasets existing in the World Radiation Monitoring Center (WRMC), the central archive of the Baseline Surface Radiation Network (BSRN) [25]. In the direction of the third strategy (i.e., using grey-or black-box modeling strategy), the most recent models are copula [26], neural-network based [27], ensemble methods [28], deep learning [29], and MC [30,31]. Moreover, a synthetic long-term dataset of CSI timeseries has been generated in [32] that is statistically indistinguishable from the observed data. The model of [32] requires an input from 10 to 15 annual time-series of hourly values that may be retrieved from satellite-based GHI data. The Hidden Markov Models (HMMs) with Gaussian observation have been used in [33] to generate synthetic CSI data. In [34], a PV systems power production model based on MC has been used for optimizing the size of a battery storage system.
The third strategy has two main advantages compared to the first and second ones as follows: (i) regarding the third strategy, the use of white-box models is computationally intensive, and numerical weather predictions are typically obtained from third-party providers; and (ii) there is not any presumption on the behaviour of PV systems power production in the third strategy and, as a result, it is more accurate. Due to the abovementioned advantages, we concentrate on the third strategy for modeling PV systems power production variations. In this paper, we use a grey-box modeling strategy, which is only based on solar irradiance data input. (The solar irradiance data at a location may be measured easily by a pyranometer.) Therefore, it is in the direction of the mentioned third strategy. According to the dependency of PV systems power production and the solar irradiance, the data of PV systems power production are generated to be used for the numerical simulation of electrical distribution grids.

Contribution
We propose a model based on a number of individual N-state MCs for ensuring the consistency of synthetic and real data across different time resolutions, i.e., one minute, daily, and monthly. Since the proposed model is suitable for short-to mid-term variation frequencies, we concentrate on one-minute, daily, and monthly temporal variations of solar irradiance, whereas we neglect the yearly variations.
Compared to the models of [21][22][23]26], where many variables such as wind speed, cloud cover, and weather temperature are needed, the model proposed in this paper only requires the measurement of GHI. Compared to [24], where the data of many locations are used for training the model simulating the downscaled one-minute PV systems power production of one day, the proposed model of this paper requires the data of one location and they may be used for the numerical simulation of PV systems power production throughout one year. Compared to the neural network model in [27], where the data of many years are required for training the model, the proposed model of this paper needs minimum measurements of one year. Compared to [30], where a two-state MC model has been developed, the proposed model of this paper uses N-state MC, which is more accurate. In addition, compared to [31], we develop an N-state MC model for each group of days with (i) cloudy, (ii) intermittent cloudy, and (iii) clear sky behavior, whereas a single N-state MC model has been developed in [31]. Compared to [32], where the synthetic data of PV systems power production have been used for analyzing the profitability/risk of financing PVs in long-term resolution by means of a model based on 10 to 15 years of real data, the model proposed in this paper is advantageous for numerical simulation in shortto mid-term analysis and it is trained by a limited amount of input measurements.
The contributions of this paper are as follows: • We provide a comprehensive model of PV systems power production that is suitable for the numerical simulation of electrical distribution grids. • The dependency of temporal weather variations, including one-minute, daily, and monthly variations on the PV systems power production, is considered. A model is developed to generate a sequence of PV systems power production with a random average based on the N-state MC model of that day. • A distinct N-state MC model is proposed for each group of days characterized by cloudy, intermittent cloudy, and clear sky. The proposed model is tested on two locations with "warm and temperate" and "tropical wet and dry/savanna" climates.

Organization
The structure of the paper is organized as follows: the proposed model, which is used for generating the synthetic data of PV systems power production, is explained in Section 2. Possible applications of the proposed model are described in Section 3. The experimental results, analyses, and discussions are presented in Section 4. In addition, the proposed model is compared with the state-of-the-art one in Section 4. The conclusions and suggestions for future lines of work are given in Section 5.

Proposed Model
The details of the proposed model for generating synthetic data of PV systems power production are described in Figure 1. The procedure is represented in three steps, namely data preprocessing, model training, and synthetic data generation. Afterwards, as depicted in Figure 1, there is an evaluation step to analyze the performance of the proposed model.
In this study, it is assumed that we have data of measured GHI, i.e., G(m, d, t), in oneminute resolution as an input of the model. The month is indexed by m ∈ {1, . . . , 12}, the day is denoted by d ∈ {1, . . . , D m }, where D m is the number of days in month m, and the time step of a day is indexed by t ∈ {1, . . . , 1440}. The objective is to generate synthetic one-minute data of GHI, i.e.,Ĝ(m, d, t), capturing the main features of original time-series, including the impacts of (i) one-minute, (ii) daily, and (iii) monthly weather variations on GHI. Then, the PV systems power production data are generated based on the model of the PV system, including the power production efficiency.
Step 1 (Preprocess Data): -Prepare CSI data; -Eliminate outliers; -Add feature of weather type, which may be cloudy, intermittent cloudy, or clear.
Step 2 (Train Model): -Calculate statistics of average daily CSI; -Calculate three different N-state transition probability matrices for days with cloudy, intermittent cloudy, or clear sky.
Step 3 (Generate Synthetic Data): -Generate average daily CSI for each day based on the associated PDF; -Determine the weather type (cloudy, intermittent cloudy, or clear) for each day based on considered thresholds; -Generate MC state time-series based on the associated transition probability matrix; -Generate GHI based on the synthesized CSI; -Generate power production of PV system based on its model and generated synthetic data of GHI.
Evaluate Synthetic Data -Compare synthetic GHI time-series with real test data.

Step 1 (Preprocess Data)
In most cases, the input data are incomplete, inconsistent, and contain outlier errors. Furthermore, measurement equipment may have inaccuracies. In order to make the proposed model resistant to flaws in the input data, we remove the outliers by means of Tukey's test [35]. Based on this test, the tolerance band is between G and G, where G 0.25 and G 0.75 are the 0.25 quantile (25 percentile) and the 0.75 quantile (75 percentile) of the input data of measured GHI, i.e., G(m, d, t), respectively. Then, we evaluate the missing values. Since the aim is to capture temporal variations of GHI in the final model, we calculate the average of two points around a missing point and replace the missing point with it. If there are consecutive missing points, we replace them all with the average of two points around the missing points. (Note that if there are more than six consecutive missing points, we neglect the data of that specific day for building our model.) Next, to model the stochastic temporal variations of instantaneous GHI, the CSI data are generated, defined as the uncertain variable of the solar irradiance when the impact of deterministic variations of the sun's position in the sky has been removed. Mathematically, it is defined as where G c (m, d, t) is the deterministic clear sky GHI over time step t of day d of month m, calculated for latitude/longitude coordinates of the PV site.
Then, we split the CSI data of each month m into training and testing datasets. We build our model using the training datasets of all months to capture the monthly weather variations of CSI.
Finally, we define the feature of day type. Each day may be clustered into cloudy, intermittent cloudy, or clear. (We may have a different number of clusters (other than three) depending on the input data. Here, we present the model for three clusters without loss of generality as we may present a similar model with another number of clusters.) To this end, we define the following rule for determining the type of each day.
where η(m, d) and σ(m, d) are the average and standard deviation of CSI(m, d, t) over day d at month m; η th1 and η th2 are two thresholds to group the day type based on the CSI average; σ th1 and σ th2 are two thresholds to group the day type based on the CSI standard deviation.
The thresholds η th1 , η th2 , σ th1 , and σ th2 are tuned such that the cloudy, intermittent cloudy, and clear days are distinguished. To this end, the following clustering problem must be solved.
arg min where Problem (5) is a conventional clustering problem with a predetermined number of clusters [36]. In the objective, the distances of CSI mean and standard deviation throughout the year from the clusters' centers, i.e., 1+η th2 2 , and 1+σ th2 2 , are minimized.
It is worth noting that the formulation of (5) does not imply that weather variations throughout the day are ignored. The intraday weather variations are captured in the model with MC in another layer of the model described in the second step.

Step 2 (Train Model)
The average daily CSI is an uncertain variable, depending on the daily weather variation. We capture this uncertain variable by 12 different PDFs referring to the months of the year. Using parametric PDFs fitted upon the available data for the given location, the parameter η(m, d) of each day is determined.
The selection of the most appropriate PDF family for the average daily CSI at a given location is an open research topic. In this paper, we concentrate on investigation of several PDF families taken from the relevant literature and choose those that best fit the empirical samples. The parameters of the PDFs are estimated through a Maximum Likelihood Estimation (MLE). The most adequate PDF for the specific location is then chosen through a Goodness of Fitting (GoF) evaluation score based on a Determination Coefficient (DC), i.e., R 2 [37,38].

Pool of Selected Probability Density Functions
Several PDFs have been considered for modeling the average daily CSI. For the sake of brevity, six of them are presented here, which are more precise for this type of data. Three of them (Beta, Kumaraswamy (Kuma) and Logit-Normal (LogitN)) are suitable for fitting random variables bounded within [0, 1], which is the typical range of average daily CSI values [13,39]. The other three (Logistic (Logist), Log-Logistic (Log-Log), and Generalized Extreme Value (GEV)) are suitable instead for modeling unbounded random variables; however, they may also be useful to characterize bounded variables such as average daily CSI [40]. The formulations of the PDFs are briefly recalled below.
(i) The Beta PDF is where a and b are shape parameters.
(ii) The Kumaraswamy PDF is where c and d are shape parameters as well.
(iii) The Logit-Normal PDF is where µ and σ are the average and standard deviation of normally distributed logit(x).
(iv) The Logistic PDF is where λ and s are the average and scale parameter, respectively.
(v) The Log-Log PDF is related to the Logistic PDF. The logarithm of generated variable by Log-Log has Logistic PDF. The Log-Log PDF is where µ and v are the Log average and Log scale parameter.
(vi) The GEV PDF is where ρ is the location parameter, c is the scale parameter, and k is the shape parameter.

Parameter Estimation of Probability Density Function and Goodness of Fitting Test
The parameters of the PDFs are estimated through an MLE, differentiated for each month m. For example, the parameters a(m) and b(m) of the Beta PDF are estimated by solving the following optimization problem: that maximizes the likelihood of the monthly data on the considered PDF. It is trivially extended to all other PDFs for consistency. The DC is calculated for each PDF to assess the GoF score quantitatively. The DC is defined as where B is the number of bins in which the domain of the random variable is divided for the GoF score assessment, p bi is the probability of the random variable to lie within the bin determined from the estimated PDF, N bi is the number of observed samples within the b th bin, and N bi = 1 B ∑ B i=1 N bi . A greater value of the DC determines better fit and indicates preferable PDF selection.

Markov Chain Training
For each group of days, an N-state (It is important to note that the number of states, i.e., N, must be less than 1/CSI res , where CSI res is the maximum possible resolution at which we can measure CSI.) MC model is developed. Each model, developed for a day type, is trained on data including CSI time-series. We discretize the data based on the number of states, i.e., N. Thus, we define the State of CSI (SCSI) as follows: Based on this configuration of states, the transition matrix is calculated as where n ij is the number of transitions from state i to state j from one time step to the next. It is worth mentioning that the transition matrix P(Type) is a function of the day type (cloudy, intermittent cloudy, or clear). Thus, three transition matrices P(Type) are computed using three separate sets of data for days with cloudy, intermittent cloudy, and clear sky. To compute n ij , the data of the day type are built. Then, the number of transitions from state i to state j in these data are enumerated.

Step 3 (Generate Synthetic Data)
Once the models for the three mentioned clusters of days are trained, they are used to generate the synthetic data with the following procedure.
First of all, for generating the data of day d of month m, we have to decide on the average weather of that day, whether the sky is cloudy, intermittent cloudy, or clear. To this end, we generate a random number using the PDF with the associated parameters of month m. This generated random number is the average η(m, d). Then, using the computed thresholds resulting from the optimization problem (5), the type of day d of month m is determined.
Next, for that day, we consider the initial CSI(m, d, 0) equal to the random number η(m, d) generated by the PDF. We use the associated transition probability matrix of that day type for calculating the state transition for the next time steps until the whole 1440 time steps are generated. It is worth mentioning that with an initial value CSI(m, d, 0) equal to the random number η(m, d) and generating the time-series based on the transition matrix, the average CSI of that day would be η(m, d) [31].

Evaluate Synthetic Data
Upon generating the synthetic data of CSI, the GHI data and PV systems power production data are generated using the CSI model of GHI and the model of the PV system. The generated synthetic GHI is compared with the real test data in terms of: • similarity of PDFs in different months to evaluate the impacts of monthly weather variation on GHI; • TAF similarity of average daily GHI for evaluating the impacts of daily weather variation on GHI; and • TAF similarity of GHI time-series for evaluating the impacts of one-minute-resolution weather variations.
By comparing the standard TAF for real and synthetic data, three mentioned similarities are evaluated. The standard TAF is defined as whereσ is the standard deviation of CSI over the year, τ is the lag of TAF in minutes, andμ is the average of CSI over the whole data. In addition, E[.] is the expectation operator. Since TAF(τ) does not capture monthly and daily variations, the standard TAF is not the correct index for our study.
To validate the performance of the proposed model, we define the following metric, which we refer to as the modified version of TAF.
where η(m, d) and σ(m, d) are the average and standard deviation of CSI at day d of month m, respectively. The difference between this modified version of TAF and the standard one is that in the modified version, the average daily CSI, i.e., η(m, d), and the standard deviation of CSI at each day, i.e., σ(m, d), are used, whereas the averageμ and standard deviation σ over the whole year are used in the standard TAF. Since the model of each month for weather variations and the transitional behavior of each day are different, the modified version of TAF is the best representative index for evaluating the synthetic data introduced in this study.

Benchmark Model
To evaluate the performance, the proposed model is compared with the state-ofthe-art model in [31]. The procedure for generating synthetic data based on this model includes three steps: (i) data preprocessing; (ii) model training; and (iii) synthetic data generation. In the data preprocessing step, the CSI data are prepared and the outliers are eliminated. In the model training step, a single transition probability matrix is calculated with N-state for all days. In the synthetic data generation step, the MC state time-series is generated based on the calculated transition probability matrix. Finally, the GHI and power production of PV systems are calculated based on the synthesized CSI.

Exemplary Applications of the Proposed Model
A possible application of the proposed model is Probabilistic Load Flow (PLF) analysis for studying the uncertainties of voltage/current profiles. The PV systems power production and other uncertain inputs such as Electric Vehicle (EV) charging and load demand are the electrical distribution grid's variables, whose models are useful for numerical simulation purposes. Figure 2 depicts a typical illustration of the process of PLF analysis. The PLF simulation requires a number of synthetic samples, extracted on the basis of predefined PDFs and TAFs. Then, the PDF of voltage and current magnitudes of medium-voltage (MV) and low-voltage (LV) distribution grids are determined using the power flow equations. Another application of the proposed model is the validation of developed algorithms in future electrical distribution grids. Such an application is investigated in the project "Grid Data Digger" [41]. The innovative aspect of this project is that it valorizes available data to system operators through a data-driven and automated process. It particularly aims at providing to the system operators all necessary information and analysis for a secure and optimal operation of the electrical distribution grids. It is achieved by developing a big-data platform with descriptive, diagnostic, predictive, and preventive data analysis applications dedicated to the electrical distribution grid's operation. A part of this emulated big-data platform is the modeling of PV systems power production, which is based on the proposed N-state MC model.

Input Data
To test the proposed procedure, the measured data of a PV system installed at Yverdonles-Bains, Switzerland (location (a)) were used. In addition, the proposed model was tested on the data obtained from the National Renewable Energy Laboratory (NREL) radiometer array in Oahu, Hawaii, USA (location (b)) [42].
The PV system at location (a) is a solar carport, i.e., a car shelter equipped with PV systems and with charging stations as illustrated in Figure 3. In total, 85 PV systems Bisol 295 W are installed. Twelve of them are dedicated to smart home production (i.e., 3540 W) and the rest is connected to a smart-grid laboratory via three-phase converters with the possibility to generate capacitive or inductive current. This infrastructure is used for validating data-driven decision-making such as PLF analysis.
The input data include one-minute time-series of measured GHI from 1 January 2016 to 31 December 2017 at location (a) and one-minute time-series of measured GHI from 1 April 2010 to 31 October 2011 at location (b). In addition, using the "pvlib" package [43] in "Python", we obtained the clear sky Ineichen model of GHI, i.e., G c (m, d, t), for the same locations and timings. Based on (3), we calculated CSI(m, d, t) for the time-series. The data of 2016 were used for training the model of location (a) and the data of 2017 were used as a testing dataset. For location (b), the data of 1 April 2010 to 31 March 2011 were used for training and the rest was used for testing.

Model Training on Data of Location (a)
The DC values calculated on the estimated PDFs for the average daily CSI at location (a) are shown in Table 1; bold values indicate the best fit for each month. The Beta, LogitN, and Kuma PDFs are the best fit in eleven of twelve months, with the GEV being instead the most suitable PDF for the remaining month.
The (a) PDFs and (b) Cumulative Density Functions (CDFs) of the average daily CSI for the months of February and June are illustrated in Figures 4 and 5, respectively. The figures include the Beta, LogitN, Kuma, and GEV PDFs. Note that the Beta and Kuma PDFs and CDFs practically overlapped due to the theoretical similarity between the two PDFs.   Three transition matrices are calculated with N = 21, i.e., the number of states is equal to 21. The transition matrices are presented in Figure 6a-c with colorbar plots for days with cloudy, intermittent cloudy, and clear sky, respectively. We use the "quantecon" package in "Python" for producing the transition probabilities from the input measured data.
The sum of probabilities of each column and row of transition matrices in Figure 6a-c must add up to unity based on the Markov requirement for transition probabilities. In Figure 6a, the state of CSI for the values less than 0.33 is persistent since the CSI is close to zero. On the other hand, the state of CSI for the days with clear sky is persistent for values more than 0.81. Finally, by observing the transition probabilities for the days with intermittent cloudy sky, one may see that the CSI transition within zero to one is possible. It is noticeable that strong persistence of CSI, i.e., high values of the diagonal elements and of those close to the diagonal elements in the transition matrix, is expectable for one-minute data since the weather variations in one-minute resolution are not considerable.  We calculated one transition matrix with N = 21 for the whole dataset based on the state-of-the-art model in [31]. The calculated transition matrix is presented in Figure 7. Here, we do not distinguish among the days with cloudy, intermittent cloudy, and clear sky.
In addition, one model has been developed for the whole year and the monthly weather variations have been neglected. This state-of-the-art model is used as a benchmark in this study.

Synthetic Data Generation at Location (a)
We generated synthetic data of CSI and GHI based on the two models (i.e., our proposed model and the state-of-the-art models in [31]). The data of real GHI and the synthetic GHI based on the two models for month June 2017 are illustrated in Figure 8a By comparing the real data with the two synthetic data visually, one may observe that the daily weather variations are not captured in the state-of-the-art model as all days behaved similarly. However, in our proposed model, some days have small CSI as they are cloudy and some of them have persistently high CSI as they are days with clear sky. It is worth mentioning that the CSI at each time step is calculated as the ratio of G(m, d, t), given by the red lines, and G c (m, d, t), given by the blue lines in Figure 8a-c.

Evaluation of Synthetic Data of Location (a)
In Figure 9a,b, the standard TAF and the defined performance index (modified version of the TAF) are presented, respectively, for synthetic data based on the state-of-the-art model [31] and the proposed model of this paper. From the viewpoint of the modified metric, the synthetic data of the proposed model have a TAF closer to the real data. As one may see in Figure 9b, the modified TAF of the synthetic data based on the proposed model perfectly matches the real data. This was expected, since three different transition matrices were developed for days with cloudy, intermittent cloudy, and clear sky, while in the state-of-the-art model [31], one transition matrix was developed for the whole year data. In addition, the monthly weather variations are not considered in the state-of-the-art model. Finally, the introduced modified version of TAF captures all mentioned temporal variabilities of the data for comparison.

Model Training on Data of Location (b)
Results of the model training on the location (b) data are presented in a compact form for conciseness. The DC values calculated on the estimated PDFs for the average daily CSI at the location (b) are shown in Table 2; bold values indicate the best fit for each month. The GEV is the best fit in four months, followed by the Beta (three months) and by the LogitN, Kuma, and Logist PDFs (two months each). Compared to the outcome evidence for location (a), the different statistical behavior of the average daily CSI at location (b) is due to the generally better weather conditions. This is particularly clear by observing the (a) PDFs and (b) CDFs of the average daily CSI for June, which are illustrated in Figure 10. The figure includes the Beta, LogitN, Kuma, and GEV PDFs only, for readability.  The real data of GHI for month June 2011, the synthetic data based on the state-of-theart model, and the synthetic data based on the proposed model are depicted in Figure 11a-c, respectively. As one may see, there is not a daily weather variation in the real data due to the climatic condition. In other words, the weather type of most of the days is classified as intermittent cloudy in location (b). As a result, the advantage of using our proposed model is marginal compared to the state-of-the-art model. In Figure 12a,b, the standard TAF and the defined performance index (modified version of the TAF) are presented, respectively, for the data of location (b).

Conclusions and Future Works
In this work, we present a model for the synthetic generation of data of PV systems power production for the numerical simulation of electrical distribution grids. The generated synthetic data are supposed to be used in the project "Grid Data Digger" for testing the data-driven algorithms in future electrical distribution grids. The proposed model is based on three different N-state MCs and a hierarchy of different temporal-scale weather variations, i.e., one-minute, daily, and monthly. The synthetic data based on the proposed model have high similarity with the real data in terms of PDF and TAF metrics compared to the synthetic data based on the state-of-the-art model. The added value of the proposed model is that it considers the seasonal, intra-month, and intra-day variability of the solar irradiance. Therefore, the proposed model is particularly suitable for the numerical simulation of electrical distribution grids.
In future works, the spatiotemporal correlation of CSI can be modeled using the real data of more locations. The Heliosat model [44], which is based on satellite data processing instead of local measuring stations, can be used to repeat the evaluation of proposed models for different geographical positions. In addition, the yearly weather variations can be addressed in future models using the long-term data retrieved from the satellite estimations. Finally, future studies may look into the effects of other meteorological factors such as ambient temperature and wind velocity on the efficiency of PV systems for generating synthetic datasets.

Conflicts of Interest:
The authors declare no conflict of interest.