Semiparametric Semivariogram Modeling with a Scaling Criterion for Node Spacing: A Case Study of Solar Radiation Distribution in Thailand

: Geostatistical interpolation methods, sometimes referred to as kriging, have been proven effective and efﬁcient for the estimation of target quantity at ungauged sites. The merit of the kriging approach relies heavily on the semivariograms in which the parametric functions are prevalently used. In this work, we explore the semiparametric semivariogram where no close-form semivariogram is required. By additionally enforcing the monotonicity condition in order to suppress the presence of spurious oscillation, a scaling of the nodes of the semiparametric kriging is proposed. To this end, the solar radiation estimates across extensive but unmeasured regions in Thailand using three different semivariogram models are undertaken. A cross validation analysis is carried out in order to justify the performance of each approach. The best results are achieved by the semiparametric model with an improvement of around 7–13% compared to those obtained from the parametric semivariograms.


Motivation and Related Work
Spatial interpolation methods are essential tools for manipulating available but inadequate data to infill missing information. A wide spectrum of spatial interpolation approaches applied to several disciplines can be found in the literature [1][2][3]. Although most techniques are somewhat similar in the sense that estimations are calculated as the weighted average of known locations, they can be classified into two categories according to their mathematical underlying concepts; deterministic and stochastic methods. Prominent examples of deterministic approaches include spline functions [4], inverse distant weight [5], multiple linear regression [6]. Whilst those of stochastic-based methods which are usually regarded as kriging are ordinary kriging, universal kriging and regression kriging. Several studies attempted to compare between these two interpolation classes [7,8]. In general, it appears that if we take accuracy as a prime concern, kriging interpolation is a preferable option for reliable estimation. This stochastic-based process model has been applied to numerous types of climate variables including temperature [9], rainfall [10], soil properties [11] and solar radiation [12,13].
Kriging is a geostatistical method where an attribute value from each location is interpreted as a single realization of some random field [14,15]. Under this assumption, statistical properties such as unbiased 1. The node selection criterion in the semiparametric semivariogram is proposed in accordance with the enforced monotonicity constraint so that the estimator fulfills the additional theoretical properties as well as eliminate spurious fluctuation due to node misspecification. 2. A performance of semiparametric semivariogram is conducted and compared with parametric models in the case of the estimation of the spatial distribution of the monthly average daily solar radiation in Thailand. 3. The GA is employed as a means to automate the search for optimal parameter values via the objective function.

Article Structure
The rest of the paper is organized as follows. Section 2 provides a brief description of seimvariogram estimators, both parametric and nonparametric models, and corresponding ordinary kriging method. A mathematical framework for the semiparametric semivariogram and our proposed node selection criterion are presented in Section 3. In Section 4, a case study of Thailand's solar radiation is adopted to illustrate and compare the performance of three distinct semivariograms. The conclusion and discussion are given in Section 5.

Ordinary Kriging
Due to insufficient information about the variable of interest, kriging exploits the probabilistic concept where the uncertainty should be introduced in the description of the target quantity. Suppose that {Z(s) : s ∈ D ⊂ R d } is a spatial random process where D is the spatial domain and d ≥ 1. The random variable Z(s) can be writte as where µ(s) is the trend component of Z(s), and R(s) is the residual component with zero mean and stationary covariance. Let {Z(s 1 ), Z(s 2 ), . . . , Z(s n )} be a collection of samples at locations s 1 , s 2 , . . . , s n . The kriging estimate, Z * (s), at any location s can be expressed as a linear combination of available n observations where λ i indicates the kriging weight assigned to Z(s i ). Particularly, the ordinary kriging assumes the random process to be intrinsically stationary. This leads to the expected difference of the samples being zero and the variance of the estimation can be formulated in terms of the semivariogram To evaluate the minimum variance of estimation (3) subject to the unbiasedness constraint ∑ n i=1 λ i = 1 [26], we apply the Lagrange multiplier method to this optimization problem which gives where µ is the Lagrange multiplier. Our goal is to determine the kriging weights in the Equation (2) which can be accomplished by using the semivariogram model described in the following subsection.

Semivariogram
The semivariogram quantifies the covariance structures of measured sample points with distance. The widely used semivariogram estimate,γ(h), was put forth by [14] where N(h) denotes the number of distinct pairs at a separation lag vector h. Henceforward, we assume the isotropy thereby leading to the semivariogram estimator,γ(h) being replaced byγ(h) where h is the Euclidean norm of h. Based on this resulting empirical semivariogram, the continuous spatial variability across the domain can be assumedly represented by smooth functions which is known as parametric semivariograms. Various prescribed functions including exponential, spherical and Gaussian functions have been exploited [27,28]. In this work, we employ two parametric krigings for the two dimensional space. One is the exponential model which is explicitly represented by and the spherical model where θ = (c 0 , c 1 , c 2 ). The parameter c 0 represents the nugget value induced by the spatial error when the distance smaller than the shortest imposed lag distance. Whereas the variance of the process is denoted by c 1 and a combination between c 0 and c 1 is referred to as sill. The parameter c 2 is the range in which it indicates a distance from the origin to the point of sill achieved. A choice of these parameters undoubtedly has a considerable impact upon the accuracy of kriging estimators. As opposed to the parametric semivariogram, a specification of the nonparametric semivariogram model is not required, this therefore provides a flexibility to compute estimate at any location. Under the isotropic and second-order stationary assumptions, the nonparametric semivariogram estimate can be written as a series from Yaglom's representation of Bochner's theorem [21,22]. In two-dimensional space, the approximated semivariogram can be expressed as where p = (p 0 , p 1 , p 2 , . . . , p m ) is a vector of nonnegative coefficients, the scalars t j , for j = 1, 2, . . . , m, are the jump points or nodes and J 0 is the Bessel function of the first kind of order zero. A goal is to find a vector p so that it can minimize the weighted sum squared errors (WSSE) [29] where L is the number of discrete lags. The weight is defined as in which this weight manifestation suggests that the short-lag spatial correlations induces strong weights. The Equation (9) is subject the following constraint where the nonnegative value b corresponds to the nugget effect.

Genetic Algorithm
Genetic algorithm primarily introduced by John Holland [30] is a class of stochastic optimization technique wherein its underlying principles are inspired by the biological evolution of living organisms. Besides its main contribution to the optimization and search problems, the method is often used as a tool for parameter identification. The process is carried out by a population of chromosomes which represents a collection of solutions to a problem. The chromosomes at each generation are assigned fitness values which indicate how likely they can survive. To produce a population of chromosomes for the subsequent generation with a higher probability of achieving optimal solutions, there are three genetic operations involved; selection, crossover and mutation. The process of chromosome evolution based on such operations is repeated until acceptable criteria are satisfied. The generic procedure of GA can be illustrated in Figure 1.

Semiparametric Semivariogram and Node Selection Criterion
On the other hand, by considering the semivariogram structures of both parametric and nonparametric models, Carmack et al. [25] purposed the semiparametric semivariogram where h in the nonparametric model is replaced by h α , for α ∈ [0, 1]. This gives rise to the semivariogram estimate An imperative part of the nonparametric semivariogram is to choose the right choice of node selection,t j in the Equation (8). Shapiro and Botha [20] and Cherry [31] utilized improptu means by using regularly spaced nodes so that t j = δj for j = 1, 2, . . . , m where δ is the positive number which can be chosen. However, it appears that equispaced nodes can lead to spurious oscillations due to the fact that estimators are not guaranteed to be positive definite in continuum [24]. Another ubiquitous suggestion for the node selection was made by Gorsich and Genton ([23,24]) who proposed that the nodes should be the roots of Bessel functions. This node representation results in the coefficients in Equation (8) being nonnegative as well as the convergence being ensured. Whilst Carmack et al. adopted the concept of the first root of the Bessel function of the first kind of order zero, t * 2 to define the nodes in the semiparametric kriging so that t j = t * 2 /h α j , for j = 1, 2, . . . , m. Despite the fact that, in practice, semivariogram structure can possess the hole effect whereby fitted semivariogram exhibits sinusoidal-wave form [32], we here restrict our study to a generally intuitive perception in which objects that are physically close to each other, should have a stronger correlation than those farther away. Given this assumption, the fitted semiparametric semivariogram might not be necessarily satisfied in some certain situations as illustrated in Figure 2. An alternative way for the node scaling is thus proposed by considering the monotonicity requirement for the fitted semivariogram. This precondition implies the first derivative of the estimated semivariogram (12) should be nonnegative, that is for i = 1, 2, . . . , L, where J 1 is the Bessel function of the first kind of order one. By assuming the node scaling in an adhoc manner for i = 1, 2, . . . , L, where t * β is the first root of J 1 , h 1 and h L are the first and final separation lag distances respectively. Since 0 < α < 1 and 0 < h 1 < h 2 < · · · < h m < · · · < h L , it consequently gives This leads to which ultimately results in for i = 1, 2, . . . , L and k = 0, 1, . . . , m − 1. By defining t j as in Equation (14), the Equation (13) can be written in terms of a series expansion: By this node restriction, the monotonicity condition is proven to be satisfied and this can ultimately prevent noisy behavior in the estimation.

A Case Study: An Estimation of Solar Radiation Distribution in Thailand
Renewable energy deployment has grown rapidly in the power sector in the last decade. In an attempt to maintain energy security and environmental sustainability over the coming years, the Thai government sets up a blueprint for 20-year Thailand's energy plan which is known as the Alternative Energy Development Plan-AEDP (2015-2036). By 2036, renewable energies are planned to account for 30% of net power generation capacity. In particular, giving its location in the equatorial regions, solar energy power has been an undoubtedly favorable renewable source in Thailand due to a large amount of incoming solar radiation all over the country throughout the year. Solar energy production is targeted to reach around 10 gigawatts or nearly 50% of the total alternative energy power capacity by 2036.
To attain the optimal benefits of such resource, one of the key factors for efficient planning and management is to examine the availability of spatially continuous data over the region of interest. Nonetheless, in practice, direct measurements are typically scarce and derived from point sources that are irregularly distributed. For these reasons, spatial interpolation techniques capable of quantitatively capturing the distribution of solar radiation are of great importance.

Data Description
The kriging technique with different types of semivariograms previously described has been applied to the estimation of solar radiation in Thailand. The study region is carried out across the whole country which extends over the area around 518,000 km 2 . The latitude ranges from 5 • 37 N to 20 • 27 N, whereas longitude 97 • 22 E to 105 • 37 E. In this study, a variation of solar radiation characteristics in Thailand is classified according to seasonal periods, namely the summer season (March-June), rainy season (July-October) and winter season (November-February).
The source of solar radiation is obtained from the website of the department of alternative energy development and efficiency, www.dede.go.th. Instead of using a collection of hourly solar radiation data to justify our models, we use monthly averages of daily solar radiation which took place during the period from January 2015 to December 2015, hereby covering changes in annual radiation. In this dataset, the radiation intensity is within the spectral range 14 to 24 MJ/m 2 day −1 in summer. Whilst those in the rainy and winter seasons are around 8-20 MJ/m 2 day −1 and 11-23 MJ/m 2 day −1 respectively. The number of solar radiation monitoring stations is 37 sites that are sparsely situated across Thailand as shown in Figure 3 and details of station positions are displayed in Appendix A.

Accuracy Assessment
Quantitative evaluation of the estimation ability of each approach is performed by using the classical k-fold cross-validation method [33]. The dataset is chronologically divided into 9 folds. On each iteration, about 10% of the data is chosen for the validation, while the remaining data are used for the training. The process is repeated until all folds are picked as a validation set. This implies every individual data is utilized for both training and validating steps.
A combination of these two conventional statistics, the root mean squared error (RMSE) and the mean absolute percentage error (MAPE), are utilised for the evaluation of the quality of forecasting models. They are defined as follows: where M is the number of validation sites andZ(s i ) and Z * (s i ) are the actual and estimated (kriged) solar radiation values at site s i respectively. Despite its popularity for the assessment of model performance in climate and environmental studies, having to compute the RMSE value through the square of the discrepancies between observed and interpolated values can lead to significant weights being placed on outliers. On the contrary, the MAPE is a scale-independent measure resulting in an ease of interpretability. Some problems associated with the MAPE can be arisen due to a division by close-to-zero value. It is accordingly required to apply at least two metrics in order to guarantee consistency in evaluation.
In our numerical experiment, three different semivariogram models are compared; exponential model, spherical model and semiparametric model. Following the recommendation by [20], the number m in the Equation (8) is chosen to be equal to L − 1. For a fair comparison, all these distinct semivariograms are fitted based on the weighted sum squared errors in Equation (9) and GA is applied to identify the optimal set of parameters. The GA configuration is as follows: a population size of 30, a crossover probability of 0.8, a mutation probability of 0.2 and 5000 maximum generations. A section operator is chosen to be a stochastic uniform with a scattered crossover and Gaussian mutation mechanisms. The computational simulations of all approaches are performed in the MATLAB. Furthermore, when an optimal set of parameters of parametric models are found using GA, the nugget value, c 0 , always approaches zero. To simplify the time-intensive computation, we thus set the value of the nugget being zero for all semivariograms.
To evaluate the overall performances of different kriging approaches for each month, the MAPE and RMSE values obtained from nine folds are averaged and their corresponding standard deviation is also computed. Table 1 shows the mean and standard deviation of MAPE derived from three different semivariograms. According to the MAPE index, all approaches provide excellent performances as the mean values of MAPE in almost every month are less than 10% with standard deviation ranging from 1.7067% to 4.9115 %, except in July. Especially, the semiparametric semivariogram consistently outperforms both parametric models with a relative improvement in MAPE ranging from around 2% to 28 %. The maximum improvement can be marked in March with MAPE values of 6.0779% for the semiparametric model and 7.6585% and 8.4843% for exponential and spherical models respectively.  As presented in Table 2, similar patterns can be observed for the RMSE values where the semiparametric model exhibits lower RMSE values than other methods for almost every month. This can also be elucidated by the average values of the means in which both parametric semivariograms present higher average values of every month being above 2 MJ/m 2 day −1 whereas that of the semiparametric model is 1.8700 MJ/m 2 day −1 .  Under different seasonal scenarios, relatively low values of the mean and standard deviation of both RMSE and MAPE for all models are produced throughout the summer months. In particular, the lowest mean values of MAPE ranging from 3.9667% to 4.2075% are acquired in April with corresponding standard deviation varying from 1.7067 % to 2.0254 %. This might due to the fact that considerable solar radiation together with high frequencies of the apparent sky are prominent during the dry season, giving rise to low variation in solar radiation across the country. In contrast with the wet season, especially in July when the maximum values of mean error and standard deviation are obtained, the variability between the wet and dry areas throughout the country becomes distinct due to the southwest monsoon together with Inter-Tropical Convergence Zone (ITCZ). This results in significant uncertainty in measurement as well as high fluctuation of solar radiation. Figure 4 displays a visual comparison of the estimations of solar radiation distribution in March, July and November in Fold 4. The reason for selecting these three months is because each of them represents an individual season in Thailand. Overall, these three semivariograms produce fairly similar distribution patterns of solar radiation with no significant difference among them. A plausible explanation for this might be due to a small local variation of solar radiation, even with a collection of measured data itself. Nonetheless, in Figure 4d-f, little differences can be observed in central Thailand where a relatively higher intensity of solar radiation is predicted by both parametric semivariograms. This similar tendency is also exhibited in November in which lower values of the monthly average of daily solar radiation estimated by exponential and spherical models are found in northern Thailand. Figure 5 illustrates a comparison regarding the fitted curves generated by each method. Broadly speaking, both exponential and semiparametric curves exhibit somewhat similar trends of behavior for these three selected months. Whereas the fitted curves derived from the spherical model tend to have a steep slope. The above-mentioned observation can be clearly seen in November where the estimated values of sill and range of the exponential semivariogram that are produced by GA are around 0.0260 and 10, while those of the spherical semivariogram are 0.0299 and 9.8739 respectively. On the other hand, the optimal α value of the semiparametric semivariogram appears to be 0.3431.
It can nevertheless be inconclusive just by examining merely the diagrams attained from each model. Additional investigation is required to assess the model quality. The weighted sum squared errors between empirical and estimated semivariograms are carried out and presented in Table 3. In March and July, the lowest WSSE values are achieved by the semiparametric model, followed by the exponential model while the highest WSSE is obtained from the spherical model.
To investigate the effects of judicious parameter α on kriging estimates, we have varied the values of α in the range of 0.1-0.9 with a spacing of 0.1 (not shown here). It is found that the semiparametric estimates with α being greater than 0.5 result in performance deterioration. This agrees well with the best-tuned α obtained from GA in which it tends to be smaller than 0.5 thoroughly. Figure 6 depicts the results of the empirical semivariogram and the associated semiparametric fits for α being 0.1, 0.5, 0.8 and optimally tuned α derived from the GA.
In July, the curves obtained from α = 0.1 and the best-tuned α which is equal to 0.2227 almost coincide with the values of WSSE as shown in Table 4 being 20.0320 and 16.9381, respectively. On the contrary, in November, a suitable choice for α is around 0.3-0.5. The values of WSSE obtained from α = 0.3431 is 25.9583 and that of α = 0.5 is 27.4183.

Conclusions and Discussion
As having been stated earlier, the crucial factor of the traditional nonparametric kriging is to correctly specify the nodes otherwise the method can suffer from spurious oscillation due to the nature of the basis function. The primary objective of this study is to adopt the semiparametric semivariogram which is a variant of the nonparametric model. Regardless of the fact that non-monotonous semivariograms (hole effect) are applicable in some certain cases, the supplementary monotonicity constraint is imposed so that the node selection criterion is established with the aim to meet the theoretical characteristics of semivariograms as well as to prevent the spurious oscillation. This results in the appropriate node space being formulated in terms of the first root of the Bessel function of the first kind of order one, as shown in Equation (14).
The comparison of exponential, spherical and semiparametric models is demonstrated in the case of solar radiation estimation in Thailand. The GA technique is also coupled to all semivariograms for the purpose of identifying optimal parameters. The estimates derived from the semiparametric model are more accurate than those from the other two parametric semivariograms, with a relative improvement of around 7-13% according to the averages values of MAPE and RMSE. Regarding seasonal distribution, high errors are achieved during the monsoon months. This is associated with big changes in cloudiness in some particular areas which causes large fluctuations in solar radiation and ultimately becomes difficult to estimate. Although all three semivariograms produce almost identical maps of solar radiation for the selected months but, from the accuracy measures point of view, it is shown that the best performance is achieved by the semiparametric model. This might due to the fact that irregular patterns of the data arise under different seasonal climatic regimes, so this viable method allows the flexibility to modeling the semivariogram, rather than the predetermined function. Furthermore, following this study, the emphasis is also put on the effect of the tuning parameter α to the semiparametric model. Based on our dataset, the range of fixed values of α that results in sufficiently good estimates is restrained to relatively low values (≤0.5) which agree well with the optimally-tuned α obtained from the GA.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: