Estimating the Precipitation Amount at Regional Scale Using a New Tool, Climate Analyzer

: Different methods are known for interpolating spatial data. Introduced a few years ago, the initial version of the Most Probable Precipitation Method (MPPM) proved to be a valuable competitor against the Thiessen Polygons Method, Inverse Distance Weighting and kriging for estimating the regional trend of precipitation series. Climate Analyzer, introduced here, is a user-friendly toolkit written in Matlab, which implements the initial and modiﬁed version of MPPM and new selection criteria of the series that participate in estimating the regional precipitation series. The software provides the graphical output of the estimated regional series, the modeling errors and the comparisons of the results for different segmentations of the time interval used in modeling. This article contains the description of Climate Analyzer, accompanied by a case study to exemplify its capabilities.


Introduction
An extensive Implementation Plan for the Grand Challenge on Understanding and Predicting Weather and Climate Extremes should focus on integrated observations, improved models, new process understanding of the physical drivers of extremes and fast-track attribution [1]. In this context, the concept of "regional area" allows the precise examination of the phenomena's impacts [2], incorporating all regional scale aspects, as the observations quality and model simulations impacts representation, processes study, region climate variability and change based on available data. Efficient management of the water resources depends on the knowledge about the climatic vectors, among which precipitation is one of the most important [3]. In arid or semi-arid regions, where high precipitation episodes follow long drought periods, the determination of the precipitation pattern at the regional scale is essential in the water resources allocation for the households, agricultural and industrial use [4], land-use planning, water control design, precipitation forecasting and downscaling [5].
Interpolation of hydrological series provides essential information for making informed decisions for managing water resources, especially in the climate change context when water scarcity became an acute issue worldwide [6]. Classical and artificial intelligence methods, such as the Thiessen Polygons (TPM), Inverse Distance Weighting (IDW), Kriging (ordinary-OK and universal-UK), have been used to interpolate the hydrological data at ungauged stations [6][7][8][9]. Ly et al. [9] reviewed TPM, IDW, polynomial and spline interpolation, Moving Window Regression (MWR), OK and UK. Wu et al. [10] compared the performances of IDW, OK, Local Polynomial Interpolation, Radial Basis Function and some versions of IDW and UK for interpolating data from the Mississippi River Basin.

•
Introducing new selection criteria of the precipitation series values that participate in fitting the regional series, with the aim at improving the algorithm's performances. Climate Analyzer is designed to work with precipitation quantity. It is easy to use, does not suppose advanced statistical and computational knowledge, and will be freely available on request from the authors.
For the interested readers, the study of the precipitation occurrence has been performed on daily precipitation [25,28,29] for the same region.

Methods and Implementation
Climate Analyzer, proposed here, is a user-friendly toolkit written in MatlabR2019b (MathWorks, Inc., Natick, MA, USA) whose main functionality is to implement the MPPM method and its version introduced in [22,23] and described in Sections 2.1-2.3. The interface and the implementation details are presented in Section 2.4.

Method I
Suppose that the data series registered at k sites in n consecutive periods are given and let us denote by (y ji ) (j = 1, . . . , n) the series registered at the station i (i = 1, . . . , k).
Method I has the following stages [22].
(I1) Establish the working intervals: compute the minimum (y j min ) and maximum values (y j max ), recorded at the jth moment (j = 1, . . . , n) and their amplitudes (A j ). (I2) Divide each interval [y j min , y j max ] into m j subintervals with the length L j = A j /m j , (j = 1, . . . , n). The number of intervals is selected by the user, based on the user's experience or different objective criteria. Each sub-interval should contain a sufficient number of values. (I3) Denote by Y jl , the subinterval l at the moment j and define its frequency, f jl , to be the number of values in Y jl , l = 1, . . . , m j , j = 1, . . . , n. Choose the interval whose frequency is maximum, denote it by I j max and its frequency by f j max .
If there is more than one subinterval whose frequency is equal to f j max , then I j max will be chosen to be that one whose average is closest to the average of the entire interval [y j min , y j max ] (j = 1, . . . , n).
(I4) Choose the average of the values in I j max to be the representative value for the period j (j = 1, . . . , n). (I5) Compute the Mean Absolute Error (MAE) and Mean Standard Error (MSE) corresponding to all the observation sites. (I6) Represent graphically the results.
In the initial version of the algorithm, at stage (I3), the existence of at least two intervals with the same maximum frequencies and the same averages was not considered. Therefore, the initial algorithm might choose any of these intervals to be I j max . Thus, many combinations may appear. To avoid this issue, at stage (I5), the average of all the values in the intervals with the equal maximum frequency will be the representative value for the period j. The implementation of the algorithm takes into account this situation.

Method II
Keeping the same notations as in Method I, Method II has the following stages [23].
(II1) Choose the number of clusters, k and perform the k-means clustering to group the data series into clusters. This step returns an n-by-1 vector (idx) containing the cluster index for each observation site, the locations of the clusters' centroids, within-cluster sums of point-to-centroid distances and distances from the points to the centroids. (II2) Determine the cluster containing the highest number of elements and build a matrix using the data series recorded at the sites from that cluster. In this version of the algorithm, the number of clusters was chosen using the silhouette method, and the cluster selection has been made based on maximizing the ratio BSS/TSS×100, where BSS is the between sum of squares and TSS is the total sum of squares computed in the k-means algorithm [30][31][32]. If two or more clusters have the same number of elements (the highest one), the representative value computed at (II3) is the average of all the elements belonging to the cluster with the highest ratio.

Comparison of the Results
The model quality evaluation uses MSE, MAE and MAPE (mean absolute percentage error) computed for each series, and the average MSE, MAE and MAPE. The smaller the values of the indicators are, the better the estimation is. At least two indicators should be utilized for comparing the results (for cross-validation). Graphical representations of the fitted data and errors are also recommended.

Implementation
Climate Analyser was initially designed to implement the Most Probable Precipitation Method (Method I). In its actual form, it implements the methods described in Sections 2.1 and 2.2. Based on the input data, it computes the maxima, minima and amplitude series, and to provide their graphical representation. Temperature and pollution trends over a region can also be modeled by utilizing this software.
After performing the spatial interpolation, the fitted series are displayed, together with the MAE and MSE series. If two methods have been performed, comparisons of the The work is in progress to implement the Thiessen Polygons Method and IDW interpolation and to provide comparisons between these methods. Another module will provide the frequency analysis for daily precipitation and temperature series, and the computation of the WMO indicators [28,29].
Data series are introduced in a Table (each series in a column) in a .csv file. The user can upload the file by clicking on the Select button and must provide the number of sub-intervals or clusters for running the algorithm. The methods that will be run may also be selected-the first one, the second one or both. After making this selection, one should click on Start button to perform the analysis. Figure 1 presents the interface of the software. The results are displayed in the six windows ( Figure 1). The flowchart of the algorithm is shown in Figure 2.
The algorithm has the following steps: 1. Compute_Amplitude step involves: the calculation of the extreme values (y j min , respectively, y j max ) for each j; -the amplitude computation for each j; 2. Amplitude Representation step: amplitude 2D Graphical Representation.

3.
Choosing the number of subintervals or clusters.
Performing the modeling using Method I and Method II. Figure 3 shows the methods flowcharts.

4.
Collecting the results in the Data Processing from Matlab destination tables. 5.
2D graphical representations of "Trend series that fit the regional one", MAE and MSE using both methods (presented in the next section) ( Figure 4). 6.
Export to file: the modeling output is exported to the xlsx files. 7.
Performing a comparative study, using both methods for different numbers of subintervals, respectively, different numbers of clusters and the graphical representation of the results for "Trend series that fit the regional one", MAE and MSE ( Figure 5). The work is in progress to implement the Thiessen Polygons Method and IDW interpolation and to provide comparisons between these methods. Another module will provide the frequency analysis for daily precipitation and temperature series, and the computation of the WMO indicators [28,29].
Data series are introduced in a Table (each series in a column) in a .csv file. The user can upload the file by clicking on the Select button and must provide the number of sub-intervals or clusters for running the algorithm. The methods that will be run may also be selected-the first one, the second one or both. After making this selection, one should click on Start button to perform the analysis. Figure 1 presents the interface of the software. The results are displayed in the six windows ( Figure 1). The flowchart of the algorithm is shown in Figure 2. The algorithm has the following steps: 1. Compute_Amplitude step involves: -the calculation of the extreme values (yj min, respectively, yj max) for each j; -the amplitude computation for each j; 2. Amplitude Representation step: amplitude 2D Graphical Representation. 3. Choosing the number of subintervals or clusters.
Performing the modeling using Method I and Method II. Figure 3 shows the meth- 7. Performing a comparative study, using both methods for different numbers of subintervals, respectively, different numbers of clusters and the graphical representation of the results for "Trend series that fit the regional one", MAE and MSE ( Figure  5).

Data Series
Data used for modeling consist of the annual precipitation series (recorded during 41 years at 10 meteorological stations situated in the southeastern part of Romania, in the Dobrogea region, between the Danube and the Black Sea ( Figure 6).

Data Series
Data used for modeling consist of the annual precipitation series (recorded during 41 years at 10 meteorological stations situated in the southeastern part of Romania, in the Dobrogea region, between the Danube and the Black Sea ( Figure 6).

Data Series
Data used for modeling consist of the annual precipitation series (recorded during 41 years at 10 meteorological stations situated in the southeastern part of Romania, in the Dobrogea region, between the Danube and the Black Sea ( Figure 6).  Most of Dobrogea has an arid climate, with a mean annual temperature of about 10-11 • C and a summer mean annual temperature of 22-23 • C. The number of days with temperatures higher than 25 • C varies between 72 in Constanta and 95 in Tulcea. The average relative humidity varies between 78% and 85%, the lowest humidity being registered in the central and southern parts of the region. The mean annual precipitation is 350-450 mm. In the zone with the highest altitude, the mean annual precipitation values increase to about 450 mm. The 400 mm isohyet, parallel to the Black Sea Shore, delimitates the coastal region from the continental Dobrogea. Years with mean annual precipitation under 250 mm have been recorded as well [33]. The study series are represented in Figure 7. Among them, Sulina is situated 12 km offshore in the Danube Delta. temperatures higher than 25 °C varies between 72 in Constanta and 95 in Tulcea. The average relative humidity varies between 78% and 85%, the lowest humidity being registered in the central and southern parts of the region. The mean annual precipitation is 350-450 mm. In the zone with the highest altitude, the mean annual precipitation values increase to about 450 mm. The 400 mm isohyet, parallel to the Black Sea Shore, delimitates the coastal region from the continental Dobrogea. Years with mean annual precipitation under 250 mm have been recorded as well [33]. The study series are represented in Figure 7. Among them, Sulina is situated 12 km offshore in the Danube Delta. For this dataset, the ANOVA was performed in previous studies [22,25] to test if there is a difference between the series, followed by the Scheffe posthoc test. This test found that the Sulina series forms a separate group.
In [25,28,29], climate change in the area has been studied utilizing the daily data and the indices recommended by WMO. To give an idea about the precipitation occurrences, Table 1 contains the statistical distributions that best fit the annual precipitation series. Readers may find information about the monthly and seasonal series trend in [29] and Dobrogea and its climate in [33][34][35]. Therefore, to exemplify how the software works, the annual data was chosen. All the mentioned results and related works, together with For this dataset, the ANOVA was performed in previous studies [22,25] to test if there is a difference between the series, followed by the Scheffe posthoc test. This test found that the Sulina series forms a separate group.
In [25,28,29], climate change in the area has been studied utilizing the daily data and the indices recommended by WMO. To give an idea about the precipitation occurrences, Table 1 contains the statistical distributions that best fit the annual precipitation series. Readers may find information about the monthly and seasonal series trend in [29] and Dobrogea and its climate in [33][34][35]. Therefore, to exemplify how the software works, the annual data was chosen. All the mentioned results and related works, together with the present article, give a global picture of the precipitation dynamics in the Dobrogea region.

Results and Discussion
The optimal number of clusters has been determined to be three by the silhouette method and two by the elbow method. For comparison reasons, we present the results obtained when Method I was run with the number of intervals equal to the number of clusters-two and three. When using three clusters, the cluster with the highest number of elements contains all but Mangalia, Medgidia and Sulina (nos. 5, 6 and 9 in Table 1) series (Figure 8a). When using two clusters, the cluster with the highest number of elements contains all but Mangalia and Medgidia series (nos. 5 and 6, in Table 1) (Figure 8b). The clustering quality can be observed in Figure 8.
The optimal number of clusters has been determined to be three by the silhouette method and two by the elbow method. For comparison reasons, we present the results obtained when Method I was run with the number of intervals equal to the number of clusters-two and three. When using three clusters, the cluster with the highest number of elements contains all but Mangalia, Medgidia and Sulina (nos. 5, 6 and 9 in Table 1) series (Figure 8a). When using two clusters, the cluster with the highest number of elements contains all but Mangalia and Medgidia series (nos. 5 and 6, in Table 1) (Figure 8b). The clustering quality can be observed in Figure 8.
Based on the presented algorithm, the series Mangalia, Medgidia and Sulina or only the first two do not participate in the computation of the regional trend (when using three or two clusters). Therefore, the regional series is determined using the group containing homogenous data series (the cluster with the highest number of elements). Figures 9 and 10 presents the graphical results of fitting the regional trend of annual precipitation ("Trend series that fit the regional one"), the resulted mean absolute errors (MAE), the mean, standard errors (MSE) and Amplitude, when working with two (three) sub-intervals/clusters, using both methods. The numbers from 1 to 10 represent the series as they are ordered in Table 1  Based on the presented algorithm, the series Mangalia, Medgidia and Sulina or only the first two do not participate in the computation of the regional trend (when using three or two clusters). Therefore, the regional series is determined using the group containing homogenous data series (the cluster with the highest number of elements). Figures 9 and 10 presents the graphical results of fitting the regional trend of annual precipitation ("Trend series that fit the regional one"), the resulted mean absolute errors (MAE), the mean, standard errors (MSE) and Amplitude, when working with two (three) sub-intervals/clusters, using both methods.
The numbers from 1 to 10 represent the series as they are ordered in Table 1.
In Figures 9a and 10a, Meth. I and Meth. II refer to the results obtained using the first or the second method, and Avg. is the average series computed for each year by averaging the annual recorded values of all the 10 series. In Figures 9f and 10f, Min and Max are the minimum and maximum precipitation series, respectively, and Amplitude is the difference between Max and Min series.
Notice the similar shapes of MAEs and MSEs in Figure 9c,e and Figure 10c,e, showing a similar pattern of the residual series. In most cases, MAEs and MSEs from Method I are higher than those from Method II (Figure 9b,d and Figure 10b,d). Tables 2 and 3 displays the MAE, MSE and MAPE computed using Method I (Method  II). MAPE was utilized as extra goodness of fit indicator because it is non-dimensional, so it is recommended to compare different types of models [8]. In Figures 9a and 10a, Meth. I and Meth. II refer to the results obtained using the first or the second method, and Avg. is the average series computed for each year by averaging the annual recorded values of all the 10 series. In Figures 9f and 10f, Min and Max are the minimum and maximum precipitation series, respectively, and Amplitude is the difference between Max and Min series.
Notice the similar shapes of MAEs and MSEs in Figures 9c,e and 10c,e, showing a similar pattern of the residual series. In most cases, MAEs and MSEs from Method I are higher than those from Method II (Figures 9b,d and 10b,d). Tables 2 and 3 displays the MAE, MSE and MAPE computed using Method I  (Method II). MAPE was utilized as extra goodness of fit indicator because it is non-dimensional, so it is recommended to compare different types of models [8].     MAPEs resulted from Method II are, in most cases, smaller than those from Method I. The average goodness of fit indicators (MAE, MSE and MAPE) are the smallest when using two clusters.
In Method I, MSE and MAPE are the smallest when working with two intervals, whereas MAE is the smallest in the case of three intervals. In Method II, all indicators are the smallest in the case of two clusters. Figure 11 shows the comparisons of MAEs and MSEs for two and three sub-intervals/ clusters. It confirms the best quality of the models when utilizing Method II with two clusters.
Comparison of the best model (Method II, two clusters) with Inverse Distance Weighting (IDW) and kriging are provided in Table 4. The best variogram model for the kriging was the exponential one, with the optimized parameters nugget = 1, sill = 2 and range = 24 [24]. The

Conclusions
In this article, two algorithms for modeling the regional trend of precipitation have been presented, together with their implementation in Climate Analyser. To exemplify the functioning of this toolbox, a complete set of annual precipitation data, recorded for 41 years at 10 meteorological stations has been utilized, and the modeling results have been discussed. The main contributions of this work are twofold.

1.
The optimizing of finding the regional series, in the following two directions: a. The selection (in the k-means algorithm) of the clusters whose elements participate in the construction of the regional series, based on maximizing the ratio BSS/TSS×100. In the previous algorithm version [22,23], when two clusters have the same number of elements (maximum), each could be selected to build the regional series. In this version, choosing the maximum BSS/TSS×100 ensures a higher distance between the clusters and a smaller one inside them. Thus, the homogeneity degree of the series in the clusters is higher, so the estimation of the regional series is better than in the previous version of Method II. b.
The choice of the values that participate in building the regional series when at least two subintervals have the same maximum frequency and absolute value of the difference between the subinterval average and the series average.
Let consider that I 1 and I 2 are such intervals, with the same maximum frequency, f max , m is the average precipitation in a specific year, m 1 and m 2 are the averages of I 1 and I 2 and |m 1 − m| = |m 2 − m| . Given that the values in the intervals are in ascending order, this means that m 1 − m = − (m 2 − m), so m = (m 1 + m 2 )/2. Therefore, selecting I 1 would underestimate the regional series (since only the smallest values would participate in the evaluation of the regional series), while selecting I 2 would overestimate it; so the best choice is using the values in both intervals for building the regional series.

2.
The second one is implementing the algorithm in user-friendly software, Climate Analyzer, freely available on request. It facilitates the computation of the regional trend. It also provides the graphical visualization of the output of the selected algorithm, offering the facility to compare the results for different segmentations of the series.
In the future, we intend to extend the software in the following directions: a. Implementing algorithms for determining the optimal number of clusters, given that the fitting quality depends on the number of groups involved in the regional series computation. b.
implementing the IDW and Thiessen Polygons Methods. c.
Extending the algorithm for analyzing the occurrence of precipitation events.