An Automated Approach to Groundwater Quality Monitoring—Geospatial Mapping Based on Combined Application of Gaussian Process Regression and Bayesian Information Criterion

: Sustainable management of the environment is based on the preservation of natural resources, ﬁrst of all, freshwater—both surface and groundwater—from exhaustion and contamination. Thus, development of adequate monitoring solutions, including fast and adaptive modelling approaches, are of high importance. Recent progress in machine learning techniques provide an opportunity to improve the prediction accuracy of the spatial distribution of properties of natural objects and to automate all stages of this process to exclude uncertainties caused by handcrafting. We propose a technique to construct the weighted Water Quality Index (WQI) and the spatial prediction map of the WQI in tested area. In particular, WQI is calculated using dimensionality reduction technique (Principal Component Analysis), and spatial map of WQI is constructed using Gaussian Process Regression with automatic kernel structure selection using Bayesian Information Criterion (BIC). We validate our approach on a new dataset for groundwater quality in the New Moscow region, where groundwater is mostly used for drinking purposes. According to estimated WQI values, groundwater quality across the study region is relatively high, with few points, less than 0.5% of all observations, severely contaminated. Estimated WQIs then were used to construct spatial distribution models, GPR-BIC approach was compared with ordinary Kriging (OK), Universal Kriging (UK) with exponential, Gaussian, polynomial and periodic kernels. Quality of models was assessed using cross-validation scheme, according to which BIC-GPR approach showed better performance on average with 15% higher R 2 score comparing to other Kriging models. We show that the proposed geospatial interpolation is a potentially powerful and adaptable tool for predicting the spatial distribution of properties of natural resources.


Introduction
It is hard to overestimate the importance of role of freshwater resources for our planet. Global climate change and the rise of human population lead to extensive landuse changes, expanding urbanisation and industrial activity, have already heightened the risk of pollution of freshwater sources [1][2][3]. Assessment and monitoring of fresh water resources are necessary for decision-makers at all levels regarding economic, health, scientific, and ecological spheres of life [4][5][6][7].
Two principal approaches to the assessment of natural water bodies' status can be distinguished. The first one operates the concept of water reservoirs' "vulnerability". It independent interpolation decision tree [32]. More comprehensive solutions were proposed, for example, Approximate Bayesian Computation (ABC), which allows to automatically deal with kernel selection and estimation of hyper-parameters and to avoid the dependence on dimensionality when operating large numbers of kernels with different dimensions [33], with simultaneous comparing of competing models.
Therefore, an automated kernel selection is highlighted as a promising research direction to avoid manual operations, raise robustness to obtain optimal results, speed up calculations, especially when most of the existing solutions are still balancing between the accuracy of interpolation and computational speed.
In this paper, we discuss the approach to water quality spatial prediction based on the modified GPR and exploratory assessment via overall variance, on the example of groundwater monitoring net. We believe that successful implementation of this approach is especially important to subsurface water resources. Groundwater is directly connected with both surface water resources and terrestrial ecosystems at the same, but hidden from rapid observation [34]. Moreover, due to restricted access to direct measurements, groundwater monitoring systems are often represented by relatively small observation nets, where kriging techniques are also the most powerful. The core idea is to introduce the Water Quality Index (WQI) that will first include the most influential environmental parameters representing environmental pollution and second, take into account the variability characteristics of these parameters that can serve as the measure for risk assessment. To observe the patterns of spatial distribution, we developed an extension to the standard Gaussian process regression (GPR) model based on automated composite kernel search implemented on a greedy algorithm with the usage of Bayesian Information Criterion; hyper-parameters selection for each elementary kernel in the optimal composite kernel was done by standard approaches. By doing this, we exclude handcrafting in the selection of potentially effective kernels. BIC criteria allows to find a simple kernel structure among possible with the fewest number of parameters but with highest accuracy at the same time according to the validation sample, compared to other variants. In contrast, without BIC criteria there will be much more parameters which will lead to overfitting and much more complex kernels.
The paper is organised as follows: Section 2 is devoted to the description of the utilised experimental and computational methods which we exploit for the organisation and processing of the dataset. In particular, we describe the details about the target region and the available dataset in Section 2.1. The details of data processing and machine learning techniques (PCA, Gaussian processes, and Bayesian methods) are discussed in Sections 2.3 and 2.4. The main results are presented in Section 3 and discussed in Section 4. To be more specific, we present an automated approach for optimisation of parameters for the kriging procedure which allows to improve significantly the quality of reconstructed freshwater maps in comparison with popular standard tools widely used by community (the details can be found therein). The final discussion sums up the advantages and drawbacks of the proposed approach for freshwater quality map reconstruction and discovers the possible directions of future research.

Site Description and Available Dataset
The data used in the current study was collected in the territory of The New Moscow district, located adjacent to the city of Moscow in the Central European part of Russia. The area of New Moscow district extends over 1480 km 2 , latitude and longitude ranges are approximately from 55 • 09 to 55 • 40 N and 36 • 48 to 37 • 36 E, respectively.
The climate of the region is moderately continental. According to the data from local weather stations (available at https://www.ncdc.noaa.gov/cdo-web/datatools/ findstation) through the period of last five years, from 2015 to 2020, across the territory of study, the mean daily temperature in the coldest month of the year, January, was −5.4 • C, while in the warmest month, July, the mean daily temperature was +17.8 • C. The average annual precipitation was approximately 440 mm, with circa two thirds rainfall and the rest snow. The territory is mostly located on the verge of the southern taiga change to the zone of temperate broad-leaf forests.
New Moscow is located in the central part of East European Plain. It is composed by the thick sedimentary rocks covering Precambrian cristallic basement. Sedimentary rocks include dolomite, lime, marl, sandstone. Bedrock rarely outcrops and mostly covered by Quaternary deposits of glacial and fluvioglacial genesis with occurrence of alluvial deposits [35]. Thus, across the region there are two main aquifers: the first is in Quaternary sediments, with thickness in a range from 2 to 12 m. This is the main source of the drinking water for the cities and households. The artesian basin is consists of the aquifers of coal age composed by limestone and dolomite. In general, across the New Moscow region aquifers are composed by different-grained, well-permeable sands and sandy loams, periodically covered by the clay of different ages [36]. Main land-use types are as follows: forests (50% from total area), arable land (21%) and low density discontinuous urban fabric (19%) [37]. The predominant types of natural vegetation are coniferous and broad-leaved forests, while agricultural lands include pastures and arable land, mostly growing feed crops and cereals. The main specific of New Moscow is that it has been rapidly urbanising during the last decade.
Sample net, using in this study, covers almost all territory of New Moscow. A total of 1600 water samples were collected during 2017-2018, mostly from May to September, from wells (1215 samples), rivers (225 samples) and springs (160 samples) in the region (see Figure 1). Water samples were collected from the wells, rivers, or springs by using a 2-L stainless-steel container. The samples were bottled and then immediately transported to the laboratory for chemical analysis, eliminating the need for conservation methods. For each water sample, 25 parameters were measured. The pH was measured by using a HANNA pH-meter 213. Anions (NO 3 , NO 2 , PO 4 , SO 4 and Cl) were measured by ion chromatography using a Dionex 1100 instrument. NH4 content was obtained on an HACH DR2800 using colorimetric determination with Nessler's reagent. Cation (K, Cr, Ni, Ca, Zn, Fe, Mn, Na, Cu, Mg) contents were obtained by inductively coupled plasma atomic emission spectroscopy on an ICP-OES Agilent 5110 spectroscope. Mineralization was measured by gravimetric analysis consisting of evaporation at 105 • C in a drying chamber. Alkalinity was obtained by titration with 0.05 NHCl. Hardness was measured by titration with Trilon B and eriochrome black. Despites being minor but not the least the important advantage of this work is relatively large size of the dataset which contains more than 1600 samples (each with 25 measured chemical parameters). We hope that it might be useful for validation and other methodological research in community and share it as open data [38].

Data Preparation and Methodology
In this study, an end-to-end solution for geospatial water quality assessment using ML methods such as Gaussian Process Regression and Bayesian Information Criterion was proposed and evaluated. Figure 2 presents a brief summary of the steps involved in this procedure. Our approach use machine learning methods for weighted WQI calculation (Steps 1 and 2) and geospatial WQI prediction by using Gaussian process regression with automatic kernel search (Steps 3-5).

Water Quality Index Calculation Based on PCA and Weighted Factors
In order to construct WQI we performed PCA and used the following outputs: principal components (PCs), eigenvalues, loadings, described variance. The brief description of the theory behind PCA is given in the Section 2.3.1. Description of the approach to WQI calculation is given in the Section 2.3.2.

PCA Theory
The full theory behind the PCA is quite comprehensive, so we address the reader to the relevant literature sources for the full explanation [39][40][41]. In order to reveal the main ideas of using of PCA for current study, we discuss the main outputs extracting in the analysis. PCA is the linear unsupervised method of dimensionality reduction of multivariate data, proposed by Hotelling [42], where original dimensions are initial (measured) variables. PCA projects the data into new coordinate system and representing it in lower dimensions set in a way retaining the maximum of the variation of the original data, so keeping the most of the information.
The new set of dimensions -principal components, PCs -is mutually orthogonal, so PCs are uncorrelated to each other. PCs are the linear combinations of the original variables and ranked by the variance of the data along. To obtain PCs, at first, the covariance matrix of the original data is estimated. This covariance matrix is symmetric and consists of set of orthogonal components-eigenvectors, or "principal directions", expressed by their own eigenvalues reflected to the measure of the variance related to the PC. Thus, the first PC reflects to the maximum of the data variance and so on in the descending order. The contributions of the each initial variables into the PC are loadings. To strengthen the interpretability of the PCs an additional manipulation can be performed, such as Varimax rotation [43,44]. In this purpose the variance of squared loadings is seeking to be maximized, thus each PC has only few, non-overlapping variables with the highest loadings.

Construction of WQI Based on PCA
As was discussed previously, it might be found very useful to express all of the multivariate water quality complexity in one conscientious parameter, which is WQI, keeping as much information as possible.
A PCA model was used to assess the pollutant loads integral to water quality and to avoid data redundancy. Raw data was filtered to eliminate anomalies: missing coordinates, incorrect record type etc. After this initial pre-processing step, the total number of useful samples decreased from 1600 to 1569. We decided to remove Hg, Cd, Co and Pb from the dataset for further analysis, as their concentrations were insignificant (much lower than toxic levels) and did not exceed the required water quality standards in Russia. Twenty-one water quality parameters were included in the PCA model. Only those components for which the corresponding eigenvalue was higher than or equal to 1 following Varimax rotation, and PCs that explained at least 5% of the observed data variation were considered for further examination. Moreover, those parameters that were correlated with other significant parameters (correlation between two particular parameters is more than 0.6) were eliminated only if they had the smallest loadings among correlated parameters. The weight scores (w i ) derived from PCA were used as weighted factors for the significant variables (indicators) from the respective PCs, and the WQI was calculated by using the Equation (1): where S is the number of significant principal components, L i denotes the loading values of each selected water property included in the particular principal component, and w i denotes the weight of the corresponding component, which is defined as the part of the described variance by each component. To scale WQI to the [0,1] range, we normalized the weight scores (w i ) to a summarized score value by using Equation (2): To perform geospatial modeling of multiple water properties from the collected dataset, we referred to the Gaussian Process Regression (GPR) framework [45], more commonly known as kriging in geostatistics. A Gaussian process is completely determined by its mean µ(·) and covariance (kernel) k(·, ·) functions: where x ∈ R 2 is a vector of d input parameters. In our case, d = 2 and x represents a vector of spatial coordinates. Let us consider a simple GPR model with additive Gaussian noise: where ∼ N (0, σ 2 ). Given the training data X = (x 1 , . . . , x N ) ∈ R n×d , y = (y 1 , . . . , y n ) ∈ R n , where n is the number of samples and (·) denotes a transpose, the predictive distribution at the unobserved point x * is given bŷ where I is an identity matrix, K = k(X, X) = k(x i , x j ), i, j = 1, . . . , N is a spatial covariance matrix between all of the training points, k * = k(X, x * ) is a spatial covariance between training points and the single prediction point and µ(X) = µ(x i ), i = 1, . . . , n is the mean function calculated at the training points. The particular choice of the kernel function depends on the assumptions about the model and a particular application, e.g., widely used Gaussian kernel (corresponding to Gaussian variogram). Kernel hyper-parameters are usually optimized using Maximum Likelihood Estimation (MLE) [46] or its variations. Figure 3 shows an example of GPR using Gaussian kernel with a constant mean function over the observations sampled from the sigmoid function with random noise. The predictive variance increases at points with missing observations, and increases significantly outside of the interpolation region with the mean failing to capture the true function trend. This emphasizes the need for a better method to select kernel hyper-parameters.

Hyper-Parameters Selection Using Bayesian Information Criterion
Common approaches to hyper-parameter optimisation are Maximum Likelihood Estimation (known model, continuous parameters), and Cross-Validation (model is unknown, discrete parameters). Typically, one could select multiple combinations of different kernels, perform MLE for each of them and then compare the models using cross-validation to choose the best overall model. In our work we follow the approach from [47] using Bayesian Information Criterion (BIC), which considers kernel function as a combination of a small number of base covariance functions using sum and product operations, and can be represented as: where n is the number of samples, m is the total number of optimised parameters, and Σ is defined as in the Equation (3). To construct the optimal kernel, we consider a basic set of operations, such as plus and multiplication, and apply them to the following kernel functions: polynomial (Equation (5)), Gaussian (Equation (6)), periodic (Equation (7)) and exponential (Equation (8)). Thus, the final automatically constructed kernel, for example, can be the multiplication of the polynomial kernel on Gaussian, plus periodic, etc. Optimal kernel structures can include the multiplication of the same types of elementary kernels. The best kernel is a combination (structure) of the elementary kernels with optimized parameters that gives the minimal BIC value. This way, we are able to model a variety of stationary kernels and control the accuracy by selecting basic kernels and boundary values for their hyper-parameters. The main goal for introducing such boundary values is to avoid over-fitting and ensure the robustness of the performance of the obtained optimal kernel (composition of the basic kernels). Moreover, we aim to reduce the model complexity by decreasing the number of tuned hyper-parameters in the optimal kernel.
where d = 2 for our task (coordinates), polynomial was taken of degree 2; θ 1 , θ 4 , θ 5 , θ 6 are the variances; θ 2 , , l and s-length scales; θ 3 -bias; T-period. In our experiments all of the kernels were considered isotropic and we applied the following constraints during hyper-parameter optimization: , where n is the number of training data points. Hence, instead of brute-force search of the best kernels, a greedy search was implemented in the current study. Greedy search in general means that the extension with the lowest BIC is selected for each extension of the current kernel. The main advantage of this approach is that it does not require any handcrafting of potentially effective kernels but instead enables an automatic search for the best kernel structure and hyper-parameter optimisation.

Universal and Ordinary Kriging
To compare our method with baseline geospatial modelling techniques we performed Ordinary Kriging (OK) and Universal Kriging (UK) using the GPy library. Since it only allows to perform Gaussian Process Regression and not kriging as is, we draw a connection between these methods as follows: (a) basic kernel functions k poly , k gaussian , k periodic , k exp correspond to respectively named variograms, (b) GPR with a constant mean function µ(x) = µ corresponds to OK, and (c) GPR with linear mean function µ(x) = β 0 + β 1 x 1 + β 2 x 2 corresponds to UK with linear trend. Hyper-parameters of the kernel and mean functions are optimized using MLE approach during the training phase.

Approach to Geospatial Modelling
First, we converted the spatial coordinates from EPSG:4326 (latitude, longitude) format to EPSG:32637 (UTM zone 37 N) format and scaled them down to the [0,10] range. Some measurements of the water quality measurements were taken spatially far from the main investigated area (Moscow region). Thus, to filter outliers, we performed clustering of the water sampling locations using the density-based DBSCAN method [48] with neighbourhood size = 1.0. The parameter allows to tune the size of a cluster and serves as the measure of the permissible distance to the point to be included into the cluster. After clustering and removing of the outliers, we again re-scaled the coordinates to the [0,10] range. Finally, it was decided to use the data only from the major class (wells, 1215 data points). Totally, 391 data points were removed from the dataset (37 data points out of them were removed by DBSCAN) and 1178 data points left for further investigation. WQI was calculated (methodology is presented in Section 2.2) for each data sample and a rectangular 100 × 100 grid was used for geospatial modelling and mapping. The boundaries of the selected grid were defined by the minimum and maximum coordinates of the kept water sampling locations.

Validation Procedure
To validate our model, we applied a standard cross-validation scheme with 5 random splits into training and testing data sets of relative size 90% and 10%, respectively. For each training/testing split, we (a) fit the model to the training data, then, (b) predict the values of WQI for the test data point locations and (c) subsequently calculate the root mean square error (RMSE) and the coefficient of determination R 2 : whereŷ i , y i are the predicted and observed values, respectively, and y is the average across the observed values. The RMSE is a good comparative statistic for assessing model output, as it provides a global indication of how similar the interpolated values are to the observed or measured data point values [49]. When analysing the RMSE statistics, a small RMSE value indicates that the interpolated values for the output model are more similar to the observed data point values, whereas a large RMSE value suggests that the interpolated model values are less similar to the observed data points. Thus, RMSE was used here to determine how well the model fits the observed data values, with low RMSE values indicting a high degree of model accuracy [50,51].

Software
All the calculations were carried out in Python programming language using the following libraries: scikit-learn, [52], GPy, [53] and Folium.

PCA-Based Weighted Water Quality Index
We applied PCA to reveal the significant contaminants among samples and calculate weighted-loads of tested parameters in WQI. In total, we observed five PCs with loads above 1 and a cumulative variance of about 61% (see Table 1). Then, we eliminated the parameters of each PC that were correlated significantly with others and had the lowest loading's among them (see Figure 4). Finally, our WQI includes only non-correlated parameters with loadings greater than 0.3 to the contributed PCs. Varimax rotation was used for PCA calculation and helped us to reveal the PCs with the exact chemical properties of water, which were clearly interrelated and signalled specific types of pollution. As an example, the chemical indicators usually linked with organic pollution were coupled to PC3, whereas parameters of water mineralization were coupled to PC1 (please see Table 1). In fact, each PC contributed to a series of chemical parameters in the tested dataset. For example, the PC1 was linked to the chloride content, overall mineralization and sodium content of water (with loading's greater than 0.3). However, all three of these parameters were correlated: r(Na & Cl) = 0.856; r(Cl & Mineralization) = 0.819; and r(Na & Mineralization) = 0.800. Thus, the final shortlisted parameters from these PCs were a subset of the co-correlated parameters to prevent overlooked results and include only Cl. A similar case with co-correlated parameters was observed in parameters attributed to PC2. The PC2 revealed three main characteristics of water pollution: hydrocarbonates (HCO 3 ), alkalinity and pH. At the same time, only HCO 3 & Alkalinity were characterized by r as 1.0 , while two other parameters revealed low values of co-correlaton R 2 (pH & HCO 3 ) = 0.227, r(pH & Alkalinity) = 0.228 and were included in the shortlisted parameters. All correlations among significant parameters for PC3, PC4 and PC5 were low (see Figure 4); thus, all parameters with loads > 0.3 (Table 1)  Finally, our WQI was presented as a combination of 12 parameters with different normalized weighted factors: WQI = 0.2912 · (Cl) + 0.0979 · (pH + Alkalinity) + 0.0884 · (NH 4 + PO 4 )+ +0.0735 · (Cr + Fe + Mn) + 0.0589 · (Cu + SO 4 + K + NO 3 )  All the parameters (concentrations) in the Equation (11) should be scaled to [0,1]. The resulting WQI after applying the Equation (11) should be again re-scaled to [0,1] before using it for the geospatial modeling. The distribution of the calculated WQIs among the tested samples is presented in the Figure 5. The mean WQI was 0.24 in the tested locations, and the median was 0.22. These values signalled that less than 0.4% of the tested samples were actually characterized as highly polluted, with a WQI > 0.75. Distribution of WQIs across the spatial coordinates-latitude and longitude-does not show any significant trends.

Geospatial Modeling
In this research, we proposed and validated technique based on GPR and kernel structure selection using BIC. The optimal kernel structure obtained by the BIC method was found to be a sum of Gaussian and periodic kernels (see Equation (12)). The optimized hyper-parameters can be found in Table 2 with  To validate our approach and compare it to baseline methods (OK and UK with different kernels), we applied the cross-validation scheme with 5 different train/test splits (90% and 10% train/test split). Table 3 shows corresponding R 2 and RMSE values obtained for different cross-validation splits. From the Table 3 it can be noticed that on the 2-nd split our proposed Kriging with BIC method gives much better results comparing to other approaches. Results, close to our Kriging with BIC approach were obtained by the exponential kernel in average, however on the second split Kriging with BIC outperformed exponential kernel, thus being more robust. The worst result represented by the negative coefficient of determination values was obtained by using the polynomial kernel. In general, the coefficient of determination can also be negative. The negative values of the coefficient of the determination indicate an inappropriate model which means that a simple averaging will give better interpolation results than by using the proposed model. The length scale of the kernels fitted in the standard methods is 1, while the variance for Gaussian kernels varies in range from 0.0032 to 0.0054. The variance for exponential kernel is 0.04. It can be seen, that our approach with optimal kernel selection gave the best R 2 compared to the standard kriging methods. RMSE for our model was comparable to other methods, however, the standard deviation of errors on the different validation data subsets was minimal compare to the other approaches which make our proposed method beneficial. In the case of RMSE assessment, it is also important to compare the obtained RMSE values with the average value for WQI in the tested dataset. As can be seen, the average value of WQI was 0.24 ( Figure 5A); therefore, the calculated RMSE 0.065 indicated that proposed GPR model coupled with BIC is suitable for modelling. Finally, Figure 6 shows the results of geospatial modelling of WQI values and corresponding uncertainty maps, obtained by different approaches. The results clearly demonstrate the advantages of automatic kernel selection using BIC, allowing to recognize the local pollutant areas.  From obtained results, we can see, that the spatial distribution of estimated WQI is not uniform, which can be explained by two main reasons: natural and artificial. e.g., as was described in Section 3.1 the ions of Cl and Na, Mineralization, and HCO 3 , Alkalinity, pH mainly contribute to overall variance according to the PCA. We can suggest, that the driving factor of high variability of these characteristics is the aquifer composition itself and basic rocks interacting with waters. The occurrence of trace elements (heavy metals), phosphate, nitrates (PC3, PC4, PC5 loadings) can also significantly contribute to low WQI. Their presence in waters can be explained by various reasons, but more likely, they are associated with the agricultural activity since all of the listed parameters are contained in macro-and micronutrients, and livestock wastes. Additionaly, Quaternary horizons, in general, are highly permeable, so they are exposed to the high risk of contamination by filtrates and pollutants' further spread.

PCA-Weighted Approach in WQI Construction
The WQI was proposed for the first time in 1965 by Horton [54]; nowadays, its variations are widely used in ecological studies. The implementation of weighted factors for quality index construction is currently becoming very popular in environmental science. This procedure was applied earlier in the environmental sustainability index [55] and the Langat River water quality index [56]. Nevertheless, the large diversity of approaches to WQI development shows a list of vulnerabilities of the idea, and many details remain unclear: the high diversity of types of water resources on the global scale that cannot be described by the same measure, the consequent diverse number of parameters used [12], and, finally, the high level of subjectivity. For these reasons, most existing WQIs are not universal and may be used only in case studies [57]. As a suggestion, the WQI should be based on an algorithm, thus excluding subjectivity, including parameters with maximum loads into general variability, thus being adaptive, so policy-makers, water users and managers may be able to implement it according to local object features and purposes.
Application of different multivariate statistical techniques, such as PCA, helps to summarise complex or multi-dimensional issues in view of supporting decision-makers, and make the process less subjective [58]. The goal of the PCA is to reveal how different variables change in relation to each other, or how they are associated. This is achieved by transforming correlated original variables into a new set of uncorrelated variables using the covariance matrix, or its standartized form-the correlation matrix. The new variables are linear combinations of the original ones and are sorted into descending order according to the amount of variance they account for in the original set of variables. The proposed PCA-weighted WQI, which involves the most influential parameters, allows us to model the comprehensive environmental situation in the region. Obviously, this simplification is a logical step toward the description such complicated object as water resources, convenient for use in both scientific and practical applications. In the case of the New Moscow area with 1569 sampling points, our PCA-based approach helps to reveal the 12 crucial parameters of water quality (Cl, pH, Alkalinity, NH 4 , PO 4 , Cr, Fe, Mn, Cu, SO 4 , K, and NO 3 ) instead of the 25 parameters initially measured. All selected parameters, alone and in mixtures, may negatively affect human health. For example, pH is a crucial water-quality parameter that affects water chemistry, including alkalinity, speciation and solubility. Alkalinity in groundwater exceeding 200 mg/L gives an unpleasant taste and, thus, limits acceptance as potable water. Health concerns regarding sulphates in drinking water have been raised because of the reports that diarrhoea, catharsis, dehydration and gastro-intestinal irritation may be associated with the ingestion of water containing higher levels of sulphate.
A similar PCA-weighted approach has also been proposed by [59] for the water quality index. The authors applied the PCA with Varimax rotation to select the most important features of water quality and reduced the original dataset from 13 parameters to 9. These authors used all important features of water quality; however, unlike our approach, included even co-correlated parameters, which in practice meant overestimation of the final values.
In addition, we can highlight at least one possible disadvantage of the proposed approach, which may be connected with the required data sizes for PCA. For example, ref. [60] recommended that at least 150 cases are needed to obtain satisfactory results by using this method. At the same time, not every study of water quality assessment includes more than 150 collection points (as examples, [61,62]) due to high installation, operational, and maintenance costs for each sampling representative of the whole water system conditions.

Automatic Approach to Geospatial Mapping
It is a well-known fact that water legislation worldwide requires adequate and rigorous monitoring on different spatial and temporal scales, including different ecosystem components; usually, these monitoring programs are time-consuming and costly. A process and tool that can be used to perform an accurate and automatic geospatial interpolation of locally collected data are in high demand.
In this research, we propose an advanced method for geospatial modelling based on Gaussian Process Regression and Bayesian Information Criterion. The proposed approach allowed us to detect multiple local foci of environmental contamination, compared with the commonly used ordinary kriging and universal kriging, which were less accurate and effective for solving our particular problem ( Figure 6).
Our approach permits determination of the most realistic spatial distribution of the WQI due to the application of an automatically constructed kernel algorithm, which consists of basic non-linear kernels. Actually, when it comes to the end-to-end implementation in operational data processing chains, like geospatial modelling, it is mandatory to invest in models that are both accurate and robust but also require minimal user intervention for fitting parameters. An automatic kernel search helps to solve the problem of manual hyperparameter and kernel structure selection. According to the applied cross-validation, our model showed lack of over-fitting and provided an accurate prediction on the test dataset according to the used metrics. Recently, this approach of automatic kernel selection was used successfully in several cases, e.g., the estimation of chlorophyll-a concentrations from remote sensing data, [63], delineation referents of city centres from topographic data [64], soft-sensor modelling for algal bloom monitoring [65]. However, to date, it has not been transferred to geological modelling.

Conclusions
We developed an end-to-end framework that allowed us to automatically reconstruct the geospatial distribution of WQI with high accuracy. Our approach states the clear methodology from the step of initial data pre-processing and construction of the generalizing Water Quality Index using PCA to the automatic kernel structure search for geospatial mapping. We apply and show the feasibility and robustness of our proposed methodology in the case of WQI estimation in the New Moscow region. The novel approach of an automatic kernel structure search was adapted and applied in this framework, and this approach allowed us to achieve detailed results for geological modelling, compared with ordinary and universal kriging methods. Overall, our proposed methodology opens up wide possibilities for solving similar problems in which it was demonstrated in this paper, in the most accurate and efficient way. Despite being a minor but not the least, important advantage of this work is the relatively large size of the dataset which contains more than 1600 samples (each with 25 measured chemical parameters). We hope that it might be useful for validation and other methodological research in community and share it as open data [38].

Data Availability Statement:
The data that support the findings of this study are openly available at https://doi.org/10.6084/m9.figshare.10283225.