Distribution of Groundwater Arsenic in Uruguay Using Hybrid Machine Learning and Expert System Approaches

: Groundwater arsenic in Uruguay is an important environmental hazard, hence, predicting its distribution is important to inform stakeholders. Furthermore, occurrences in Uruguay are known to variably show dependence on depth and geology, arguably reﬂecting different processes controlling groundwater arsenic concentrations. Here, we present the distribution of groundwater arsenic in Uruguay modelled by a variety of machine learning, basic expert systems, and hybrid approaches. A pure random forest approach, using 26 potential predictor variables, gave rise to a groundwater arsenic distribution model with a very high degree of accuracy (AUC = 0.92), which is consistent with known high groundwater arsenic hazard areas. These areas are mainly in southwest Uruguay, including the Paysand ú , R í o Negro, Soriano, Colonia, Flores, San Jos é , Florida, Montevideo, and Canelones departments, where the Mercedes, Cuaternario Oeste, Raig ó n, and Cret á cico main aquifers occur. A hybrid approach separating the country into sedimentary and crystalline aquifer domains resulted in slight material improvement in a high arsenic hazard distribution. However, a further hybrid approach separately modelling shallow (<50 m) and deep aquifers (>50 m) resulted in the identiﬁcation of more high hazard areas in Flores, Durazno, and the northwest corner of Florida departments in shallow aquifers than the pure model. Both hybrid models considering depth (AUC = 0.95) and geology (AUC = 0.97) produced improved accuracy. Hybrid machine learning models with expert selection of important environmental parameters may sometimes be a better choice than pure machine learning models, particularly where there are incomplete datasets, but perhaps, counterintuitively, this is not always the case.


Introduction
Arsenic in groundwater utilized as drinking water constitutes a major public health hazard in many parts of the world [1], most notably in Asia [2][3][4][5][6], but also in South America [7,8]. While the countries most impacted in South America have been reported as Argentina [9,10] and Chile [11], groundwater arsenic has also been identified as a public health concern in Uruguay since 2007 [12,13]. High groundwater arsenic occurrences in Uruguay have been documented in the Raigón and Mercedes aquifers in southwest Uruguay [12,14]. Strong positive correlations were reported between arsenic, vanadium, fluoride, and sodium in shallow wells up to 50 m in depth [15,16]. Quaternary ash deposits were regarded as a possible primary source of geogenic arsenic in aquifers in the southern part of the country [14,17,18].
The utilization of groundwater resources has increased rapidly since the 1950s. In Uruguay, domestic drinking water demands in urban areas are mainly satisfied by the Table 1. Names, types, and descriptions of the models generated in this study.

No.
Model Code Name of Model Pure/ Hybrid Combination of the percentage (%) of OSE arsenic concentrations by pixel >10 µg/L in each aquifer in Uruguay, used to compared with the above models

Study Area
Uruguay is a country in the southeast part of South America, bordered by Argentina to the west, Brazil to the east and north, the Río de la Plata estuary to the south, and the Atlantic Ocean to the southeast. Uruguay has a population of around 3.3 million, of which 1.8 million live in its capital and largest city, Montevideo [33].
In Uruguay, the geological environments of the major aquifers used for water supply reveal a great deal of variability, ranging from a sedimentary basin fill with good matrix porosity and permeability to a crystalline basement with permeability largely related to interconnected faults and fractures [17,34].
The country consists of three hydrogeological systems: the Paranaense, Meridional, and Costero systems [35]. The Paranaense system, located in the northcentral and northeast parts of the country, includes the Arapey fissured basaltic aquifer and the Guaraní, Mercedes, and Salto sedimentary aquifers. The Guaraní aquifer is one of the largest freshwater reservoirs in the world, and one of the most important aquifers in the country. The Guaraní aquifer contains approximately 37,000 km 3 of water, covering 1.2 million km 2 in Argentina, Brazil, Paraguay, and Uruguay [16]. The Meridional system underlies most of the country's area, and is comprised largely of fissured aquifers of the Uruguayan Precambrian cratons [34,36,37]. The Costero system is formed by: (i) the Raigón sedimentary aquifer of the Santa Lucía Basin, which is the largest reserve of groundwater in the southern part of the country; (ii) quaternary sediments of the Merín Basin; and (iii) the Chuy aquifer, located on the eastern Uruguayan coast [17,34,35].
Groundwater arsenic, in general, may be associated with some possible anthropogenic sources, such as fertilizer and pesticides, mining activities, and industrial/urban pollution of large population centers [38]. However, to date, there has been little if any definite evidence for anthropogenic arsenic in Uruguayan aquifers; rather, the natural origins of groundwater arsenic have received widespread attention in Uruguay. The bulk concentration of arsenic in the upper continental crust is between 1.5 to 2 mg/kg [39], although it can be much higher in certain igneous and metamorphic rocks, as well as some types of sedimentary rocks, such as mudstones [40]. In Uruguay, high arsenic concentrations have been found in the following geological units: (i) Precambrian basement rocks related to sulfur-bearing minerals, notably pyrite and arsenopyrite; (ii) sedimentary aquifers (e.g., Raigón and Salto Aquifers) from the Tertiary Period to the present impacted by ash-fall deposits from Andes volcanic activities [41]; arsenic occurs within a large variety of minerals, as well as being absorbed into clay and oxide-hydroxide minerals; and (iii) the Guaraní aquifer, inferred from high arsenic concentrations detected in some water wells [16] and suspected to be related to the alteration of sulfur minerals.
Dissolved arsenic species in Uruguay aquifers are dominated by inorganic As(III) and As(V), but with contrasting As(III)/As(V) ratios in different aquifer types [15]. As(V) has been found to be the predominant species in the majority of the aquifers, particularly those with elevated arsenic concentrations; low As(III)/As(V) ratios are typically associated with elevated pH, possibly reflecting pH-dependent desorption processes as a major mechanism for arsenic mobilization. As(III) has been found to be the predominant species in groundwater from the Guaraní aquifer, while groundwater from the Mercedes aquifer typically has been found to have intermediate As(III)/As(v) ratios [15].

Dataset Compilation
Geolocated (longitude/latitude) ( Figure 1) groundwater total arsenic concentration (tAs) data (n = 504), mostly (n = 432) accompanied by depth data, were kindly provided by the Uruguay public water supply company, Obras Sanitarias del Estado (OSE). For each well, the average of the arsenic concentrations from sampling carried out in 2018 and 2019 was used, although no systematic substantial variations in arsenic concentration with the year of sampling were noted: (i) 140 arsenic data points exceeded a concentration of 10 µg/L; and (ii) 347 data points were located in the sedimentary aquifers and 157 data points were located in the crystalline aquifers. Figure S1 shows the frequency of groundwater arsenic concentrations.  [34]. Note that the Cretácico Arapey Formation is composed of basalts, with vacuolar porosity and some interbedded sandstones. Although arguably ambiguous, we classify these here as crystalline. Groundwater arsenic data are from OSE. Aquifer map (a) modified from [17].

Dataset Preparation
Prior to modelling, 504 arsenic concentrations were assigned to one square kilometer pixels. Where more than one data point was available within a pixel, the geometric mean of those values was taken to represent the groundwater arsenic concentration of the pixel, resulting in a reduction of the number of groundwater arsenic data to 434. Of these, 26% (115) groundwater arsenic concentrations exceeded 10 µg/L. The prediction of a binary dependent variable can avoid some related uncertainties, therefore improving the accuracy and effectiveness of models. The averaged arsenic concentrations were therefore converted into a binary variable (0 or 1) according to whether the arsenic concentration was less than or equal to 10 µg/L or greater than 10 µg/L. The proportions of arsenic concentrations by pixel exceeding 10 µg/L in each department (administrative division) and the whole of the country are listed in Table 2.   The above dataset was used to develop a pure overall model to predict the distribution of groundwater arsenic in the whole country. For comparison to hybrid models with expert parameter selection and robust accounting for potential differences in groundwater arsenic distribution between shallow and deep and between sedimentary and crystalline aquifers in Uruguay, four other subsets of the data were prepared and modelled: (i) shallow aquifers (≤50 m, 234 arsenic concentrations); (ii) deep aquifers (>50 m, 153 arsenic concentrations); (iii) sedimentary aquifers (300 arsenic concentrations); and (iv) crystalline aquifers (134 arsenic concentrations). Then, the five datasets were randomly split into training (80%) and testing (20%) datasets, maintaining the same ratio of low to high values in the entire set and the subsets [24,31].

Machine Learning (Random Forest) Modelling
A random forest model is an ensemble of decision trees, and is a classification model that can be used to predict binary target variables. In a random forest model, the binary dependent variable (groundwater arsenic concentration in this study) is split based on both independent variables at each branch and their cutoff values. In a random forest, randomness is introduced into the growth of each individual decision tree by: (i) only two or three data rows in the training dataset being used to grow an individual tree (some of which are selected multiple times due to random selection with replacement of data rows); and (ii) a restricted number of predictors and randomly chosen predictor combinations at each branch, leading to trees developing differently. Introduced randomness and taking the classification average of the class prediction results of all trees as the final result effectively eliminates multicollinearity amongst predictor variables, producing a more robust model. [50,51] In order to comprehensively understand the distribution of arsenic in groundwater, compare the prediction performance between pure and hybrid machine learning models, and discover differences in groundwater arsenic distribution between shallow and deep and between sedimentary and crystalline aquifers in Uruguay, we first established five models (Table 1)  The optimal number of predictors at each branch of the trees grown in a random forest was determined by trying values between 1 and 26 (the total number of potential predictors) and comparing the out-of-bag (OOB) error results. The number with the smallest OOB error, which produced the most accurate model, was then used as the optimal number [24]. The random forest models produced in this study encompassed 1001 trees. These models were applied to create the probability maps of groundwater arsenic concentrations exceeding 10 µg/L in Uruguay.
Then, the probability maps were converted into occurrence maps of groundwater arsenic concentrations exceeding 10 µg/L by using two approaches: (i) a default probability cutoff value of 0.5, and (ii) the cutoff where sensitivity was equal to specificity [24,31], representing a model that arguably classifies high and low arsenic values equally well. The accuracy of each random forest model was evaluated by the area (AUC) under the receiver operating characteristic (ROC) curve calculated on its testing dataset, which is produced in turn by plotting sensitivity and specificity against the probability cutoff from 0 to 1. The area under the ROC curve (AUC) in general varies between 0.5 (for a random model) and 1 (for a perfect model) [52]. An AUC < 0.5 can theoretically be achieved for a model that is worse than random.

The Importance of the Predictors
The importance of the predictors in the random forest models were assessed by two statistical indices: (i) the decrease in accuracy, and (ii) the decrease in Gini node impurity. Both decreases in the two indexes were normalized by their largest values, respectively. Higher positive values of the decreases in accuracy and Gini node impurity indicated a greater relative importance of the predictor. Nevertheless, predictors with a negative decrease in value do not benefit the model, and were therefore removed from the models.

The Comparison of Prediction Performance between Pure and Hybrid Machine Learning Models
Hybrid machine learning models using expert selection for important factors, in this case depth and geology, are worthy of being studied to determine whether they have better predictive performance than pure ones. We combined shallow with deep and sedimentary with crystalline testing datasets to calculate AUC values of hybrid depth (2C-HML-Depth) and geology (3C-HML-Geol) models, respectively, and then compared with the AUC value of the pure overall model (1A-ML-Pure). Furthermore, another pure overall model with sedimentary or crystalline as the new predictor (1B-ML-Pure with Geol) was created to compare with the hybrid geology (sedimentary and crystalline; 3C-HML-Geol) models, so that the impact of expert selection of groundwater geology (sedimentary and crystalline) could be evaluated intuitively and effectively. Hybrid machine learning models were also compared with a simple expert system aquifer model (4-ES-Aqui), which consisted of the percentage (%) of OSE arsenic concentrations by pixel exceeding 10 µg/L in each aquifer in Uruguay (Table 1).

Pure Overall Machine Learning Model
A total of 434 averaged groundwater arsenic concentrations (26% > 10 µg/L) and their matching potential predictors were used to develop the pure overall model (1A-ML-Pure) for the whole of Uruguay. The optimal number (producing the lowest out-of-bag (OOB) error) of predictors at each branch of the trees grown in the pure overall random forest was 19. The comparison of the OOB error results by varying the branch predictor number between 1 to 26 is shown in Table S2. Two potential predictors, calcisols and fluvisols, with negative mean decreases in accuracy and/or Gini node impurity did not contribute to the model, hence, they were removed from the model (1A-ML-Pure).
The cross-validation results of the pure overall model (1A-ML-Pure) as applied to its testing dataset are plotted in Figure S2a. The area under the ROC (receiver operator characteristic) curve (AUC) was 0.92. This was somewhat better than that (AUC = 0.89) for a recent global groundwater arsenic random forest model, the dataset of which does not contain any arsenic concentration from Uruguay [20]. The pure overall model (1A-ML-Pure) presented here, therefore, provided a better prediction result at the country-specific scale for Uruguay. This is perhaps not a surprising result, given the global diversity of processes leading to elevated groundwater arsenic concentrations.
The prediction model ( Figure 2) using cutoff values of 0.5 and 0.71 ( Figure S3a) captured a markedly large area of high hazard levels of groundwater arsenic in southwest Uruguay, including the Paysandú, Río Negro, Soriano, Colonia, Flores, San José, Florida, Montevideo, and Canelones departments, where the Mercedes, Cuaternario Oeste, Raigón, and Cretácico main aquifers exist (Figure 1). About 35-60% of Mercedes, 53-88% of Cuaternario Oeste, 25-57% of Raigón, and 2-15% of Cretácico aquifer areas were modelled as high groundwater arsenic (>10 µg/L) hazard areas (Table 3). Meanwhile, relatively small areas of high groundwater arsenic hazard were also indicated in the Tacuarembó, Durazno, and Rocha departments. The Raigón aquifer, located in the San José department, is the most exploited groundwater resource in the country. The areas exposed to arsenic concentrations greater than 10 µg/L in the departments/country and aquifers generated by the pure overall model (1A-ML-Pure) are listed in Tables 2 and 3, respectively.  The high groundwater arsenic occurrence in southwestern Uruguay is likely related to continental sediments containing volcanic ashes, as in the Puelche aquifer in Argentina. This is consistent with positive correlations between arsenic and chemical elements, such as vanadium, typically encountered in volcanic ashes [18,53].
The importance of the predictors in the pure overall model (1A-ML-Pure) was assessed by the normalized mean decreases in accuracy, as well as Gini node impurity, which are displayed together in Figure 5a. The importance of aridity significantly ranks above the others, followed by the soil pH and topographic wetness index. However, the least important predictors of those included for consideration were soil water capacity and solonchaks. Aridity may accelerate the evaporation of surface waters, and, therefore, increase arsenic concentrations in groundwater recharge. In general, large-scale geogenic groundwater, particularly those related to oxidizing conditions, tend to occur in inland or closed basins in arid or semi-arid areas. Soil pH can impact the desorption of arsenic from mineral oxides, especially Fe oxides [42].

Hybrid Machine Learning Model with Expert Selection of Depth (Shallow and Deep)
In Uruguay, high arsenic concentrations (defined here as >10 µg/L) were more likely to be observed in shallow (31%; n = 72 of 234) than deep (24%; n = 36 of 153) aquifers. For some aquifers, notably the Basamento Cristalino, Raigón, and Devónico-pérmico, 50 m might be able to be used as a reasonably effective single depth cutoff to distinguish whether high arsenic is likely to occur ( Figure S4). Two datasets, 234 averaged arsenic concentrations with depths ≤ y50 m and 153 averaged arsenic concentrations with depths >50 m, were therefore used to produce the hybrid shallow (2A-HML-Shal; ≤50 m) and deep (2B-HML-Deep; >50 m) aquifer machine learning models.
For the hybrid shallow model (2A-HML-Shal), the number of predictors at each branch of the decision trees was three, which produced the most accurate model (for the comparison, see Table S2). The predictors-coarse fragments, soil water capacity, and landform-with negative mean decreases in accuracy and/or Gini node impurity did not benefit the model and were removed. However, for the hybrid deep model (2B-HML-Deep), the number of predictors at each branch was 15, and the predictors-calcisols, slope, and landform-were excluded from the model due to their negative mean decreases in accuracy and/or Gini node impurity. The AUC values of hybrid shallow (2A-HML-Shal; ≤50 m) and deep (2B-HML-Deep; >50 m) aquifer models were 0.92 ( Figure S2d) and 0.96 ( Figure S2e), respectively. These values are close to 1 (perfect predictive model), reflecting that both of these models have an excellent predictive ability, even better than the pure machine learning overall model (1A-ML-Pure).
The arsenic probability and occurrence maps using cutoffs of 0.50 or 0.67 ( Figure  S3c) for shallow aquifers and 0.50 or 0.73 ( Figure S3d) for deep aquifers, developed by the hybrid shallow (2A-HML-Shal) and deep (2B-HML-Deep) models, are displayed in Figure 3. Both shallow and deep aquifers possessed similar distributions of high arsenic groundwater in the four departments in southwestern Uruguay: Paysandú, Río Negro, Soriano, and Colonia, which are provided for by the Mercedes and Cuaternario Oeste aquifers. Furthermore, high arsenic occurs more in shallow groundwater in the San José (where the Raigón aquifer is located), Canelones (cf. Cretácico aquifer), the northwest corner of Florida, west Durazno, and Flores departments. On the contrary, the junction of the Treinta y Tres and Rocha departments, which is underlain by the Cuaternario Este aquifer, and the junction of the Tacuarembó and Durazno departments, underlain by the Arapey aquifer, tended to be characterized by higher arsenic in the deep aquifers. The areas exposed to groundwater arsenic concentrations greater than 10 µg/L in the departments/country and aquifers generated by hybrid shallow (2A-HML-Shal) and deep (2B-HML-Deep) models are listed in Tables 2 and 3, respectively. High hazard areas generated by the hybrid deep aquifer model (2B-HML-Deep) were almost completely overlapped by high hazard areas predicted by the pure overall model (1A-ML-Pure), however, high hazard areas of the hybrid shallow model (2A-HML-Shal) far exceeded those of the pure overall model (1A-ML-Pure), especially in the Flores, Durazno, and the northwest corner of Florida departments. The hybrid shallow model (2A-HML-Shal) had better utility in identifying potential high hazard areas, which was also one of the purposes of modelling at different depths. Moreover, the AUC value (0.95) of hybrid depth models (2C-HML-Depth; Figure S2c), using combined testing datasets, was greater than that of the pure overall model (1A-ML-Pure; Figure S2a), showing a better accuracy of the hybrid models, with expert selection of depth as a key variable. The normalized mean decreases in accuracy and Gini node impurity of predictors reflecting their importance for both hybrid shallow (2A-HML-Shal) and deep (2B-HML-Deep) models are shown together in Figure 5c,d. Aridity, precipitation, soil pH, and the topographic wetness index were of relatively high importance in both models. Meanwhile, the ranks of importance of soil cation exchange capacity, sand, soil organic carbon density, clay, and soil and sedimentary deposit thickness in the two models were obviously different (rank difference >5).

Hybrid Machine Learning Model with Expert Selection of Geology (Sedimentary and Crystalline)
High arsenic groundwater tended to appear more in sedimentary (36%; n = 108 of 300) than crystalline (5%; n = 7 of 134) aquifers, this potentially being related to different predominant processes of mobilization. Separate datasets of averaged arsenic concentrations in sedimentary aquifers and in crystalline aquifers were therefore used to develop hybrid sedimentary (3A-HML-Sed) and crystalline (3B-HML-Cry) models, respectively. The hybrid sedimentary model (3A-HML-Sed) performing most accurately was an ensemble of decision trees with two predictors at each branch (Table S2). Calcisols with negative importance indexes (mean decreases in accuracy and/or Gini node impurity) were removed from the hybrid sedimentary model (3A-HML-Sed). For the hybrid crystalline model (3B-HML-Cry), the number of predictors at each branch of the decision trees was three, which produced the most accurate model (Table S2). The predictors soil and sedimentary deposit thickness, coarse fragments, slope, solonchaks, and landform were excluded from the hybrid crystalline model (3B-HML-Cry) because of their negative importance values. The AUC values of the two models were 0.91 ( Figure S2g) and 0.98 ( Figure S2h), respectively, indicating good classification ability (close to 1).
The probability and occurrence maps, defined by cutoffs of 0.5 or 0.63 ( Figure S3e) for sedimentary aquifers and 0.5 or 0.85 ( Figure S3f) for crystalline aquifers, of groundwater arsenic exceeding 10 µg/L in the sedimentary and crystalline aquifers are plotted in Figure 4. Comparing the results of the hybrid sedimentary (3A-HML-Sed) and crystalline (3B-HML-Cry) models, almost all (>99%) high arsenic hazard areas were in sedimentary aquifers. Only limited areas of high groundwater arsenic were modelled in crystalline aquifers in the southwest corner of the Colonia department and the junction of Florida and Flores departments. The two prediction maps of the hybrid sedimentary (3A-HML-Sed) and crystalline (3B-HML-Cry) aquifer models were also combined to form a complete prediction for the whole of Uruguay (Figure 4e,f). The distribution of high arsenic hazard areas of the combined maps was similar to that of the pure overall model (1A-ML-Pure). From the occurrence maps (Figure 4f vs. Figure 2b), only slight differences existed in the Durazno, Flores, and northwest corner of Florida departments. The so-derived hybrid model (3C-HML-Geol) made a more accurate prediction than the pure overall model (1A-ML-Pure), as indicated by its great AUC value of 0.97 ( Figure S2f). The areas exposed to arsenic concentrations greater than 10 µg/L in the departments/country and aquifers generated by hybrid sedimentary (3A-HML-Sed) and crystalline (3B-HML-Cry) models are listed in Tables 2 and 3, respectively.
The concentrations of arsenic in sedimentary rocks typically ranged between 5 and 10 mg/kg [54], slightly above average terrestrial abundance. Arsenic in sedimentary rocks can be released into groundwater through water-rock interactions in the aquifers. Under the arid conditions in South America, both silicate and carbonate reactions are prominent, and the pH of groundwater often tends to be high. In such oxidizing groundwater, arsenic predominantly exists as As(V) [55]. Metal oxides in the sediments, particularly Fe and Mn oxides and hydroxides, are regarded as the main sources of dissolved arsenic as a result of desorption in the high pH groundwater environment [42].
The importance of the various predictors in the hybrid sedimentary (3A-HML-Sed) and crystalline (3B-HML-Cry) models were assessed by the normalized mean decrease in accuracy, as well as in Gini node impurity (Figure 5e,f). Aridity, precipitation, soil cation exchange capacity, potential evapotranspiration, and the topographic wetness index were of relatively high importance in both models. Meanwhile, the ranks of importance of silt, temperature, and soil water capacity in the two models were obviously different (rank difference >5).
In order to effectively evaluate the impact of expert selection of groundwater geology (sedimentary and crystalline) in the hybrid models, another pure overall model with sedimentary or crystalline as the new predictor (1B-ML-Pure with Geol) was created to compare with the hybrid geology (sedimentary and crystalline; 3C-HML-Geol) models. The probability and occurrence maps of this new pure overall model (1B-ML-Pure with Geol) are shown in Figure S5. These are very similar to those of the previous pure overall model (1A-ML-Pure; Figure 2), and this may be mainly because the importance of sedimentary or crystalline is relatively low in the new pure overall model (1B-ML-Pure with Geol), ranking as the ninth least important of the predictor variables considered (Figure 5b). However, we found that high groundwater arsenic was obviously more likely to occur in sedimentary (36%; n = 108 of 300) than crystalline (5%; n = 7 of 134) aquifers (Figure 1b). Sedimentary or crystalline as a predictor in the pure model (1B-ML-Pure with Geol) could not reflect its actual individual importance, so hybrid models with expert selection for sedimentary and crystalline as a categorical variable may be a better model choice.
Two pure overall (with (1B-ML-Pure with Geol)/without (1A-ML-Pure) sedimentary or crystalline as the new predictor), hybrid depth (2C-HML-Depth), and hybrid geology (3C-HML-Geol) models were validated based on their combined testing datasets. Their AUC values are shown in Table 4 and Figure S2. The AUC values of hybrid models were slightly higher than that of the pure models. Although the differences in AUC values were not large, the hybrid machine learning models had slightly better accuracy and predictive performance than the pure models. Therefore, hybrid machine learning models with expert selection of important environmental parameters warrant further study and use for predicting groundwater contaminants, such as arsenic.

Conclusions and Limitations
Geostatistical models of the distribution of groundwater arsenic in Uruguay were generated by a variety of basic expert system and machine learning approaches in order to provide an overview of high arsenic hazard areas in Uruguay. A pure random forest model (1A-ML-Pure) using 26 potential predictor variables gave rise to a groundwater arsenic distribution model with a very high degree of accuracy (AUC = 0.92), and was consistent with known high groundwater arsenic hazard areas and with the higher prevalence of high arsenic groundwater in sedimentary rather than crystalline aquifers. Obvious differences in high groundwater arsenic hazard areas were also modelled between shallow and deep aquifers. The modelled distribution of groundwater arsenic concentrations was more accurate, and gave rise to a more detailed spatial resolution of groundwater arsenic hazard areas than simple expert system models, based upon aquifer classification alone.
A hybrid approach (3C-HML-Geol model), separating the country into sedimentary and crystalline aquifer domains, resulted in improved accuracy (AUC = 0.97) and a slight material improvement in the modelled distribution of high arsenic hazard areas compared to the pure machine learning model (1A-ML-Pure). A further hybrid approach (2C-HML-Depth model) separately modelling shallow (≤50 m) and deep (>50 m) aquifers also resulted in marginally improved accuracy (AUC = 0.95).
Hybrid machine learning models with expert selection for sedimentary and crystalline (3C-HML-Geol) with higher AUC values (0.97) was a better choice than the pure overall model with sedimentary or crystalline as the new predictor (1B-ML-Pure with Geol; AUC = 0.92). Moreover, hybrid models take the dependence of groundwater arsenic on depth and geology into account more substantially and comprehensively.
Therefore, hybrid machine learning models with expert selection of important environmental parameters may sometimes be a better choice than pure machine learning models, particularly where there are incomplete datasets, and where the processes controlling (and, hence, the predictors better modelling) groundwater arsenic concentrations are materially different for different areas and/or depths. Perhaps, counterintuitively, this is not always the case. Hybrid geospatial models deserve to be further studied and used for predicting groundwater contaminants, such as arsenic.
Although this study is based on arsenic concentration data widely distributed across the country, aquifer heterogeneity can cause material changes in the concentration of arsenic in groundwater within a short distance, limiting the small-scale predictive accuracy of the models, therefore, targeted testing is still needed to determine whether a particular well is highly contaminated with arsenic.
While the data available have supported the development of country-wide models, it is noted that the dataset is dominated by shallow wells, with some areas of the country having very limited deep well data. Obtaining further data from deeper wells in these areas would help to improve the model in these areas, although it is recognized that there will be required costs and inputs in terms of money, time, human resources, and technical support to achieve this. of high arsenic (>10 µg/L) in the aquifers, and geometric mean of arsenic concentrations was taken within a pixel (1 km 2 ). Modelled high groundwater hazard areas defined as those with a probability of arsenic concentration being than 10 µg/L exceeding a cutoff value of 0.5 and a specific cutoff value where sensitivity is equal to specificity (overall: 0.71; shallow: 0.67; deep: 0.73; sedimentary: 0.63; crystalline: 0.85). The mean of modelled high groundwater hazard areas defined by two cutoffs were used to produce the (b-e) and been shown in Table S3, Table S1: Potential predictors used in the machine learning models. Descriptions and data sources are listed. Predictors are grouped into 4 categories: climate, soil, topography, and lithology, Table S2: Comparison of number of available predictors at each branch in random forest models from 1 to 26 (the total number of potential predictors). The number shown in grey shadow is the optimum number of predictors at each branch in the model, Table S3  Data Availability Statement: Data presented in this study not otherwise available from the references and organisations indicated in the text may be available on request from the corresponding authors.