Assessing the Suitability of Boosting Machine-Learning Algorithms for Classifying Arsenic-Contaminated Waters: A Novel Model-Explainable Approach Using SHapley Additive exPlanations

: There is growing tension between high-performance machine-learning (ML) models and explainability within the scientiﬁc community. In arsenic modelling, understanding why ML models make certain predictions, for instance, “high arsenic” instead of “low arsenic”, is as important as the prediction accuracy. In response, this study aims to explain model predictions by assessing the relationship between inﬂuencing input variables, i.e., pH, turbidity (Turb), total dissolved solids (TDS), and electrical conductivity (Cond), on arsenic mobility. The two main objectives of this study are to: (i) classify arsenic concentrations in multiple water sources using novel boosting algorithms such as natural gradient boosting (NGB), categorical boosting (CATB), and adaptive boosting (ADAB) and compare them with other existing representative boosting algorithms, and (ii) introduce a novel SHapley Additive exPlanation (SHAP) approach for interpreting the performance of ML models. The outcome of this study indicates that the newly introduced boosting algorithms produced efﬁcient performances, which are comparable to the state-of-the-art boosting algorithms and a benchmark random forest model. Interestingly, the extreme gradient boosting (XGB) proved superior over the remaining models in terms of overall and single-class performance metrics measures. Global and local interpretation (using SHAP with XGB) revealed that high pH water is highly correlated with high arsenic water and vice versa. In general, high pH, high Cond, and high TDS were found to be the potential indicators of high arsenic water sources. Conversely, low pH, low Cond, and low TDS were the main indicators of low arsenic water sources. This study provides new insights into the use of ML and explainable methods for arsenic modelling.


Introduction
In recent years, elevated concentrations of arsenic in water bodies have increasingly become a global health challenge due to their toxic nature and adverse effects on human health [1].It is estimated that more than 200 million people worldwide are chronically exposed to elevated arsenic concentrations in drinking water [2].The real danger lies in situations where an exposed community is unaware of arsenic contamination because the long-term health effects of arsenic poisoning are caused by chronic exposure [3].Such in-stances resulted in the largest arsenic poisoning in Bangladesh and West Bengal, India, where 35-77 million people were estimated to be at risk of ingesting arsenic-contaminated water above the WHO recommended limit of 10 µg/L [4].The fundamental intervention is to test all drinking water sources in order to identify and provide arsenic-free drinking water [4].Unfortunately, this positive call is mostly a burden, especially for rural communities in developing countries, because the quantification of arsenic usually involves very expensive methods such as atomic absorption spectrophotometry, together with highly trained technicians and high maintenance costs.Hence, an indirect way of estimating arsenic concentration from physicochemical parameters such as pH, electrical conductivity (Cond), total dissolved solids (TDS) and turbidity (TurB), which are much simpler to perform, could be vital in detecting arsenic contamination.
Over the past few years, many studies have demonstrated the applicability of machine learning (ML) methods in modelling arsenic concentrations in water sources and have found that ML can achieve consistent or even better results than traditional methods [5][6][7][8][9].Among other ML methods, boosting algorithms have emerged as robust and competitive techniques that have been consistently placed among the top contenders in most Kaggle competitions [10,11].Their performances are strongly justified and backed up theoretically by their ability to improve the performance of weak classification models by combining the outputs of many "weak" classifiers [12].Furthermore, these models can absorb more input variables and adequately describe non-linear and complicated relations between variables.For example, boosted regression trees (BRT) have been widely adopted in modelling arsenic concentration in water sources [5][6][7][8].Ayotte, et al. [13] applied BRT and logistic regression (LR) to predict the probability of arsenic exceeding 10 µg/L or 5 µg/L in drinking water wells in the Central Valley, California.They reported higher predictive accuracy for BRT compared to LR.A recent study by Ibrahim, et al. [9] also introduced two novel boosting variants, extreme gradient boosting (XGB) and light gradient boosting (LGB) to the arsenic modelling domain.Their study indicated that boosting algorithms are very efficient and produce satisfactory predictions comparable to state-ofthe-art ML methods.It is evident that boosting algorithms' applicability and superiority in modelling complex underlying relationships are well justified.However, limited studies have investigated the performance of other new variants of boosting algorithms in arsenic modelling.Interestingly, new variants of boosting algorithms such as natural gradient boosting (NGB), categorical boosting (CATB), and adaptive boosting (ADAB) exist in the literature and have been used successfully in various hydrological modelling such as estimating daily reference evapotranspiration [14,15], runoff probability prediction [16] and pan evaporation estimation [17].Given that no single boosting algorithm is consistently better than another [18], it is critical to investigate the suitability of these new boosting models for modelling arsenic concentration in water sources.
Aside from their predictive ability, ML or boosting models use complex algorithms with difficult-to-understand decisions and processes.In critical areas such as arsenic modelling, the inability of end users to understand these models appears problematic, because entrusting critical decisions to a system that cannot explain itself poses obvious dangers.ML frequently struggles to generalise beyond the circumstances seen during training, yet when given out-of-distribution samples, they still make (mostly incorrect) predictions with high confidence [19].To make these models safe, fair, and reliable, explainable ML models are needed to assist end-users, decision-makers and regulators in accurately determining why models make certain predictions [20].Existing research [5][6][7][8][9] has used methods such as variable importance scores and partial dependency plots to determine the influence of predictor variables on arsenic mobility.Such methods, however, are incapable of determining the importance of predictor variables for individual classes in a classification problem and/or their impact on individual predictions.Again, such methods assume feature independence even when they are correlated [20].The recent introduction of the novel SHapley Additive exPlanation (SHAP) approach has proved to be useful in understanding the response of the ML output to input [21].SHAP has shown satisfactory results in several scientific research projects on feature analysis [22][23][24][25][26][27].To the best of our knowledge, only a few studies have used explainable ML algorithms to analyse variable attribution in hydrology.Hence, more research on model explainability and interpretability is required to decipher the complex hydrogeochemical controls of arsenic mobilisation.
In the quest to introduce more efficient boosting algorithms while also explaining why certain decisions are made by these models, the current study therefore seeks to: (i) determine the efficacy of six boosting algorithms i.e., gradient boosting machine (GBM), In the quest to introduce more efficient boosting algorithms while also explaining why certain decisions are made by these models, the current study therefore seeks to: (i) determine the efficacy of six boosting algorithms i.e., gradient boosting machine (GBM), XGB, LGB, NGB, CATB and ADAB in classifying low (<5 µg/L), medium (>5 to ≤10 µg/L) and high (>10 µg/L) arsenic water; and (ii) identify the importance and contribution of each predictor variable in each individual class (high, low and medium) and visually interpret the complex non-linear behaviour underlying arsenic mobility.The comparison is extended to a non-boosting model, i.e., random forest (RF), which can be considered a benchmark model due to its success in arsenic modelling [8,9,28,29].

Study Area
In this study, samples were collected in a few selected mining communities near Ayanfuri in Ghana's Upper Denkyira West district (Figure 1).The district is located between latitudes 6°0' N and 5°55' N and longitudes 1°57' W and 1°52' W, and it borders the districts of Bibiani-Anhwiaso-Bekwai to the north-west, Amansie West and Amansie Central to the north-east, Wassa Amenfi East and Wassa Amenfi West to the south-west, and Upper Denkyira East Municipal to the south [30].The study area's average temperature, specifically Ayanfuri, was 26.4 °C, with the warmest and coldest temperatures recorded in March (average of 27.6 °C) and August (average of 24.6 °C), respectively [30].As shown in Figure 1, the majority of the research area is underlain by Paleoproterozoic Birimian flysch-type metasediments composed of dacitic volcaniclastics, greywackes, and argillaceous (phyllitic) sediments that have been strongly folded, faulted, and metamorphosed to upper greenschist facies [31,32].The sediments have been intruded by a variety of tiny granite masses as well as several regional formations.Gold mineralisation occurs within the Ashanti Gold Belt's granitic plugs and sills or dykes in southern Ghana, West Africa, as well as two or three regional shear structures [31].Mineralisation occurs as extremely fine grains, frequently at sulphide grain boundaries and in sulphide fractures, primarily at or near vein edges, with coarse visible gold seen in the quartz on occasion [31].High-grade gold intercepts are frequently associated with very coarse arsenopyrite +/-sphalerite, chalcopyrite, and galena [31].Groundwater (boreholes, pipe-borne, As shown in Figure 1, the majority of the research area is underlain by Paleoproterozoic Birimian flysch-type metasediments composed of dacitic volcaniclastics, greywackes, and argillaceous (phyllitic) sediments that have been strongly folded, faulted, and metamorphosed to upper greenschist facies [31,32].The sediments have been intruded by a variety of tiny granite masses as well as several regional formations.Gold mineralisation occurs within the Ashanti Gold Belt's granitic plugs and sills or dykes in southern Ghana, West Africa, as well as two or three regional shear structures [31].Mineralisation occurs as extremely fine grains, frequently at sulphide grain boundaries and in sulphide fractures, primarily at or near vein edges, with coarse visible gold seen in the quartz on occasion [31].High-grade gold intercepts are frequently associated with very coarse arsenopyrite +/− sphalerite, chalcopyrite, and galena [31].Groundwater (boreholes, pipe-borne, and handdug wells) and surface water (rivers and streams) supply more than 70% and 13% of the area's drinking water, respectively [33].

Data Description
A total of 597 water samples, with 354 from groundwater sources and 243 from surface water sources, were obtained within the study area from the period of 2011 to 2018.The sampling locations are shown in Figure 1.Groundwater samples were mostly taken from residential and public boreholes as well as hand-dug wells, while surface water samples were obtained from rivers and streams.Prior to sampling, 1 L polythene bottles were first cleaned with weak HNO 3 and rinsed with distilled water.Physicochemical parameters like pH, TurB, TDS and Cond were measured in the field using the HQ40d18 Series Portable Meter, while arsenic was determined using Inductively Coupled Plasma Optical Emission Spectrometry (ICP-MS).Here, H 2 was used in the collision cell at a flow rate of 80 mL/min to maximise sensitivity while minimising potential polyatomic interferences with the target element (arsenic).The flow was optimised by monitoring the counts of the target element in a 2.0 M nitric acid eluate.Finally, determinations were made using aqueous standards in 2.0 M nitric acid with metal concentrations ranging from 0 to 0.25 mg/L.Triplicate analysis was conducted to ensure the accuracy of the arsenic measurements and the mean of each measurement was recorded as the representative concentration.The dataset obtained consists of 324 (54%) low, 100 (17%) medium, and 173 (29%) high arsenic concentrations.

Model Development
A total of 417 datasets (70%) were used to train the ML models, with the remaining 180 (30%) used to assess (test) the models' performance.The dataset was split using stratified sampling to ensure a uniform proportion of the target class in the training and testing datasets.The input variables pH, Turb, TDS, and Cond were used to classify the low, medium, and high arsenic concentrations.Figure 2  It is important to note that the hyperparameter search space was limited to only two hyperparameters to avoid model complexity while also ensuring fast computational time.Table 1 shows the libraries and optimal parameters used to build the various models.It is important to note that the hyperparameter search space was limited to only two hyperparameters to avoid model complexity while also ensuring fast computational time.Table 1 shows the libraries and optimal parameters used to build the various models.By using standard performance metrics, the overall predictive efficiency of the developed models was evaluated and compared.Finally, the relationship between the predictor variables and arsenic mobility was explained using the best-performing boosting model and SHAP.The theoretical concept of SHAP and the boosting algorithms used in the study are presented briefly in the following subsections.Table 2 summarises the main advantages and limitations of the boosting algorithms investigated.The RF model is not discussed because it has been extensively treated in the literature [9,[38][39][40].

Model Advantages Limitations
LGB (i) Fast learning [41] (ii) Lower memory consumption [41] (i) Can lose predictive performance due to gradient-based one-side sampling split (GOSS) approximations [20] (ii) Sensitive to noisy data [34] XGB (i) Higher execution speed [41] (ii) Less prone to overfitting (iii) Supports parallelisation (iv) Scalable (i) Performs sub-optimally on sparse and unstructured data CATB (i) Less prone to overfitting [17,42] (i) Setting of different random numbers has a certain impact on the model prediction results [34] ADAB (i) Easier implementation [41] (ii) Less prone to overfitting [34] (iii) Simpler feature selection [41] (i) Sensitive to outliers and noisy data [41] GBM (i) Insensitive to missing data [14] (ii) Reduced bias [14] (iii) Reduced overfitting [14] (i) Computationally expensive NGB (i) Flexible and scalable [37] (ii) Performs probabilistic prediction [37] (iii) Efficient for joint prediction [37] (iv) Modular with respect to base learners (i) Limited in some skewed probability distribution [43] 3.2.1.Categorical Boosting CATB, introduced by Dorogush et al. [42], is a tree-based gradient boosting algorithm that is capable of handling categorical data in both regression and classification problems.The algorithm introduces two new advances to the boosting implementation in order to fight the problem of prediction shift caused by target leakage, which is found in all currently existing gradient boosting algorithms.The first is ordered boosting, which is a permutation-driven alternative to the conventional boosting method, while the second is a novel technique for dealing with categorical data [35].Unlike other gradient boosting techniques, which require converting categorical variables before implementation, CATB uses oblivious decision trees (OBT) as base predictors when building a tree [44].The OBT is very balanced and less prone to overfitting and increased speeds during the implementation of CATB.In dealing with categorical variables, CATB employs a more efficient technique that eliminates overfitting and allows for the use of all datasets in training.It randomly permutes the dataset and computes an average label value for each sample, with the same category value placed before the supplied one in the permutation.More theoretical intuition is presented by Dorogush et al. [42].

Natural Gradient Boosting
NGB, introduced by Duan et al. [37], is a key innovation in the gradient boosting family, which enables predictive uncertainty estimation with gradient boosting by employing a probabilistic forecast.The NGB algorithm has the advantage of being simpler as it requires relatively less expertise to implement.The major contribution to the boosting family is that it uses multiparameter boosting and natural gradients to integrate any choice of base learner (e.g., regression tree), parametric distribution (normal), and scoring rule (MLE), which are all chosen during configuration [37].In this study, the base learner used was a decision tree with a maximum depth of 3. The categorical distribution was used with the default Friedman MSE error as a criterion.

Adaptive Boosting
ADAB [45] creates an ensemble by focusing on previously misclassified cases.Like all ensembles, it generates a set of classifiers and then votes on them to classify test examples.However, here, the various classifiers were built in a sequence by focusing the underlying learning algorithm on those training examples that were misclassified.The algorithm's efficiency depends on building a diverse, yet accurate, collection of classifiers [46].The key idea behind ADAB is to use weighted versions of the same training samples, rather than to use random subsamples [12].

Light Gradient Boosting
LGB [34] is a gradient boosting algorithm based on decision trees and the idea of combining weak learners into powerful learners.It uses histogram-based algorithms [15,47] that discretise continuous feature values into P-bins and constructs a histogram of width, p, thereby resulting in enhanced speed and less memory usage.The leaf-wise technique is used by LGB while growing trees [15,34,48].This technique improves LGB efficiency by selecting the leaf with the highest branching gain.This renders the algorithm prone to overfitting.The maximum depth parameter, on the other hand, is used to restrict the tree depth in order to prevent overfitting while guaranteeing efficiency.Furthermore, LGB employs gradient-based one-side sampling and unique feature bundling for quicker learning and feature bundling [11].

Extreme Gradient Boosting
XGB is an efficient technique created by Chen et al. [49] that is built using the gradient boosting framework and can handle both regression and classification issues [50,51].In an iterative procedure, the algorithm learns the functional connection between the input and target characteristics by training individual trees successively on the residuals from the preceding tree [6].In this case, it iteratively merges weak base learning models with a stronger learner to maximise the objective function.

Gradient Boosting Machine
The GBM technique by Friedman [50] is a form of boosting approach that iteratively generates new base learners by reweighting misclassified data.Unlike ADAB, GBM derives the weights by working on the loss function's negative partial derivatives at each training observation.These partial derivatives are also known as pseudo-residuals, and they are used to iteratively expand an ensemble.As a result, the feature space is partitioned, with related pseudo-residuals grouped together [52].

SHapley Additive exPlanation
SHAP, introduced by Lundberg and Lee [53], is a game theory-based model explainability method for explaining predictions from a ML model.The main goal of SHAP is to interpret individual predictions of an instance (e.g., high arsenic) by computing the contribution of each predictor variable (e.g., pH, Turb, Cond, and TDS) using the coalition game theory.Here, each predictor variable acts as a player in a coalition.The contribution of each input variable, also known as the payoff, is the increase in the probability of a particular class occurring when conditioning on the feature.The outcome of the model is explained using the concept of additive feature attribution.SHAP specifies the explanation in Equation (1) as: where g represents the explanation model, x i is a coalition vector that indicates whether the ith predictor is present (x i = 1) or absent (x i = 0), ∇ o is the base value when all inputs are unavailable, and M is the maximum coalition size.∇ i ∈ R is the Shapley value, which represents feature attribution for a feature i. SHapley values, ∇ i , have properties that make it suitable for evaluating feature importance [20,54]: Dummy: If a feature i does not contribute any marginal value, ∇ i = 0. Additivity: If a model S is an ensemble of m submodels, the contributions of a feature i in the submodels should add up; ∇ i s = ∑ K i=1 ∇ i .Efficiency: All SHapley values must add up as the difference between predictions and expected values.
Substitutability: If two given features i and j contribute equally to all their possible subsets, then their SHapely values are equal ∇ i = ∇ j .
Since SHAP computes SHapley values, it is known to be the only method that satisfies all the properties of efficiency, symmetry, dummy, and additivity [54].Interestingly, SHAP is justified to provide a unique solution with three vital properties [20,53,54]: Local accuracy: Equivalent to SHapley's efficiency property.Consistency: Follows the additivity and substitutability dummy properties of the SHapley values.
Missingness: This means that missing features get a SHapley value of zero.It is important to note that, in theory, a missing feature can have an arbitrary SHapley value without having accuracy.In practice, it is only needed when features are constant.
The precise determination of the SHAP value is difficult since it necessitates sophisticated exponential computation for each feasible subset of variables.Hence, Lundberg et al. [55] introduced the TreeSHAP to efficiently approximate SHAP values for tree-based ML models such as XGB, LGB, and RF.Therefore, this study employed the TreeSHAP variant with XGB for the model explainability.

Statistical Evaluation of Model Performance
The selection of suitable metrics for discriminating the optimal solution is an important step towards obtaining an optimised classifier [56].In this study, accuracy (Acc), kappa, precision, F1, sensitivity, area under the receiver operating characteristic curve (AUC) and Matthews correlation coefficient (MCC) were used as evaluators to measure the effectiveness of the classifiers on unseen data (testing data).These metrics allow a clear and intuitive interpretation of the performance of the classifiers in all classes [57].The metrics and corresponding mathematical representations (Equations 2 to 8) are presented in Table 3.
Table 3. Evaluation metrics with formula and description.

Formula Equation Description
Acc It measures the ratio of correct predictions over the total number of instances evaluated.
The kappa coefficient demonstrates the agreement between the observed classes and the measured classes [58].
The AUC value indicates how well the probabilities of the positive class are separated from the negative class.
MCC is generally known to be a balanced metric for evaluating classification performance on data with varying class sizes [59].It is a good indicator of the total unbalanced prediction model [60].
The percentage of the relevant materials data sets that were correctly identified.
Precision TP TP + FP (7) Precision represents the proportion of predicted positive cases that are correctly real values [56].
This measures the harmonic mean between recall and precision values [56].
Note(s): TP, TN, FP, and FN are true positive, true negative, false positive, and false negative, respectively.True positive rate, R tp is the function of the false positive rate, R fp along the receiver operating characteristic curve.For kappa and MCC, c represents the total number of elements correctly predicted, s is the total number of elements, p k is the number of times class k was predicted, and t k represents the number of times class k truly occurs.

Hydrogeochemistry of Input Parameters and Arsenic Pollution
Table 4 shows the summary data for the major hydrochemical parameters.The pH of the samples obtained from surface water sources varied in the range of 3.90 to 8.51 with a mean of 6.35, whereas samples obtained from groundwater sources also varied in a pH range of 4.23 to 7.30 with a mean of 5.73.This means that the pH of the samples ranges from very acidic to mildly alkaline.The acidic pH in some of the samples, notably in the surface water samples, may be due to the presence of sulfur-bearing minerals in the aquifer system, which encouraged the accumulation of acidity from rainwater and other sources, lowering the pH [30,61,62].Furthermore, the sample locations contain carbonate rocks (argillitic and volcaniclastic deposits), which can cause carbonate minerals to dissolve and mix with surface and groundwater, increasing the pH.The Cond values of the surface water samples are in the range of 206 to 2040 µS/cm with a mean value of 183.50 µS/cm, whereas the groundwater samples are in the range of 83 to1070 µS/cm with a mean value of 245.91 µS/cm.When compared with the guideline value (2500 µS/cm) of Cond in drinking water [3], the Cond values of all the samples in both surface water and groundwater are below the guideline value (Table 4).The higher Cond values are attributable to inputs from anthropogenic activities in the region, such as aquaculture and indiscriminate garbage dumping [30].
The TDS values of the surface water samples vary from 8440 to 2,390,000 µg/L, with a mean value of 104,169 µg/L, whereas the groundwater samples vary from 48300 to 934,000 µg/L, with a mean value of 150,003 µg/L (Table 4).It is interesting to note that the majority of the surface water samples have TDS concentrations above the WHO guideline value of 1,000,000 µg/L.The increased TDS concentrations in the surface water samples indicate that pesticide and herbicide runoff from agricultural operations is a serious issue in the area.Additionally, leachate from adverse mining and mineral processing activities in the area could contribute to the elevated TDS concentrations.
The surface water samples' turbidity readings vary from 0.60 to 292,600 NTU, with a mean value of 1312.72 NTU, whereas the groundwater samples are in the range of 0.20 to 142 NTU, with a mean value of 18.17 NTU.The turbidity values of a majority of the samples in both surface water and groundwater exceed the WHO guideline value of 5 NTU (Table 4).Surface water samples had higher turbidity values than groundwater samples, presumably due to severe rainfall or disturbances to land near raw water sources caused by undesirable farming and mining operations.
In recent years, the adverse mining activities in the area have resulted in elevated arsenic concentrations in water sources [63] leaving a majority of the population highly exposed to arsenic contamination [64].The arsenic concentrations of the surface water samples are in the range of 2.0-620 µg/L with a mean value of 28.51 µg/L, whereas the groundwater samples are in the range of 2.0-88.29 µg/L with a mean value of 4.23 µg/L.When compared with the guideline value (10 µg/L) of arsenic in drinking water [3], the arsenic concentrations of a majority of the samples in both surface water and groundwater exceed the guideline value (Table 4).Surface water tests contained significant levels of arsenic, which might be attributed to the extensive surface mining of gold-bearing rocks containing sulphide minerals such as pyrite and arsenopyrite.Elevated arsenic content in groundwater samples can also be related to sulphidic aquifer oxidation.Figure 3 depicts low, medium and high arsenic concentrations in surface water and groundwater.
Water 2022, 14, 3509 10 of 24 samples in both surface water and groundwater exceed the WHO guideline value of 5 NTU (Table 4).Surface water samples had higher turbidity values than groundwater samples, presumably due to severe rainfall or disturbances to land near raw water sources caused by undesirable farming and mining operations.
In recent years, the adverse mining activities in the area have resulted in elevated arsenic concentrations in water sources [63] leaving a majority of the population highly exposed to arsenic contamination [64].The arsenic concentrations of the surface water samples are in the range of 2.0-620 µg/L with a mean value of 28.51 µg/L, whereas the groundwater samples are in the range of 2.0-88.29 µg/L with a mean value of 4.23 µg/L.When compared with the guideline value (10 µg/L) of arsenic in drinking water [3], the arsenic concentrations of a majority of the samples in both surface water and groundwater exceed the guideline value (Table 4).Surface water tests contained significant levels of arsenic, which might be attributed to the extensive surface mining of gold-bearing rocks containing sulphide minerals such as pyrite and arsenopyrite.Elevated arsenic content in groundwater samples can also be related to sulphidic aquifer oxidation.Figure 3 depicts low, medium and high arsenic concentrations in surface water and groundwater.

Overall Model Performance
The models are analysed and compared, in this section, based on how well they maximised the performance metrics.The overall performance of the testing dataset was as-

Overall Model Performance
The models are analysed and compared, in this section, based on how well they maximised the performance metrics.The overall performance of the testing dataset was assessed and compared using the performance measures kappa, Acc, MCC, and AUC.Table 5 displays the overall performance measures of the developed models.Figure 4 depicts a plot of the overall performance of the developed models.As shown in Figure 4, XGB is more efficient (with an Acc of 0.86) in classifying the various water classes, followed closely by LGB (0.83), NGB (0.82), GBM (0.82), CATB (0.81), and ADAB (0.76).In terms of the area under the curve (AUC) score, all the models, except ADAB (AUC of 0.83), obtained more than 0.9 (Figure 4).  Figure 4 depicts a plot of the overall performance of the developed models.As shown in Figure 4, XGB is more efficient (with an Acc of 0.86) in classifying the various water classes, followed closely by LGB (0.83), NGB (0.82), GBM (0.82), CATB (0.81), and ADAB (0.76).In terms of the area under the curve (AUC) score, all the models, except ADAB (AUC of 0.83), obtained more than 0.9 (Figure 4).In interpreting the strength of the agreement, kappa values of 0.01-0.20 are considered minor, 0.21-0.40are considered fair, 0.41-0.60 are considered moderate, 0.61-0.80 are considered significant, and 0.81-1.00are considered virtually perfect [65].The kappa results presented indicate that the LGB, XGB, CATB, NGB, and GBM reached a substantial agreement with kappa values of 0.71, 0.75, 0.67, 0.68, and 0.69, respectively, whereas ADAB achieved a moderate agreement with a score of 0.58 (Figure 4).
MCC has a range of [−1,1], with values close to 1 indicating a very good correlation between the predicted and observed class and values close to 0 indicating poor correlation.XGB produced the highest correlation of 0.75, followed, sequentially, by LGB, GBM, NGB, CATB, and ADAB with correlation values of 0.72, 0.69, 0.69, 0.68, and 0.58, respectively (Figure 4).
In terms of overall evaluation (Acc, kappa, MCC, and AUC), XGB outperformed the other models by receiving the highest MCC, Acc, and kappa scores, followed by LGB.ADAB had the worst overall performance across all metrics.In comparison to the standard RF model, all boosting models developed in this study performed well in terms of arsenic classification in surface water and groundwater.In most cases, XGB performed admirably; however, these findings should be further investigated in future studies using a large number of environmental datasets representing a wide range of environmental In interpreting the strength of the agreement, kappa values of 0.01-0.20 are considered minor, 0.21-0.40are considered fair, 0.41-0.60 are considered moderate, 0.61-0.80 are considered significant, and 0.81-1.00are considered virtually perfect [65].The kappa results presented indicate that the LGB, XGB, CATB, NGB, and GBM reached a substantial agreement with kappa values of 0.71, 0.75, 0.67, 0.68, and 0.69, respectively, whereas ADAB achieved a moderate agreement with a score of 0.58 (Figure 4).
MCC has a range of [−1,1], with values close to 1 indicating a very good correlation between the predicted and observed class and values close to 0 indicating poor correlation.XGB produced the highest correlation of 0.75, followed, sequentially, by LGB, GBM, NGB, CATB, and ADAB with correlation values of 0.72, 0.69, 0.69, 0.68, and 0.58, respectively (Figure 4).
In terms of overall evaluation (Acc, kappa, MCC, and AUC), XGB outperformed the other models by receiving the highest MCC, Acc, and kappa scores, followed by LGB.ADAB had the worst overall performance across all metrics.In comparison to the standard RF model, all boosting models developed in this study performed well in terms of arsenic classification in surface water and groundwater.In most cases, XGB performed admirably; however, these findings should be further investigated in future studies using a large number of environmental datasets representing a wide range of environmental settings and compartments in order to draw broad conclusions.

Single-Class Model Performance
Single-class metrics are less sensitive to class imbalance, making them ideal for evaluating classifiers in skew data domains [57].Precision, sensitivity, and F1 measures were used to assess comparative single-class performance.Table 6 displays the single class performance results.The identification of water samples with low arsenic concentrations is critical in quantifying safe drinking water sources for human consumption [8]. Figure 5a shows that all of the models performed admirably in identifying waters with low arsenic concentrations.In terms of precision, more than 0.77 of the predicted low arsenic concentrations were found to be low.The precision score for XGB was 0.87, followed by GBM (0.86), LGB (0.84), CATB (0.84), NGB (0.80), and ADAB (0.78).This is critical for arsenic modelling because it lowers the number of false positive predictions (predicting either a high or medium arsenic concentration as a low).This means that the models are safe to use because they do not incorrectly classify high or medium arsenic water sources as low.Again, all of the developed models had a sensitivity greater than 0.85.The highest sensitivity score was 0.93, achieved by XGB, followed by NGB (0.91), LGB (0.89), CATB (0.89), ADAB (0.87), and GBM (0.86).All of the models have a high F1 score (>0.81), which represents a balance between sensitivity and precision.XGB had the highest predictive efficiency (F1 score of 0.90), while ADAB had the lowest (F1 score of 0.82).Overall, XGB demonstrated the greatest predictive efficiency for low arsenic concentrations, with the highest precision, sensitivity, and F1 score.
The boosting models performed poorly in estimating the medium arsenic concentration, as shown in Figure 5b.XGB had the highest precision (0.85), while NGB had the highest sensitivity (0.72) and F1 score (0.72).The medium arsenic's poor performance can be attributed to the dataset's small sample size (17% of the total data used).
Typically, the primary goal of arsenic modelling is to identify and accurately classify high arsenic areas in order to reduce arsenic contamination and pollution [13].Thus, sensitivity is critical because it serves as a protective buffer for the population.High sensitivity indicates low false negatives (predicting a low arsenic concentration in a water sample when it is high) and vice versa.From Figure 5c, it can be seen that all the models except ADAB and NGB could correctly predict the high arsenic class in the testing dataset with a sensitivity score greater than 0.802.Again, all the models could predict the high arsenic concentration with very good precision (>0.83).In terms of the F1 score, which represents the balance between precision and sensitivity, LGB is more efficient in predicting the high arsenic waters.The boosting models performed poorly in estimating the medium arsenic concentration, as shown in Figure 5b.XGB had the highest precision (0.85), while NGB had the highest sensitivity (0.72) and F1 score (0.72).The medium arsenic's poor performance can be attributed to the dataset's small sample size (17% of the total data used).
Typically, the primary goal of arsenic modelling is to identify and accurately classify high arsenic areas in order to reduce arsenic contamination and pollution [13].Thus, sensitivity is critical because it serves as a protective buffer for the population.High sensitivity indicates low false negatives (predicting a low arsenic concentration in a water sample when it is high) and vice versa.From Figure 5c, it can be seen that all the models except ADAB and NGB could correctly predict the high arsenic class in the testing dataset with a sensitivity score greater than 0.802.Again, all the models could predict the high arsenic concentration with very good precision (>0.83).In terms of the F1 score, which represents the balance between precision and sensitivity, LGB is more efficient in predicting the high arsenic waters.
In terms of single class assessment (precision, sensitivity, and F1), all the models achieved very good performance in classifying the high and low arsenic waters but relatively poor performance in classifying the medium arsenic waters.XGB achieved the highest sensitivity for the high and low arsenic waters, the highest precision for low and medium arsenic waters, and the highest F1 score for the low arsenic class.NGB obtained the highest sensitivity and F1 score for the medium class.Previous comparative studies [41] have similarly justified XGB's superiority over other boosting variants.For the medium class, NGB had the highest sensitivity and F1 score.Generally, ADAB performed the worst in terms of precision, sensitivity, and F1 score for all the various classes.ADAB's poor performance is consistent with previous comparisons [14,41].In terms of single class assessment (precision, sensitivity, and F1), all the models achieved very good performance in classifying the high and low arsenic waters but relatively poor performance in classifying the medium arsenic waters.XGB achieved the highest sensitivity for the high and low arsenic waters, the highest precision for low and medium arsenic waters, and the highest F1 score for the low arsenic class.NGB obtained the highest sensitivity and F1 score for the medium class.Previous comparative studies [41] have similarly justified XGB's superiority over other boosting variants.For the medium class, NGB had the highest sensitivity and F1 score.Generally, ADAB performed the worst in terms of precision, sensitivity, and F1 score for all the various classes.ADAB's poor performance is consistent with previous comparisons [14,41].

Relative Importance of Predictor Variables
The relative importance of the predictor variables can be used to identify the primary input factors influencing the predictions [13].The variable importance plots in Figure 6 show a similar trend, albeit with a slight variation.Overall, pH and Cond had the greatest impact on arsenic distribution in water sources, while TDS and TurB had a moderate influence.These findings are consistent with the domain knowledge that arsenic mobility in water sources is often controlled by pH and Cond [66].

Relative Importance of Predictor Variables
The relative importance of the predictor variables can be used to identify the primary input factors influencing the predictions [13].The variable importance plots in Figure 6 show a similar trend, albeit with a slight variation.Overall, pH and Cond had the greatest impact on arsenic distribution in water sources, while TDS and TurB had a moderate influence.These findings are consistent with the domain knowledge that arsenic mobility in water sources is often controlled by pH and Cond [66].

SHAP Global Interpretation
The global importance of the individual predictor variables can be explored using the SHAP feature importance plot in Figure 7. Unlike the variable importance plots in Figure 6, which are based on the decrease in model performance, SHAP feature importance is based on the magnitude of feature attributions.Here, the contribution of the predictor variables for the individual classes could be verified (Figure 7).Such important insight has not yet been explored in previous studies.Previous studies, such as Lombard et al. [8], mostly adopt variable importance plots, which only account for the influence on overall classification and not on individual classes.

SHAP Global Interpretation
The global importance of the individual predictor variables can be explored using the SHAP feature importance plot in Figure 7. Unlike the variable importance plots in Figure 6, which are based on the decrease in model performance, SHAP feature importance is based on the magnitude of feature attributions.Here, the contribution of the predictor variables for the individual classes could be verified (Figure 7).Such important insight has not yet been explored in previous studies.Previous studies, such as Lombard et al. [8], mostly adopt variable importance plots, which only account for the influence on overall classification and not on individual classes.From Figure 7, it can be seen that the higher the mean SHAP value, the more important the predictor variable.It is effortless to decipher that Cond is the most influencing variable, followed by pH, TDS, and Turb.In predicting high and low arsenic classes, pH is the most important variable, followed by Cond.With regards to the medium class, Cond is the most important variable, followed by TDS.The plot is very insightful as it establishes how variable importance varies according to the concentration of arsenic (high, medium, or low) in water sources.
The directional marginal contribution of each predictor variable to the various classes is presented in Figure 8.Here, each point on the plot corresponds to a row in the dataset.The gradient colour of each point represents the magnitude of the input variable, i.e., the red or blue plots represent the higher or lower values of inputs, respectively.The y-axis represents the variable names, ranked from top to bottom in order of importance, and the x-axis depicts the SHAP value.The presence of coloured points on both sides for all features indicates how much a feature impacts the model negatively (left) or positively (right).The overlapping points are jittered in the y-axis direction to show the distribution of the SHAP values per feature.From Figure 7, it can be seen that the higher the mean SHAP value, the more important the predictor variable.It is effortless to decipher that Cond is the most influencing variable, followed by pH, TDS, and Turb.In predicting high and low arsenic classes, pH is the most important variable, followed by Cond.With regards to the medium class, Cond is the most important variable, followed by TDS.The plot is very insightful as it establishes how variable importance varies according to the concentration of arsenic (high, medium, or low) in water sources.
The directional marginal contribution of each predictor variable to the various classes is presented in Figure 8.Here, each point on the plot corresponds to a row in the dataset.The gradient colour of each point represents the magnitude of the input variable, i.e., the red or blue plots represent the higher or lower values of inputs, respectively.The y-axis represents the variable names, ranked from top to bottom in order of importance, and the x-axis depicts the SHAP value.The presence of coloured points on both sides for all features indicates how much a feature impacts the model negatively (left) or positively (right).The overlapping points are jittered in the y-axis direction to show the distribution of the SHAP values per feature.
From Figure 8a, it is indicative that water sources have a high probability of containing low arsenic concentration when pH values are low (i.e., high values correlate negatively and vice versa).It is seen that low Cond values generally increase the chance of having water with a low arsenic concentration.Likewise, low Turb mostly increases the probability of having low arsenic water.The least important variable in determining low arsenic water is the Turb.Although low values seem to have an undefined relationship with arsenic mobility, high values of Turb generally indicate low arsenic water.On the other hand, high values of pH and TDS reduce the chances of water being low in arsenic.Overall, it is identified that low pH and low Cond mostly prevent arsenic mobility in water, whereas the converse, together with high TDS, encourages arsenic mobility.
According to Figure 8b, the most important determinant of medium arsenic waters is Cond, followed by TDS, Turb, and pH.It is observed that a source of water with low Cond is likely to contain a medium arsenic concentration.Similarly, low TDS values mostly indicate medium arsenic waters.The spread of the low values on opposite sides of the plot is indicative of how difficult it was for the model to learn the classification rules between the input and target variables.On the other hand, high Cond, high TDS, high Turb and high pH of a water source make the occurrence of medium arsenic water unlikely.
The gradient colour of each point represents the magnitude of the input variable, i.e., the red or blue plots represent the higher or lower values of inputs, respectively.The y-axis represents the variable names, ranked from top to bottom in order of importance, and the x-axis depicts the SHAP value.The presence of coloured points on both sides for all features indicates how much a feature impacts the model negatively (left) or positively (right).The overlapping points are jittered in the y-axis direction to show the distribution of the SHAP values per feature.From Figure 8a, it is indicative that water sources have a high probability of containing low arsenic concentration when pH values are low (i.e., high values correlate negatively and vice versa).It is seen that low Cond values generally increase the chance of having water with a low arsenic concentration.Likewise, low Turb mostly increases the probability of having low arsenic water.The least important variable in determining low arsenic water is the Turb.Although low values seem to have an undefined relationship with arsenic mobility, high values of Turb generally indicate low arsenic water.On the other hand, high values of pH and TDS reduce the chances of water being low in arsenic.Overall, it is identified that low pH and low Cond mostly prevent arsenic mobility in water, whereas the converse, together with high TDS, encourages arsenic mobility.
According to Figure 8b, the most important determinant of medium arsenic waters is Cond, followed by TDS, Turb, and pH.It is observed that a source of water with low Cond is likely to contain a medium arsenic concentration.Similarly, low TDS values mostly indicate medium arsenic waters.The spread of the low values on opposite sides of the plot is indicative of how difficult it was for the model to learn the classification rules between the input and target variables.On the other hand, high Cond, high TDS, high Turb and high pH of a water source make the occurrence of medium arsenic water unlikely.
Figure 8c shows that high pH water is highly correlated with high arsenic water.Similarly, high Cond is a potential indicator of a high arsenic water source.Additionally, high TDS is mostly associated with high pH.On the other hand, low pH, Cond, Turb and TDS generally indicate the absence of a high arsenic concentration.show the relationship between SHAP values and changes in individual predictor variables for low, medium, and high arsenic waters, respectively.The plots depict the variation of SHAP values in relation to the input variables.The colour coding on the right corresponds to the interaction term values.When vertical dispersion of points is observed, interaction among predictor variables can be identified.
Figure 9a-c, show that there is no clear relationship between SHAP values and TDS, Turb, or Cond, respectively.However, beyond TDS of 2500, the probability of low arsenic water occurring increases with increasing TDS, as shown in Figure 9(b).Waters with a high Cond (> 250) and a low pH are also likely to be low in arsenic, as shown in Figure 9(c).The inverse relationship between pH and SHAP values is depicted in Figure 9(d).It is important to note that high SHAP values indicate a high likelihood of low arsenic water occurrence.As can be seen, increasing pH reduces the likelihood of low arsenic water occurrence and vice versa.Interaction with TDS is also observed.According to the inter- Figure 8c shows that high pH water is highly correlated with high arsenic water.Similarly, high Cond is a potential indicator of a high arsenic water source.Additionally, high TDS is mostly associated with high pH.On the other hand, low pH, Cond, Turb and TDS generally indicate the absence of a high arsenic concentration.show the relationship between SHAP values and changes in individual predictor variables for low, medium, and high arsenic waters, respectively.The plots depict the variation of SHAP values in relation to the input variables.The colour coding on the right corresponds to the interaction term values.When vertical dispersion of points is observed, interaction among predictor variables can be identified.
Figure 9a-c, show that there is no clear relationship between SHAP values and TDS, Turb, or Cond, respectively.However, beyond TDS of 2500, the probability of low arsenic water occurring increases with increasing TDS, as shown in Figure 9b.Waters with a high Cond (> 250) and a low pH are also likely to be low in arsenic, as shown in Figure 9c.The inverse relationship between pH and SHAP values is depicted in Figure 9d.It is important to note that high SHAP values indicate a high likelihood of low arsenic water occurrence.As can be seen, increasing pH reduces the likelihood of low arsenic water occurrence and vice versa.Interaction with TDS is also observed.According to the interaction effect, water with a high pH and medium TDS is less likely to be low in arsenic.
In Figure 10a-d, it is very difficult to decipher a defined relationship between SHAP values and the various input parameters.This explains why the models performed relatively poorly in classifying medium-arsenic water.Moreover, variable interaction seems to be relatively poor.However, beyond a Cond of 300, SHAP values can be observed to increase with increasing Cond in Figure 10c.In Figure 10a-d, it is very difficult to decipher a defined relationship between SHAP values and the various input parameters.This explains why the models performed relatively poorly in classifying medium-arsenic water.Moreover, variable interaction seems to be relatively poor.However, beyond a Cond of 300, SHAP values can be observed to increase with increasing Cond in Figure 10c.In Figure 10a-d, it is very difficult to decipher a defined relationship between SHAP values and the various input parameters.This explains why the models performed relatively poorly in classifying medium-arsenic water.Moreover, variable interaction seems to be relatively poor.However, beyond a Cond of 300, SHAP values can be observed to increase with increasing Cond in Figure 10c.In Figure 11a,b, TDS and Turb have no clear relationship with the presence of high arsenic water.Waters with zero Turb and high pH, on the other hand, are highly likely to contain high arsenic, according to the interaction effect in (Figure 11b).Figure 11c shows that high Cond values rule out the possibility of arsenic-rich water.Figure 11d demonstrates why pH is the most important variable in determining arsenic levels in water sources.There is a direct relationship between pH and SHAP values (Figure 11d).This indicates that the probability of the occurrence of high arsenic water increases with an increase in pH.This observation is consistent with some past studies where higher concentrations of arsenic were detected in groundwater of high pH [67,68] and vice versa [69].In Figure 11a,b, TDS and Turb have no clear relationship with the presence of high arsenic water.Waters with zero Turb and high pH, on the other hand, are highly likely to contain high arsenic, according to the interaction effect in (Figure 11b).Figure 11c shows that high Cond values rule out the possibility of arsenic-rich water.Figure 11d demonstrates why pH is the most important variable in determining arsenic levels in water sources.There is a direct relationship between pH and SHAP values (Figure 11d).This indicates that the probability of the occurrence of high arsenic water increases with an increase in pH.This observation is consistent with some past studies where higher concentrations of arsenic were detected in groundwater of high pH [67,68] and vice versa [69].In Figure 11a,b, TDS and Turb have no clear relationship with the presence of high arsenic water.Waters with zero Turb and high pH, on the other hand, are highly likely to contain high arsenic, according to the interaction effect in (Figure 11b).Figure 11c shows that high Cond values rule out the possibility of arsenic-rich water.Figure 11d demonstrates why pH is the most important variable in determining arsenic levels in water sources.There is a direct relationship between pH and SHAP values (Figure 11d).This indicates that the probability of the occurrence of high arsenic water increases with an increase in pH.This observation is consistent with some past studies where higher concentrations of arsenic were detected in groundwater of high pH [67,68] and vice versa [69].The global interpretations made thus far will be critical in deciphering why the sample was predicted as high rather than the other options (low or medium).In terms of the probability of the sample being low in arsenic (Figure 12a), Cond, Turb and TDS all work in favour (positively).However, the pH concentration makes it highly unlikely that it is low in arsenic, driving the prediction in the opposite direction.Furthermore, the sample was not classified as medium because all of the input variables drove the prediction negatively, reducing the likelihood of medium arsenic water occurrence (Figure 12b).According to Figure 12c, it can be seen that the variables driving the prediction positively (to the right) are high pH and high Cond, which were previously established to have the greatest global influence on high arsenic.Their influence is stronger compared to Turb and TDS, which drive the prediction in the opposite direction.This explains why the sample was overall classified as high in arsenic.
In Figure 13, it can be seen that a water sample with pH = 6, Cond = 55, Turb = 28, and TDS = 33 was predicted as medium.All of the input features in Figure 13a, except TDS, drove the prediction negative.It is clear that the pH and Cond values significantly reduced the likelihood of the sample being low in arsenic.The temperature and pH pushed the prediction to the right, increasing the likelihood of encountering medium The global interpretations made thus far will be critical in deciphering why the sample was predicted as high rather than the other options (low or medium).In terms of the probability of the sample being low in arsenic (Figure 12a), Cond, Turb and TDS all work in favour (positively).However, the pH concentration makes it highly unlikely that it is low in arsenic, driving the prediction in the opposite direction.Furthermore, the sample was not classified as medium because all of the input variables drove the prediction negatively, reducing the likelihood of medium arsenic water occurrence (Figure 12b).According to Figure 12c, it can be seen that the variables driving the prediction positively (to the right) are high pH and high Cond, which were previously established to have the greatest global influence on high arsenic.Their influence is stronger compared to Turb and TDS, which drive the prediction in the opposite direction.This explains why the sample was overall classified as high in arsenic.
In Figure 13, it can be seen that a water sample with pH = 6, Cond = 55, Turb = 28, and TDS = 33 was predicted as medium.All of the input features in Figure 13a, except TDS, drove the prediction negative.It is clear that the pH and Cond values significantly reduced the likelihood of the sample being low in arsenic.The temperature and pH pushed the prediction to the right, increasing the likelihood of encountering medium arsenic water, as shown in Figure 13b.Low Cond values, as shown in Figure 8b, increase the likelihood of medium arsenic waters.Low Cond is found to be the major variable driving the prediction in favour of medium arsenic water, followed by pH.Concerning the possibility of the sample containing arsenic (Figure 13c), it is easy to see that pH, TDS and Turb greatly reduced such a possibility.
022, 14, 3509 20 of 24 arsenic water, as shown in Figure 13b.Low Cond values, as shown in Figure 8b, increase the likelihood of medium arsenic waters.Low Cond is found to be the major variable driving the prediction in favour of medium arsenic water, followed by pH.Concerning the possibility of the sample containing arsenic (Figure 13c), it is easy to see that pH, TDS and Turb greatly reduced such a possibility.

Contribution and Limitations
Arsenic in drinking water is becoming more widely recognized as a potential health risk for rural people in developing countries like Ghana, West Africa.Recognising the potential health risks, it has become critical to regularly monitor arsenic levels in drinking water supply systems such as surface water and groundwater.However, traditional testing and monitoring approaches are somewhat costly and time-consuming.This study evaluated the predictive efficiency of various boosting algorithms as ML techniques for classifying low, medium, and high arsenic concentrations in surface water and groundwater for the first time.In this study, a standard ML technique known as RF was used as a benchmark model.The findings from this study suggest that the concentrations of arsenic in some samples obtained from the study area exceeded the WHO limit of 10 µg/L, with some samples showing a maximum concentration of about 620 µg/L.This is a major environmental concern because several potential sources of arsenic pollution in the area are increasing, including mining, fuel combustion, wood preservation, and the use of Asbased pesticides in agriculture.
This study provides a significant contribution to the existing knowledge in the literature by developing ML algorithms that can be used as a cost-effective and quicker approach for monitoring and classifying low, medium, and high arsenic concentrations in various water supply systems.In terms of overall and single-class performance, all the developed boosting algorithms showed excellent performance in classifying arsenic concentrations.More importantly, the XGB model exhibited exceptional performance compared to the other boosting models and can be adopted in future studies for classifying and predicting arsenic concentrations in water supply systems.
The study, also for the first time, employed SHAP to identify important influence variables on arsenic mobilization.Models' predictions were explained using SHAP to promote transparency in ML modelling.Interaction effects among input variables were also assessed.
Despite the high predictive tendencies of the ML algorithms developed in this study, the models were developed using predictor variables available at a regional scale or smaller geographic area (a specific region in Ghana, West Africa), and hence, caution

Contribution and Limitations
Arsenic in drinking water is becoming more widely recognized as a potential health risk for rural people in developing countries like Ghana, West Africa.Recognising the potential health risks, it has become critical to regularly monitor arsenic levels in drinking water supply systems such as surface water and groundwater.However, traditional testing and monitoring approaches are somewhat costly and time-consuming.This study evaluated the predictive efficiency of various boosting algorithms as ML techniques for classifying low, medium, and high arsenic concentrations in surface water and groundwater for the first time.In this study, a standard ML technique known as RF was used as a benchmark model.The findings from this study suggest that the concentrations of arsenic in some samples obtained from the study area exceeded the WHO limit of 10 µg/L, with some samples showing a maximum concentration of about 620 µg/L.This is a major environmental concern because several potential sources of arsenic pollution in the area are increasing, including mining, fuel combustion, wood preservation, and the use of As-based pesticides in agriculture.
This study provides a significant contribution to the existing knowledge in the literature by developing ML algorithms that can be used as a cost-effective and quicker approach for monitoring and classifying low, medium, and high arsenic concentrations in various water supply systems.In terms of overall and single-class performance, all the developed boosting algorithms showed excellent performance in classifying arsenic concentrations.More importantly, the XGB model exhibited exceptional performance compared to the other boosting models and can be adopted in future studies for classifying and predicting arsenic concentrations in water supply systems.
The study, also for the first time, employed SHAP to identify important influence variables on arsenic mobilization.Models' predictions were explained using SHAP to promote transparency in ML modelling.Interaction effects among input variables were also assessed.
Despite the high predictive tendencies of the ML algorithms developed in this study, the models were developed using predictor variables available at a regional scale or smaller geographic area (a specific region in Ghana, West Africa), and hence, caution should be applied for direct translation of knowledge.Furthermore, the dataset used in this study was obtained through the analysis of water samples collected between the years 2011 and 2018, so care should be taken when interpreting the results in current times because of potential uncertainty due to possible temporal variation.Also, including other relevant predictors such as redox conditions could lead to better performance and could be considered in future studies.

Conclusions
An empirical comparison of six representative categories of the most popular boosting algorithms including extreme gradient boosting (XGB), gradient boosting machine (GBM), Light gradient boosting (LGB), natural gradient boosting (NGB), categorical boosting (CATB) and adaptive boosting (ADAB) for arsenic modelling.SHapley Additive ex-Planation (SHAP) was also used to explain model decisions in order to decipher the complex underlying non-linear relationship between influencing input variables (pH, Turb, TDS, and Cond) and arsenic mobility.The major findings are as follows: • In terms of overall assessment metrics (Acc, MCC, Kappa and AUC), all the boosting models (XGB, NGB, LGB, ADAB, CATB, and GBM) developed proved efficient in the arsenic modelling task with minimum AUC, MCC, Kappa, and Acc scores of 0.83, 0.58, and 0.76, respectively.

•
The single class assessment metrics (precision, sensitivity and F1 score) indicate that the boosting models are more efficient at recognising high and low arsenic contaminated waters.

•
Essentially, the XGB algorithm outperformed the remaining models in terms of overall and single-class assessment metrics, whereas ADAB obtained the least performance.

•
High pH water was found to be highly correlated with high arsenic water, and vice versa.Water with high pH, Cond and TDS increases the likelihood of encountering high arsenic water sources.Low pH, Cond, and TDS levels are all indicators of low arsenic water.Medium arsenic waters are mostly associated with low Cond and low TDS.
Overall, this study provides a comprehensive evaluation of boosting algorithms and explainable ML that may be useful for future prediction, categorisation, and control of arsenic concentrations in various water supply systems.Although the models used in this study are somewhat predictive, the data used for validation and testing was fairly limited to the study region and timeframe.As a result, future studies should validate these models using large and current datasets.

Figure 1 .
Figure 1.Location and geological map of the study area.

Figure 1 .
Figure 1.Location and geological map of the study area.

24 Figure 2 .
Figure 2. Model development and workflow of the study.

Figure 2 .
Figure 2. Model development and workflow of the study.

Figure 3 .
Figure 3. Arsenic concentrations in surface water and groundwater illustrated using box and whisker plot.

Figure 3 .
Figure 3. Arsenic concentrations in surface water and groundwater illustrated using box and whisker plot.

Figure 4 .
Figure 4. Overall performance evaluation of developed models using AUC, MCC, Kappa, and Acc.

Figure 4 .
Figure 4. Overall performance evaluation of developed models using AUC, MCC, Kappa, and Acc.

Figure 5 .
Figure 5. Single-class performance of developed models in classifying: (a) low arsenic concentrations; (b) medium arsenic concentrations; and (c) high arsenic concentrations.

Figure 5 .
Figure 5. Single-class performance of developed models in classifying: (a) low arsenic concentrations; (b) medium arsenic concentrations; and (c) high arsenic concentrations.

Figure 7 .
Figure 7. SHAP feature importance plot for different arsenic concentrations.

Figure 7 .
Figure 7. SHAP feature importance plot for different arsenic concentrations.

4. 6 . 24 4. 6 .
SHAP Local InterpretationSince the global interpretations are based on the training dataset, it is critical to understand how the model makes decisions when an unknown dataset (testing dataset) is introduced.In most ML predictive tasks, the ultimate goal is generalizable predictive performance, or performance on unknown datasets.In this regard, SHAP local interpretation was used to justify decision-making on the testing dataset.The SHAP waterfall plots are used to explain how individual predictions were made.It helps to understand how input variables contribute to the model's prediction.f(x) in waterfall plots indicates the predicted water occurrence probability, and E[f(x)] indicates the expectation of water occurrence.The x-axis represents the range of responses.The y-axis represents the variable name and the corresponding observed value.The bottom of the plot starts with the expected values of the model output, and then, each row shows how the positive (red) or negative (blue) contributions of each feature move the value from the expected model output to the final prediction.For instance, in Figure12, it can be seen that a water sample with pH = 7.3, Cond = 415, Turb = 335, and TDS = 207 is considered.The sample has a high arsenic concentration, which was correctly predicted by the XGB model.Figure12a-cillustrate the probabilities of the samples occurring as low, medium, and high, respectively.The individual probabilities (f(x)) are indicated on the top of each plot.Water 2022, 14, 3509 19 of SHAP Local Interpretation Since the global interpretations are based on the training dataset, it is critical to understand how the model makes decisions when an unknown dataset (testing dataset) is introduced.In most ML predictive tasks, the ultimate goal is generalizable predictive performance, or performance on unknown datasets.In this regard, SHAP local interpretation was used to justify decision-making on the testing dataset.The SHAP waterfall plots are used to explain how individual predictions were made.It helps to understand how input variables contribute to the model's prediction.f(x) in waterfall plots indicates the predicted water occurrence probability, and E[f(x)] indicates the expectation of water occurrence.The x-axis represents the range of responses.The y-axis represents the variable name and the corresponding observed value.The bottom of the plot starts with the expected values of the model output, and then, each row shows how the positive (red) or negative (blue) contributions of each feature move the value from the expected model output to the final prediction.For instance, in Figure 12, it can be seen that a water sample with pH = 7.3, Cond = 415, Turb = 335, and TDS = 207 is considered.The sample has a high arsenic concentration, which was correctly predicted by the XGB model.Figure 12a-c illustrate the probabilities of the samples occurring as low, medium, and high, respectively.The individual probabilities (f(x)) are indicated on the top of each plot.

Figure 12 .
Figure 12.Local interpretations of a correct prediction considering the various possibilities of occurrence for different As concentrations: (a) low; (b) medium; and (c) high.

Figure 12 .
Figure 12.Local interpretations of a correct prediction considering the various possibilities of occurrence for different As concentrations: (a) low; (b) medium; and (c) high.

Figure 13 .
Figure 13.Local interpretations of a correct prediction considering the various possibilities of occurrence for different As concentrations: (a) low; (b) medium; and (c) high.

Figure 13 .
Figure 13.Local interpretations of a correct prediction considering the various possibilities of occurrence for different As concentrations: (a) low; (b) medium; and (c) high.

Table 1 .
Optimal hyperparameters for building various ML models.

Table 2 .
Summary of advantages and limitations of the boosting algorithms used in the study.

Table 4 .
Statistical summary of all the measured parameters and WHO guideline values.

Table 5 .
Overall performance of developed models on testing data.

Table 5 .
Overall performance of developed models on testing data.

Table 6 .
Single-class performance of developed models on testing data.