Groundwater Arsenic Distribution in India by Machine Learning Geospatial Modeling

Groundwater is a critical resource in India for the supply of drinking water and for irrigation. Its usage is limited not only by its quantity but also by its quality. Among the most important contaminants of groundwater in India is arsenic, which naturally accumulates in some aquifers. In this study we create a random forest model with over 145,000 arsenic concentration measurements and over two dozen predictor variables of surface environmental parameters to produce hazard and exposure maps of the areas and populations potentially exposed to high arsenic concentrations (>10 µg/L) in groundwater. Statistical relationships found between the predictor variables and arsenic measurements are broadly consistent with major geochemical processes known to mobilize arsenic in aquifers. In addition to known high arsenic areas, such as along the Ganges and Brahmaputra rivers, we have identified several other areas around the country that have hitherto not been identified as potential arsenic hotspots. Based on recent reported rates of household groundwater use for rural and urban areas, we estimate that between about 18–30 million people in India are currently at risk of high exposure to arsenic through their drinking water supply. The hazard models here can be used to inform prioritization of groundwater quality testing and environmental public health tracking programs.


Introduction
Around the world, but particularly so in India, there is an ever-increasing dependence on groundwater for drinking water supplies and irrigation [1,2]. This is related, in part, to population and economic growth as well as to climate change.
Groundwater is generally much less susceptible to biological and other sources of anthropogenic contamination than is surface water. Its longer residence time and exposure to varying geochemical environments in an aquifer, however, can subject groundwater to the accumulation of various chemical elements in sufficiently high concentrations to pose a health risk to those using it for drinking or cooking [3]. Examples of such naturally occurring (geogenic) contaminants include arsenic, fluoride, manganese, and uranium. Of these, arsenic is one of the most serious contaminants, both in terms of toxicity and ubiquity, which is due to its widespread presence in trace amounts in minerals found in all types of rocks and sediments [3,4]. Many of the large-scale occurrences of geogenic arsenic contamination of groundwater are found in Asia and are due to the release of arsenic found in recently deposited sediments that are exposed to geochemical conditions that are either predominantly reducing (e.g., Ganges-Brahmaputra and Mekong deltas [5]) or oxidizing (e.g., Indus plain [6]). The dominant mobilization mechanisms involve microbially mediated reductive dissolution of host Fe(III) oxyhydroxide minerals and/or reduction of arsenic [7] in reducing environments, while in general intra-aquifer concentrations may be strongly modified by pH-and competitive anion-dependent reversible sorption processes [8]. Other common but less widespread sources or mechanisms of arsenic release into aquifers include the oxidation of sulfide minerals and geothermal activity [9,10].
Long-term exposure to arsenic can lead to various skin diseases, cancers, and cardiovascular diseases [3]. The most common intake pathways include drinking arsenic-contaminated groundwater or consuming high inorganic arsenic crops, particularly rice and those grown in high arsenic soils and/or irrigated with arsenic-contaminated water [11]. Although at least some of the arsenic found in food is present in a less toxic organic form, the arsenic present in groundwater predominantly occurs as one of the more toxic inorganic species, that is, arsenate or arsenite [3]. For this reason, the World Health Organization (WHO) recommends keeping arsenic concentrations in drinking water as low as possible. Although it has a guideline concentration of 10 µg/L, this is only provisional due in part to the difficulties of removing arsenic from water [12]. Likewise, India has set the concentration of 10 µg/L as a requirement ("acceptable limit"), whereas 50 µg/L is kept as a permissible limit in the absence of alternate sources [13].
Aside from some of the well-known arsenic-contaminated areas of India, such as along the Ganges and Brahmaputra rivers in parts of Assam, Bihar, Uttar Pradesh, and West Bengal, groundwater arsenic is not comprehensively tested throughout the country. The complete picture of arsenic contamination in India may therefore not be fully understood. There may also exist in the country areas of arsenic groundwater contamination that have not yet been identified. In order to help determine where high concentrations of arsenic in groundwater may exist in India, we have employed machine learning with a comprehensive dataset of arsenic concentrations and various environmental parameters to produce a model of arsenic-contaminated groundwater for the whole of India. This should indicate where in the country high groundwater arsenic concentrations are likely to be found where no arsenic measurements currently exist. Such an approach has been previously been carried out for the states of Gujarat [14] and Uttar Pradesh [15], as well as the entire world [16], but never solely for all of the country of India or with the substantially larger dataset of arsenic concentrations as in this study. All other factors being equal, a modeling study conducted on a smaller area allows the model to focus on, and better characterize, the arsenic occurrences in that area and thereby produce a more accurate model.

Arsenic Concentration Measurements
A total of 145,099 geographically distinct arsenic concentration measurements in groundwater were assembled from a multitude of sources including from a systematic compilation of published sources (Table 1). Although the focus of this study is on India, data from adjoining countries, where available, were incorporated to help characterize the occurrence of arsenic in border areas. As such, data acquisition was concentrated on India, which contributed 91% of the data, whereas available datasets from the neighboring countries of Bangladesh (3%), Nepal (5%), and Pakistan (1%) were also included. Reported concentrations were mainly determined by ICP-MS or AA, although some field kit test measurements assured by cross-calibration with laboratory measurements were also included, particularly for areas with otherwise limited data. However, the model sensitivity to the precision of the arsenic measurements in the dataset is lessened by the fact that we convert them to binary format before modeling (see below). As many data are heavily concentrated in a few areas in West Bengal and to a lesser degree Bihar and the Terai (Nepal), an effort was made to reduce the disproportionately high frequency of data coming from these areas and thereby temper their influence on the model. To this end, the concentration measurements were averaged to generate individual data points corresponding to the 1-km × 1-km resolution of the predictor variables where more than one original data point was located within a 1-km × 1-km pixel. This considerably reduced the size of the dataset to 23,799 data points ( Figure 1), with the breakdown by country being: India (74%), Bangladesh (15%), Nepal (8%), and Pakistan (3%). The resulting cumulative distribution of arsenic concentrations is shown in Figure S1, with 42% of the concentrations exceeding 10 µg/L. As explained below, a binary target variable was modeled, for which the arsenic concentrations were first recoded to either 0 or 1 according to them being either less than or equal to 10 µg/L or greater than 10 µg/L.  Although other statistical learning approaches, such as logistic regression and support vector machines, were initially attempted, the random forest method [59] was ultimately adopted due its superior prediction performance in initial tests and was implemented using the R programming language [60]. Random forests are ensembles of decision trees that are grown with elements of introduced randomness. The data used to grow an individual tree are randomly selected by sampling with replacement from the full (training) dataset, which results in about 2/3 of the data rows being utilized, some of which multiple times. Furthermore, the number of predictor variables made available at each branch is restricted and the variables are randomly chosen. Because not all variables are considered simultaneously, any possible effects of multicollinearity among predictors can generally be disregarded in a random forest [61].
The typical number of variables made available at each branch of the trees grown in a random forest is the square root of the number of predictors, which in this case would have been five (taking

Predictor Variables
Although the target of the modeling is the concentration of arsenic located at some depth from the surface, only parameters determined at the surface are available in a spatially continuous sense across all of India. This is due to the generally high cost and/or difficulty of obtaining relevant subsurface data (e.g., geophysical measurements, drill logs) across the entire country.
In total, 26 different spatially continuous parameters were used as predictor variables in modeling ( Table 2). These variables were selected based on their known or perceived function as proxies for the accumulation of arsenic in groundwater [16]. Most of the variables are related to climate or the surface geology, which includes metamorphic and sedimentary rocks in the Himalayas in the north, volcanic and metamorphic units in Deccan plateau in the south, and extensive unconsolidated sediments along the Ganges and Brahmaputra rivers in between (Figure 1b). In addition, included are many soil parameters, which are influenced by both climate and geology, as well as land cover, topography, and water table depth. All but two of the variables are available as 1-km x 1-km rasters (30 arc-second resolution). The two exceptions are the categorical variables of land cover and lithology, both of which are provided as polygon files that were then converted to the same 30 arc-second resolution (1 km at the equator). In preparation for modeling, the geographical coordinates of the 23,799 arsenic data points were used to retrieve the corresponding values of these predictor variables and added to a modeling table.

Prediction Modeling
The arsenic concentration dataset and predictor variables described above were used to create a statistical prediction model of the occurrence of arsenic in groundwater exceeding the WHO and national India guideline concentration of arsenic in drinking water of 10 µg/L. A binary rather than a continuous response variable was chosen due to the anticipated application of the resulting prediction model being to address progressing towards fuller compliance with the 10 µg/L regulatory standard. Furthermore, as we were restricted by necessity to using surface parameters to predict arsenic concentrations at depth, modeling a binary target variable circumvents some of the associated uncertainty and as a consequence should improve the model's effectiveness.
Although other statistical learning approaches, such as logistic regression and support vector machines, were initially attempted, the random forest method [59] was ultimately adopted due its superior prediction performance in initial tests and was implemented using the R programming language [60]. Random forests are ensembles of decision trees that are grown with elements of introduced randomness. The data used to grow an individual tree are randomly selected by sampling with replacement from the full (training) dataset, which results in about 2/3 of the data rows being utilized, some of which multiple times. Furthermore, the number of predictor variables made available at each branch is restricted and the variables are randomly chosen. Because not all variables are considered simultaneously, any possible effects of multicollinearity among predictors can generally be disregarded in a random forest [61].
The typical number of variables made available at each branch of the trees grown in a random forest is the square root of the number of predictors, which in this case would have been five (taking the square root of 26). This parameter was tuned in order to find the optimal value to use with our dataset by trying values between 1 and 26 (the total number of variables) and comparing the results. This showed that making 10 variables available at each branch produces the most accurate model as measured against the out-of-bag (OOB) data that are randomly sorted out of each tree grown.
For the actual model, the full dataset of 23,799 data points was randomly split into training (80%) and testing (20%) datasets. This was done by stratified sampling so as to maintain the same balance between low and high cases (0 or 1) of the binary target variable, for which 42% of values were greater than 10 µg/L. The training dataset was used to develop the model (encompassing 10,001 trees), which was then cross validated with the testing dataset to determine its accuracy in predicting low (≤10 µg/L) and high (>10 µg/L) arsenic concentrations on new data. The model was then applied to the 26 spatially continuous predictor variables to create a map of the probability of the occurrence of the concentration of arsenic in groundwater exceeding 10 µg/L for all of India.

Importance of Predictor Variables
The effect of the predictor variables on the random forest model was evaluated directly through two statistics: (i) decrease in accuracy and (ii) decrease in Gini node impurity. In calculating both of these, the values of each predictor variable were randomly shuffled in turn and the resulting decrease in accuracy and decrease in Gini node impurity (how well the target variable is split at a branch) were measured on the OOB data of each tree and averaged. A higher (positive) value indicates a greater relative importance of a variable, whereas a negative value (corresponding to greater accuracy or node purity when the values are reassigned) shows that a variable does not benefit a model and should be removed.
As the measures of random forest variable importance described above do not indicate how a predictor variable relates to the target variable, e.g., with a positive or negative trend, Pearson correlations between each (continuous) predictor variable and the proportion of arsenic measurements exceeding 10 µg/L were calculated. This was done by first ordering the values of each predictor and placing them into 16 bins, each containing the same number of values. The proportion of arsenic measurements in each bin greater than 10 µg/L was then calculated. The number of bins was determined using Sturges' formula (1 + log2 n) [62]. The correlation was calculated between the average value of the predictor in each bin and the proportion of high arsenic concentrations.

Estimating Potentially Affected Population
The prediction map generated from the random forest model was subsequently used to estimate the population potentially exposed to high concentrations of arsenic in drinking water. The first step of this procedure was to determine the areas at high risk of having arsenic concentrations greater than 10 µg/L. This was done using two approaches that have been described in more detail by Podgorski and Berg [16]. The sensitivity and specificity, that is, the correct classification of high and low values, respectively, were plotted for 100 probabilities between 0 and 1. Where both curves intersect indicates the probability cutoff at which the model classifies low and high values equally well. The second approach that was used follows a similar procedure that instead finds the intersection of the positive predictive value (PPV) and negative predictive value (NPV), which are defined as rate of correct positive and negative predictions, respectively. These analyses for determining how to interpret the model were conducted with all available data (training and testing datasets). Each probability threshold found with these two approaches was subsequently used to identify high hazard areas on the probability map. The two sets of high hazard areas were then used for further calculations to produce a range of values of populations at risk.
The populations living in the high hazard areas determined according to the above procedure were then multiplied by the modeled probabilities. The population figures were taken from a population model for 2020 based on existing patterns of development [63]. The at-risk population was further refined by differentiating between rural and urban areas [55] and multiplying by recent (2016) estimated nationwide usage rates of untreated groundwater in rural (0.637) and urban (0.238) areas, respectively [64].

Arsenic Prediction Model
The cross-validation results of the final random forest model as applied to the test dataset are shown in Table 3. The area under the ROC (receiver operator characteristic) curve (AUC) is 0.86, which in general can range between 0.5 (random model) and 1 (perfect model) and represents how well a binary model can predict both low and high values as assessed over numerous probability cutoff values [65]. The AUC of this model is generally a better result than that of similar regional or country-scale groundwater quality studies (where reported), for example, 0.84 for fluoride in India [66], 0.71-0.83 for arsenic in Gujarat [14], 0.82 for arsenic in the USA [67], 0.80 for arsenic in Pakistan [6], or 0.74 for arsenic in Uttar Pradesh [15]. The overall accuracy of the model as applied to the test dataset of 0.79 is significantly higher than the no information rate of 0.58 (p value < 2.2 × 10 −16 ) and is comparable to the average accuracy with OOB samples of 0.77. The no information rate refers to the accuracy that would be achieved without a model and is simply the proportion of the more frequent class of the dataset, i.e., 58% of the arsenic measurement points are equal to or less than 10 µg/L. Similarly, the Cohen's kappa statistic [68] (0.5606) is an indicator of accuracy beyond what could be expected by chance and varies from 0 (no agreement) to 1 (perfect agreement). Despite the spatial averaging of data to 1-km pixels, there still remains a much higher density of data points in a few regions, particularly in West Bengal where nearly 50% of the averaged data points are located. In order to test if such an imbalance considerably biases the model by allocating excessive weight to the conditions found in a particular region, the data from West Bengal were randomly split into 10 subsets that were each modeled separately with the rest of the dataset. The results of the 10 different models were averaged and compared with the single model using the full dataset (containing all West Bengal data). As the model performance was essentially identical, e.g., AUC of 0.86 and balanced accuracy of 0.78, the simpler approach of using a single model was retained and that of splitting the West Bengal data was not further pursued.
Early attempts to separately model areas corresponding to reducing environments, arid-oxidizing environments, and sulfide oxidation arsenic mobilization processes did not result in considerably different predictions or accuracy and were therefore abandoned in favor of a single model for India. It appears that the single model is able to effectively account for different geochemical environments due to utilizing the same parameters of climate, geology, and soil pH that were used to define the different environments. For example, in the ten thousand trees that make up the random forest, splits would have been made on significant differences in these parameters, which is similar in principle to having done so manually.
The arsenic prediction map generated from the final random forest model is displayed in Figure 2a. As most of the data used in the model is from India with the data from neighboring countries having been incorporated merely to help characterize border areas, the analysis that follows will focus only on India. Nonetheless, the arsenic prediction map including the rest of South Asia is presented in Figure S2, which, as seen with model continuity across international borders, also serves as a reminder that political boundaries do not necessarily coincide with those of natural systems. The prediction model (Figure 2a) captures known arsenic-prone areas in the alluvial sediments along the Ganges and Brahmaputra plains [27] and in Gujarat [14] and Punjab [43]. It also identifies less well-known or previously undocumented areas such as Haryana, Jammu and Kashmir, and central Madhya Pradesh. Elevated arsenic hazard also appears in Himachal Pradesh, Kerala, and Uttarakhand, from where no arsenic concentration data were available to us. Consistent with the findings of Sovann and Polya [69] for Cambodia, this confirms that small alluvial systems, in addition to the larger systems, such as those of the Ganges, Brahmaputra, and Mekong, may also host elevated groundwater arsenic concentrations. This highlights some of the advantages and uses of a prediction model based on machine learning rather than spatial interpolation. That is, predictions can be made in areas removed from where concentration measurements exist by relying on statistical relationships between measurements in other areas and predictor variables that cover the entire model domain. The model can then be used to prioritize areas for testing where there is a dearth of data. The results of new water quality testing can then in turn be fed back into the model to further improve it.

Influence of Predictor Variables
The importance of the predictor variables in the random forest model was assessed as the mean decrease in accuracy and mean decrease in Gini impurity. Each of these was normalized by the maximum value calculated among the predictor variables, and both are displayed together in Figure  3. This shows both silt variables (subsoil and topsoil) placing markedly above the others in importance, followed by aridity and both actual and potential evapotranspiration. The least important variables relative to the others are topographic wetness index, water wilting point, gleysols, and land use. Nevertheless, none of the predictors has a negative importance value, which suggests that they are all beneficial to the model. The prediction model (Figure 2a) captures known arsenic-prone areas in the alluvial sediments along the Ganges and Brahmaputra plains [27] and in Gujarat [14] and Punjab [43]. It also identifies less well-known or previously undocumented areas such as Haryana, Jammu and Kashmir, and central Madhya Pradesh. Elevated arsenic hazard also appears in Himachal Pradesh, Kerala, and Uttarakhand, from where no arsenic concentration data were available to us. Consistent with the findings of Sovann and Polya [69] for Cambodia, this confirms that small alluvial systems, in addition to the larger systems, such as those of the Ganges, Brahmaputra, and Mekong, may also host elevated groundwater arsenic concentrations. This highlights some of the advantages and uses of a prediction model based on machine learning rather than spatial interpolation. That is, predictions can be made in areas removed from where concentration measurements exist by relying on statistical relationships between measurements in other areas and predictor variables that cover the entire model domain. The model can then be used to prioritize areas for testing where there is a dearth of data. The results of new water quality testing can then in turn be fed back into the model to further improve it.

Influence of Predictor Variables
The importance of the predictor variables in the random forest model was assessed as the mean decrease in accuracy and mean decrease in Gini impurity. Each of these was normalized by the maximum value calculated among the predictor variables, and both are displayed together in Figure 3. This shows both silt variables (subsoil and topsoil) placing markedly above the others in importance, followed by aridity and both actual and potential evapotranspiration. The least important variables relative to the others are topographic wetness index, water wilting point, gleysols, and land use. Nevertheless, none of the predictors has a negative importance value, which suggests that they are all beneficial to the model.  Table S1).
In order to better understand how the predictor variables may relate to high arsenic concentrations (>10 µg/L), the proportion of high arsenic measurements were plotted against the averages of binned predictor values, for which rank-order correlations (Kendall Tau-b) were also calculated ( Figure 4). The strongest rank-order correlations were found with topsoil coarse fragments  (Figure 4x), which highlights why it can be important to use a nonlinear classifier for modeling, such as random forest, to capture such relationships.
The importance of gleysols and their positive correlation with high arsenic concentrations can be linked to chemically reducing conditions, which are conducive to arsenic release [7,9], brought about by the poor drainage associated with this soil type. The negative relationship between soil cation exchange capacity and groundwater arsenic is at first glance counterintuitive given that higher CEC may give rise to higher pHs at which anionic, deprotonated As(III) or As(V) species will be more likely to desorb. Accordingly, this relationship might instead reflect the typically higher clay/silt contents of high CEC soils. The inverse correlation with water table depth is likely related to the occurrence of higher arsenic concentrations in the shallower Holocene layers of alluvial sedimentary systems [70] and that tend to contain higher concentrations of labile reactive organics associated with arsenic mobilization [71][72][73]. Furthermore, the strong dependence of the model on the silt content of soil, including a positive relationship with high arsenic concentrations, points toward the occurrence of high arsenic concentrations beneath floodplains, as do similar relationships with fluvisols.  Table S1).
In order to better understand how the predictor variables may relate to high arsenic concentrations (>10 µg/L), the proportion of high arsenic measurements were plotted against the averages of binned predictor values, for which rank-order correlations (Kendall Tau (Figure 4x), which highlights why it can be important to use a nonlinear classifier for modeling, such as random forest, to capture such relationships.
The importance of gleysols and their positive correlation with high arsenic concentrations can be linked to chemically reducing conditions, which are conducive to arsenic release [7,9], brought about by the poor drainage associated with this soil type. The negative relationship between soil cation exchange capacity and groundwater arsenic is at first glance counterintuitive given that higher CEC may give rise to higher pHs at which anionic, deprotonated As(III) or As(V) species will be more likely to desorb. Accordingly, this relationship might instead reflect the typically higher clay/silt contents of high CEC soils. The inverse correlation with water table depth is likely related to the occurrence of higher arsenic concentrations in the shallower Holocene layers of alluvial sedimentary systems [70] and that tend to contain higher concentrations of labile reactive organics associated with arsenic mobilization [71][72][73]. Furthermore, the strong dependence of the model on the silt content of soil, including a positive relationship with high arsenic concentrations, points toward the occurrence of high arsenic concentrations beneath floodplains, as do similar relationships with fluvisols. Many of the observations made above are compatible with each other and are consistent with the reductive dissolution of arsenic in known high arsenic-hazard areas of Figure 2a found within unconsolidated sediments (Figure 1b), particularly along the Ganges and Brahmaputra rivers. Notable exceptions to this are the arsenic areas in central Madhya Pradesh (Figure 2a), which occur primarily within mafic volcanic rocks as well as the high hazard areas located in acidic plutonic rocks Many of the observations made above are compatible with each other and are consistent with the reductive dissolution of arsenic in known high arsenic-hazard areas of Figure 2a found within unconsolidated sediments (Figure 1b), particularly along the Ganges and Brahmaputra rivers. Notable exceptions to this are the arsenic areas in central Madhya Pradesh (Figure 2a), which occur primarily within mafic volcanic rocks as well as the high hazard areas located in acidic plutonic rocks (e.g., granite) in Kerala and northeast Karnataka. The last example may be due to the oxidation of arsenic-bearing sulfide minerals, which is likely responsible for high arsenic concentrations in areas of historical gold mining activity in Karnataka [74].

Populations at Risk
The estimated arsenic-risk areas based on the probability cutoffs of 0.49 and 0.55 (see Figure 5) are shown in Figure 2b, in which most of the risk areas are seen to be concentrated along the Brahmaputra river (Assam) and the lower half of the Ganges river (Uttar Pradesh, Bihar, and West Bengal). Other notable risk areas are found in Jammu and Kashmir, Punjab, Gujarat, Madhya Pradesh, and most of the states neighboring Assam. The proportion of land area and populations that are potentially affected are broken down by state/territory in Table 4. Because only grid cells with a probability exceeding 0.49 or 0.55 were considered, underestimation may occur where there is localized arsenic contamination. In total, we estimate that between about 18 and 30 million people in India may be currently consuming arsenic in groundwater at concentrations exceeding 10 µg/L. This population estimate is substantially lower than that estimated (~41 million) for the five most impacted Indian states by Chakraborti et al. [75] and is consistent with the estimate of 31 million derivable from Ravenscroft et al. [9]. It is difficult to state possible reasons for these differences as the methods used to calculate these other estimates are not clear. Our range of potentially affected population is also toward the lower end of the range of 18-90 million estimated for India as part of a global groundwater arsenic prediction model [16], which highlights how a separate study, such as this one, concentrated on a single area can lead to more precise results. (e.g., granite) in Kerala and northeast Karnataka. The last example may be due to the oxidation of arsenic-bearing sulfide minerals, which is likely responsible for high arsenic concentrations in areas of historical gold mining activity in Karnataka [74].

Populations at Risk
The estimated arsenic-risk areas based on the probability cutoffs of 0.49 and 0.55 (see Figure 5) are shown in Figure 2b, in which most of the risk areas are seen to be concentrated along the Brahmaputra river (Assam) and the lower half of the Ganges river (Uttar Pradesh, Bihar, and West Bengal). Other notable risk areas are found in Jammu and Kashmir, Punjab, Gujarat, Madhya Pradesh, and most of the states neighboring Assam. The proportion of land area and populations that are potentially affected are broken down by state/territory in Table 4. Because only grid cells with a probability exceeding 0.49 or 0.55 were considered, underestimation may occur where there is localized arsenic contamination. In total, we estimate that between about 18 and 30 million people in India may be currently consuming arsenic in groundwater at concentrations exceeding 10 µg/L. This population estimate is substantially lower than that estimated (~41 million) for the five most impacted Indian states by Chakraborti et al. [75] and is consistent with the estimate of 31 million derivable from Ravenscroft et al. [9]. It is difficult to state possible reasons for these differences as the methods used to calculate these other estimates are not clear. Our range of potentially affected population is also toward the lower end of the range of 18-90 million estimated for India as part of a global groundwater arsenic prediction model [16], which highlights how a separate study, such as this one, concentrated on a single area can lead to more precise results. Positive predictive value (PPV) and negative predictive value (NPV) were found to be equivalent at a probability cutoff of 0.55 also with a corresponding accuracy of 96%. Table 4. Area and population potentially exposed to arsenic concentrations greater than 10 µg/L by state/territory. Based on probabilities in Figure 2a

State/Territory
Percentage of Land Area Exposed   Table 4. Area and population potentially exposed to arsenic concentrations greater than 10 µg/L by state/territory. Based on probabilities in Figure 2a exceeding

Conclusions
The purpose of the hazard models produced here is to offer an overview of where high concentrations of geogenic arsenic are likely to be found in groundwater across the whole of India, with some insight into the physical processes at work, as well as to assess the size of the populations potentially affected. Particularly since arsenic is often not routinely analyzed in many areas without a known problem, the maps should serve as a guide to identifying where additional testing should be conducted as well as in assessing health impacts. In addition to drinking water supplies, the hazard map is relevant to the utilization and siting of wells for the irrigation of crops.
Although this study is based on over 145,000 groundwater samples, targeted testing is still required to determine if a specific well is highly contaminated with arsenic or not. This is due, at least in part, to the heterogeneous nature of aquifers that can lead to great variability of groundwater arsenic concentrations over short distances. Further, additional groundwater arsenic concentration data would help improve the model, particularly in areas (e.g, in parts of Maydha Pradesh) in which anthropogenic processes may have had a material influence on groundwater arsenic concentrations and thereby the rendered model. Despite the hazard model presented here being very effective in predicting high groundwater arsenic concentrations, it is unable to account for the depth dependency of arsenic in an aquifer, which, for example, can vary according to sediment age and/or redox conditions as well as 3-D heterogeneous permeability structures [76]. As such, it can be assumed that the predictions become less accurate with greater depth, as was demonstrated in a recent global study [16]. Although~3/4 of the modeled data points used here have an associated well depth, essentially all of these data were confined to just a few areas in West Bengal and Bihar. Given the size of India and its variations in climate and geology, it was not feasible to generalize depth relationships based on data from just a small part of the country. However, incorporating depth as a predictor variable could be effective in a smaller-scale study of a single region.
Another dimension that could enhance the modeling is that of time. For example, accounting for fluctuations in arsenic concentrations relative to the monsoon season could possibly improve the accuracy of the model, particularly in areas with extensive hyporheic zones undergoing surface water-groundwater exchange and particularly including those proximal to rivers with large differences in post-and pre-monsoonal stages. Any long-term secular trends in arsenic concentration related to aquifer exploitation [77][78][79][80], aquifer depletion, or climate change could also produce useful results. This was not possible in this study due to each of the arsenic measurement points representing spatially distinct locations that also generally lacked information on the timing of sampling.
The population estimates rely on a single country-wide set of groundwater-use rates for rural and urban areas. Knowing this rate at a finer scale (e.g., state, division, and district) could be expected to lead to more accurate results. In order get a better grasp of who is physically impacted by high groundwater arsenic concentrations, exposure studies and environmental public health tracking [81] leveraging the arsenic hazard map produced here would be very helpful.

Acknowledgments:
We thank NERC (UK) and DST (India) for their joint funding of the Newton Bhabha Indo-UK Water Quality project, FAR-GANGA (www.farganga.org). We thank colleagues at the FAR-GANGA stakeholder partner, the Central Ground Water Board (CGWB, Faridabad, India), for the acquisition and provision of many of the groundwater arsenic data utilized in this study. We acknowledge the further support of and/or discussions with Benjamin Ambühl, Michael Berg, Abhijit Mukherjee, Prashant Rai, Laura A. Richards and Dipankar Saha. The views expressed in this work do not necessarily reflect those of any of the funders, organizations, or individuals that we acknowledge here.