Groundwater Arsenic-Attributable Cardiovascular Disease (CVD) Mortality Risks in India

: Cardiovascular diseases (CVDs) have been recognized as the most serious non-carcinogenic detrimental health outcome arising from chronic exposure to arsenic. Drinking arsenic contaminated groundwaters is a critical and common exposure pathway for arsenic, notably in India and other countries in the circum-Himalayan region. Notwithstanding this, there has hitherto been a dearth of data on the likely impacts of this exposure on CVD in India. In this study, CVD mortality risks arising from drinking groundwater with high arsenic (>10 µ g/L) in India and its constituent states, territories, and districts were quantiﬁed using the population-attributable fraction (PAF) approach. Using a novel pseudo-contouring approach, we estimate that between 58 and 64 million people are exposed to arsenic exceeding 10 µ g/L in groundwater-derived drinking water in India. On an all-India basis, we estimate that 0.3–0.6% of CVD mortality is attributable to exposure to high arsenic groundwaters, corresponding to annual avoidable premature CVD-related deaths attributable to chronic exposure to groundwater arsenic in India of between around 6500 and 13,000. Based on the reported reduction in life of 12 to 28 years per death due to heart disease, we calculate value of statistical life (VSL) based annual costs to India of arsenic-attributable CVD mortality of between USD 750 million and USD 3400 million.


Introduction
Geogenic or anthropogenic arsenic has long been recognized as an important human carcinogen [1,2]. Globally, and notably, particularly in India, the utilization of high arsenic hazard groundwaters as drinking and irrigation water has posed, and still does pose, a serious threat to public health [3][4][5]. The major human exposure pathways to groundwater arsenic are drinking arsenic-contaminated groundwaters and consuming inorganic arsenicbearing crops, notably including rice [6,7], due, albeit only in part, to irrigation with high arsenic groundwater [8]. Chronic exposure to inorganic arsenic may give rise to internal cancers (bladder, lung, kidney, and liver), skin cancers, skin lesions, and, importantly, cardiovascular diseases (CVDs), which are considered as the most serious non-carcinogenic detrimental health outcome [9][10][11][12].
In India, arsenic contamination of groundwaters takes place mainly in the alluvial deposits of major rivers (e.g., Ganges and Brahmaputra) flowing from the Himalayas and Tibetan plateau [13][14][15]. The dominant mechanism of arsenic enrichment and mobilization in these groundwaters in alluvial aquifers is the microbially mediated reductive dissolution of arsenic-bearing Fe(III) oxide (oxyhydroxide, hydroxide, and oxide) minerals [16,17]. Other processes, including complexation of arsenic by dissolved humic substances, competitive desorption, oxidation of sulfide minerals, geothermal activities, and mining-related activities, may also influence arsenic concentrations in groundwaters [18,19]. Spatial geostatistical models of the distribution of groundwater arsenic have been produced at both state (Gujarat [20], Uttar Pradesh [21]) and national scales [22,23] to help better fully understand the distribution of groundwater arsenic hazard in India, particularly given the lack of all-India systematic groundwater arsenic testing. Podgorski et al. [22] generated a random forest model with over a hundred thousand arsenic concentration observations and over two dozen environmental parameter predictors to identify high groundwater arsenic (>10 µg/L) in well-known arsenic-contaminated areas in the alluvial sediments along the Ganges and Brahmaputra plains and in Gujarat and Punjab states and to capture less-known or previously unrecorded areas. Estimates of the population numbers in India exposed to high arsenic through drinking groundwaters range from 18 to 30 million [22] to 90 million [23].
Notwithstanding recent spatial distribution models of groundwater arsenic for India, a country-wise and country-specific public health risk model, notably for cardiovascular diseases has not yet been created. The association of CVDs with chronic exposure to arsenic in drinking water have been widely discussed [24][25][26][27][28]. Meta-analysis is a recommended method to quantify the dose-response relationship between CVDs and arsenic exposure, suitably weighting the results of a range of studies [24,29]. Meanwhile, the populationattributable fraction (PAF) approach is a common method used to quantify potential disease burden reduction in population by eliminating a certain risk factor [30,31].
The aim of this study was to estimate the magnitude of groundwater arsenic-attributable cardiovascular disease (CVD) mortality in India. These aims were achieved by (i) generating machine learning models of the probability distribution of groundwater arsenic concentrations exceeding four different thresholds (10,20,50, 80 µg/L); (ii) thereby creating a pseudo-contour map of groundwater arsenic concentrations; (iii) estimating the state/district-level population exposed to these four arsenic concentration ranges (10-20, 20-50, 50-80, >80 µg/L) based on simple exposure models; (iv) determining the state/district-level population-attributable fraction (PAF) of CVD mortality attributable to exposure to groundwater arsenic; and hence (v) determining the state-level CVD mortality attributable to groundwater arsenic. Lastly, (vi) the costs of arsenic-attributable CVD mortality were estimated for the whole of country.

Hazard Models of Groundwater Arsenic
A pseudo-contour map of groundwater arsenic concentrations for the whole of India was produced, based on the machine learning approach of Podgorski et al. [22] incorporating the novel pseudo-contour approach of Wu et al. [20]. The reader is referred to those studies [20,22] and the references therein for a brief account of the advantages, disadvantages, and limitations of machine learning models. In this study, four random forest models, encompassing 1001 trees, were conducted to comprehensively characterize the distribution of groundwater arsenic by setting four different thresholds (10,20,50, and 80 µg/L) and using the package 'Random Forest' in R statistical software. The prediction results of the four machine learning models were applied to map the probability of groundwater arsenic concentrations exceeding 10, 20, 50, and 80 µg/L in India. The cross-validations of the random forest models were executed by area (AUC) under the receiver operating characteristic (ROC) curve (0.5 for random model, 1 for perfect model) [32][33][34] calculated on the testing dataset, while, the importance of the predictors in random forest models was assessed by considering the decrease in accuracy as well as in Gini node impurity.
Compared with the random forest model of groundwater arsenic exceeding 10 µg/L for India conducted by Podgorski et al. [22], 716 more geolocated (longitude/latitude) groundwater arsenic concentration measurements [35][36][37][38][39][40][41][42][43][44][45][46][47] (Figure 1a and Table S1) and 5 more potential environmental predictors (Table S2) were included in the dataset; then, the averaged 24,297 arsenic data were classified into 0-10 (59%, n = 14,192), 10-20 (9%, n = 2233), 20-50 (14%, n = 3456), 50-80 (6%, n = 1536), and >80 (12%, n = 2880) µg/L concentration categories ( Figure S1). Considering the five arsenic concentration categories, which are different from the arsenic classification of Podgorski et al. [22], the entire dataset was randomly split into training (80%) and testing (20%) datasets by stratified sampling to ensure that the same ratios of the arsenic concentration categories were present in the entire, training, and testing datasets. water arsenic concentrations by using a default probability cutoff value of 0.5. Such a machine learning approach may give rise to maps in which the areas ascribed to higher concentration categories may not entirely map within those ascribed to lower concentration categories. In this study, this was minimized by appropriate stratified sampling on splitting the overall dataset into training and testing datasets, but in the limited number of areas for which this still arose, higher concentration areas were still recognized to exist for the purposes of modelling arsenic-attributable health risks.
The district-wise proportion of areas with groundwater arsenic concentration ranges of 10-20, 20-50, 50-80, and >80 μg/L were calculated based on the pseudo-contour map of groundwater arsenic concentrations, and the error boundaries of the area proportion were estimated by Monte Carlo simulations (1000 times) [48].  Then, these probability maps were converted into a pseudo-contour map of groundwater arsenic concentrations by using a default probability cutoff value of 0.5. Such a machine learning approach may give rise to maps in which the areas ascribed to higher concentration categories may not entirely map within those ascribed to lower concentration categories. In this study, this was minimized by appropriate stratified sampling on splitting the overall dataset into training and testing datasets, but in the limited number of areas for which this still arose, higher concentration areas were still recognized to exist for the purposes of modelling arsenic-attributable health risks.
The district-wise proportion of areas with groundwater arsenic concentration ranges of 10-20, 20-50, 50-80, and >80 µg/L were calculated based on the pseudo-contour map of groundwater arsenic concentrations, and the error boundaries of the area proportion were estimated by Monte Carlo simulations (1000 times) [48].

Hazard and Population Exposure Maps
The district-wise proportion of areas with various groundwater arsenic concentration ranges was combined with total/rural/urban population [49] and household proportion of using groundwater as drinking water [50] to estimate district-wise total/rural/urban population exposed to arsenic concentration ranges of 10-20, 20-50, 50-80, and >80 µg/L by drinking groundwaters. The proportion of using groundwater as drinking water were calculated by two model approaches: (exposure model 1) regarding water from hand pumps and tube wells/boreholes as groundwater and (exposure model 2) considering water from wells (covered + uncovered), hand pumps, and tube wells/boreholes as groundwater [51]. The spatial distribution of these values, together with that of the difference between two values, is shown in Figure 1b-d.
Assuming that the fraction of district rural ( ) and urban ( A d ) areas with groundwater arsenic level ( = 1 (0-10 µg/L), 2 (10-20 µg/L), 3 (20-50 µg/L), 4 (50-80 µg/L), and 5 (> 80 µg/L)) are equal, the fraction of the district rural (F d,r,gw,n, ) and urban (F d,u,gw,n, ) populations exposed to groundwater arsenic exposure levels were calculated using Equations (1) and (2): where X d,r,gw,n and X d,u,gw,n are the fraction of the households drinking groundwater in rural and urban areas of a district, respectively, and n is the exposure model approach number to calculate proportion of using groundwater as drinking water, n = 1 (groundwater = hand pump + tube well/borehole) or 2 (groundwater = well (covered + uncovered) + hand pump + tube well/borehole). The fraction of the total district population exposed to groundwater arsenic exposure level (F d,t,gw,n, ) was then calculated by: F d,t,gw,n, = F d,r,gw,n, × P d,r + F d,u,gw,n, × P d,u P d,t where P d,t , P d,r , P d,u are the total, rural, and urban population respectively of a district. Thus, district-wise rural (P d,r,gw,n, ), urban (P d,u,gw,n, ), and total (P d,t,gw,n, ) population exposed to groundwater arsenic exposure level were estimated by Equations (4)-(6), respectively: P d,r,gw,n, = F d,r,gw,n, × P d,r

Pooled Relative Risks of Cardiovascular Disease (CVD) Related to Arsenic Concentrations in Drinking Water
Incorporating the dose-response relationships between CVD mortality risk and arsenic in drinking water from a recent meta-analysis by Xu et al. [24], the pooled log-linear and non-linear relative risks for each CVD endpoint were calculated for arsenic concentrations of 15 µg/L (mid-point of the 10-20 µg/L category), 35 µg/L (mid-point; 20-50 µg/L), 65 µg/L (mid-point; 50-80 µg/L), and 131 µg/L (a weighted mid-point for the >80 µg/L category), using a drinking water arsenic concentration of 5 µg/L (mid-point for the 0-10 µg/L category) as the reference exposure. The goodness-of-fit of the model was evaluated using deviance tests, coefficient of determination (R 2 and adjusted R 2 ) and Akaike's information criterion (AIC) values. The heterogeneity was illustrated by P-heterogeneity (p < 0.05 for significant), I 2 -statistics (<25% for low heterogeneity, 25-50% for moderate, and >50% for high) and Cochran's Q-statistic [52].

Cardiovascular Disease (CVD) Health Risk Estimation
The contribution of drinking groundwater with high arsenic (>10 µg/L) to CVD mortality was quantified using the population-attributable fraction (PAF) approach. The state-wise and district-wise PAFs of high groundwater arsenic-attributable CVD mortality were calculated by combining the population exposure distribution and dose-response relationship using Equation (7) based on the formula given by Rockhill et al. [30] and Mondal et al. [31]: where RR is the relative (to reference exposure level, =1, (0-10 µg/L)) risk of CVD mortality at groundwater arsenic exposure level ; = 1 (0-10 µg/L), 2 (10-20 µg/L), 3 (20-50 µg/L), 4 (50-80 µg/L), and 5 (> 80 µg/L); F d,t,gw,n, and F d,t,gw,n, are the fraction of the total district population exposed to groundwater arsenic exposure level for actually current situation and ideal situation, respectively; and n is the number of the exposure model for calculating fraction of the population drinking groundwater. The error boundaries of the PAF values were estimated by 1000 Monte Carlo simulations. The PAF of groundwater arsenic-attributable CVD mortality was combined with published data [53] on CVD mortality in India on a state-wise basis to quantify the state-wise annual CVD mortality due to consuming high groundwater arsenic. Finally, the total annual hospitalization costs due to high groundwater arsenic-attributable CVD mortality were estimated for the whole of country based on the estimated PAFs, cost of hospitalization [54], prevalent cases [55], and deaths [55] for CVDs. Value of statistical life (VSL)-based annual costs of arsenic-attributable CVD mortality were estimated for India based on our estimated CVD mortality, reported Indian individual VSL [56], and estimates of years of life lost due to heart diseases [57,58].

Random Forest Models for Hazard Maps
The assembled dataset of 24,297 averaged arsenic concentration with potential environmental predictors were utilized to create four random forest models of probability of groundwater arsenic exceeding 10, 20, 50, and 80 µg/L for the whole of India. The optimal numbers with the smallest out-of-bag (OOB) error of predictors at individual tree branch in random forest models were 16, 18, 22, and 21, separately (Table S3). There was no predictor with negative mean decreases in both accuracy and Gini node impurity, so all predictors benefited the models and so were retained in the models. The cross-validation of the models on their testing dataset showed that the area (AUC) under the ROC curves were 0.84, 0.84, 0.83, and 0.83, indicating good model prediction performance ( Figure S2). The importance of the predictors in the four random forest models is shown in Figure S3, showing that some climate (e.g., temperature, actual evapotranspiration, aridity, precipitation) and soil (e.g., fluvisols) predictors tend to have relatively higher importance and rank than others.
The probability maps (Figure 2a-d) of arsenic concentration exceeding 10, 20, 50, and 80 µg/L in groundwaters were transformed into a pseudo-contour map (Figure 2e) of groundwater arsenic concentrations using a probability cutoff value of 0.5. The concentration map captured a significantly high hazard of groundwater arsenic in the north and northeast India, particularly in Assam and West Bengal, mainly consistent with previously well-known documented high hazard areas [22]. The district-wise proportions of areas with various groundwater arsenic concentration ranges were calculated (Figure 3, and  Table S4), showing over 50% of areas of several districts in the Ganges and Brahmaputra Water 2021, 13, 2232 6 of 16 alluvial plains, in Punjab, and in less known areas of Haryana and NCT of Delhi with groundwater arsenic greater than 10 µg/L, the provisional guideline value of arsenic in drinking water recommended by the World Health Organization (WHO) [59]. ated with greater errors than those at the state and country levels; however, most of those districts having high groundwater arsenic listed by India Central Ground Water Board (CGWB) [60] were captured by our models. The produced random forest models are only 2D models, not 3D models, and common limitations of such 2D models are illustrated by Podgorski et al. [22]), and Wu et al. [20,61]. Clearly, in areas where there are multiple aquifers, substantial differences between the arsenic concentrations between aquifers may exist, indicating the requirement for detailed local studies and tests. However, the models here may broadly reflect the relative sampling densities in these different aquifers, so they could arguably be used to estimate current exposures. However, as the balance of sampling/supply between different aquifers changes, then these 2D models would necessarily become dated and unrepresentative.  The district-wise statistics of modelled high groundwater arsenic areas were associated with greater errors than those at the state and country levels; however, most of those districts having high groundwater arsenic listed by India Central Ground Water Board (CGWB) [60] were captured by our models. The produced random forest models are only 2D models, not 3D models, and common limitations of such 2D models are illustrated by Podgorski et al. [22]), and Wu et al. [20,61]. Clearly, in areas where there are multiple aquifers, substantial differences between the arsenic concentrations between aquifers may exist, indicating the requirement for detailed local studies and tests. However, the models here may broadly reflect the relative sampling densities in these different aquifers, so they could arguably be used to estimate current exposures. However, as the balance of sampling/supply between different aquifers changes, then these 2D models would necessarily become dated and unrepresentative.

Population Exposure Maps
The district-wise total/rural/urban population exposed to arsenic concentration ranges of 10-20, 20-50, 50-80, and >80 μg/L by drinking groundwater are shown in Figure  4 (exposure model 1), Figure S4 (exposure model 2), and Table S4 (exposure model 1). In India, totally, we estimated that between 58 (exposure model 1) and 64 (exposure model 2) million inhabitants are exposed to arsenic exceeding 10 μg/L in drinking water. This estimation is higher than the estimations of 31 million people by Ravenscroft et al. [62] and 18-30 million people by Podgorski et al. [22], whilst it is lower than the estimation of 90 million people exposed to high arsenic in groundwater by Mukherjee et al. [23]. We prefer our estimate of 58 million to 64 million as it is based on a more granular and detailed assessment of district-level exposures at a range of arsenic concentration levels than the previous estimates-we also note that this estimated range is consistent with the arithmetic mean of those of Podgorski et al. [22] and Mukherjee et al. [23]. Meanwhile, about be-

Population Exposure Maps
The district-wise total/rural/urban population exposed to arsenic concentration ranges of 10-20, 20-50, 50-80, and >80 µg/L by drinking groundwater are shown in Figure 4 (exposure model 1), Figure S4 (exposure model 2), and Table S4 (exposure model 1). In India, totally, we estimated that between 58 (exposure model 1) and 64 (exposure model 2) million inhabitants are exposed to arsenic exceeding 10 µg/L in drinking water. This estimation is higher than the estimations of 31 million people by Ravenscroft et al. [62] and 18-30 million people by Podgorski et al. [22], whilst it is lower than the estimation of 90 million people exposed to high arsenic in groundwater by Mukherjee et al. [23]. We prefer our estimate of 58 million to 64 million as it is based on a more granular and detailed assessment of district-level exposures at a range of arsenic concentration levels than the previous estimates-we also note that this estimated range is consistent with the arithmetic mean of those of Podgorski et al. [22] and Mukherjee et al. [23]. Meanwhile, about between 39.1 (exposure model 1) and 43.4 (exposure model 2) million (i.e., 3.2-3.6% of the population of India), 15.1-17.0 million (1.3-1.4%), 1.5-1.6 million (0.1%), and 2.3-2.4 million (0.2%) people are exposed to high groundwater arsenic of 10-20 µg/L, 20-50 µg/L, 50-80 µg/L, and >80 µg/L, respectively, with more rural populations (32.9-36.6 million for 10-20 µg/L; 12.3-13.8 million for 20-50 µg/L; 1.2 million for 50-80 µg/L; 1.6-1.7 million for >80 µg/L) currently consuming high arsenic in groundwaters than urban populations (6.1-6.8 million for 10-20 µg/L; 2.9-3.1 million for 20-50 µg/L; 0.3 million for 50-80 µg/L; 0.6 million for >80 µg/L). A large number of districts in Assam, Bihar, and West Bengal states had a large population (>1 million people) drinking groundwater with high arsenic concentration (>10 µg/L). The population exposures were based on two contrasting end-member assumptions as to the arsenic status of shallow wells: (exposure model 1) shallow dug wells are largely groundwater arsenic-free and (exposure model 2) shallow dug wells contain groundwater arsenic, posing a threat to human health. Thus, the range of values presented here reflect in part uncertainties in the relative validities of exposure Models 1 and 2. It is recognized both that (i) the validity of the models may be a function of geology and consumer behavior and (ii) the validity of the models may change with time, indeed may have changed materially since 2011, the date of the basis of the district-wise rural/urban population data used in this study [49,50].
Water 2021, 13, x FOR PEER REVIEW 8 of 16 tain groundwater arsenic, posing a threat to human health. Thus, the range of values presented here reflect in part uncertainties in the relative validities of exposure Models 1 and 2. It is recognized both that (i) the validity of the models may be a function of geology and consumer behavior and (ii) the validity of the models may change with time, indeed may have changed materially since 2011, the date of the basis of the district-wise rural/urban population data used in this study [49,50].  . District-wise rural (a-d), urban (e-h), and total (i-l) population exposed to arsenic concentration ranges of 10 to 20 µg/L, 20 to 50 µg/L, 50 to 80 µg/L, and >80 µg/L by drinking groundwater.

Pooled Relative Risks of Cardiovascular Disease (CVD) Related to Arsenic Concentrations in Drinking Water
The utilized linear and non-linear positive associations between mortality risk of cardiovascular diseases (CVDs) and arsenic concentrations in drinking water with significant overall trends (p < 0.05) were justified based on the recent meta-analysis by Xu et al. [24] ( Table 1). The relative mortality risks of CVDs of arsenic concentration ranges of 10-20, 20-50, 50-80, and >80 µg/L compared with the reference level of 0-10 µg/L are listed in Table 1. The association models have great heterogeneity due to considerable variations of the selected studies in terms of study area, study design, exposure assessment, ascertainment of outcome, and adjustment of confounders. Table 1. Dose-response relationships between cardiovascular disease mortality and arsenic concentration in drinking water. Pooled relative risks (95% CIs) for cardiovascular disease (CVDs) mortality for the median concentration in each arsenic concentration category relative to the median concentration in the reference category of 0-10 µg/L. Calculations after Xu et al. [24]. Although the dose-response relationship utilized does not explicitly take into account any of the factors known to be strongly associated with CVD (for example, gender, age, obesity, smoking habits, and income), the PAF approach used in this study does mean that these factors are largely implicitly accounted for in the baseline CVD mortality risks. Despite this, in due course, we still expect more refined dose-response models to be published, and the estimates should therefore be accordingly updated to take these into account.
Based upon the PAF estimates by two exposure models and linear and non-linear dose-response relationships, recent annual hospitalization cost of USD 14.4 billion [54], prevalent cases of 54.5 million [55], and deaths of 2.8 million [55] for CVDs, annual total hospitalization costs due to high groundwater arsenic-attributable CVD mortality were estimated to be USD 1.95-3.74 million (exposure model 1) or USD 2.14-4.13 million (exposure model 2) in India. Meanwhile, the value of statistical life (VSL)-based annual costs of arsenic-attributable CVD mortality were estimated to be USD 749-3127 million (exposure model 1) or USD 813-3408 million (exposure model 2) in India, according to our estimated mortality, reported individual VSL of USD 0.64 million [56], and year of life lost due to heart diseases per death of 12 years [58] or 28 years [57]. The produced VSL is much higher than our estimated hospitalization costs-this may be because many people, notably in poorer rural areas, do not get hospital treatments or only get limited and incomplete treatments for their CVD. Although CVD cases that are untreated by hospitals do not give rise to hospitalization costs, they still impact the country's economy, this being more accurately reflected in VSL-based cost estimates. Lastly, on this point, we note that the estimated costs here are only in relation to CVD mortality-costs related otherwise to morbidity, though not quantified here, are likely also to be very substantial.
At the state level, the PAFs (Tables S5 and S6) of groundwater arsenic-attributable CVD mortality were calculated to be highest in Assam (2.5% to 5.7%) and West Bengal (1.6% to 2.7%). The PAFs of most entire states were estimated to be below 1%. Combined with recently reported disease burden of CVDs for India [53], the annual state-wise CVD mortalities due to consuming high groundwater arsenic were estimated and are listed in Table S7 (exposure model 1) and Table S8 (exposure model 2). For the whole of India, in total, the modelled annual CVD mortality attributable to chronic consumption of high groundwater arsenic was estimated to be about 6510-12,132 (exposure model 1) or 7067-13,224 (exposure model 2). The country-wise cost and state-wise mortality estimates were completed based on reported data of from somewhat different years (PAF: 2011; hospitalization cost, death, and cases: 2016, disease burden: 2019), which may cause some estimation bias; however, the magnitude of this bias is considerably less than the combined uncertainties arising from hazard distribution, exposure and dose-response models, and documented all-cause CVD disease burden in India.
At a district level (Figures 5 and S5 and Table S4), the estimated PAF of groundwater arsenic-attributable CVD mortality was as high as between 6.0 (exposure model 1, non-linear dose-response relationship) and 9.8% (exposure model 1, linear dose-response relationship) or between 6.5 (exposure model 2, non-linear dose-response relationship) and 10.7% (exposure model 2, linear dose-response relationship) in Dhemaji district (Assam), indicating that drinking high arsenic groundwater is likely to be one of most important factors causing CVD mortality in some Indian districts. Relatively higher PAF (>3%, about 5 or 10 times the country-wise overall PAF) was identified in some further districts in Arunanchal Pradesh, Assam, Bihar, Punjab, Nagaland, and West Bengal states. This is consistent with the recorded distribution of well-known relatively higher groundwater arsenic [22]. District-wise PAF estimates exhibit a wider range of modelled CVD mortality attributable to drinking high arsenic groundwater than state-wise PAF estimations, however, in part because of the greater number of districts than state, but further it is reiterated that the reliability of the model decreases as it becomes more granular. Overall modelled mean hazard and ultimately CVD risk for the whole of India are associated with smaller relative errors than for the same estimates at the state level, more so at the district level, whilst very substantial errors might be expected if the model were used to predict arsenic concentrations at one well/household level, an application of the model that we do not recommend unless supplemented by well-specific testing. Water 2021, 13, x FOR PEER REVIEW 11 of 16

Conclusions
By combining machine learning models of groundwater arsenic distribution with basic exposure route models and CVD-specific dose-response models, we provide for the first time an estimate of both the magnitude and distribution of groundwater arsenic-attributable CVD mortality risks in India. The data were mapped at country, state, and district scales, although the higher resolution (e.g., district-level) mapping and modelling is likely to be associated with higher relative model errors.
As the basis of this, we generated for the first time a pseudo-contour map of model groundwater arsenic concentrations for the whole of India, based upon the machine learning approach of Podgorski et al. [22] and the pseudo-contour approach of Wu et al. [20]. Incorporation of CVD dose-response relationships after a recent meta-analysis by Xu et al. [24], using the PAF approach rather than an absolute risk vs. dose-response (cf. Mondal et al. [31]) has enabled major spatially variable drivers of CVD in India, to a large extent, to be implicitly taken into account in the health-risk modelling.

Conclusions
By combining machine learning models of groundwater arsenic distribution with basic exposure route models and CVD-specific dose-response models, we provide for the first time an estimate of both the magnitude and distribution of groundwater arsenicattributable CVD mortality risks in India. The data were mapped at country, state, and district scales, although the higher resolution (e.g., district-level) mapping and modelling is likely to be associated with higher relative model errors.
As the basis of this, we generated for the first time a pseudo-contour map of model groundwater arsenic concentrations for the whole of India, based upon the machine learning approach of Podgorski et al. [22] and the pseudo-contour approach of Wu et al. [20]. Incorporation of CVD dose-response relationships after a recent meta-analysis by Xu et al. [24], using the PAF approach rather than an absolute risk vs. dose-response (cf. Mondal et al. [31]) has enabled major spatially variable drivers of CVD in India, to a large extent, to be implicitly taken into account in the health-risk modelling.
On a country-wise basis, groundwater arsenic is calculated to contribute to around 0.3-0.6% of all CVD mortality, but this figure is much higher (>3%) for many individual districts in Arunanchal Pradesh, Assam, Bihar, Punjab, Nagaland, and West Bengal and indicates that the continued focusing of government and NGO resources on mitigation of groundwater-derived drinking water supplies in these areas is indicated to protect public health.
On a country-wise basis, we estimate around 6500 to 13,000 premature avoidable CVD deaths may be ascribed to chronic exposure from groundwater-derived drinking water supplies with high arsenic concentrations. The cost of this on a VSL (value of a statistical life) basis is conservatively estimated to be at least on the order of USD 750 million to USD 3400 million per annum. These costs could be used to inform what magnitude of investment might reasonably be warranted to mitigate high arsenic groundwater-derived drinking water supplies in India.
Lastly, we note that the models presented are also limited both in terms of data availability and the nature of the models themselves. In particular, (i) the hazard models are dependent on the quality of the published data upon which they are based [20,22,61]; (ii) the geospatial hazard models produced here are 2D models, not 3D models, that do not consider comprehensively the often substantial differences between the arsenic concentrations in shallow aquifers often exploited via dug wells and deeper aquifers often exploited by tube wells or boreholes [22,61]; (iii) in the absence of more detailed readily available public domain data, the exposure models make plausible but untested assumptions about the distribution of groundwater arsenic concentrations within districts and in particular between areas dominated by rural and urban populations; (iv) the data on district-wise drinking water supply resources and total/rural/urban populations were produced in 2011 [49,50] and so the exposure and health risk estimates may be dated as a result of changes in population distribution and water supply type since that time; (v) model imprecision is likely to be greater at higher spatial resolutions-thus country-wise estimates are likely to be more precise than, in turn, state-wise and district-wise estimates, and extending the model to individual wells and households is not recommended-indeed, for the latter, any material decision-making would be best advised to be based on due consideration by appropriate professionals of individual/household level circumstances and data.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/ 10.3390/w13162232/s1, Figure S1: Frequency distribution of averaged groundwater arsenic concentrations used in this study, Figure S2: Cross-validation results. ROC curves and AUC values of random forest models of groundwater arsenic exceeding (a) 10, (b) 20, (c) 50, and (d) 80 µg/L, Figure S3: Normalized importance of predictors in terms of mean decrease values in accuracy and in Gini node impurity in the random forest models of groundwater arsenic exceeding (a) 10, (b) 20, (c) 50, and (d) 80 µg/L. Both mean decrease in accuracy and mean decrease in Gini node impurity were normalized by their largest values, respectively, Figure S4: Maps of modelled district-wise population exposure to groundwater arsenic in specific concentration ranges and based on exposure model 2 (see text). District-wise rural (a-d), urban (e-h), and total (i-l) population exposed to arsenic concentration ranges of 10 to 20 µg/L, 20 to 50 µg/L, 50 to 80 µg/L, and >80 µg/L by drinking groundwater, Figure S5: District-level maps showing (a,d) population-attributable fraction (PAF) of high groundwater arsenic-attributable CVD mortality with error boundaries (lower (b,e) and upper (c,f) limits), based on exposure model 2 (see text). PAF estimations were based on both linear (a-c) and non-linear (d-f) dose-response relationships between cardiovascular disease mortality and arsenic concentration in drinking water, Table S1: Groundwater arsenic concentration measurements in random forest models. Existing arsenic measurements taken from over 50 sources, mainly from India but also some neighboring South Asian countries. Summaries are given for before and after spatial averaging (geometric mean of arsenic concentrations within a 1 km 2 pixel), Table S2: Potential predictors used in the machine learning models. Brief descriptions and data sources are listed, Table  S3: Comparison of out-of-bag (OOB) errors as a function of number of available predictors at each branch in random forest models (note 31 is the total number of potential predictors). The number shown in grey shadow is the optimum number of predictors at each branch in the model, Table S4: Modelled district-wise exposures and cardiovascular disease (CVD) health risk. District-wise rural, urban, and total population exposed to arsenic concentration ranges of 10 to 20, 20 to 50, 50 to 80, and >80 µg/L by drinking groundwater. Population-attributable fraction (PAF) of high groundwater arsenic-attributable CVD mortality. The estimations were based on exposure model 1 regarding water from hand pumps and tube wells/boreholes as groundwater, and both linear and non-linear dose-response relationships between cardiovascular disease mortality and arsenic concentration in drinking water, Table S5: Modelled state-wise cardiovascular disease (CVD) health risk. State-wise rural, urban, and total population exposed to arsenic concentration ranges of 10 to 20, 20 to 50, 50 to 80, and >80 µg/L by drinking groundwater. Population-attributable fraction (PAF) of high groundwater arsenic-attributable CVD mortality. The estimations were based on exposure model 1 regarding water from hand pumps and tube wells/boreholes as groundwater, and both linear and non-linear dose-response relationships between cardiovascular disease mortality and arsenic concentration in drinking water, Table S6: Modelled state-wise cardiovascular disease (CVD) health risk. State-wise rural, urban, and total population exposed to arsenic concentration ranges of 10 to 20, 20 to 50, 50 to 80, and >80 µg/L by drinking groundwater. Population-attributable fraction (PAF) of high groundwater arsenic-attributable CVD mortality. The estimations were based on exposure model 2 considering water from wells (covered + uncovered), hand pumps, and tube wells/boreholes as groundwater, and both linear and non-linear dose-response relationships between cardiovascular disease mortality and arsenic concentration in drinking water, Table S7: India state-wise modelled annual CVD mortality attributable to chronic consumption of high groundwater arsenic in India based on exposure model 1, including total CVD mortality for all high arsenic concentrations and CVD mortality for four different arsenic concentration categories (10-20, 20-50, 50-80, and >80 µg/L). These mortality estimations were based on both linear and non-linear dose-response relationships between cardiovascular disease mortality and arsenic concentration in drinking water. By definition only there is no modelled arsenic-attributable CVD mortality in the reference arsenic concentration category. Exposure model 1 assumes that modelled groundwater arsenic hazard data apply to and exposure arises from only hand pumps, tube wells, and boreholes. Although modelled mortalities for model 1 are somewhat lower than for model 2, these differences are of the same order as the uncertainties in baseline all-cause CVD mortality data for the states, Table S8: India state-wise modelled annual CVD mortality attributable to chronic consumption of high groundwater arsenic in India based on exposure model 2, including total CVD mortality for all high arsenic concentrations and CVD mortality for four different arsenic concentration categories (10-20, 20-50, 50-80, and >80 µg/L). These mortality estimations were also based on both linear and non-linear dose-response relationships between cardiovascular disease mortality and arsenic concentration in drinking water. By definition only there is no modelled arsenic-attributable CVD mortality in the reference arsenic concentration category. Exposure model 2 assumes that modelled groundwater arsenic hazard data apply to and exposure arises from covered/uncovered hand pumps, tube wells, and boreholes. Although modelled mortalities for model 1 are somewhat lower than for model 2, these differences are of the same order as the uncertainties in baseline all-cause CVD mortality data for the states.
Author Contributions: Conceptualization, D.A.P., R.W. and L.X.; methodology, software, validation, formal analysis in relation to As-CVD dose-response relationships: L.X.; all other methodology, software, validation, formal analysis, R.W.; data curation, R.W.; writing-original draft preparation, R.W.; writing-review and editing R.W., D.A.P. and L.X.; supervision and project administration, D.A.P.; funding acquisition, D.A.P. and L.X. All authors have read and agreed to the published version of the manuscript. Data Availability Statement: Data presented in this study not otherwise available from the references and organizations indicated in the text may be available on request from the corresponding authors.