The Link between Soil Geochemistry in South-West England and Human Exposure to Soil Arsenic

: The aim of this research is to use the whole soil geochemistry and selected bioaccessibility measurements, using the BioAcessibility Research Group of Europe (BARGE) method, on the same soils to identify the geochemical controls on arsenic (As) bioaccessibility and to gain an understanding of its spatial distribution in south-west England. The total element concentrations of 1154 soils were measured with As concentrations ranging from 4.7–1948 mg · kg − 1 , with the bioaccessible As of 50 selected soils ranging from 0.6–237 mg · kg − 1 . A Self Modelling Mixture Resolution approach was applied to the total soil element chemistry to identify the intrinsic soil constituents (ISCs). The ISCs were used as predictor variables and As bioaccessibility as the dependant variables in a regression model for the prediction of As bioaccessibility at all soil locations to examine its regional spatial distribution. This study has shown that bioaccessibility measurements can be directly linked to the geochemical properties of soils. In summary, it seems the primary source of bioaccessible As comes from soils developed directly over the mineralised areas surrounding the granite intrusions. Secondary sources of bioaccessible As are derived from As that has been mobilised from the primary mineralised source and then re-absorbed onto clay material, Fe oxides and carbonate coatings. This information can be of direct use for land development, since land contamination can affect the health of people living, working, visiting or otherwise present on a site.


Introduction
The geology and mineral deposits of south-west England are diverse [1]. A recent review of the geology [2] as a result of a geochemical survey of stream sediments identifies 16 geological domains ( Figure 1) with five major granites playing an important role in the formation of mineralised zones where mining operations have been located. Exploitation of mineral deposits containing economically attractive (Cu, Pb, Zn, Sn, Au, Ag etc.) elements and some that are considered to be harmful, such as As and Pb [3,4] have been undertaken since the Roman era peaking in the mid-nineteenth century. For example, the Devon Great Consols Copper (Cu) and Arsenic (As) mine in the Tamar valley produced over 70, 000 t of As between 1848 and 1909 [5]. The presence of such mineral deposits influences the geochemical nature of the overlying ground (soil) rendering areas considered to be contaminated; this can be exacerbated by extraction during mining activities. Over the last century geochemical surveys of land masses have been used to identify sites for mineral exploration and more recently provide a useful indicator of areas of soil contamination, e.g., from the presence of elements of interest and subsequent extraction and smelting activities (e.g., [6][7][8][9][10]. Measurement of soil geochemistry provides a snapshot of chemical elements (hazardous and beneficial) in the surface environment at a given point in time (e.g., the time of sampling) [11]. Geochemical soil surveys have been carried out across the world as part of national [12] and continental surveys, e.g., [13], on regional and urban scales (e.g., [14][15][16][17][18][19] and for different land uses (e.g., agricultural soils [20,21]). The detailed geochemical information that they provide highlights the spatial distribution of elements [22] and a reference point by which, for example, environmental change may be monitored and previous land activities identified (e.g., mineral exploration) [17,23]. The usefulness of baseline geochemistry data to health was highlighted in 1998 [24], and more recently, soil geochemistry from baseline data sets has been used to make linkages with the potential hazards to human [25][26][27][28] and animal health [29].
The south-west of England has a population of ca. 5 million accounting for about 8.5 per cent of the UK population [30] and is a popular tourist destination with around 500,000 visitors a year. Through daily activities, both residents and tourists alike will interact with the natural environment and be exposed to soil and any potentially harmful elements (PHE) within. Humans, particularly children as the most vulnerable in society, are exposed to soil PHE by oral, inhalation and dermal pathways. Ingestion is the primary exposure route with an estimated ingested mass of up to 100 mg per day for children [31] and 50 mg per day for adults [32]. Measures of exposure to soil PHEs are made by in vitro (laboratory based) simulation of the human gastro-intestinal (ingestion) pathway [32]. The fraction of PHE that is soluble in this environment is classed as bioaccessible [33] and has the potential to cause harm. The past two decades have seen the development, comparison, refinement and validation of laboratory based bioaccessibility methods for soil contamination. The priority contaminants As and Pb have been the main focus of interest to workers across the world, along with evaluations of PHE availability in a range of soils types and contamination scenarios Over the last century geochemical surveys of land masses have been used to identify sites for mineral exploration and more recently provide a useful indicator of areas of soil contamination, e.g., from the presence of elements of interest and subsequent extraction and smelting activities (e.g., [6][7][8][9][10]. Measurement of soil geochemistry provides a snapshot of chemical elements (hazardous and beneficial) in the surface environment at a given point in time (e.g., the time of sampling) [11]. Geochemical soil surveys have been carried out across the world as part of national [12] and continental surveys, e.g., [13], on regional and urban scales (e.g., [14][15][16][17][18][19] and for different land uses (e.g., agricultural soils [20,21]). The detailed geochemical information that they provide highlights the spatial distribution of elements [22] and a reference point by which, for example, environmental change may be monitored and previous land activities identified (e.g., mineral exploration) [17,23]. The usefulness of baseline geochemistry data to health was highlighted in 1998 [24], and more recently, soil geochemistry from baseline data sets has been used to make linkages with the potential hazards to human [25][26][27][28] and animal health [29].
The south-west of England has a population of ca. 5 million accounting for about 8.5 per cent of the UK population [30] and is a popular tourist destination with around 500,000 visitors a year. Through daily activities, both residents and tourists alike will interact with the natural environment and be exposed to soil and any potentially harmful elements (PHE) within. Humans, particularly children as the most vulnerable in society, are exposed to soil PHE by oral, inhalation and dermal pathways. Ingestion is the primary exposure route with an estimated ingested mass of up to 100 mg per day for children [31] and 50 mg per day for adults [32]. Measures of exposure to soil PHEs are made by in vitro (laboratory based) simulation of the human gastro-intestinal (ingestion) pathway [32]. The fraction of PHE that is soluble in this environment is classed as bioaccessible [33] and has the potential to cause harm. The past two decades have seen the development, comparison, refinement and validation of laboratory based bioaccessibility methods for soil contamination. The priority contaminants As and Pb have been the main focus of interest to workers across the world, along with evaluations of PHE availability in a range of soils types and contamination scenarios e.g., [33][34][35][36][37][38][39][40][41]. Performance validation studies, rather than comparison between in vivo (animal) and in vitro (laboratory simulation) data for the same soils [33], have been undertaken, focussing on assessing agreement of model predictions using independent data sets [42].
To predict PHE bioaccessibility over different spatial scales, rather than for soil type or contamination scenarios, soil geochemical data (along with bioaccessibility data) for urban and regional environments have been incorporated into regression modelling [62,63]. Similarly, studies of spatial variation in bioaccessible soil PHE concentrations have been used to assist in the identification of contaminant sources [25,64,65], the main factors controlling contaminant bioaccessibility [66,67] and map hazard quotients for human health risk from soil contaminants [68]. However, there is little work available that spatially predicts and maps PHE bioaccessibility on a regional or urban scale [69].
In this study, the total element concentrations of 1154 soils from south-west England were measured along with the bioaccessibility of As in 50 selected soils (details are given in the Methods section). The aim of this research is twofold: (i) To use the whole soil geochemistry data and selected bioaccessibility measurements on the same soils to identify the geochemical controls on As bioaccessibility; (ii) To use regression modelling, with derived data from the whole soil geochemistry as predictor variables and As bioaccessibility as the dependant variables and to predict As bioaccessibility at all soil locations to examine its regional spatial distribution.

Materials and Methods
The British Geological Survey's (BGS), geochemical baseline mapping programme (GBASE) began in the late 1960s using the geochemical mapping methods and techniques established by Professor Jane Plant [17]. The geochemical soil survey of south-west England [70] was completed in 2014, with the collection and preparation (drying and sieving to <2 mm) of 1154 shallow soils (5-20 cm) ( Figure 2). The soils were collected at varying densities (ranging from one per 2 km 2 to one per 5 km 2 , depending on underlying parent material homogeneity). Each sample was collected at a depth of 5-20 cm and made up of a composite of material from auger flights taken from five holes distributed within an area of approximately 20 m × 20 m [71]. The survey of south-west England also included high resolution sampling of the Tamar catchment in 2002 [72], shown by the more densely sampled area in Figure 2. X-Ray Fluorescence Spectrometry (XRFS) was used to determine the elemental content (Ag, Al, As, Ba, Bi, Br, Ca, Cd, Ce, Cl, Co, Cr, Cs, Cu, Fe, Ga, Ge, Hf, I, In, K, La, Mg, Mn, Mo, Na, Nb, Nd, Ni, P, Pb, Rb, S, Sb, Sc, Se, Si, Sm, Sn, Sr, Ta, Te, Th, Ti, Tl, U, V, W, Y, Yb, Zn, Zr) of each soil sample [73]. Details of the reference material quality assurance data obtained whilst running the soil samples is given in the supplementary information (Table S1) along with the XRFS detection limits (Table S2) Subsamples of 50 representative samples from the 1154 soils were selected for bioaccessibility testing. The selection was carried out using clustering of the samples based on their elemental data measured by XRFS. The geochemistry data set was mean centred and scaled to unit variance and hierarchical clustering was conducted using Euclidean distance and Wards method. Nine clusters were found to be a parsimonious representation of the different variation in geochemistry over the region and represented the underlying geology ( Figure 1). Between 5 and 6 samples were randomly chosen from each group to make up 50 samples for bioaccessibility testing. The identified samples, covering a range of As concentrations, were sieved to <250 µm (the size fraction considered to adhere to children's hands [74][75][76]). The total As concentration of the 50 <250 µm subsamples was determined by digestion of 0.25 g in a mixture of nitric, perchloric and hydrofluoric acids followed by inductively coupled plasma mass spectrometery (ICP-MS) analysis [77,78]. Five of the <250 µm sub-samples were digested and analysed in duplicate along with the reference materials NIST 2711a and BGS 102. The digestion repeatability (RSD) was <5% and the extraction recovery of NIST 2711a and BGS 102 was 100% ± 10%. A plot of the location of the samples selected for bioaccessibility testing and an interpolated plot of the total XRFS measured As from the 1154 samples is shown in Figures 2  and 3 respectively. X-Ray Fluorescence Spectrometry (XRFS) was used to determine the elemental content (Ag, Al, As, Ba, Bi, Br, Ca, Cd, Ce, Cl, Co, Cr, Cs, Cu, Fe, Ga, Ge, Hf, I, In, K, La, Mg, Mn, Mo, Na, Nb, Nd, Ni, P, Pb, Rb, S, Sb, Sc, Se, Si, Sm, Sn, Sr, Ta, Te, Th, Ti, Tl, U, V, W, Y, Yb, Zn, Zr) of each soil sample [73]. Details of the reference material quality assurance data obtained whilst running the soil samples is given in the supplementary information (Table S1) along with the XRFS detection limits (Table S2) Subsamples of 50 representative samples from the 1154 soils were selected for bioaccessibility testing. The selection was carried out using clustering of the samples based on their elemental data measured by XRFS. The geochemistry data set was mean centred and scaled to unit variance and hierarchical clustering was conducted using Euclidean distance and Wards method. Nine clusters were found to be a parsimonious representation of the different variation in geochemistry over the region and represented the underlying geology ( Figure 1). Between 5 and 6 samples were randomly chosen from each group to make up 50 samples for bioaccessibility testing. The identified samples, covering a range of As concentrations, were sieved to <250 µm (the size fraction considered to adhere to children's hands [74][75][76]). The total As concentration of the 50 subsamples (<250 µm) was determined by digestion of 0.25 g in a mixture of nitric, perchloric and hydrofluoric acids followed by inductively coupled plasma mass spectrometery (ICP-MS) analysis [77,78]. Five of the <250 µm sub-samples were digested and analysed in duplicate along with the reference materials NIST 2711a and BGS 102. The digestion repeatability (RSD) was <5% and the extraction recovery of NIST 2711a and BGS 102 was 100% ± 10%. A plot of the location of the samples selected for bioaccessibility testing and an interpolated plot of the total XRFS measured As from the 1154 samples is shown in Figures 2 and 3 respectively. Bioaccessibility testing using the two stage (stomach and intestine) BioAcessibility Research Group of Europe (BARGE) Unified Bioaccessibility Method (UBM) [33,79] was carried out on the 50 <250 µm subsamples. As part of the extraction procedure 8 samples were extracted in duplicate along with 8 samples of the guidance material BGS 102. The repeatability of duplicate samples was <10% for all As measurements. The measured As and Pb bioaccessibility in BGS 102 was within reported guidance values [41,69,80].
A Self Modelling Mixture Resolution (SMMR) approach [69] was applied to the total soil element chemistry to identify the intrinsic soil constituents (ISCs) of south-west England ( Figure 4) using the Matlab © programming language. An ISC is an assemblage of soil particles from a common biogenic, geogenic or anthropogenic input, with a consistent chemical composition present at varying concentrations in a number of similarly developed soils [81]. Application of the SMMR algorithm to the multi-element XRFS data set resulted in the identification of 22 ISCs for the south-west of England (using the Akaike Information Criterion [82,83]. The SMMR algorithm gives the chemical composition of each ISC and the proportion of each ISC in each sample. The ISCs are named using a combination of the chemical symbols of the elements that make up more than 10% of the ISC. Using a combination of the chemical composition and the spatial distribution of each ISC it is possible to tentatively identify the source of specific ISCs. An example of this for the south-west is given in the Supplementary Information (Figures S1 and S2).
The statistical analysis was carried out using the R programming language [84]. Spatial interpolation and visualisation of the total As, the ISCs associated with bioaccessible As and the Bioaccessibility testing using the two stage (stomach and intestine) BioAcessibility Research Group of Europe (BARGE) Unified Bioaccessibility Method (UBM) [33,79] was carried out on the 50 subsamples (<250 µm). As part of the extraction procedure 8 samples were extracted in duplicate along with 8 samples of the guidance material BGS 102. The repeatability of duplicate samples was <10% for all As measurements. The measured As and Pb bioaccessibility in BGS 102 was within reported guidance values [41,69,80].
A Self Modelling Mixture Resolution (SMMR) approach [69] was applied to the total soil element chemistry to identify the intrinsic soil constituents (ISCs) of south-west England ( Figure 4) using the Matlab © programming language. An ISC is an assemblage of soil particles from a common biogenic, geogenic or anthropogenic input, with a consistent chemical composition present at varying concentrations in a number of similarly developed soils [81].  Bioaccessibility testing using the two stage (stomach and intestine) BioAcessibility Research Group of Europe (BARGE) Unified Bioaccessibility Method (UBM) [33,79] was carried out on the 50 <250 µm subsamples. As part of the extraction procedure 8 samples were extracted in duplicate along with 8 samples of the guidance material BGS 102. The repeatability of duplicate samples was <10% for all As measurements. The measured As and Pb bioaccessibility in BGS 102 was within reported guidance values [41,69,80].
A Self Modelling Mixture Resolution (SMMR) approach [69] was applied to the total soil element chemistry to identify the intrinsic soil constituents (ISCs) of south-west England (Figure 4) using the Matlab © programming language. An ISC is an assemblage of soil particles from a common biogenic, geogenic or anthropogenic input, with a consistent chemical composition present at varying concentrations in a number of similarly developed soils [81]. Application of the SMMR algorithm to the multi-element XRFS data set resulted in the identification of 22 ISCs for the south-west of England (using the Akaike Information Criterion [82,83]. The SMMR algorithm gives the chemical composition of each ISC and the proportion of each ISC in each sample. The ISCs are named using a combination of the chemical symbols of the elements that make up more than 10% of the ISC. Using a combination of the chemical composition and the spatial distribution of each ISC it is possible to tentatively identify the source of specific ISCs. An example of this for the south-west is given in the Supplementary Information (Figures S1 and S2).
The statistical analysis was carried out using the R programming language [84]. Spatial interpolation and visualisation of the total As, the ISCs associated with bioaccessible As and the Application of the SMMR algorithm to the multi-element XRFS data set resulted in the identification of 22 ISCs for the south-west of England (using the Akaike Information Criterion [82,83]. The SMMR algorithm gives the chemical composition of each ISC and the proportion of each ISC in each sample. The ISCs are named using a combination of the chemical symbols of the elements that make up more than 10% of the ISC. Using a combination of the chemical composition and the spatial distribution of each ISC it is possible to tentatively identify the source of specific ISCs. An example of this for the south-west is given in the Supplementary Information (Figures S1 and S2).
The statistical analysis was carried out using the R programming language [84]. Spatial interpolation and visualisation of the total As, the ISCs associated with bioaccessible As and the predicted bioaccessible As were carried out using the Thin Plate Spline regression (TPS) as implemented in the "fields" package in R [85].

Total As Concentrations
The range of total As concentrations (mg·kg −1 ) for this study, as measured by XRFS is shown as an interpolated map in Figure 3. Summary statistics for As in the whole data set and the selected samples are shown in Table 1. Wilcoxon signed rank test between the selected sample XRFS data and the digest data, p value = 0.003.
The median and the MAD (median absolute deviation) for all three data sets are similar but skew values larger than 1 indicate large positive tails on the data distributions which is why the mean values are significantly larger than the medians. Although a Wilcoxon signed rank test (used to test equality of non-normally distributed data [86]) shows a significant difference between the selected sample XRFS data and the digest data, a linear regression of the digest data on the XRFS data gives a slope not significantly different from 1 and intercept not significantly different from 0 with an r-square of 0.987. Figure 5 shows comparative combined box and whisker plots and violin plots, with the former showing the medians and interquartile ranges and the latter providing a visualisation of the data density. Figure 6 provides information on the relationship between total and bioaccessible As. The summary data in Table 1 and the data distribution plots in Figure 7 suggest that the 50 samples selected for bioaccessibility testing are representative of the larger data set. predicted bioaccessible As were carried out using the Thin Plate Spline regression (TPS) as implemented in the "fields" package in R [85].

Total As Concentrations
The range of total As concentrations (mg·kg −1 ) for this study, as measured by XRFS is shown as an interpolated map in Figure 3. Summary statistics for As in the whole data set and the selected samples are shown in Table 1. The median and the MAD (median absolute deviation) for all three data sets are similar but skew values larger than 1 indicate large positive tails on the data distributions which is why the mean values are significantly larger than the medians. Although a Wilcoxon signed rank test (used to test equality of non-normally distributed data [86]) shows a significant difference between the selected sample XRFS data and the digest data, a linear regression of the digest data on the XRFS data gives a slope not significantly different from 1 and intercept not significantly different from 0 with an rsquare of 0.987. Figure 5 shows comparative combined box and whisker plots and violin plots, with the former showing the medians and interquartile ranges and the latter providing a visualisation of the data density. Figure 6 provides information on the relationship between total and bioaccessible As. The summary data in Table 1 and the data distribution plots in Figure 7 suggest that the 50 samples selected for bioaccessibility testing are representative of the larger data set.    Relationship between total As in the <250 µm particle size and the bioaccessible As fraction. The histograms on plot show the separate marginal distributions of the bioaccessible As fraction and the total As in the selected soils. Table 2 shows a comparison of measured As against generic assessment criteria and that 25-32% of samples were above the guideline value for residential land uses. When compared to Normal Background Concentrations (NBC) information [87] 35.4-38% of the total As concentration was greater than the principal domain NBC and 2.17-8% above the mineralised domain.  Relationship between total As in the <250 µm particle size and the bioaccessible As fraction. The histograms on plot show the separate marginal distributions of the bioaccessible As fraction and the total As in the selected soils. Table 2 shows a comparison of measured As against generic assessment criteria and that 25-32% of samples were above the guideline value for residential land uses. When compared to Normal Background Concentrations (NBC) information [87] 35.4-38% of the total As concentration was greater than the principal domain NBC and 2.17-8% above the mineralised domain. . Relationship between total As in the <250 µm particle size and the bioaccessible As fraction. The histograms on plot show the separate marginal distributions of the bioaccessible As fraction and the total As in the selected soils. Table 2 shows a comparison of measured As against generic assessment criteria and that 25-32% of samples were above the guideline value for residential land uses. When compared to Normal Background Concentrations (NBC) information [87] 35.4-38% of the total As concentration was greater than the principal domain NBC and 2.17-8% above the mineralised domain. Comparison of mean total As concentrations measured by XRFS in the <2 mm for all of the GBASE samples (n = 1154) and the subset of 50 samples identified for further bioaccessibility testing (XRFS and acid digestion) were compared to both generic assessment criteria (Category 4 Screening Level (C4SL)) and Normal Background Concentrations (NBC), shown in Table 2.

Bioaccessible As Concentrations
The bioaccessible As concentration in the soil represents the amount of As that would become solubilised in the human gastrointestinal tract and be available for absorption into the systemic circulation [32]. Whilst this is a conservative estimate of the true bioavailable concentration i.e., the actual amount of As absorbed into systemic circulation, it is a much more realistic concentration than the total As concentration in the soil for the purposes of human health risk assessment. The bioaccessible fraction (BAF) is the amount bioaccessible As expressed as a percentage of the total As in the soil. Whilst this is a useful measurement of the relative mobility of the As in a particular soil, it does not provide any information on the absolute magnitude of total or bioaccessible As.
The summary statistics of the bioaccessible As in the stomach and intestine are given in Table 1. A Wilcoxon signed rank test shows that there is a significant difference between the stomach and intestinal phases. A linear regression of intestine against stomach As yields an intercept not significantly different from 0, a slope of 0.85 ± 0.1 and an r-square of 0.987 suggesting that on average the intestinal As concentration is 85% of the stomach As. Table 2 shows that a far smaller percentage of the bioaccessible As values exceed the NBC and C4SL screening levels. Figure 6A shows the relationship between total As in the <250 µm fraction and the As in the stomach phase, displaying a gradual increase in bioaccessible As between total As values of 0-200 mg·kg −1 followed by a steep rise in bioaccessible As above 200 mg·kg −1 . Figure 6B shows that below 200 mg·kg −1 As there is a relatively linear relationship between total and bioaccessible As, with on average 17% of the total As being bioaccessible, although the data become more scattered at total As values of more than 50 mg·kg −1 . Figure 7 shows that there is no correlation between the bioaccessible fraction (BAF) and the total As in the <250 µm fraction (Spearman correlation <0.01). Most of the samples have BAF values between 4% and 16% but there are five samples with BAF values higher than 20%. This shows that, from the consideration of the As data alone, we conclude that there is no simple relationship between the bioaccessible As fraction and the total As present in the soil and that BAF values are not consistent across different concentration ranges (Figure 7). The absolute bioaccessible concentration is therefore a much better measure for comparison of As mobility between sites particularly where there is a high variation in total As concentrations from site to site. In order to get a better understanding of the controls on the bioaccessible As in these soils we need to consider how the overall geochemical make-up of the soils affects As mobility.

Intrinsic Soil Constituents
Whilst it is possible to use the individual element concentrations in the soils as possible predictors of bioaccessible As, each individual element will be distributed across different physico-chemical components of the soils, e.g., minerals, mineral coatings, clay, organic matter. As already explained the ISC data represents these soil components and for this reason we preferred to use the ISC data, derived from the total element data, as it would provide a better link to the individual physico-chemical components of the soils.
The SMMR algorithm used to calculate the ISCs also allows us to calculate the proportion of total As summed over all samples associated with each ISC. This provides information on the solid phase fractionation of As in soils from south-west England, which is summarised in an ordered barplot in (Figure 8). We are now interested in how much of the As associated in these ISCs is bioaccessible.

Intrinsic Soil Constituents
Whilst it is possible to use the individual element concentrations in the soils as possible predictors of bioaccessible As, each individual element will be distributed across different physicochemical components of the soils, e.g., minerals, mineral coatings, clay, organic matter. As already explained the ISC data represents these soil components and for this reason we preferred to use the ISC data, derived from the total element data, as it would provide a better link to the individual physico-chemical components of the soils.
The SMMR algorithm used to calculate the ISCs also allows us to calculate the proportion of total As summed over all samples associated with each ISC. This provides information on the solid phase fractionation of As in soils from south-west England, which is summarised in an ordered barplot in (Figure 8). We are now interested in how much of the As associated in these ISCs is bioaccessible. Figure 8. Solid phase distribution of total As between the soil intrinsic soil constituents (ISCs) in southwest England.

Regression Modelling of Bioaccessible As
An efficient way of modelling the data and selecting which ISCs, if any, control the bioaccessible As is the "lasso" linear regression algorithm as implemented in the "glmnet" package in the R programming language [88]. Lasso regression shrinks large regression coefficients in order to reduce overfitting, and performs covariate selection by forcing the sum of the absolute value of the regression coefficients to be less than a fixed value, which shrinks certain coefficients to zero, effectively choosing a simpler model that does not include those coefficients.
The masses of each ISC in each sample were used as the predictor variables. Figure 6A shows that the bioaccessible As has a nonlinear response to total As and the summary statistics for bioaccessible As in Table 1 show that the data is right skewed. For this reason we have used a log10 transform of the stomach phase bioaccessible As as the dependant variable. A lasso regularised regression was then carried out selecting the optimum tuning parameter λ by cross validation [88]. The glmnet lasso model was validated with a 10 fold cross validation which gives rise to a cross validated mean square error of 0.0318 log10 bioaccessible As which is equivalent to 1.08 mg·kg −1 Figure 8. Solid phase distribution of total As between the soil intrinsic soil constituents (ISCs) in south-west England.

Regression Modelling of Bioaccessible As
An efficient way of modelling the data and selecting which ISCs, if any, control the bioaccessible As is the "lasso" linear regression algorithm as implemented in the "glmnet" package in the R programming language [88]. Lasso regression shrinks large regression coefficients in order to reduce overfitting, and performs covariate selection by forcing the sum of the absolute value of the regression coefficients to be less than a fixed value, which shrinks certain coefficients to zero, effectively choosing a simpler model that does not include those coefficients.
The masses of each ISC in each sample were used as the predictor variables. Figure 6A shows that the bioaccessible As has a nonlinear response to total As and the summary statistics for bioaccessible As in Table 1 show that the data is right skewed. For this reason we have used a log10 transform of the stomach phase bioaccessible As as the dependant variable. A lasso regularised regression was then carried out selecting the optimum tuning parameter λ by cross validation [88]. The glmnet lasso model was validated with a 10 fold cross validation which gives rise to a cross validated mean square error of 0.0318 log10 bioaccessible As which is equivalent to 1.08 mg·kg −1 bioaccessible As. Figure 9A shows a plot of the predicted vs measured bioaccessible As. Figure 9B shows a bar plot of the ISCs that were selected by the lasso model and the magnitude of coefficient (since all predictor ISCs were mean centred and scaled prior to modelling) gives an approximate measure of the importance of the selected ISCS in predicting bioaccessible As.
bioaccessible As. Figure 9A shows a plot of the predicted vs measured bioaccessible As. Figure 9B shows a bar plot of the ISCs that were selected by the lasso model and the magnitude of coefficient (since all predictor ISCs were mean centred and scaled prior to modelling) gives an approximate measure of the importance of the selected ISCS in predicting bioaccessible As. In a classical linear regression the p-value for each coefficient tests the null hypothesis that the coefficient is equal to zero (no effect). A low p-value (<0.05) indicates that you can reject the null hypothesis. In other words, a predictor that has a low p-value is likely to be a meaningful addition to your model because changes in the predictor's value are related to changes in the response variable. In a lasso regularised regression, the classical p-value cannot be applied. However, estimates of the p-values for this work [89] were obtained by bootstrapping the residuals of the lasso regression 1000 times and counting the number of times out of the 1000 realisations that a particular ISC predictor was given a zero coefficient (i.e., not selected for the model). Using this method, (Figure 10) only four of the ISC coefficients had p-values <0.05 these being Fe.Al, Al.K.Si.Fe, Si.Fe.1 and Si.Al.1. All four are in the top seven total As containing ISCS with the Fe.Al ISC containing the most As (Figure 8). Using the interpolated spatial distribution ( Figure 11) and chemical composition of these four ISCs (See Supplementary Material Figures S3-S6) we can make tentative assignments of the source of these components and discuss why their As content contributes to the bioaccessible content of the soils. In a classical linear regression the p-value for each coefficient tests the null hypothesis that the coefficient is equal to zero (no effect). A low p-value (<0.05) indicates that you can reject the null hypothesis. In other words, a predictor that has a low p-value is likely to be a meaningful addition to your model because changes in the predictor's value are related to changes in the response variable. In a lasso regularised regression, the classical p-value cannot be applied. However, estimates of the p-values for this work [89] were obtained by bootstrapping the residuals of the lasso regression 1000 times and counting the number of times out of the 1000 realisations that a particular ISC predictor was given a zero coefficient (i.e., not selected for the model). Using this method, (Figure 10) only four of the ISC coefficients had p-values <0.05 these being Fe.Al, Al.K.Si.Fe, Si.Fe.1 and Si.Al.1. All four are in the top seven total As containing ISCS with the Fe.Al ISC containing the most As (Figure 8). Using the interpolated spatial distribution ( Figure 11) and chemical composition of these four ISCs (See Supplementary Material Figures S3-S6) we can make tentative assignments of the source of these components and discuss why their As content contributes to the bioaccessible content of the soils.       Fe.Al: This ISC has the highest concentration of total As associated with it and has the third largest coefficient in the lasso prediction model. The composition plot ( Figure S3) shows that it is made principally of Fe (ca. 50%) and Al (ca. 40%) with a low Si (<2%) content. This suggests that this is probably a mixed Fe/Al oxide. It also contains a high concentration of As (ca. 2%) as well as ca. 0.5-1% of Sn, Cu, W which are all elements associated with mining in the south-west of England [90]. The interpolated spatial distribution ( Figure 11) shows high areas in the mineralised zones surrounding the granite intrusions ( Figure 1), with hotspots ( Figure 11) surrounding known mining sites (e.g., Devon Great Consols and Hemerdon mines [90]). This suggests that this ISC is derived from in situ weathering of mineralised rock material and mine spoil.
Al.K.Si.Fe: This ISC has the 4th highest concentration of total As associated with it and has the largest coefficient in the lasso prediction model. The composition plot ( Figure S4) shows that it is made principally of Al (ca. 30%), K (ca. 30%) and Fe (ca. 15%) with a Si content that is not well defined. It also contains ca. 0.3% Rb and 0.03% As. The high content of K and Rb suggest that this ISC is a clay related geochemical component. The spatial distribution of this component ( Figure 11) shows that it less localised to the mineralised area around the granite intrusions, suggesting that the As content of this component is derived from As mobilised from the high concentration mineralised rock/soil material (possibly by reductive dissolution [91]) which has then been re-absorbed onto soil clay minerals [92].
Si.Fe.1: The Si.Fe.1 ISC has the 3rd highest concentration of total As and the 5th highest coefficient in the lasso prediction model. The composition plot ( Figure S5) shows that the Si content is not well defined (it has a very large uncertainty) with an Fe content of ca. 30-70% suggesting that this is an Fe oxide dominated component with ca.0.3% As and high concentrations of other metals (Pb 1%, Cu and Zn 0.5%). Like the Al.K.Si.Fe ISC, the Si.Fe.1 is not localised to the mineralised areas ( Figure 11) suggesting that it is secondary Fe oxide component that has adsorbed previously mobilised As arising from the mineralised rock/soil.
Si.Al.1: The Si.Fe.1 ISC has the 7th highest concentration of total As and the 2nd highest coefficients in the lasso prediction model. The composition plot ( Figure S6) shows that it is probably a silica dominated component with a Si composition of ca. 40-70%. The presence of relatively high concentrations of Ca and Mg (ca. 7% and ca. 1.5% respectively) suggests carbonate is present as a coating on or within the silicate matrix. The bioaccessible As in this component can be explained by studies [93] which show carbonate in groundwater can release As from Fe oxides into solution which is then precipitated as As enriched carbonates. Fe.Al: This ISC has the highest concentration of total As associated with it and has the third largest coefficient in the lasso prediction model. The composition plot ( Figure S3) shows that it is made principally of Fe (ca. 50%) and Al (ca. 40%) with a low Si (<2%) content. This suggests that this is probably a mixed Fe/Al oxide. It also contains a high concentration of As (ca. 2%) as well as ca. 0.5-1% of Sn, Cu, W which are all elements associated with mining in the south-west of England [90]. The interpolated spatial distribution ( Figure 11) shows high areas in the mineralised zones surrounding the granite intrusions ( Figure 1), with hotspots ( Figure 11) surrounding known mining sites (e.g., Devon Great Consols and Hemerdon mines [90]). This suggests that this ISC is derived from in situ weathering of mineralised rock material and mine spoil.
Al.K.Si.Fe: This ISC has the 4th highest concentration of total As associated with it and has the largest coefficient in the lasso prediction model. The composition plot ( Figure S4) shows that it is made principally of Al (ca. 30%), K (ca. 30%) and Fe (ca. 15%) with a Si content that is not well defined. It also contains ca. 0.3% Rb and 0.03% As. The high content of K and Rb suggest that this ISC is a clay related geochemical component. The spatial distribution of this component ( Figure 11) shows that it less localised to the mineralised area around the granite intrusions, suggesting that the As content of this component is derived from As mobilised from the high concentration mineralised rock/soil material (possibly by reductive dissolution [91]) which has then been re-absorbed onto soil clay minerals [92].
Si.Fe.1: The Si.Fe.1 ISC has the 3rd highest concentration of total As and the 5th highest coefficient in the lasso prediction model. The composition plot ( Figure S5) shows that the Si content is not well defined (it has a very large uncertainty) with an Fe content of ca. 30-70% suggesting that this is an Fe oxide dominated component with ca.0.3% As and high concentrations of other metals (Pb 1%, Cu and Zn 0.5%). Like the Al.K.Si.Fe ISC, the Si.Fe.1 is not localised to the mineralised areas ( Figure 11) suggesting that it is secondary Fe oxide component that has adsorbed previously mobilised As arising from the mineralised rock/soil.
Si.Al.1: The Si.Fe.1 ISC has the 7th highest concentration of total As and the 2nd highest coefficients in the lasso prediction model. The composition plot ( Figure S6) shows that it is probably a silica dominated component with a Si composition of ca. 40-70%. The presence of relatively high concentrations of Ca and Mg (ca. 7% and ca. 1.5% respectively) suggests carbonate is present as a coating on or within the silicate matrix. The bioaccessible As in this component can be explained by studies [93] which show carbonate in groundwater can release As from Fe oxides into solution which is then precipitated as As enriched carbonates.

Bioaccessible Arsenic Prediction
Having used the regression model to provide information on the geochemical soil properties controlling As bioaccessibility the model (based on the 50 selected samples) can be used to make predictions of the bioaccessible As at all 1154 sampling locations which can be interpolated and mapped ( Figure 12). The As bioaccessibility data for all sampling locations was predicted using the lasso regression model (see the regression modelling section) and the interpolation was carried out by Thin Plate Spline regression (TPS) as implemented in the "fields" package [85] in a similar manner to that used to the total As in soil map shown in Figure 3. This shows a very similar spatial distribution to the total soil As (Figure 3) which is not surprising since the main source of both total As and bioaccessible As is from the mineralised areas around the granite intrusions which relates directly to the Fe.Al ISC ( Figure 11) and contains the highest proportion of the total As summed over all samples ( Figure 8).
It must be emphasised that the bioaccessible map is only a first order approximation since: (i) The regression model only looks at linear relationships and does not take into account interaction effects; (ii) The regression model does not take into account spatial correlation; (iii) The thin plate spline is not a full spatial model which takes into account the actual spatial variability in the data (e.g., as measured by a variogram [94]); (iv) No attempts have been made to quantify the overall uncertainty in the predicted values.
Further studies will be required to address these issues when considering the production of a hazard map for As bioaccessibility in soil where accuracy of magnitude and spatial location will be required to delineate areas which are likely to exceed screening guidelines ( Table 2).

Conclusions
This study has shown that bioaccessibility measurements can be directly linked to the geochemical properties of soils and, by taking a regional view, gain an understanding of the main processes controlling the bioaccessibility and the magnitude of bioaccessible soil As concentration in soils from south-west England. In summary, it seems the primary source of bioaccessible As comes from soils developed directly over the mineralised areas surrounding the granite intrusions. Secondary sources of bioaccessible As are derived from As that has been mobilised from the primary mineralised source and then re-absorbed onto clay material, Fe oxides and carbonate coatings. This information can be of direct use for land development, since land contamination can affect the health of people living, working, visiting or otherwise present on a site. The range of bioaccessible As concentrations found provides an important input parameter to the risk assessment process which is used to establish whether there is an unacceptable risk to humans. As well as the overall magnitude of the bioaccessible As, the information on chemical form can also be useful for evaluating how land use may affect the mobility of As. For example if land use practices provide reducing soil conditions As can be mobilised from Fe oxides [91] and if the pH of the soil is lowered it could liberate As from carbonate coatings.
In addition to the information on the magnitude and mobility of As in soil, this study provides a template for spatial prediction of bioaccessible element concentrations from geochemical soil surveys where it is not feasible to measure the bioaccessibility on every sample by: (i) Taking a representative selection of soil samples, based on their geochemistry, from the total geochemical survey; (ii) Measuring the bioaccessibility for the element(s) under study in the selected samples using a robust, validated and well documented method [79]; (iii) Establishing a predictive relationship between the bioaccessible element concentration and the total geochemical composition of the soil using a suitable modelling method; (iv) Predicting the bioaccessible element concentration for all of the soil samples; (v) Spatially modelling or interpolating the predicted bioaccessible element concentrations over the region covered by the geochemical soil survey.
Further work is required to refine the spatial modelling process to take into account non-linearity, interaction effects, spatial correlation and prediction uncertainty.
Supplementary Materials: The following is available online at http://www.mdpi.com/2075-163X/8/12/570/s1, a word document containing: an example ISC from south west England data ( Figures S1 and S2); plots of the four geochemical components which contribute to the bioaccessible As content of the soils in Figures S3-S6; and quality assurance data for XRFS analysis of soils consisting of reference material data and detection limit information in Tables S1 and S2.
Author Contributions: J.W. and M.C designed the project; T.R.L. was responsible for overseeing the soil sampling and the XRFS analysis and associated data quality assurance; E.H. undertook the bioaccessibility testing; M.C. carried out the statistical analysis and modelling; J.W. wrote the initial draft of the paper which was further modified by J.W. and M.C. after further discussions with all the authors.
Funding: This research received no external funding.