Spectroscopic Diagnosis of Arsenic Contamination in Agricultural Soils

This study investigated the abilities of pre-processing, feature selection and machine-learning methods for the spectroscopic diagnosis of soil arsenic contamination. The spectral data were pre-processed by using Savitzky-Golay smoothing, first and second derivatives, multiplicative scatter correction, standard normal variate, and mean centering. Principle component analysis (PCA) and the RELIEF algorithm were used to extract spectral features. Machine-learning methods, including random forests (RF), artificial neural network (ANN), radial basis function- and linear function- based support vector machine (RBF- and LF-SVM) were employed for establishing diagnosis models. The model accuracies were evaluated and compared by using overall accuracies (OAs). The statistical significance of the difference between models was evaluated by using McNemar’s test (Z value). The results showed that the OAs varied with the different combinations of pre-processing, feature selection, and classification methods. Feature selection methods could improve the modeling efficiencies and diagnosis accuracies, and RELIEF often outperformed PCA. The optimal models established by RF (OA = 86%), ANN (OA = 89%), RBF- (OA = 89%) and LF-SVM (OA = 87%) had no statistical difference in diagnosis accuracies (Z < 1.96, p < 0.05). These results indicated that it was feasible to diagnose soil arsenic contamination using reflectance spectroscopy. The appropriate combination of multivariate methods was important to improve diagnosis accuracies.


Introduction
Soil heavy metal contamination demands effective methods for diagnosing suspected contaminated areas and controlling the rehabilitation process. There is increasing interest in using visible and near-infrared reflectance spectroscopy (VNIRS, 350-2500 nm) to measure soil heavy metal contents and to map its spatial distribution [1], since this technique provides a non-destructive, rapid, and cost-effective method for measuring several soil properties from a single scan, and requires minimal sample preparation and hazardous chemicals [2].
The spectroscopic measurement of heavy metals is usually feasible because of their indirect relationships with some spectral feature soil properties, such as organic matter, iron-oxides or clays [1]. Therefore, the spectral information for soil heavy metal estimations is weak, indirect, soil heavy metal monitoring, contamination remediation, or digital soil mapping, the diagnosis of soil heavy metal contamination may be sufficient rather than accurate heavy metal content estimations. However, at present, soil reflectance spectroscopy is rarely employed to qualitatively diagnose soil heavy metal contamination. To the best of our knowledge, Bray et al. [23] were the first to employ an ordinal logistic regression technique to diagnose Cd, Cu, Pb and Zn contamination in urban soils from reflectance spectra. Therefore, it is interesting and necessary to extend the knowledge about the diagnosis of soil heavy metal contamination by using soil reflectance spectroscopy.
In China, arsenic content has continuously increased in agricultural soils during the past 30 years, because of some anthropogenic activities, such as chemical fertilizers, arsenic-bearing pesticides, animal manures, mining, smelting, and irrigation with arsenic-contaminated water [26]. Excessive arsenic accumulation in agricultural soils can hinder the crops' growth and decrease the yield and quality of agricultural products. Moreover, as a potent carcinogen, arsenic might pose a serious health threat to the human body, such as malignant arsenical skin lesions, respiratory disease, gastrointestinal disorder, liver malfunction, nervous system disorder and haematological diseases [27].
Given the importance of monitoring arsenic contamination in agricultural soils, this study aimed to compare the abilities of pre-processing techniques (derivative transformations, MSC, SNV, MC) and machine-learning techniques (random forests (RF), ANN, and SVM) in diagnosing soil arsenic contamination from soil reflectance spectroscopy, and to investigate whether the feature selection approaches (PCA and RELIEF) could improve the diagnosis accuracy by using different machine-learning methods. The result of this study is expected to establish a technical process for diagnosing soil heavy metal contamination by using soil reflectance spectroscopy.

Soil Samples
In total, 195 historical soil samples collected in Yixing and Zhongxiang regions were used for this work. Yixing (Figure 1b) is located in the south of Jiangsu Province, China, with an annual temperature of 15.7 • C and a mean annual precipitation of 1177 mm. Zhongxiang (Figure 1c) is situated in the middle of Hubei Province, China, and its mean annual temperature is 15.0 • C with a mean annual precipitation of 961 mm. Yixing's dominant soil types are dystric cambisols, lixisols, anthrosols, alisols, calcaric fluvisols, calcisols, cambisols and gleysols for different crop cultivation [20]. The soils collected from Zhongxiang mainly belong to anthrosols for rice planting [28]. At each sample site, surface soils (0-10 cm) were collected. The industrial wastewater, exhaust gas or waste residues produced by local chemical factories are the major causes of arsenics contamination in agricultural soils in the Zhongxiang region [28]; in Yixing, the contamination may mostly result from sewage irrigation, parent materials or vehicle exhausts [29]. mapping, the diagnosis of soil heavy metal contamination may be sufficient rather than accurate heavy metal content estimations. However, at present, soil reflectance spectroscopy is rarely employed to qualitatively diagnose soil heavy metal contamination. To the best of our knowledge, Bray et al. [23] were the first to employ an ordinal logistic regression technique to diagnose Cd, Cu, Pb and Zn contamination in urban soils from reflectance spectra. Therefore, it is interesting and necessary to extend the knowledge about the diagnosis of soil heavy metal contamination by using soil reflectance spectroscopy. In China, arsenic content has continuously increased in agricultural soils during the past 30 years, because of some anthropogenic activities, such as chemical fertilizers, arsenic-bearing pesticides, animal manures, mining, smelting, and irrigation with arsenic-contaminated water [26]. Excessive arsenic accumulation in agricultural soils can hinder the crops' growth and decrease the yield and quality of agricultural products. Moreover, as a potent carcinogen, arsenic might pose a serious health threat to the human body, such as malignant arsenical skin lesions, respiratory disease, gastrointestinal disorder, liver malfunction, nervous system disorder and haematological diseases [27].
Given the importance of monitoring arsenic contamination in agricultural soils, this study aimed to compare the abilities of pre-processing techniques (derivative transformations, MSC, SNV, MC) and machine-learning techniques (random forests (RF), ANN, and SVM) in diagnosing soil arsenic contamination from soil reflectance spectroscopy, and to investigate whether the feature selection approaches (PCA and RELIEF) could improve the diagnosis accuracy by using different machinelearning methods. The result of this study is expected to establish a technical process for diagnosing soil heavy metal contamination by using soil reflectance spectroscopy.

Soil Samples
In total, 195 historical soil samples collected in Yixing and Zhongxiang regions were used for this work. Yixing (Figure 1b) is located in the south of Jiangsu Province, China, with an annual temperature of 15.7 °C and a mean annual precipitation of 1177 mm. Zhongxiang (Figure 1c) is situated in the middle of Hubei Province, China, and its mean annual temperature is 15.0 °C with a mean annual precipitation of 961 mm. Yixing's dominant soil types are dystric cambisols, lixisols, anthrosols, alisols, calcaric fluvisols, calcisols, cambisols and gleysols for different crop cultivation [20]. The soils collected from Zhongxiang mainly belong to anthrosols for rice planting [28]. At each sample site, surface soils (0-10 cm) were collected. The industrial wastewater, exhaust gas or waste residues produced by local chemical factories are the major causes of arsenics contamination in agricultural soils in the Zhongxiang region [28]; in Yixing, the contamination may mostly result from sewage irrigation, parent materials or vehicle exhausts [29].

Laboratory Spectrum and Soil Arsenic Content Measurement
Soil samples were air-dried and ground in a mechanical agate grinder to a particle size of ≤2 mm. The diffuse reflectance spectra were measured by using the FieldSpec3 portable spectroradiometer (ASD Inc., now PANalytical Company, Boulder, CO, USA) with a spectral range of 350 to 2500 nm. The spectral measurements were conducted in a dark room. The air-dried and ground soil sample was placed in a 10 cm diameter petri dish with a thickness of approximately 15 mm. A 50 W halogen lamp was used as the light source, which was positioned 30 cm away from soil sample, with a 15 • zenith angle [20]. The optical probe was installed about 15 cm above the soil sample. A Spectralon panel (Labsphere, North Sutton, NH, USA) was used for white referencing once every six measurements.
After spectral measurement, soil samples were further ground, and passed through a 100-mesh sieve (0.15 mm). The finely ground soil samples were digested by HF-HClO4-HNO3. The arsenic contents of digested samples were then analyzed by using a hydride generation atomic fluorescence spectrometry (HG-AFS) method [30]. Certified soil reference materials (GBW 07401, GBW 07402, and GBW 07407, National Research Center for Certified Reference Materials of China) were used to verify the precision of HG-AFS method.
For the purpose of diagnosis, the measured soil arsenic contents were coded into binary 0 or 1, describing uncontaminated or contaminated samples, respectively. The index of geo-accumulation (I geo ) [31] was applied to assess the arsenic contamination in the soils: where M As is the measured arsenic contents in the soils, B As is the geochemical background value of arsenic (13 mg·kg −1 ), the constant of 1.5 was used to eliminate fluctuations caused by regional differences and anthropogenic influences [31]. I geo ≤ 0 indicates practically uncontaminated, whereas I geo > 0 means contaminated [31].

Pre-Processing Transformations
The whole measured soil arsenic content data and their corresponding spectral data were divided into training (n = 98) and test (n = 97) data sets using a Kennard-Stone algorithm [32], which is effective for selecting spectra-representative samples for model development. The reflectance spectra were first reduced to 400-2450 nm to remove the wavelengths with high noise effects at the spectral edges. The reflectance spectra were then SG smoothed with a moving window of 9 nm. The smoothed spectra were resampled to 10 nm intervals (e.g., 400, 410, and 420 nm, etc.) to eliminate the data redundancy by using a Gaussian model [4]. Moreover, first and second derivatives, MSC, SNV and MC of reflectance spectra were performed for soil spectra to enhance spectral features and to further establish robust diagnosis models. Reflectance spectra were transformed into log(1/Reflectance) before MSC and SNV were performed.

Feature Selection
PCA and the RELIEF algorithm were applied to extract features from spectral variables of the training data-set. PCA was an optimal linear scheme for extracting several principle components (PCs) from high dimensional variables, and the extracted components can hold the majority of the variables' information. The RELIEF algorithm, first described by Kira and Rendell [33], was used as a simple, fast and effective approach to weigh variables, and its output is the ranking weights between −1 and 1 for spectral variables, in which the more positive weights indicate more predictive spectral variables. In this study, PCA and the RELIEF algorithm were implemented in Weka (Waikato Environment for Knowledge Analysis). The number of PCs was determined by the diagnosis accuracy of the calibration. The threshold for the RELIEF weight value was set to 0, and the scattered spectral bands with local extreme weights were selected as spectral features to avoid the multicollinearity among RELIEF-selected features.

Multivariate Diagnosis Analysis
Machine-learning methods, such as RF, ANN and SVM, were employed for calibrating diagnosis models using the training data set. For brevity, the summaries of these techniques were provided, and some key references were cited. Interested readers may find more details about these techniques in these references. In this study, the machine-learning methods were implemented by using a R-based Rattle package developed by Williams [34].

Random Forests (RF)
RF, introduced by Breiman [35], is an ensemble learning method that constructs a multitude of decision trees. For the RF learner, each tree is independently trained from a randomized bootstrap sample of the entire training data set, and a subset of explanatory variables is randomly selected for the node-splitting rules in each tree [36]. In classification, trees are voted by majority [35]. The RF depends only on two user-defined parameters: the number of variables in each random subset (nv) and the number of trees in the forest (nt). In this study, the nv was optimized from 1 to the total number of variables with increments of 1, and nt from 0 to 1000 by increments of 10. The variable that is important for RF modeling can be determined by mean decrease GINI values.

Artificial Neural Network (ANN)
The concept of ANN learner may date back to 1940s when McCulloch and Pitts [37] initially planned to develop a virtual "central nervous system" for computer modeling. The design of ANN simulates the data processing in biological nervous systems. The structure of an ANN consists of a set of interconnected neurons. Some neurons are adopted for the reception of information, others for its forwarding and storage, and another group for the outward release of information [38]. Neurons are connected to each other through weighted synapses. In an ANN, the number of hidden layers and neurons in each hidden layer ought to be optimized [21]. In this study, the number of hidden layers was optimized by iterating this parameter from 1 to 20, and the number of neurons in each layer was set as the total number of variables.

Support Vector Machine (SVM)
SVM is a kernel-based machine learning method developed on the basis of statistical learning theory [39]. SVM applies a kernel function to map training data into a higher dimensional feature space, and computes separating hyperplanes that achieve maximum separation (margin) between the classes [40]. The maximum separation hyperplane is the training data on the margin, which are called support vectors. The quality of the SVM classifier is affected by the type of kernel function, kernel width (γ) and regularization parameter (C) [40]. In this study, radial basis function (RBF) and linear function (LF) were adopted as kernel functions, respectively.

Validation and Comparison of Diagnosis Models
The calibrated models were applied for diagnosing the contaminated and uncontaminated soil samples of the test data-set. The overall accuracy (OA, Equation (2)) [38] of the test data-set was calculated and employed for comparing the diagnosis abilities of multivariate methods. The same computer environment was kept for running different machine-learning algorithms.
where the meanings of pp, np, pn and nn are displayed in Table 1. The statistical significance of the difference between diagnosis models was evaluated by using McNemar's test [41], which is based on a binary distinction between correct and incorrect class allocations ( Table 2). McNemar's test is also based on the standardized normal test statistic expressed in Equation (3): Therefore, the test is focused on the cases that are correctly diagnosed by one classifier but misdiagnosed by the other. Two diagnosis models may exhibit different accuracies at the 95% level of confidence if Z > |1.96|.

Soil Arsenic and the Spectra
The percent mean standard error of the HG-AFS method for arsenic determination was 2.9%. The descriptive statistics of soil arsenic of the 195 soil samples are shown in Table 3. For the total data set, the soil arsenic contents varied from 1.91 to 133.36 mg·kg −1 , with a mean of 18.13 mg·kg −1 and a standard deviation of 18.67 mg·kg −1 . Considering I geo values, 27%, 26% and 29% of samples were contaminated by arsenic in total, training and test data sets, respectively. The mean value and standard deviation of original and pre-processed spectra for contaminated and uncontaminated soil samples are shown in

Principal Components and RELIEF Selected Features
The first three loadings of the PCA analysis for original reflectance and pre-processed spectra were displayed in Figure 2. The score plots showed that the spectral space of the contaminated samples fell into those of uncontaminated samples. This meant that the linear classifier might be unable to effectively diagnose contaminated or uncontaminated soil samples by using principal components.
The RELIEF weights and the selected spectral features are displayed in Figure 3. The RELIEF weights of the MC spectra (Figure 3b

Principal Components and RELIEF Selected Features
The first three loadings of the PCA analysis for original reflectance and pre-processed spectra were displayed in Figure 2. The score plots showed that the spectral space of the contaminated samples fell into those of uncontaminated samples. This meant that the linear classifier might be unable to effectively diagnose contaminated or uncontaminated soil samples by using principal components.
The RELIEF weights and the selected spectral features are displayed in Figure 3. The RELIEF weights of the MC spectra (Figure 3b

Principal Components and RELIEF Selected Features
The first three loadings of the PCA analysis for original reflectance and pre-processed spectra were displayed in Figure 2. The score plots showed that the spectral space of the contaminated samples fell into those of uncontaminated samples. This meant that the linear classifier might be unable to effectively diagnose contaminated or uncontaminated soil samples by using principal components.
The RELIEF weights and the selected spectral features are displayed in Figure 3. The RELIEF weights of the MC spectra (Figure 3b

Comparison of the Abilities of Different Methods
The operation times, parameter setting, and validated OAs for diagnosis models by using different methods are illustrated in Table 4. The results showed that (1) the suitable combination of pre-processing and feature selection was vital to improve OAs of each machine-learning method; (2) feature selection methods, PCA and RELIEF, could improve modeling accuracies and decrease operation times of modeling, and RELIEF often outperformed PCA; (3) derivative transformation often resulted in the best diagnosis models. The optimal models for RF, ANN, LF and RBF-SVM were described as follows: Figure 3. RELIEF weights and the selected spectral features for original reflectance spectra (a), mean centering spectra (b), standard normal variate spectra (c), multiplicative scatter correction spectra (d), first derivative spectra (e), and second derivative spectra (f). The threshold of RELIEF weight was set to 0 (horizontal dashed lines).

Comparison of the Abilities of Different Methods
The operation times, parameter setting, and validated OAs for diagnosis models by using different methods are illustrated in Table 4. The results showed that (1) the suitable combination of pre-processing and feature selection was vital to improve OAs of each machine-learning method; (2) feature selection methods, PCA and RELIEF, could improve modeling accuracies and decrease operation times of modeling, and RELIEF often outperformed PCA; (3) derivative transformation often resulted in the best diagnosis models. The optimal models for RF, ANN, LF and RBF-SVM were described as follows: Table 4. The operation times, parameter setting, and overall accuracies for diagnosis models by using different pre-processing, feature selection and machinelearning methods 1 . The optimal pre-processing method for the RF model was second dervative. The best RF model was calibrated by using 12 RELIEF-selected spectral features, and the optimized nv and nt of the RF model were 3 and 50, respectively. The mean decrease GINI values (Figure 4) showed the importance of the spectral features for RF modeling in descending order as 2150, 810, 1400, 670, 1890, 2220, 1290, 570, 990, 750, 1570 and1990 nm. The validated OA for the RF model was 86%, which mean that the RF model correctly diagnosed 86% of soil samples in the test data-set (Figure 5a).

RF
The optimal pre-processing method for the RF model was second dervative. The best RF model was calibrated by using 12 RELIEF-selected spectral features, and the optimized nv and nt of the RF model were 3 and 50, respectively. The mean decrease GINI values (Figure 4) showed the importance of the spectral features for RF modeling in descending order as 2150, 810, 1400, 670, 1890, 2220, 1290, 570, 990, 750, 1570 and1990 nm. The validated OA for the RF model was 86%, which mean that the RF model correctly diagnosed 86% of soil samples in the test data-set (Figure 5a).

ANN
The optimal pre-processing method employed for ANN modeling was first derivative; PCA was selected as the feature selection method, and the number of hidden layers was three. The factor number for modeling was eight, and the first eight PCs explained approximately 99% of the variation of the spectral data. The ANN model correctly diagnosed 89% of soil samples in the test data-set ( Figure 5b).

SVM
Second derivative was the optimal pre-processing method for RBF-SVM, and first derivative was the optimal pre-processing method for LF-SVM. The optimized C and γ for RBF-SVM were 1

RF
The optimal pre-processing method for the RF model was second dervative. The best RF model was calibrated by using 12 RELIEF-selected spectral features, and the optimized nv and nt of the RF model were 3 and 50, respectively. The mean decrease GINI values (Figure 4) showed the importance of the spectral features for RF modeling in descending order as 2150, 810, 1400, 670, 1890, 2220, 1290, 570, 990, 750, 1570 and1990 nm. The validated OA for the RF model was 86%, which mean that the RF model correctly diagnosed 86% of soil samples in the test data-set (Figure 5a).

ANN
The optimal pre-processing method employed for ANN modeling was first derivative; PCA was selected as the feature selection method, and the number of hidden layers was three. The factor number for modeling was eight, and the first eight PCs explained approximately 99% of the variation of the spectral data. The ANN model correctly diagnosed 89% of soil samples in the test data-set ( Figure 5b).

SVM
Second derivative was the optimal pre-processing method for RBF-SVM, and first derivative was the optimal pre-processing method for LF-SVM. The optimized C and γ for RBF-SVM were 1

ANN
The optimal pre-processing method employed for ANN modeling was first derivative; PCA was selected as the feature selection method, and the number of hidden layers was three. The factor number for modeling was eight, and the first eight PCs explained approximately 99% of the variation of the spectral data. The ANN model correctly diagnosed 89% of soil samples in the test data-set (Figure 5b).

SVM
Second derivative was the optimal pre-processing method for RBF-SVM, and first derivative was the optimal pre-processing method for LF-SVM. The optimized C and γ for RBF-SVM were 1 and 0.06, respectively, while the optimized C for LF-SVM was 1. By adopting 12 RELIEF-selected spectral features, the RBF-SVM model correctly diagnosed 89% of soil samples in the test data-set ( Figure 5c); and the LF-SVM model correctly diagnosed 87% of soil samples by using the RELIEF-selected spectral features (Figure 5d). Figure 5 displayed the predicted values of samples in the test dat-set by using three optimal diagnosis models. McNemar's test applied to these diagnosis models showed that the Z values were all less than 1.96 (Table 5), which indicated that there was no statistical difference in the diagnosis abilities of these optimal diagnosis models (p < 0.05).

Discussion
In this study, with the combination of pre-processing, feature selection and machine-learning methods, the OAs for soil arsenic contamination diagnosis achieved a satisfactory level (OA > 85%). This result demonstrated that VNIRS could be applied to diagnose soil arsenic contamination, although in the process of developing diagnosis models, VNIRS technology depended on conventional methods for providing the ground-truth of soil heavy metal contamination. Compared with conventional methods, this study confirmed that VNIRS might allow for faster and cheaper classification of soil heavy metal contaminants in an increased spatial coverage, which has been suggested by Bray, Viscarra Rossel and McBratney [23].
This study demonstrated that, to establish robust diagnosis models, the trial and error of various pre-processing methods was vital. Pre-processing methods, including SNV, MSC, first derivative, and second derivative, can be employed to eliminate the baseline drift caused by the difference in particle size and optical setups [6]. Derivative transformations also enhance the minor absorption features which may be useful to improve the diagnosis abilities of models. Nevertheless, derivative transformation will add noises into the spectral data, generating more noises with the increase of derivative orders [20]. Therefore, derivative transformations are often applied in conjunction with a smoothing algorithm to amplify noise [6]. Our research suggested that, compared with other pre-processing methods, derivative transformation was a more suitable pre-processing method for developing diagnosis models.
Feature selection methods could improve modeling accuracies by eliminating uninformative spectral variables and increase modeling efficiency by reducing the independent variables for modeling [10]. PCA extracted principle components from spectral variables without consideration of dependent variables (i.e., soil arsenic contamination in this study). However, RELIEF-selected spectral features based on their contributions to the classification of dependent variables [33]. Therefore, the results in this study indicated that RELIEF always outperformed PCA for diagnosing soil arsenic contamination from hyperspectral spectra. We considered that, based on these factors, the RELIEF algorithm was a more suitable method to select spectral features. Moreover, van Groenigen et al. [43] demonstrated that pre-processing methods could strongly influence the reflectance spectra, and they will therefore have an impact on the spectral features. Therefore, in this study, the results indicated that pre-processing methods affected the RELIEF-selected spectral features (Figure 3).
The establishment of robust diagnosis models by using different machine-learning methods (i.e., RF, ANN, LF-SVM, RBF-SVM) depends on the selection of appropriate pre-processing and feature selection methods. In addition, our study demonstrated that these optimal models for machine-learning methods had no statistical difference in diagnosis abilities; moreover, RF was superior to other machine-learning methods because of its ability to simplify parameter optimization and its better models explanatory. In this study, based on mean decrease GINI values, wavelengths at 2150, 810, 1400, and 670 nm can be identified as the first four important wavelengths for diagnosing arsenic contamination with the RF model. Wavelengths near 2150 and 810 nm relate to organic matter, and spectral features near 1400 and 670 nm coincide with wavelengths related to mostly iron oxides [42]. This might demonstrate that the diagnosis of arsenic contamination might depend on its surrogated correlations with organic matter and iron oxides.
Over-fitting is a common problem for modeling. It means that the best diagnosis model for the training data-set will not work well for the test data-set. RF is robust against over-fitting. Breiman [35] observed that the error associated with the error of RF converged to a limit with the increase in the number of trees in a forest. Nevertheless, in the case of ANN, over-fitting is a serious problem [40]. RF is easily accessible to non-specialists because of its simplicity in parameters optimization. However, for SVM, a number of hyper-parameters need to be optimized for each kernel function [40], while its parameters optimization also requires considerable knowledge of the frequently non-trivial underlying mathematics [40]. Moreover, complex machine-learning algorithms, such as SVM and ANN, were not easily interpretable to present relationships between independent and dependent variables [44]. However, RF, a method that performs a majority vote of tree-based classifiers, is explicit and comprehensible, revealing the important spectral variables for modeling [40]. Variable importance in RF can be evaluated by the increase in prediction error when the out-of-bag data are permuted for a certain variable, while keeping all other data constant. Considering these advantages, we regarded RF as a more efficient machine-learning method for modeling soil arsenic contamination levels.
This study investigated the abilities of laboratory reflectance spectroscopy to diagnose soil arsenic contamination. The field and air-/space-borne imaging spectroscopy have the potential to rapidly map heavy metal contamination over large areas [17,45,46]. Compared with laboratory spectroscopy, the application of field or imaging spectroscopy faces some constraints, such as soil surface, atmospheric and illumination conditions [47]. Therefore, the principles of this study should be further tested with field and imaging data.

Conclusions
The spectroscopic diagnosis of soil arsenic contamination is feasible, and the appropriate combination of pre-processing, feature selection and machine-learning methods is important for diagnosis accuracies. The RELIEF algorithm is a simple and efficient method to extract spectral features to improve modeling efficiency and diagnosis accuracy. Compared with ANN and SVM, RF is a more optimal machine-learning method for developing diagnosis models, because of its ability to simplify parameter optimization and its better models explanatory.