Predicting Bioaccumulation of Potentially Toxic Element in Soil–Rice Systems Using Multi-Source Data and Machine Learning Methods: A Case Study of an Industrial City in Southeast China

Potentially toxic element (PTE) pollution in farmland soils and crops is a serious cause of concern in China. To analyze the bioaccumulation characteristics of chromium (Cr), zinc (Zn), copper (Cu), and nickel (Ni) in soil-rice systems, 911 pairs of top soil (0–0.2 m) and rice samples were collected from an industrial city in Southeast China. Multiple linear regression (MLR), support vector machines (SVM), random forest (RF), and Cubist were employed to construct models to predict the bioaccumulation coefficient (BAC) of PTEs in soil–rice systems and determine the potential dominators for PTE transfer from soil to rice grains. Cr, Cu, Zn, and Ni contents in soil of the survey region were higher than corresponding background contents in China. The mean Ni content of rice grains exceeded the national permissible limit, whereas the concentrations of Cr, Cu, and Zn were lower than their thresholds. The BAC of PTEs kept the sequence of Zn (0.219) > Cu (0.093) > Ni (0.032) > Cr (0.018). Of the four algorithms employed to estimate the bioaccumulation of Cr, Cu, Zn, and Ni in soil–rice systems, RF exhibited the best performance, with coefficient of determination (R2) ranging from 0.58 to 0.79 and root mean square error (RMSE) ranging from 0.03 to 0.04 mg kg−1. Total PTE concentration in soil, cation exchange capacity (CEC), and annual average precipitation were identified as top 3 dominators influencing PTE transfer from soil to rice grains. This study confirmed the feasibility and advantages of machine learning methods especially RF for estimating PTE accumulation in soil–rice systems, when compared with traditional statistical methods, such as MLR. Our study provides new tools for analyzing the transfer of PTEs from soil to rice, and can help decision-makers in developing more efficient policies for regulating PTE pollution in soil and crops, and reducing the corresponding health risks.


Introduction
Soil contamination arouse from potentially toxic element (PTE) such as chromium (Cr), lead (Pb), cadmium (Cd), mercury (Hg), arsenic (As), copper (Cu), zinc (Zn), nickel (Ni) in agricultural land has raised serious concerns worldwide [1][2][3][4][5][6][7], particularly in Land 2021, 10, 558 3 of 17 was constructed using the correlated quantitative covariates as predictors for comparison. The objectives of current research were to: (1) analyze the concentration and transfer characteristics of PTEs in soil-rice system; (2) build models to predict the transfer of PTEs from soil to rice grains using MLR, SVM, RF, and Cubist methods; (3) identify potential dominators of PTE bioaccumulation in soil-rice systems. The results are expected to aid the control of PTE contamination in soil and rice, as well as reduce the health risk of human consumption of such rice.

Study Area
The survey region was situated in Eastern China (120 • 55 -122 • 16 E, 28 • 51 -30 • 33 N) (Figure 1), occupying an area of around 9,800 square kilometers and with a population of 8.5 million. The survey city is one of the most developed cities in China, with the 12th highest gross domestic product (GDP) in 2019. The total industrial output value of Ningbo in 2019 was 578.3 billion Yuan. Three pillar industries in Ningbo are chemical industry, textile and clothing, and machinery industry. The survey region was situated in the lower reach of the Yangtze River in China, and has been an important area for rice cultivation, because of the favorable climate and terrain conditions. Rice plantations occupy more than 100,000 ha in this region and the rice yield was around 660,000 tons in 2018. Details of the study area have been provided Hu et al. [30].
Therefore, in the present study, we used three machine learning algorithms to build models to predict the bioaccumulation coefficient (BAC) of Cr, Cu, Zn, and Ni from soil to rice grains. A total of 20 covariates, including total soil PTE content, pH, CEC, and soil texture, were used as predictors to improve model performance. An MLR model was constructed using the correlated quantitative covariates as predictors for comparison. The objectives of current research were to: (1) analyze the concentration and transfer characteristics of PTEs in soil-rice system; (2) build models to predict the transfer of PTEs from soil to rice grains using MLR, SVM, RF, and Cubist methods; (3) identify potential dominators of PTE bioaccumulation in soil-rice systems. The results are expected to aid the control of PTE contamination in soil and rice, as well as reduce the health risk of human consumption of such rice.

Study Area
The survey region was situated in Eastern China (120°55'-122°16'E, 28°51'-30°33'N) (Figure 1), occupying an area of around 9,800 square kilometers and with a population of 8.5 million. The survey city is one of the most developed cities in China, with the 12 th highest gross domestic product (GDP) in 2019. The total industrial output value of Ningbo in 2019 was 578.3 billion Yuan. Three pillar industries in Ningbo are chemical industry, textile and clothing, and machinery industry. The survey region was situated in the lower reach of the Yangtze River in China, and has been an important area for rice cultivation, because of the favorable climate and terrain conditions. Rice plantations occupy more than 100,000 ha in this region and the rice yield was around 660,000 tons in 2018. Details of the study area have been provided Hu et al. [30].

Sampling and Chemical Analysis
Overall, 911 pairs of surface soil (0-0.2 m) and rice samples were collected in 2013. The rice samples were collected during the harvest season. The surface soil samples were collected after the harvest season at the same locations as rice samples. Each soil sample was a composite of 5 sub-samples collected from five locations within five meters using a stainless steel shovel. The sampling positions were documented using a GPS ( Figure 1). Here, we presented the number of surface soil and rice sample collected from the study area in grid with size of 2 kilometers (Figure 1). The pH was determined by a pHS-3C digital pH meter (Shanghai REX Sensor Technology Co., Ltd., China). Soil samples were digested in acid (HCl-HNO 3 -HClO 4 ), whereas the corresponding rice grain samples were digested using the dry ashing method, and the concentration of Cr, Cu, Zn, and Ni were measured by inductively coupled plasma optical emission spectrometry (ICP-OES 6300, Thermo Fisher Scientific, Waltham, MA, USA). Quality control and standard quality assurance were conducted using soil standard reference materials (GBW07403 and GBW07404 obtained from the National Standard Detection Research Center, Beijing, China), duplicate samples, and procedure blanks during the measurements. Reagent blanks were also included in the analyses for quality control, and element recoveries were in the range of 90% to 110% [39]. MLR [50] can predict the outcome of a response variable using multiple explanatory variables. MLR is used to construct model to describe linear relationships between the independent variables and the dependent variable. MLR has been described in detail by Myers (1990) [50].

Support Vector Machines (SVM)
SVMs aim to build a hyperplane or a set of hyperplanes in a high-or infinitedimensional space, which can then be used for classification and regression [51]. For regression analyses, SVMs can take the non-linearities of a natural system into account, including kernel functions, which function as constructing blocks for SVMs [52]. SVMs have previously been described by Cortes and Vapnik (1995) [51] and Bordoni et al. (2018) [53].

Random Forest (RF)
The RF is a non-linear, non-parametric algorithm developed by Breiman (2001) [54], which comprises a sequence of numerous individual predictor tree models trained from bootstrap samples of the data. The splits of each tree are decided based on a subset of predictor variables were chosen randomly from all available predictors [37]. The prediction results of all trees are then averaged to obtain the final prediction [54]. RF can also calculate the relative importance of a variable from the prediction error of out-of-bag (OOB) predictions [55]. RF perturbs each variable and estimates its importance as the change in the OOB error [54].

Cubist
Cubist is a rule-based data mining and prediction-oriented regression model [56,57]. It firstly creates a tree structure and then folds each path through the tree into a rule. After that, it fitted a regression model for each rule based on the subset of data described by the rules, and the prediction model is selected based on these rules. Prediction can be improved by generating several rule-based models, referred to as "committees", through boosting [58]. Subsequent models are built based on the corrections of the predictions made by previous models to minimize the prediction error.

Data Collection
We used 20 auxiliary variables, which have been proved to have effects on the transfer of PTEs from soil to the rice by previous studies to build a model to predict PTE bioaccumulation in soil-rice systems (Table 1) [36,40,[45][46][47][48][49]. The auxiliary variables included soil properties (such as soil organic matter, pH, soil bulk density, content of PTE in soil, cation exchange capacity, soil sand content, soil clay content, soil silt content, soil coarse fraction), climatic factors (such as annual temperature, annual precipitation), terrain attributes (such as elevation), agricultural management practices (such as amount of phosphate fertilizer applied annually, amount of organic fertilizer applied annually, amount of nitrogen fertilizer applied annually, amount of potash fertilizer applied annually), geology (soil group, parent material), and land use (land use types), and population density are detailed in Table 1.

Data Analysis
In this study the BAC for different PTEs was calculated to represent the migration of PTEs from soil to rice. The BACs were determined by: where C rice and C soil represent the total PTE content in rice grain and soil, respectively. Exploratory data analysis was performed using R (R Core Team, 2015). The MLR, RF, SVM, and Cubist algorithms in the caret package [62] for R were used to construct models for predicting BACs of PTEs in soil-rice systems. The flowchart of this study was presented in Figure 2.

PTE Content in Soil and Rice Grains
The summary characteristics of PTEs in soil and rice grains are described in Table 2. Zn exhibited the highest mean content (115.50 mg kg −1 ) in soil, followed by Cr (71.87 mg

PTE Content in Soil and Rice Grains
The summary characteristics of PTEs in soil and rice grains are described in Table 2. Zn exhibited the highest mean content (115.50 mg kg −1 ) in soil, followed by Cr (71.87 mg kg −1 ), Cu (35.76 mg kg −1 ), and Ni (29.94 mg kg −1 ). The coefficient of variation (CV, %) indicates the variation degree of the PTE content [39]  . The CVs of the soil PTE concentrations were in the order of: Ni (50.17%) > Cu (36.89%) > Cr (36.81%) > Zn (29.76%). All the soil samples have Cr and Cu content lower than risk screening value regulated by Chinese government (GB 15618-2018). Additionally, 1.32% of soil samples with Zn content higher than risk screening value meanwhile 0.88% of soil samples with Ni content higher than risk screening value (GB 15618-2018). The mean PTE contents in rice grains kept the sequence: Zn (23.89 mg kg −1 ) > Cu (2.98 mg kg −1 ) > Cr (0.79 mg kg −1 ) > Ni (0.64 mg kg −1 ). As outlined by the Chinese Government, the national standard values for Cr [63], Cu [64], Zn [65], and Ni [66] in rice grains are 1.00, 10.00, 50.00, and 0.40 mg kg −1 , respectively. The results revealed that the mean Ni concentration in rice grains exceeded the national threshold, whereas Cr, Cu, and Zn contents were below their national limits. The CVs of PTEs in the rice grain were in the order of: Cr (117.08%) > Ni (75.79%) > Cu (26.92%) > Zn (18.04) ( Table 2). The percentage of rice samples with a concentration of Cr, Cu, Zn, and Ni higher than corresponding national standard value is 20.75%, 0, 0.11%, and 65.75%, respectively.

BAC of PTEs from Soil to Rice
The BACs of the different PTEs in rice grains exhibited remarkable variations ( Figure 3). The BACs of Cr, Cu, Zn, Ni ranged from 0.001 to 0.890, 0.002 to 0.485, 0.013 to 0.729, and 0.003 to 0.783, respectively. The average BACs of the different PTEs followed the sequence: Zn (0.219) > Cu (0.093) > Ni (0.032) > Cr (0.018) (Figure 3). The CVs of the BACs of Cr, Cu, Zn, and Ni in rice grains were 270.34%, 45.90%, 29.87%, and 168.68%, respectively, demonstrating that the BACs of Cr and Ni varied greatly in the study area.

Modeling the Transfer of PTEs from Soil to Rice
We employed four different methods (Cubist, RF, SVMs, and MLR) to build models for predicting PTE bioaccumulation from soil to rice grains with input from multiple ancillary sources. The sample dataset was firstly randomly divided as training dataset and validation dataset with ratio of 4:1. Firstly the models were trained using a training dataset and 5-fold cross-validation was employed to evaluate the model performance on training dataset. Then the constructed models were validated independently using a validation dataset. Performance metrics of the constructed models are listed in Table 3. The R 2 of the Cubist models for the different PTEs varied between 0.05 and 0.72, whereas the RMSE varied between 0.04 and 3.22 mg kg −1 . The R 2 of RF models varied between 0.58 and 0.79, whereas the RMSE varied between 0.03 and 0.04 mg kg −1 (Figure 4). For the SVM models, the R 2 spanned 0.05 and 0.69, and the RMSE ranged from 0.05 to 2.95 mg kg −1 . For the MLR models, the R 2 varied between 0.46 and 0.67, and the RMSE ranged from 0.05 to 2.95 mg kg −1 . The RF algorithms markedly outperformed the other three methods, with the highest R 2 and Lin's concordance correlation coefficient (CCC) and the lowest RMSE and bias for all the tested PTEs (Table 3)

BAC of PTEs from Soil to Rice
The BACs of the different PTEs in rice grains exhibited remarkable variations ( Figure  3). The BACs of Cr, Cu, Zn, Ni ranged from 0.001 to 0.890, 0.002 to 0.485, 0.013 to 0.729, and 0.003 to 0.783, respectively. The average BACs of the different PTEs followed the sequence: Zn (0.219) > Cu (0.093) > Ni (0.032) > Cr (0.018) (Figure 3). The CVs of the BACs of Cr, Cu, Zn, and Ni in rice grains were 270.34%, 45.90%, 29.87%, and 168.68%, respectively, demonstrating that the BACs of Cr and Ni varied greatly in the study area.

Modeling the Transfer of PTEs from Soil to Rice
We employed four different methods (Cubist, RF, SVMs, and MLR) to build models for predicting PTE bioaccumulation from soil to rice grains with input from multiple ancillary sources. The sample dataset was firstly randomly divided as training dataset and validation dataset with ratio of 4:1. Firstly the models were trained using a training dataset and 5-fold cross-validation was employed to evaluate the model performance on training dataset. Then the constructed models were validated independently using a validation dataset. Performance metrics of the constructed models are listed in Table 3. The R 2 of the Cubist models for the different PTEs varied between 0.05 and 0.72, whereas the RMSE varied between 0.04 and 3.22 mg kg −1 . The R 2 of RF models varied between 0.58 and 0.79, whereas the RMSE varied between 0.03 and 0.04 mg kg −1 (Figure 4). For the SVM models, the R 2 spanned 0.05 and 0.69, and the RMSE ranged from 0.05 to 2.95 mg kg −1 . For the MLR models, the R 2 varied between 0.46 and 0.67, and the RMSE ranged from 0.05 to 2.95 mg kg −1 . The RF algorithms markedly outperformed the other three methods, with the highest

Variable Importance for Modeling PTE Bioaccumulation in Soil-Rice Systems
We used 20 auxiliary variables (Table 1), including terrain attributes, climatic factors, soil properties, geological factors, and geomorphic factors, to estimate PTE bioaccumulation from soil to rice grains. As described in Section 3.3, the RF model yielded the best estimation of PTE bioaccumulation from soil to rice grains. Therefore, the potential dominators for PTE bioaccumulation were based on the relative variable importance determined by RF.
The relative importance of variables for estimation of PTE bioaccumulation in soil-rice systems is presented in Figure 5. The total concentration of Cr in soil, CEC, and the amount of organic fertilizer applied annually were identified as the three most important variables for predicting the BAC of Cr. Soil Cu content, annual average precipitation, and the amount of potash applied annually were the three principle variables for modeling the BAC of Cu. Soil Zn content and annual average precipitation and temperature were the three primary variables for modeling the BAC of Zn. Soil Ni content, CEC, and the amount of nitrogen fertilizer applied annually were the three most critical variables for modeling the BAC of Ni. variables for predicting the BAC of Cr. Soil Cu content, annual average precipitation, and the amount of potash applied annually were the three principle variables for modeling the BAC of Cu. Soil Zn content and annual average precipitation and temperature were the three primary variables for modeling the BAC of Zn. Soil Ni content, CEC, and the amount of nitrogen fertilizer applied annually were the three most critical variables for modeling the BAC of Ni.

PTE Content in Soil-Rice Systems
In this study, the mean contents of all PTEs in the region under survey were clearly higher than their national background concentrations in China, indicating PTE accumulation resulting from anthropogenic activities (Table 4). In comparison with other surveys

PTE Content in Soil-Rice Systems
In this study, the mean contents of all PTEs in the region under survey were clearly higher than their national background concentrations in China, indicating PTE accumulation resulting from anthropogenic activities (Table 4). In comparison with other surveys conducted in regions like Liaoning, Jilin, Jiangsu, Fujian, Xinjiang, Shanghai, Guangdong and Hunan Province in China [42,[67][68][69][70] (listed in Table 4), the soil Cr content was relatively high, whereas soil Cu, Zn, and Ni were present in moderate levels. The mean PTE contents in rice grains from the survey region were notably lower than the national permissible values, except Ni. The concentration of Cr in rice grains observed in this study was the highest among those reported in the studies listed in Table 4. The Zn content of rice grains was also higher than that noted in other surveys, whereas the Cu content of rice grains was lower. The maximum values of Cr, Zn, and Ni contents exceeded the national permissible limits, indicating that some of the rice crops are already affected by PTE pollution (Table 2). Especially, 20.75% of rice samples have Cr content higher than national threshold, moreover 65.75% of rice samples have Ni content higher than national threshold. This indicates that it is urgent task to take measures to control the bioaccumulation of PTEs in rice.  Taking the results of other studies into consideration, the BACs of Cr, Cu, Zn, and Ni ranged from 0.003 to 0.012, 0.038 to 0.386, 0.086 to 0.241, and 0.004 to 0.103, respectively (Table 4). Our results found that migration capacity of Zn is the strongest while Cr is the weakest in soil-rice system which is consistent with the study reported by Li et al. (2021). The BAC of Cr calculated in this research was higher than that reported by previous study listed in Table 4, whereas the BACs of Cu, Zn and Ni were mid-range. The BAC of Zn was evidently higher than that of the other PTEs, which was consistent with previous study reported by Wang [39]. Cakmark (2008) [77] noted that application of Zn fertilizers enhances the amount of available Zn in the soil solution, leading to bioaccumulation in grains during the productive growth stage. Nan et al. (2002) [78], Mohammad and Moheman (2010) [79]; demonstrated the interactions occurring during the accumulation of Cd and Zn and the synergistic effects of the bioaccumulation of these two PTEs in soil-crop systems. These factors may contribute to the higher BAC of Zn observed in soil-rice systems.

Model Performance Employing the Different Methods and Elements
Clear differences were observed in the model performances. Generally, RF had the best performance, followed by MLR, SVM and Cubist. MLR aims to model the linear relationship between the explanatory variables and response variable. MLR has previously been used to predict soil organic carbon (SOC) [80], soil pH [81], soil bulk density [82], and soil texture [50]. However, MLR is not suitable if non-linear relationships are present in the data [83]. This may explain the poor performance of MLR when compared with that of RF and Cubist. Furthermore, the relationship between the covariates and BAC of PTEs in soil-rice systems is highly complex and difficult to model using linear relations alone, which further lower the performance of MLR. SVMs have also previously been employed to predict soil SOC stock [84], phosphorus [85], and clay [85]. They were reported to be more suitable for classification than quantitative prediction [86]. Cubist has proven to be a feasible method for predicting soil properties, such as SOC [87], salinity [88], silt [89], and concentration of soil Cr, Pb, Cu, Zn, Ni, and As [90]. The RF algorithm is based on decision trees and is currently the most widely used machine learning method for predicting soil properties. It has several advantages, such as ease of application, insensitivity to data size, and the ability to model non-linear relationships in the data set [37,60,91], which may contribute to the better performance of RF.
The model constructed using RF yielded the best estimation of Cr bioaccumulation in soil-rice systems. The R 2 of this model was 0.79 (Table 3 and (Table 5). Our models for predicting Cu, Zn, and Ni were also more accurate than those reported previously in similar case studies ( Table 5). The results confirmed the ability of RF to efficiently predict PTE bioaccumulation in soil-rice systems with multi-source covariates.

Potential Dominators for PTE Bioaccumulation in Soil-Rice Systems
The total PTE content of soil was found to be the principal variable for estimating the BAC of PTEs in soil-rice systems, as evident from Figure 4. In addition, CEC and annual average precipitation contributed significantly to BAC estimation. Cr, Cu, Zn, and Ni are indispensable elements for rice growth. The PTEs can be absorbed by rice plants from the soil. Several studies have demonstrated the strong relationship between PTEs in the soil and in crops, including rice [39,41,47,49,93,94], which may explain the importance of SC in modeling PTE bioaccumulation in rice.
The close relationship between CEC and absorption of PTEs by crops from the soil has been reported by several researchers [37,[95][96][97]. Vega et al. (2010) [98] revealed that CEC plays important roles in the sorption and retention processes of PTEs in soil-crop systems. Gupta et al. (2008) [99] reported that high CEC could significantly slow down PTE uptake by crops from the soil. Shen et al. (1998) [100] also found that high CEC allows the minerals in soil to absorb PTE ions from soil solutions through ion-exchange processes. Gu et al. (2011) [101] noted that an increase in soil CEC could promote PTE precipitation and complexation in the soil solution.
The annual average precipitation may affect PTE bioaccumulation in soil-rice systems by promoting crop growth, subsequently enhancing PTE absorption by the crop. Other covariates, such as SOM, appeared to be less important than expected, and these warrant further investigation.

Policy Recommendations
Our results indicate that some measures could be taken to curb the bioaccumulation of PTEs in rice. Firstly, great effort still necessary to reduce accumulation of PTEs in soil which could then reduce the bioaccumulation of PTEs from soil to rice. Secondly, more reasonable agricultural land management measures for example reasonable application of fertilization, reasonable crop rotation were expected to curb transfer of PTEs from soil to rice. After that, liming of acidic soils is recommended especially in the areas serous polluted by PTEs such as Cd, Pb since some PTEs such as Cd and Pb are more easily absorbed by crops in acidic soil [102]. Finally, breeding crops, such as rice cultivars with low accumulation, is another efficient way to reduce the human health risk caused by PTEs pollution in food or vegetables.

Limitations and Perspectives
Although the model developed using RF could efficiently predict PTE bioaccumulation in soil-crop systems, some limitations remain to be addressed. The present study did not take the rice variety into consideration while building the model. Some studies have reported significant differences in the BAC of PTEs among different rice varieties [47,103,104]. Furthermore, some of our ancillaries, such as CEC, soil bulk density, annual average temperature, annual average precipitation, and soil texture, were sourced from maps available online. The resolution of these data is 1000 m, which is still coarse to provide accurate information on the spatial variation of these variables (Table 1). This may negatively affect model performance. In addition, the agricultural practices, such as application of lime, duration of flooding, can affect PTE bioaccumulation in soil-rice systems [105,106].
Several measures could be employed to improve the estimation accuracy. Firstly, the differences between the rice varieties must be taken into consideration in future works. Secondly, maps with high resolution data of CEC, soil bulk density, annual average temperature, annual average precipitation, and soil texture, recorded during additional surveys should be used as covariates. Greater information related to soil management practices should also be included in the model to improve prediction accuracy. Finally, in our current study, the model was built using the total PTE content as one of the core covariates. However, the available form of PTEs in soil was reported to be more closely related to the PTEs absorbed by crops from the soil [39,[107][108][109][110]. Therefore, future studies should employ the available form of PTEs in soil for constructing the model, instead of the total PTE content.

Conclusions
Results obtained in current study revealed that the mean contents of Cr, Cu, Zn, and Ni in farmland soils in the region under survey were higher than their background contents in China. The average Ni content in rice grains clearly exceeded the national permissible limit, whereas the Cr, Cu, and Zn contents were lower than their thresholds. The mean value of BAC of Zn (0.219) was the highest, followed by that of Cu (0.093), Ni (0.032), and Cr (0.018).
The model developed using RF significantly outperformed the MLR, SVM, and Cubist models, and could efficiently predict the bioaccumulation of Cr, Cu, Zn, and Ni in soil-rice systems, with an R 2 varying between 0.58 and 0.79 and RMSE varying between 0.03 and 0.04 mg kg −1 . Total PTE content in soil, CEC, and annual average precipitation were the principal components for the estimation of BACs of PTEs in soil-rice systems. This study confirmed the feasibility of RF for predicting PTE bioaccumulation in soil-rice systems and identifying potential dominators for the transportation of PTEs in soil-rice systems. The findings of this study are expected to enhance our knowledge of the PTEs accumulation in soil and rice grains, contributing to food safety and reducing the human health risk of consuming PTE-polluted rice.