Mapping Topsoil Total Nitrogen Using Random Forest and Modified Regression Kriging in Agricultural Areas of Central China

Accurate understanding of spatial distribution and variability of soil total nitrogen (TN) is critical for the site-specific nitrogen management. Based on 4337 newly obtained soil observations and 33 covariates, this study applied the random forest (RF) algorithm and modified regression kriging (RF combined with residual kriging: RFK, hereafter) model to spatially predict and map topsoil TN content in agricultural areas of Henan Province, central China. According to the RFK prediction, topsoil TN content ranged from 0.52 to 1.81 g kg−1, and the farmland with the topsoil TN contents of 1.00–1.23 g kg−1 and 0.80–1.23 g kg−1 accounted for 48.2% and 81.2% of the total farmland area, respectively. Spatially, the topsoil TN in the study area was generally higher in the west and lower in the east. By using the Boruta variable selection algorithm, soil organic matter (SOM) and available potassium contents in topsoil, nitrogen deposition, average annual precipitation, livestock discharges, and topsoil pH were identified as the main factors driving the spatial distribution and variation of soil TN in the study area. The RF and RFK models used showed the expected performance and achieved acceptable TN prediction accuracy. In comparison, RFK performed slightly better than the RF model. The R2 and RMSE achieved by the RFK model were improved by 4.5% and 4.5%, respectively, compared with that by the RF model. However, the results suggest that RFK was inferior to the RF model in quantifying prediction uncertainty and thus may have a slight disadvantage in model reliability.


Introduction
Soil total nitrogen (TN) is one of the most important indicators of soil productivity and the biogeochemical cycle, and plays an essential role in agroecosystem functioning and climate change mitigation [1][2][3][4]. Low soil TN content suggests that nitrogen may become a crucial limiting factor for primary productivity in agroecosystems, while excessive soil TN content implies the risk of agricultural non-point source pollution and greenhouse gas emissions [5][6][7][8][9][10]. Spatially predicting the distribution and variability of soil TN and determining its main controlling factors are of great significance for understanding the carbon-nitrogen cycle in agroecosystems, implementing site-specific nitrogen management, and maintaining nitrogen dynamic balance at regional, landscape, and field scales, which help improve soil quality, increase food production, prevent agricultural non-point source pollution, and reduce greenhouse gas emissions [7,[11][12][13].
It is well known that soil TN content, especially in arable soils, is not only affected by natural factors such as topography, parent material, climate, and biology, but also

Descriptive Statistics of Soil Total Nitrogen Observations
Summary statistics of the topsoil TN contents observed in the agricultural areas of the study area are presented in Table 1. The observed topsoil TN content ranged from 0.16 to 2.11 g kg −1 , with a mean of 1.06 g kg −1 . The coefficient of variation (CV) of the entire sample set was 27.00%, indicative of a moderate variability. Smaller kurtosis and skewness values indicate that the dataset was close to a normal distribution with a slight right (positive) skewness. There was no significant difference in the statistical characteristics of the entire set, calibration set, and validation set, indicating that all were well representative.

Relative Importance of Covariates
Boruta's quantitative evaluation showed that, except for aspect in the topographic attribute category, the relative importance of all the remaining 32 covariates was greater than the maximum value of the shadow variables (maximum Z-score), that is, they had an important influence on the spatial prediction of topsoil TN in the study area, and were involved in modeling as predictors. As shown in Figure 1, in addition to soil organic matter (SOM), which ranked first in the relative importance list by absolute dominance, the covariates associated with soil nitrogen sources (e.g., application of livestock manure and N-fertilizer, atmospheric N-deposition), soil nutrient-holding capacity (e.g., available K and P contents), and the climatic covariates closely related to soil water availability (e.g., evaporation, precipitation, and relative humidity) ranked higher ( Figure 1).

Relative Importance of Covariates
Boruta's quantitative evaluation showed that, except for aspect in the topograp attribute category, the relative importance of all the remaining 32 covariates was gre than the maximum value of the shadow variables (maximum Z-score), that is, they an important influence on the spatial prediction of topsoil TN in the study area, and w involved in modeling as predictors. As shown in Figure 1, in addition to soil organic m ter (SOM), which ranked first in the relative importance list by absolute dominance, covariates associated with soil nitrogen sources (e.g., application of livestock manure N-fertilizer, atmospheric N-deposition), soil nutrient-holding capacity (e.g., availabl and P contents), and the climatic covariates closely related to soil water availability (e evaporation, precipitation, and relative humidity) ranked higher ( Figure 1).

Spatial Distribution and Variability of Topsoil TN
Based on the covariate set established by the variable selection using the Boruta algorithm, the spatial distribution of topsoil TN content predicted by the RF model was shown in Figure 2b. As tested, the residues from the RF prediction had spatial autocorrelation (Moran's I = −0.06, Z-score = −7.06, p < 0.01) and matched the normal distribution (K-S test p > 0.05). The optimal semi-variance model parameters are shown in Table 2. The results showed that the best-fitting model for the RF residuals was an exponential model. The nugget and sill values were 0.0018 and 0.0447, respectively. The nugget effect was 4.02%, indicating that the RF residuals exhibited strong spatial dependence. Then, the spatial distribution of topsoil TN residues was estimated by OK interpolation. The final TN prediction by the RFK model was generated by adding the deterministic component from the RF model with the residual interpolation ( Figure 2e). According to the RFK prediction, the topsoil TN content in the study area ranged from 0.52 to 1.81 g kg −1 , with a mean of 1.06 g kg −1 . Compared with the TN observations in the calibration set, the distribution range of predicted TN content was significantly narrowed, reflecting the apparent smoothing effect of the RFK prediction. The agricultural lands with topsoil TN content of 1.00-1.23 g kg −1 were the most widely distributed in the study area, accounting for 48.2% of the total agricultural area, followed by the lands with TN content of 0.80-1.00 g kg −1 , covering 33.0% of the total agricultural area. The agricultural lands with topsoil TN > 1.37 g kg −1 were mainly distributed in the mountainous areas of western Henan Province, while the lands with topsoil TN contents ≤0.48 g kg −1 were concentrated in the Huang-Huai-Hai plain within the study area. Spatially, the topsoil TN in the study area showed considerable spatial variability.

Spatial Distribution and Variability of Topsoil TN
Based on the covariate set established by the variable selection using the Boruta algorithm, the spatial distribution of topsoil TN content predicted by the RF model was shown in Figure 2b. As tested, the residues from the RF prediction had spatial autocorrelation (Moran's I = −0.06，Z-score = −7.06, p < 0.01) and matched the normal distribution (K-S test p > 0.05). The optimal semi-variance model parameters are shown in Table 2. The results showed that the best-fitting model for the RF residuals was an exponential model. The nugget and sill values were 0.0018 and 0.0447, respectively. The nugget effect was 4.02%, indicating that the RF residuals exhibited strong spatial dependence. Then, the spatial distribution of topsoil TN residues was estimated by OK interpolation. The final TN prediction by the RFK model was generated by adding the deterministic component from the RF model with the residual interpolation ( Figure 2e). According to the RFK prediction, the topsoil TN content in the study area ranged from 0.52 to 1.81 g kg −1 , with a mean of 1.06 g kg −1 . Compared with the TN observations in the calibration set, the distribution range of predicted TN content was significantly narrowed, reflecting the apparent smoothing effect of the RFK prediction. The agricultural lands with topsoil TN content of 1.00-1.23 g kg −1 were the most widely distributed in the study area, accounting for 48.2% of the total agricultural area, followed by the lands with TN content of 0.80-1.00 g kg −1 , covering 33.0% of the total agricultural area. The agricultural lands with topsoil TN > 1.37 g kg −1 were mainly distributed in the mountainous areas of western Henan Province, while the lands with topsoil TN contents ≤ 0.48 g kg −1 were concentrated in the Huang-Huai-Hai plain within the study area. Spatially, the topsoil TN in the study area showed considerable spatial variability.

Comparison of Model Performance
The independent validation showed that in predicting topsoil TN content in the study area (Figure 3), the R 2 achieved by the RF and RFK models was 0.44 and 0.46, and the RMSE was 0.22 and 0.21, respectively. The RFK model outperformed the RF model in terms of predictive performance. Based on the calculation of the CI width, the uncertainty of topsoil TN predictions were quantitatively evaluated by counting the percentage of topsoil TN observations that fell within the specified 90% CI, according to the technical specifications of GlobalSoilMap [44,45] (Table 3). Approximately 92.4% of the topsoil TN observations in the validation set fell into the 90% CI of the RF model, demonstrating an acceptable reliability of the predictions. In comparison, the CI coverage probability of RFK model was higher than that of the RF model, and the percentage of soil observations in the validation set falling into 90% CI was 98.2%, significantly deviating from the theoretical range, indicating that the uncertainty of model prediction was overestimated.

Comparison of Model Performance
The independent validation showed that in predicting topsoil TN content in the study area (Figure 3), the R 2 achieved by the RF and RFK models was 0.44 and 0.46, and the RMSE was 0.22 and 0.21, respectively. The RFK model outperformed the RF model in terms of predictive performance. Based on the calculation of the CI width, the uncertainty of topsoil TN predictions were quantitatively evaluated by counting the percentage of topsoil TN observations that fell within the specified 90% CI, according to the technical specifications of GlobalSoilMap [45,46] (Table 3). Approximately 92.4% of the topsoil TN observations in the validation set fell into the 90% CI of the RF model, demonstrating an acceptable reliability of the predictions. In comparison, the CI coverage probability of RFK model was higher than that of the RF model, and the percentage of soil observations in the validation set falling into 90% CI was 98.2%, significantly deviating from the theoretical range, indicating that the uncertainty of model prediction was overestimated.

Covariate Contributions
The relative importance of the covariates derived from the variable selection algorithm refers to the relative influence of the covariates on the spatial prediction of the target soil variables. If the model prediction was reliable, then the relative importance of the covariates largely implied the ability of the covariates to drive the spatial distribution and variation of the target soil variables. Therefore, although the RF used in this study was not an explanatory model, it could reveal the driving factors of spatial distribution and variation of topsoil TN content in the study area from the relative importance of covariates involved in modeling.

Covariate Contributions
The relative importance of the covariates derived from the variable selection algorithm refers to the relative influence of the covariates on the spatial prediction of the target soil variables. If the model prediction was reliable, then the relative importance of the covariates largely implied the ability of the covariates to drive the spatial distribution and variation of the target soil variables. Therefore, although the RF used in this study was not an explanatory model, it could reveal the driving factors of spatial distribution and variation of topsoil TN content in the study area from the relative importance of covariates involved in modeling. As expected, SOM content dominated the spatially explicit estimation of the topsoil TN in the study area, which was consistent with most other studies [12,16,[46][47][48]. The statistics of soil observations in the calibration set showed that the content of SOM and topsoil TN was positively correlated at the p < 0.01 level (Figure 4), which indicated that most topsoil TN existed in organic form, and perhaps some inorganic nitrogen was adsorbed on the SOM functional groups. The relative importance of available potassium content in topsoil ranked second among all covariates, which may be due to two causes. First, the widespread use of compound fertilizer containing N, P, K elements increased the possibility of the coexistence of available nitrogen and available potassium. Secondly, and most importantly, there was a close correlation between available potassium and TN content. If the topsoil TN content in the calibration set is divided into seven grades according to the legend grading standard in Figure 2, and the scatterplot of topsoil TN against available potassium is made according to the average content of each grade, then there is an almost perfect linear correlation between them (Figure 5a). This was most likely attributed to the fact that the available potassium in soil was mainly adsorbed to organic colloid in the form of exchangeable cations. Such a perfect correlation also existed between SOM and topsoil TN (Figure 5b), and available potassium ( Figure 5c). As expected, SOM content dominated the spatially explicit estimation of the topso TN in the study area, which was consistent with most other studies [12,16,[47][48][49]. Th statistics of soil observations in the calibration set showed that the content of SOM an topsoil TN was positively correlated at the p < 0.01 level (Figure 4), which indicated th most topsoil TN existed in organic form, and perhaps some inorganic nitrogen was a sorbed on the SOM functional groups. The relative importance of available potassiu content in topsoil ranked second among all covariates, which may be due to two cause First, the widespread use of compound fertilizer containing N, P, K elements increase the possibility of the coexistence of available nitrogen and available potassium. Secondl and most importantly, there was a close correlation between available potassium and T content. If the topsoil TN content in the calibration set is divided into seven grades a cording to the legend grading standard in Figure 2, and the scatterplot of topsoil T against available potassium is made according to the average content of each grade, the there is an almost perfect linear correlation between them (Figure 5a). This was most like attributed to the fact that the available potassium in soil was mainly adsorbed to organ colloid in the form of exchangeable cations. Such a perfect correlation also existed betwee SOM and topsoil TN (Figure 5b), and available potassium (Figure 5c).  Both forms of N deposition ranked third and seventh, respectively in the relative importance of covariates, indicating that they played a very important role in the topsoil TN prediction, which has rarely been reported in other soil TN estimation studies conducted in China [18,34,48,49]. In fact, few studies have included N deposition as a covariate for soil TN prediction, possibly because the intensity of N deposition has shown a dramatic decline in most parts of the country over the past two decades. Nevertheless, at least in this study area, N deposition seemed to remain an important source of soil nitrogen, and had a significant contribution to topsoil TN content. The relative importance of average annual precipitation ranked fourth among the covariates, and had a significantly positive correlation with topsoil TN at the p < 0.05 level in the calibration set ( Figure 4), which was consistent with the identification of influencing factors of soil TN in other studies [34,[49][50][51]. We believe that precipitation, together with evaporation, has a dual impact on soil TN: one was to affect SOM accumulation and thus TN content; the other was to alter soil water availability to drive nitrogen behavior, such as leaching and volatilizing.
Two covariates characterizing the potential nitrogen output of the local livestock industry, namely the pig equivalent per unit area and the risk index of livestock manure pollution, were also relatively high in the relative importance ranking, among which the risk index of livestock manure pollution was significantly and positively correlated with the topsoil TN content at the p < 0.05 level. In previous studies of soil TN prediction, almost none included the livestock-related data layer as a covariate, possibly due to the small size of the livestock industry in these study areas. In Henan province, however, the comprehensive production capacity of animal husbandry has been continuously enhanced over the past decade. In 2021, the output value of animal husbandry in the province ranked the second in the country, accounting for 28.7% of the total agricultural output value of the province. The impact of livestock waste discharge on soil nitrogen should not be ignored.
Topsoil pH also ranked in the top 10 covariates, and was significantly negatively correlated with topsoil TN at the p < 0.01 level. This correlation has also been found by previous studies [48]. Many studies have shown that the entry of exogenous nitrogen, such as N fertilizer application and N deposition, could increase the SOM and soil TN contents while leading to the decrease in soil pH [48,[52][53][54][55].

Prediction Accuracy
As mentioned above, soil TN is one of the most difficult soil attributes to be spatially predicted due to the high diversity and great spatial-temporal variability of influencing Both forms of N deposition ranked third and seventh, respectively in the relative importance of covariates, indicating that they played a very important role in the topsoil TN prediction, which has rarely been reported in other soil TN estimation studies conducted in China [18,33,47,48]. In fact, few studies have included N deposition as a covariate for soil TN prediction, possibly because the intensity of N deposition has shown a dramatic decline in most parts of the country over the past two decades. Nevertheless, at least in this study area, N deposition seemed to remain an important source of soil nitrogen, and had a significant contribution to topsoil TN content. The relative importance of average annual precipitation ranked fourth among the covariates, and had a significantly positive correlation with topsoil TN at the p < 0.05 level in the calibration set (Figure 4), which was consistent with the identification of influencing factors of soil TN in other studies [33,[48][49][50]. We believe that precipitation, together with evaporation, has a dual impact on soil TN: one was to affect SOM accumulation and thus TN content; the other was to alter soil water availability to drive nitrogen behavior, such as leaching and volatilizing.
Two covariates characterizing the potential nitrogen output of the local livestock industry, namely the pig equivalent per unit area and the risk index of livestock manure pollution, were also relatively high in the relative importance ranking, among which the risk index of livestock manure pollution was significantly and positively correlated with the topsoil TN content at the p < 0.05 level. In previous studies of soil TN prediction, almost none included the livestock-related data layer as a covariate, possibly due to the small size of the livestock industry in these study areas. In Henan province, however, the comprehensive production capacity of animal husbandry has been continuously enhanced over the past decade. In 2021, the output value of animal husbandry in the province ranked the second in the country, accounting for 28.7% of the total agricultural output value of the province. The impact of livestock waste discharge on soil nitrogen should not be ignored.
Topsoil pH also ranked in the top 10 covariates, and was significantly negatively correlated with topsoil TN at the p < 0.01 level. This correlation has also been found by previous studies [47]. Many studies have shown that the entry of exogenous nitrogen, such as N fertilizer application and N deposition, could increase the SOM and soil TN contents while leading to the decrease in soil pH [47,[51][52][53][54].

Prediction Accuracy
As mentioned above, soil TN is one of the most difficult soil attributes to be spatially predicted due to the high diversity and great spatial-temporal variability of influencing factors. Given that the study area covered 167,000 km 2 , the performance of the RF and RFK models used in this study and the achieved accuracy of topsoil TN prediction were in line with expectations. Under the conditions of comparable soil sample density, covariate availability, and landscape complexity, the smaller the geographical scope of the study area, the better the prediction performance of the model used. Liu et al. [55] successfully predicted soil TN content using a multiple linear regression (MLR) model in a small watershed of 4.2 km 2 in Shandong Province, China, and achieved a prediction R 2 of 0.69. In the study conducted by Wadoux et al. [56] in the metropolitan territory of France covering about 540,000 km 2 , based on the soil observations from the LUCAS dataset, the topsoil TN prediction using the RF model just obtained an R 2 of 0.20, while the RMSE was as high as 1.52. In Zhejiang Province (located in East China and with a total area of 104,300 km 2 ), Deng et al. [47] used the RF model to spatially predict topsoil TN content and achieved an R 2 of 0.65. The density of the soil observations in Deng et al.'s study was about seven times that of our soil observations. We believed that the much higher soil observation density might be one of the key reasons for the significantly higher R 2 . However, the prediction RMSE achieved by Deng et al.'s study was 0.45, much higher than the 0.22 obtained in this study. Considering the differences in topsoil TN levels between the two study areas, the normalized RMSE (NRMSE) was calculated by dividing the RMSE by the mean of the TN observations. It was found that the NRMSE of topsoil TN prediction in Zhejiang study area was 0.25, while that in our study area was 0.20. It seems that increasing the sample observations could significantly promote the model capacity to explain the spatial variation of topsoil TN, but it might not effectively reduce the prediction deviation.
In terms of R 2 and RMSE, the accuracy of the RFK model was better than the RF model for topsoil TN prediction in the study area. The R 2 and RMSE obtained by the RFK model improved by 4.5% and 4.5%, respectively, compared with those obtained by the RF model. The superiority of RFK over the RF model is visually demonstrated by the plots of predicted against measured values of the topsoil TN contents ( Figure 3). As shown in Figure 3, although both models display a similar pattern, RFK scatter is less tight around the 1:1 line, and overestimated lower and underestimated higher TN content values to a lesser extent than RF. This finding was close to the studies conducted by Takoutsing and Heuvelink [37].
Many studies have reported that the RK model and its modified visions were superior to competitors to varying degrees in spatially predicting soil TN [18,48,57,58]. In comparison, the performance advantage of RFK over RF in this study was smaller than in most previous studies. First, the relatively large study area increased the terrain diversity, landscape complexity and the soil heterogeneity, leading to the decrease in effective control scope of the spatial autocorrelation of soil TN [26,59]. Therefore, the existing soil observations were not enough to predict the spatial stochastic variation of soil TN well. With the same calibration dataset, using the OK model to predict the topsoil TN in this study area, the achieved R 2 was only 0.21, but the RMSE was as high as 0.25. Obviously, the performance of OK was inferior to the RF and RFK models. Second, the model structure and the used covariates largely influenced the residual spatial autocorrelation of deterministic prediction. The RF used in this study was a tree-based ML model populated with all relevant variables, which usually leaves no or weak residual spatial autocorrelation [26]. Thus, the substantial superiority of RFK performance could not achieved by OK of the residuals from the RF model.

Prediction Uncertainty
One of the main advantages of the DSM approach is that it allows for quantitative analysis of prediction uncertainties. Based on the statistical results of the validation sample points (Table 3), the CI of RF (92.40%) is closer to the theoretical value of 90% compared to RFK (CI of 98.21%), indicating that RF outperforms RFRK in terms of quantitative estimation of spatial prediction uncertainty. Similarly, Takoutsing and Heuvelink [37] found in a recent study at the landscape scale that regression kriging (RK) was better at predicting a variety of soil properties by achieving lower RMSE values, but worse at quantifying prediction uncertainty than the RF model. In this study, the results showed that the performance of the RK and RF models did not change in terms of both prediction accuracy and quantification of prediction uncertainty when the trend term in the RK was fitted with the RF model instead of the regression model.

Study Area
Henan province (31 • 23 -36 • 22 N and 110 • 21 -116 • 39 E) is located in the middle and lower reaches of the Yellow River in central China (Figure 6), covering a total land area of 167,000 square kilometers, of which 7.51 million hectares are arable land. Henan Province is generally high in the west and low in the east, with an altitude range of 23.2-2413.8 m. The province has a variety of landforms, among which mountains and hills account for 44.3% and plains and basins account for 55.7% of the total land area. Most of the province is in the warm temperate zone, belongs to a continental monsoon climate with a transition from the northern subtropical zone to the warm temperate zone, and features four distinct seasons and simultaneous rain and heat. The average annual temperature of the province from south to north is 10.5-16.7 • C, the average annual precipitation is 464.2-1193.2 mm, the most rainfall is from June to August, the average annual sunshine is 1285.7-2292.9 h, and the annual frost-free period is 208.7-290.2 days, which is suitable for a wide range of crops. The cropping system in Henan Province mainly adopts a winter wheat-summer maize (northern region) and a rice-winter wheat (southern region) crop rotation. As a major agricultural province, grain production in Henan Province plays an important role in China's food security strategy. In 2022, the grain output of the province in a recent study at the landscape scale that regression kriging (RK) was better at predicting a variety of soil properties by achieving lower RMSE values, but worse at quantifying prediction uncertainty than the RF model. In this study, the results showed that the performance of the RK and RF models did not change in terms of both prediction accuracy and quantification of prediction uncertainty when the trend term in the RK was fitted with the RF model instead of the regression model.

Study Area
Henan province (31°23′-36°22′ N and 110°21′-116°39′ E) is located in the middle and lower reaches of the Yellow River in central China (Figure 6), covering a total land area of 167,000 square kilometers, of which 7.51 million hectares are arable land. Henan Province is generally high in the west and low in the east, with an altitude range of 23.2-2413.8 m. The province has a variety of landforms, among which mountains and hills account for 44.3% and plains and basins account for 55.7% of the total land area. Most of the province is in the warm temperate zone, belongs to a continental monsoon climate with a transition from the northern subtropical zone to the warm temperate zone, and features four distinct seasons and simultaneous rain and heat. The average annual temperature of the province from south to north is 10.5-16.7 °C, the average annual precipitation is 464.2-1193.2 mm, the most rainfall is from June to August, the average annual sunshine is 1285.7-2292.9 h, and the annual frost-free period is 208.7-290.2 days, which is suitable for a wide range of crops. The cropping system in Henan Province mainly adopts a winter wheat-summer maize (northern region) and a rice-winter wheat (southern region) crop rotation. As a major agricultural province, grain production in Henan Province plays an important role in China's food security strategy. In 2022, the grain output of the province reached 67.89 billion kg, ranking the second in China, and exceeding 50 billion kg for 16 consecutive years and 65 billion kg for the sixth consecutive year. According to the Chinese Soil Taxonomy, the types of major agricultural soils in Henan Province consist of several suborders of Cambosols (WRB: Vertic Cambisols, Calcaric Cambisols), Argosols (WRB: Calcic Luvisols, Haplic Luvisols) and Primosols (WRB: Fluvic Cambisols, Calcaric Fluvisols), and Stagnic Anthrosols (WRB: Hydragric Anthrosols). Figure 6. Geographical location of the study area and spatial distribution of the soil sampling sites. Figure 6. Geographical location of the study area and spatial distribution of the soil sampling sites.

Soil Sampling and Measurement
For the purpose of monitoring cultivated land quality and promoting formulated fertilization, a total of 4337 topsoil samples were collected in the agricultural areas of Henan Province from 2017 to 2019. Taking the data layers of topography, land use and soil type as the basic strata, the soil sample sites were generated through a stratified random strategy and located using a global positioning system (GPS). At each location, the topsoil sample was taken at a depth of 0-20 cm, which weighed about 1 kg and was composed of the subsamples gathered from the corners and center of a 20 × 20 m quadrat. All the soil samples were carefully packed into cotton bags, labeled, and transported to the laboratory. After air-drying at room temperature for three weeks, the soil samples were removed from plant roots, litter, stones, and alien items, and sieved with a 0.25 mm mesh of stainless steel. The soil TN content was measured using an automatic Kjeldahl analyzer and the laboratory operations followed the relevant technical regulations in Agricultural Industry Standards of the People's Republic of China No. NY/T1121.
The soil samples (n = 4337) were split into calibration (n = 3470, 80%) and validation (n = 867, 20%) sets using the createDataPartition function in the caret package [60] in R 4.0.3 [61]. The calibration set was used to train the RF and RFK models, while the validation set was prepared for independent validation. The spatial distribution of soil sampling sites in calibration and validation sets is shown in Figure 1.

Covariates and Variable Selection
A total of 33 covariates that had pedogenetic associations with soil nitrogen or explanatory capacity for soil nitrogen behavior were collected and prepared as potential predictors of topsoil TN content. These 33 covariates could be roughly regarded as six categories, namely, nitrogen sources, soil properties, topographic attributes, climate characteristics, organism features, and management practices. Nitrogen fertilizer use, atmospheric nitrogen deposition, the pig equivalent per unit area, and straw returning to field were classified into the category of nitrogen sources. Soil property category included organic matter, available phosphorus and available potassium contents in topsoil, soil type, soil parent material, soil profile morphology, topsoil pH, topsoil texture, topsoil clay content, soil temperature regime, and soil moisture regime. Terrain attribute category mainly comprised elevation, slope and aspect. The climate characteristics category included average annual temperature, average annual precipitation, average annual evaporation, relative humidity, average annual sunshine, and annual cumulative temperature. The organism features included the normalized difference vegetation index (NDVI), net primary productivity index (NPP), and crop yield. The management practices category was composed of land use, cropping system, irrigation condition, drainage capability, and risk index of manure pollution. The brief descriptions of 33 covariates and their sources were listed in Table 4. To achieve the uniformity of spatial reference and resolution, all covariates were converted to WGS1984_UTM_49N projection coordinates and resampled to 1000 m resolution in ArcGIS 10.7.
For the vast majority of ML models, the prediction accuracy does not entirely depend on the number of covariates involved in modeling. Redundant, irrelevant covariates usually have a negative impact on the model performance. Variable selection, or feature selection, thus becomes an important aspect of model building and helps in building predictive models free from correlated variables, biases, and unwanted noise [34,62]. In this study, Boruta, an algorithm as a wrapper around RF, was chosen to conduct variable selection and valuation of covariate relative importance on the R statistical computing and analysis platform [63].

Predictive Models
The RF algorithm is a typical bagging algorithm (bootstrap aggregation) in ensemble learning [64]. It contains a number of decision trees and uses bootstrap resampling methods to perform put-back sampling of the dataset to train each decision tree in the model. Finally, the results of each tree are integrated. To generate a predictive model, the RF algorithm needs two user-defined parameters to be set, namely the number of trees to grow in the forest (ntree) and the number of covariates selected at each split (mtry). Many cases have demonstrated that 150 trees were sufficient to generate stable outcomes [65,66]. In the present study, we fixed ntree = 200. By default, we settled mtry to the rounded down square root of the total number of covariates. This study carried out the RF modeling using the randomForest package [67] in R 4.0.3 and the final prediction of topsoil TN content was presented as the average value of all the tree predictions generated based on a bootstrap sample of the calibration set.
The residuals from the RF model were obtained by subtracting the predicted TN content from the measured TN content at the same site. Then, ordinary kriging (OK) was used to obtain the spatial distribution of the RF residuals, and finally the interpolated results of the RF residuals were added to the RF prediction results to obtain the RFK prediction results. TN prediction from the hybrid model RFK can be described as follows: whereŶ RFK(s) is the predicted TN by the hybrid model RFK at the locations,Ŷ RF(s) is the predicted TN by the RF model, andέ OK(s) is the residual estimation by ordinary kriging (OK) interpolation. It should be emphasized that before fitting the semi-variance function, Spatial autocorrelation of RF residuals using the global Moran's I index test according to the requirements of ordinary kriging for data. If there is spatial autocorrelation and the residuals of RF conform to a normal distribution, OK interpolation can be used. If there is no spatial autocorrelation, the predicted topsoil TN by RF will be the output result. In the present study, spatial autocorrelation analysis, semi-variance analysis, and OK interpolation of RF residuals were all implemented in the ArcGIS 10.7 environment [18,68].

Evaluation of Model Performance
An independent validation approach was applied to assess performance of the prediction models in spatially predicting topsoil TN. Two commonly used assessment metrics, namely, the root mean square error (RMSE) and the coefficient of determination (R 2 ), were chosen to compare the accuracy of topsoil TN prediction by the RF and RFK models.
where n is the validation sample size, o i and p i represent the observed and predicted values, respectively, of topsoil TN content by a given method at the ith locations, and o is the average of the observed values of topsoil TN for the validation samples. Of the metrics used, RMSE summarizes the magnitude of the residuals, and a smaller RMSE indicates a higher accuracy of model prediction, while R 2 indicates the proportion of the topsoil TN variance explained by the covariate set. RMSE and R 2 can evaluate the accuracy of a model, but they lack the ability to quantify the uncertainty of the model. In this study, the 5% and 95% quantiles of the quantile regression forest (QRF) [69] prediction were regarded as the lower and upper limits of the 90% confidence interval (CI) width of the RF model, respectively. Assuming that the kriging interpolation of deterministic residuals followed the normal distribution, the upper and lower limits of 90% CI of the residual kriging was calculated at µ ± 1.645σ, where µ and σ were the mean and standard deviation of the predicted residuals, respectively [44]. Then, the 90% CI width of RFK model can be jointly determined by the upper and lower limits of 90% CI of RF model and kriging interpolation. Finally, we calculated the percentage of topsoil TN observations that fell at 90% CI to evaluate the ability of RF and RFK to quantify the uncertainty in spatial predictions of total soil nitrogen.

Conclusions
Under the combined effect of SOM, available potassium contents, nitrogen deposition, average annual precipitation, livestock discharges and topsoil pH, the TN content of agricultural soils in central China ranged from 0.52 to 1.81 g kg −1 . The agricultural land with topsoil TN content between 1.00 g kg −1 and 1.23 g kg −1 was the most widely distributed, accounting for approximately half of the total agricultural land area. The spatial variability of topsoil TN in the study area was significant, and was overall high in the west and low in the east. Accurately predicting the spatial distribution of soil TN on a regional scale and understanding the drivers of soil TN provides the basis and technical support for site-specific nitrogen management and dynamic change control. In terms of R 2 and RMSE achieved, RFK slightly outperformed the RF model. However, RFK was inferior to the RF model in quantifying prediction uncertainty. Overall, model performance evaluation should not be limited to the commonly used accuracy metrics, but should also consider the uncertainty of the quantitative prediction results.