Comparison of Soil Total Nitrogen Content Prediction Models Based on Vis-NIR Spectroscopy

Visible-near-infrared spectrum (Vis-NIR) spectroscopy technology is one of the most important methods for non-destructive and rapid detection of soil total nitrogen (STN) content. In order to find a practical way to build STN content prediction model, three conventional machine learning methods and one deep learning approach are investigated and their predictive performances are compared and analyzed by using a public dataset called LUCAS Soil (19,019 samples). The three conventional machine learning methods include ordinary least square estimation (OLSE), random forest (RF), and extreme learning machine (ELM), while for the deep learning method, three different structures of convolutional neural network (CNN) incorporated Inception module are constructed and investigated. In order to clarify effectiveness of different pre-treatments on predicting STN content, the three conventional machine learning methods are combined with four pre-processing approaches (including baseline correction, smoothing, dimensional reduction, and feature selection) are investigated, compared, and analyzed. The results indicate that the baseline-corrected and smoothed ELM model reaches practical precision (coefficient of determination (R2) = 0.89, root mean square error of prediction (RMSEP) = 1.60 g/kg, and residual prediction deviation (RPD) = 2.34). While among three different structured CNN models, the one with more 1 × 1 convolutions preforms better (R2 = 0.93; RMSEP = 0.95 g/kg; and RPD = 3.85 in optimal case). In addition, in order to evaluate the influence of data set characteristics on the model, the LUCAS data set was divided into different data subsets according to dataset size, organic carbon (OC) content and countries, and the results show that the deep learning method is more effective and practical than conventional machine learning methods and, on the premise of enough data samples, it can be used to build a robust STN content prediction model with high accuracy for the same type of soil with similar agricultural treatment.


Introduction
Soil total nitrogen (STN) has a significant impact on plant growth [1,2], and predicting STN content is vital for crop production as well as income generation for farmers. Spectroscopic techniques have the advantages of fast detection speed, low cost, and non-destructive [3], and it has been widely used in evaluation of crop yield [4,5], detection of fruit maturity [6][7][8], monitoring plant diseases and pests [9,10], and other areas of agricultural production. The spectral absorptions are associated with the stretching and bending of bonds forming the O-H, N-H, and C-H groups in the visible-near-infrared method can be used to establish STN content prediction model when a massive soil sample data is available.
From a practical point of view, different modeling approaches may be suitable for different application conditions and lead to different performances. In order to investigate which modeling approach can get higher accuracy without complicated spectral transformation and clarify model effectiveness based on different data characteristics, several mainstream conventional STN content modeling methods, along with deep learning modeling techniques, are implemented and compared in this paper, and their effectiveness are explored according to data feature. The objectives of this paper are: (1) establish reasonable STN prediction models based on Vis-NIRS using both conventional machine learning and deep learning methods with LUCAS topsoil dataset; (2) compare the performance of the built models to find which method suitable for predicting STN content with different conditions; and (3) evaluate the effectiveness and applicability of the models due to data characteristics. Based on the results, we believe that this paper can help the researchers and engineers to quickly build a robust and effective STN content prediction model for their own soil samples. Figure 1 is the workflow for STN prediction models based on Vis-NIRS. Firstly, the raw data is read-in, and data screening is carried out to cleanse and segment the dataset. Secondly, the datasets are subjected to preprocessing in conventional machine learning approaches. Finally, through each modeling process, the best model can be obtained from each specific method according to performance evaluation. The following sections expand upon the workflow in greater detail.

Materials and Methods
Sensors 2020, 20, x FOR PEER REVIEW 3 of 20 From a practical point of view, different modeling approaches may be suitable for different application conditions and lead to different performances. In order to investigate which modeling approach can get higher accuracy without complicated spectral transformation and clarify model effectiveness based on different data characteristics, several mainstream conventional STN content modeling methods, along with deep learning modeling techniques, are implemented and compared in this paper, and their effectiveness are explored according to data feature. The objectives of this paper are: (1) establish reasonable STN prediction models based on Vis-NIRS using both conventional machine learning and deep learning methods with LUCAS topsoil dataset; (2) compare the performance of the built models to find which method suitable for predicting STN content with different conditions; and (3) evaluate the effectiveness and applicability of the models due to data characteristics. Based on the results, we believe that this paper can help the researchers and engineers to quickly build a robust and effective STN content prediction model for their own soil samples. Figure 1 is the workflow for STN prediction models based on Vis-NIRS. Firstly, the raw data is read-in, and data screening is carried out to cleanse and segment the dataset. Secondly, the datasets are subjected to preprocessing in conventional machine learning approaches. Finally, through each modeling process, the best model can be obtained from each specific method according to performance evaluation. The following sections expand upon the workflow in greater detail.   All the models are implemented on the Python platform using Keras and Scikit-learn library. The process is run on a Linux server (version is Ubuntu 16.04) with 64 GB of RAM, and an Nvidia Geforce GTX 1080Ti graphics card with 11 GB of RAM.

Dataset Description
In this paper, we use the open accessed Land Use/Cover Area Frame Statistical Survey (LUCAS) Soil dataset. The project collected 19,036 samples of topsoil and determined their soil parameters and measured their Vis-NIR spectra after oven-dried. The published dataset consists of soil properties (clay, silt and sand content, coarse fragments, pH, organic carbon content, nitrogen, etc.) and Vis-NIRS absorbance. The spectra covered a range from 400 to 2499.5 nm with a resolution of 0.5 nm [23]. Table 1 shows the description of the LUCAS Soil dataset and it is available at: https: //esdac.jrc.ec.europa.eu/content/lucas-2009-topsoil-data#tabs-0-description=0. Given the experimental environment changes and variability of sample origins, there exist various noise signals and artifacts derived from sample pretreatment, instrument, and experiment. To reduce noise, data cleansing was carried out and those samples with STN = 0 g/kg were removed. Accordingly, 19,019 samples were obtained as raw experimental data to be used to establish the prediction models.
In most cases, just two data sets are commonly needed to calibrate and validate model in conventional machine learning methods. In this paper, in order to maximize the utilization of reasonably distributed data to obtain the most reliable model, instead of random data division, the samples were sorted in descending order according to their STN content at first. Then a set of data was then extracted at regular intervals (every 5 samples) as the test subset (3804 samples), and the other samples as train subset (15,215 samples). These two data subsets were used in conventional methods-based modeling. However, when CNN modeling was investigated, the above train subset was randomly partitioned into two subsets: the validation subset (25% of the data, 3804 samples), and the train subset (75% of the data, 11,411 samples).

Baseline Correction and Smoothing
Preprocessing is usually required to enhance calibration model accuracy by reducing un-modeled variations such as instrumental and experimental artifacts [24,25]. Baseline correction and smoothing are the typical preprocessing approaches for Vis-NIRS analysis. In this paper, we selected asymmetric least squares baseline correction method (AsLS) and Savitzy-Golay smoothing method (SGS) to remove the noise from the raw spectral data.
Eliser et al. [26] proposed the AsLS approach, and the target function is presented in Equation (1).
where z represents the fitting spectrum; y i is the raw data at wavelength of i nm; z i is the fitting information at wavelength of i nm, and w i is the weight at wavelength of i nm. λ is an adjustment coefficient, and ∆ 2 z i is differential operation and presented in Equation (2).

of 20
When y i is much higher than z i , let w i = p; when y i equal to or less than z i , w i = 1 − p. Here p is introduced as another adjustment coefficient. So, there are two free parameters to tune: λ and p. In general, p takes values in the range (0.001-0.1), and λ takes values in the range (102-109). p is set to 0.01 and λ is set to 10 9 in this paper.
SGS convolution smoothing algorithm is one of the most commonly used spectral smoothing methods. The algorithm uses mathematical methods to establish polynomial regression of local interval data and evaluate its value in the center of approximate window. It not only inherits the advantages of moving average algorithm for fast and simple denoising, but also can effectively retain the relative maximum, minimum, and changeable information of the signal. Compared to traditional filters, SGS is faster and can handle missing data quite well [27]. In this work, the window size is set to 9 and uses a third order polynomial.

Dimensionality Reduction
Conventional machine learning approaches attempt to assign weight to each feature or band (in the case of Vis-NIRS) based on how much information is contained in each feature or band. However, they do not perform well in high dimensional spectral data, so that dimensionality reduction is required. Principal components analysis (PCA) is one of the most popular and effective dimension reduction methods used to explore variation in complex datasets [28].
PCA can transform a high dimensional dataset into a lower dimensional orthogonal feature set while retaining maximum information from the original high dimension dataset. To this goal, the samples (x i ) are projected with W T x i in new hyperplane (W), and the location of the projection point determine the information of the new samples. The optimization function (Target) is expressed as Equation (3).
In the formula, where I is the unit matrix, XX T is the covariance matrix. Eigen-decomposition is performed on the covariance matrix to have the eigenvectors and corresponding eigenvalues, and the eigenvalues are then sorted in descending order. The required numbers of eigenvalues are taken, and then form a new eigenvector. The new eigenvector is the solution of PCA. In this paper, the number of principal components set to 2-50.

Feature Selection
Feature selection is another way to reduce a high-dimensional feature space, and it ranks the features by a metric and eliminates all features that do not achieve an adequate score. Usually, feature selection methods are classified into three general groups: filter, wrapper, and embedded methods [29]. Because wrapper and embedded methods are related to prediction models, we instead chose the mutual information (MI) method to select feature, which falls under the category of the filter.
With the MI method, the feature subset can be obtained by calculating the mutual information between the features themselves and between the features and the class variables [30]. To measure the value of mutual information, information entropy is incorporated. Given a discrete random variable X, the entropy of X is denoted as H(X) (Equation (5)) where p(x) is the prior probability of X. MI has different expressions, and this paper chooses the expression (Equation (6)) proposed by Ross [31].
where Ψ(.) is the digamma function, x i , the parameter k represents the number of neighbors in the KNN (K-nearest neighbors) algorithm, n x and n y , respectively represent the sample data falling within the range of the KNN, and N is the number of samples.

Ordinary Least Square Estimation (OLSE) Regression
The linear regression is one of the most commonly used models, which is simple and easy to be built. Given a dataset of Vis-NIRS (marked as x = (x 1 ; x 2 ; . . . ; x d )) with d wavelengths, the linear regression model of STN content can be presented as Equation (7).
where w = (w 1 ; w 2 ; . . . ; w d ). The model is defined when w and b are obtained, and the most common approach to obtain w and b is to minimize the mean-squared error. The linear regression model was selected with the ordinary least square estimation (OLSE) in this paper.

RF Regression
Random forest is an ensemble algorithm based on learning a collection of decision trees [32], and it is characterized by its ease of modeling, fast to calculate, and low computational overhead. RF builds several randomized decision trees at first, each of which is trained to classify the STN content. Then by using the decision tree as base learner, RF randomly chooses a specific number of wavelengths at each node and finds the best split among these wavelengths. Finally, RF generates predictions by voting over a collection of randomized decision trees.

ELM Regression
Extreme learning machine is a kind of feed-forward neural network. Once the parameters (such as the weight between input layer and hidden layer, the threshold of the hidden layer) are determined, they do not change to adapt to new data. As a result, ELM performs fast-learning. In this paper, three-layer ELM was built, the number of hidden layer nodes ranged from 2 to 1000, and sine is selected as activation function in ELM.

STN Prediction Models Based on Convolutional Neural Networks
Different from conventional machine learning, a convolution layer is used for feature extraction in CNN, together with a pooling to compress the features. Figure 2 represents an example of the feature processing with convolution and max pooling, the convolutional layers labeled as Conv and max pooling layer labeled as Pooling.
Sensors 2020, 20, x FOR PEER REVIEW 7 of 20 Vis-NIRS spectra Input Conv Pooling Figure 2. Feature processing of the receptive spectra region by one convolutional layer and one max pooling layer. This is an example of two layers with the convolutional layer having a filter size of three and stride of three and the max pooling layer having a pooling size of three. A green rectangle in the input layer represents one of the input spectral variables, a rectangle in the Conv or Pooling layer represents a neuron. One neuron in the pooling layer covers a receptive field of 9 original spectral variables, respectively.
The prediction effect of CNN can be affected by the structures of the networks. To evaluate the performance of different models, we built 3 STN predicted models incorporated Inception module [33], whose structures are shown in Figure 3. The major difference of these models lies in the numbers of 1 × 1 convolutions, which will impact the nonlinear fitting ability.

Pooling Conv2
(a) Figure 2. Feature processing of the receptive spectra region by one convolutional layer and one max pooling layer. This is an example of two layers with the convolutional layer having a filter size of three and stride of three and the max pooling layer having a pooling size of three. A green rectangle in the input layer represents one of the input spectral variables, a rectangle in the Conv or Pooling layer represents a neuron. One neuron in the pooling layer covers a receptive field of 9 original spectral variables, respectively.
The prediction effect of CNN can be affected by the structures of the networks. To evaluate the performance of different models, we built 3 STN predicted models incorporated Inception module [33], whose structures are shown in Figure 3. The major difference of these models lies in the numbers of 1 × 1 convolutions, which will impact the nonlinear fitting ability. Feature processing of the receptive spectra region by one convolutional layer and one max pooling layer. This is an example of two layers with the convolutional layer having a filter size of three and stride of three and the max pooling layer having a pooling size of three. A green rectangle in the input layer represents one of the input spectral variables, a rectangle in the Conv or Pooling layer represents a neuron. One neuron in the pooling layer covers a receptive field of 9 original spectral variables, respectively.
The prediction effect of CNN can be affected by the structures of the networks. To evaluate the performance of different models, we built 3 STN predicted models incorporated Inception module [33], whose structures are shown in Figure 3. The major difference of these models lies in the numbers of 1 × 1 convolutions, which will impact the nonlinear fitting ability.  The input of each model is 4200-dimensional raw spectroscopic data and the output is the estimated STN content(s). The convolutional layers are labeled as Conv, flatten layer is labeled as Flatten, and fully connected layer is labeled as F1. The purple modules and yellow modules represent different sizes of general convolution to extract the feature. The green module represents 1 × 1 convolution to reduce the network parameters and enhance model fitting. The gray module represents max pooling (pool_size = 3) to compress the features. The first CNN model (Model 1) is the simplest one with introducing the minimum number of 1 × 1 convolution. The second CNN model (Model 2) and the third CNN model (Model 3) have the same number of 1 × 1 convolution, but model 3 is deeper than model 2. In this paper, zero-padding is used to retain the edge information, and the The input of each model is 4200-dimensional raw spectroscopic data and the output is the estimated STN content(s). The convolutional layers are labeled as Conv, flatten layer is labeled as Flatten, and fully connected layer is labeled as F1. The purple modules and yellow modules represent different sizes of general convolution to extract the feature. The green module represents 1 × 1 convolution to reduce the network parameters and enhance model fitting. The gray module represents max pooling (pool_size = 3) to compress the features. The first CNN model (Model 1) is the simplest one with introducing the minimum number of 1 × 1 convolution. The second CNN model (Model 2) Sensors 2020, 20, 7078 9 of 20 and the third CNN model (Model 3) have the same number of 1 × 1 convolution, but model 3 is deeper than model 2. In this paper, zero-padding is used to retain the edge information, and the rectified linear unit (ReLU) is selected as the activation function for the convolutional layers and fully connected layer in each model. To search the local minimum of the objective function, we selected adam optimizer to train the model. Mean squared error (MSE) is adopted as the loss function, which is presented in Equation (8).
where y i andŷ i are measured values and predicted values, respectively. n is the number of samples in the training set, and i is the i-th sample. The detail hyperparameters set is shown in Table 2.

Model Evaluation
The model performance is evaluated by root mean squared error of prediction (RMSEP), the coefficient of determination (R 2 ), and the residual prediction deviation (RPD), their calculations are shown in Equations (9)-(11).
where y i andŷ i are measured values and predicted values, respectively; n is the number of samples in the training set; and s y is the standard deviation of the observed values, which is calculated using Equation (12). In which, y is the arithmetic mean of y i ; N is the number of samples in the test set. Note that in the Equation (9), R 2 can vary from −∞ to 1 (i.e., it can yield negative values).

Impact of Preprocessing on Conventional Machine Learning Approaches
To study the effect of different pre-treatments on conventional machine learning approaches, different pre-treatments (including baseline correction (marked as Baseline), smoothing, dimensionality reduction (marked as Reduction), and feature selection (marked as Selection)) are adopted to pretreat the Vis-NIRS data. After preprocessing, the predicted models (including ordinary least square estimation (OLSE), random forest (RF), and extreme learning machine (ELM)) are built based on the new independent variable data, and the results are showed in Table 3. By comparing the impact of different pretreatments to three modeling methods, we can conclude that baseline correction is optimal but MI feature selection performs worst. MI feature selection, as a preprocessing stage, is the method used to select a subset of relevant features for model construction. However, no subset can contain the full information of the raw spectral data. Thus, models built based on MI feature selection has the worst results among all models. PCA transforms the raw data into a lower dimensional orthogonal feature set while retaining maximum amount of information from the original high dimensional dataset. It preforms better than MI feature selection, but worse than the raw data without any pre-treatments. Baseline correction and smoothing filter the noise without reducing the information of raw data. Minor differences were observed between them with OLSE model, but baseline correction works better than smoothing with RF and ELM models. Both models perform better than that based on the raw data.
Different prediction models with the same pre-treatment result in different performance. To observe and evaluate the stability of different models, the coefficient of determination (R 2 ) for three models are plotted in Figure 4. As we can see in Figure 4, ELM showed the best adaptation and OLSE is the worst, this is due to non-linearity of ELM.
Sensors 2020, 20, x FOR PEER REVIEW 11 of 20 into a lower dimensional orthogonal feature set while retaining maximum amount of information from the original high dimensional dataset. It preforms better than MI feature selection, but worse than the raw data without any pre-treatments. Baseline correction and smoothing filter the noise without reducing the information of raw data. Minor differences were observed between them with OLSE model, but baseline correction works better than smoothing with RF and ELM models. Both models perform better than that based on the raw data. Different prediction models with the same pre-treatment result in different performance. To observe and evaluate the stability of different models, the coefficient of determination (R 2 ) for three models are plotted in Figure 4. As we can see in Figure 4, ELM showed the best adaptation and OLSE is the worst, this is due to non-linearity of ELM. To observe and evaluate the usability of different models, the RDP heatmap was plotted in Figure 5. If RPD is below 1.5, the model performance cannot be used for STN prediction due to poor performance. If it is between 1.5 and 1.8, the model needs to be improved to get better results. For values of RPD in the range of 1.8-2, the STN prediction model is considered to be practical, and if it is higher than 2.0, the model performance is considered to be very good [28]. For OLSE model, the RPD value is 1.62 with the raw data, but it decreases to 0.56 due to the MI feature selection. In this case, OLSE cannot be used for STN prediction. Baseline correction and PCA pretreatments can improve the availability of RF model, but the RPDs will be significantly reduced if MI feature selection involved. The best usability is obtained from the ELM. The RPD value is 2.34 with baseline correction, and can be used to prediction.

The Comparison Among Three CNN Models and Conventional Models
To evaluate the model performance of different CNNs, the training process was repeated 25 times for each model. The results are shown in Table 4. Overall, CNNs based on raw data can obtain better results than conventional machine learning models. Model 3 provides the best prediction for To observe and evaluate the usability of different models, the RDP heatmap was plotted in Figure 5. If RPD is below 1.5, the model performance cannot be used for STN prediction due to poor performance. If it is between 1.5 and 1.8, the model needs to be improved to get better results. For values of RPD in the range of 1.8-2, the STN prediction model is considered to be practical, and if it is higher than 2.0, the model performance is considered to be very good [28].
Sensors 2020, 20, x FOR PEER REVIEW 11 of 20 into a lower dimensional orthogonal feature set while retaining maximum amount of information from the original high dimensional dataset. It preforms better than MI feature selection, but worse than the raw data without any pre-treatments. Baseline correction and smoothing filter the noise without reducing the information of raw data. Minor differences were observed between them with OLSE model, but baseline correction works better than smoothing with RF and ELM models. Both models perform better than that based on the raw data. Different prediction models with the same pre-treatment result in different performance. To observe and evaluate the stability of different models, the coefficient of determination (R 2 ) for three models are plotted in Figure 4. As we can see in Figure 4, ELM showed the best adaptation and OLSE is the worst, this is due to non-linearity of ELM. To observe and evaluate the usability of different models, the RDP heatmap was plotted in Figure 5. If RPD is below 1.5, the model performance cannot be used for STN prediction due to poor performance. If it is between 1.5 and 1.8, the model needs to be improved to get better results. For values of RPD in the range of 1.8-2, the STN prediction model is considered to be practical, and if it is higher than 2.0, the model performance is considered to be very good [28]. For OLSE model, the RPD value is 1.62 with the raw data, but it decreases to 0.56 due to the MI feature selection. In this case, OLSE cannot be used for STN prediction. Baseline correction and PCA pretreatments can improve the availability of RF model, but the RPDs will be significantly reduced if MI feature selection involved. The best usability is obtained from the ELM. The RPD value is 2.34 with baseline correction, and can be used to prediction.

The Comparison Among Three CNN Models and Conventional Models
To evaluate the model performance of different CNNs, the training process was repeated 25 times for each model. The results are shown in Table 4. Overall, CNNs based on raw data can obtain better results than conventional machine learning models. Model 3 provides the best prediction for For OLSE model, the RPD value is 1.62 with the raw data, but it decreases to 0.56 due to the MI feature selection. In this case, OLSE cannot be used for STN prediction. Baseline correction and PCA pretreatments can improve the availability of RF model, but the RPDs will be significantly reduced if MI feature selection involved. The best usability is obtained from the ELM. The RPD value is 2.34 with baseline correction, and can be used to prediction.

The Comparison among Three CNN Models and Conventional Models
To evaluate the model performance of different CNNs, the training process was repeated 25 times for each model. The results are shown in Table 4. Overall, CNNs based on raw data can obtain better results than conventional machine learning models. Model 3 provides the best prediction for STN (R 2 is 0.93; RMSEP is 0.95 g/kg; and RPD is 3.85 in optimal case). The feature processing with convolution and pooling can retain the information of raw data, and it is not excised the features (or wavelengths). Meanwhile, nonlinear model (ELM) preforms better than linear model (OLSE), and the accuracy is improved by introducing nonlinear activation functions. So, among all the built models, the deep learning method shows encouraging support and the CNN model with deeper convolution layers illustrates the best performance among three different CNN structures built in this paper.

Impact of Data Properties on Prediction Models
Data properties, such as sample size, may have an impact on the predicted models. To study and reveal the impact, this paper reformed datasets with different data properties (Table 5) to analyze. What needs illustration is that these datasets are randomly sampled from raw dataset, in order to avoid the introduction of instrumental and experimental artifacts due to different collections. According to the above modeling results, four models with different pre-treatments are selected to compare the performance on different data properties: baseline-corrected OLSE, baseline-corrected RF, baseline-corrected and smoothed ELM; and Model 3 of CNN. The performance of each model is shown in Figure 6. R 2 is marked by the lines and RMSEP is marked by the histograms. Of which, Inception model is labeled as red, ELM model is labeled as blue, RF model is labeled yellow, and OLSE model is labeled purple.
The prediction accuracy of the models is poor with small sample data, and the accuracy improves as the amount of data increases. The effects of ELM and RF changed little on different datasets, but OLSE was fluctuated significantly. To evaluate the usability of different models based on different datasets, the RDP heatmap was plotted in Figure 7. Figure 7 shows that, as the amount of data increases, the value of RPD increases. The CNN incorporated Inception module performs better than the other models with different datasets. The prediction accuracy of the models is poor with small sample data, and the accuracy improves as the amount of data increases. The effects of ELM and RF changed little on different datasets, but OLSE was fluctuated significantly. To evaluate the usability of different models based on different datasets, the RDP heatmap was plotted in Figure 7. Figure 7 shows that, as the amount of data increases, the value of RPD increases. The CNN incorporated Inception module performs better than the other models with different datasets. In order to clarify the effectiveness and applicability of STN prediction models according to data characteristics, we separated the samples from different countries and made the new dataset (Table  6). Of these, France has the largest number of samples (2807 samples), and Luxembourg has the least number of samples (three samples). Based on the datasets of different countries, we compared the performance of different models (baseline-corrected OLSE, baseline-corrected RF, baseline-corrected and smoothed ELM, and Model 3 of CNN), the results are shown in Figure 8. The exceptional is that the results of the 15th dataset are set 0 due to the small number of samples.  The prediction accuracy of the models is poor with small sample data, and the accuracy improves as the amount of data increases. The effects of ELM and RF changed little on different datasets, but OLSE was fluctuated significantly. To evaluate the usability of different models based on different datasets, the RDP heatmap was plotted in Figure 7. Figure 7 shows that, as the amount of data increases, the value of RPD increases. The CNN incorporated Inception module performs better than the other models with different datasets. In order to clarify the effectiveness and applicability of STN prediction models according to data characteristics, we separated the samples from different countries and made the new dataset (Table  6). Of these, France has the largest number of samples (2807 samples), and Luxembourg has the least number of samples (three samples). Based on the datasets of different countries, we compared the performance of different models (baseline-corrected OLSE, baseline-corrected RF, baseline-corrected and smoothed ELM, and Model 3 of CNN), the results are shown in Figure 8. The exceptional is that the results of the 15th dataset are set 0 due to the small number of samples. In order to clarify the effectiveness and applicability of STN prediction models according to data characteristics, we separated the samples from different countries and made the new dataset (Table 6). Of these, France has the largest number of samples (2807 samples), and Luxembourg has the least number of samples (three samples). Based on the datasets of different countries, we compared the performance of different models (baseline-corrected OLSE, baseline-corrected RF, baseline-corrected and smoothed ELM, and Model 3 of CNN), the results are shown in Figure 8. The exceptional is that the results of the 15th dataset are set 0 due to the small number of samples. The performance of the OLSE model is the worst among four models. RF model is predicted by dichotomy and presents some nonlinearity, but its performance needs to be improved. For the CNN incorporated Inception module, the accuracy depends on sample size and data distribution, and its credibility decreases as the sample size decreases. ELM model preforms the best in the four predicted models with the dataset divided based on different countries. From a practical point of view, the CNN model is always trained by the large size of samples and it can be used in each country soil samples. So, we selected the CNN model with total samples in optimal case (R 2 is 0.93; RMSEP is 0.95 g/kg; and RPD is 3.85) to predict the STN content with different datasets from different countries, and compared with ELM model (Figure 9). Overall, the CNN model preforms better than ELM, but it is apparently worse in smaller size. The performance of the OLSE model is the worst among four models. RF model is predicted by dichotomy and presents some nonlinearity, but its performance needs to be improved. For the CNN incorporated Inception module, the accuracy depends on sample size and data distribution, and its credibility decreases as the sample size decreases. ELM model preforms the best in the four predicted models with the dataset divided based on different countries.
From a practical point of view, the CNN model is always trained by the large size of samples and it can be used in each country soil samples. So, we selected the CNN model with total samples in optimal case (R 2 is 0.93; RMSEP is 0.95 g/kg; and RPD is 3.85) to predict the STN content with different datasets from different countries, and compared with ELM model (Figure 9). Overall, the CNN model preforms better than ELM, but it is apparently worse in smaller size. From a practical point of view, the CNN model is always trained by the large size of samples and it can be used in each country soil samples. So, we selected the CNN model with total samples in optimal case (R 2 is 0.93; RMSEP is 0.95 g/kg; and RPD is 3.85) to predict the STN content with different datasets from different countries, and compared with ELM model (Figure 9). Overall, the CNN model preforms better than ELM, but it is apparently worse in smaller size. According to the research [34], a close relationship exists between organic carbon (OC) and N levels in soil. The higher OC concentration, the greater the N concentration. Moreover, the C-to-N ratio is relatively stable across different soil types. In order to investigate reliability of different STN prediction model, we divided the LUCAS dataset into 24 soil groups according to different OC concentrations by K-means clustering algorithm (Table 7), and the modeling results are shown in Table 8.  Figure 9. Effects of predicted with CNN and ELM.
According to the research [34], a close relationship exists between organic carbon (OC) and N levels in soil. The higher OC concentration, the greater the N concentration. Moreover, the C-to-N ratio is relatively stable across different soil types. In order to investigate reliability of different STN prediction model, we divided the LUCAS dataset into 24 soil groups according to different OC concentrations by K-means clustering algorithm (Table 7), and the modeling results are shown in Table 8.
According to Table 8, the accuracy of each prediction model becomes worse based on OC content. We do not think it is because no obvious relationship exists between OC and STN; instead, it is because different countries have different agricultural production and farming histories, so that their soil physical structures and properties differ from each other. So, it is unreasonable and unscientific to divide data by OC concentration alone, it has to be combined country with OC concentration in order to get a high STN prediction accuracy model. Another conclusion can be drawn is that, due to the poor performance of OLSE, it should be noted that the original soil spectra does not have a linear relationship with the STN content.

Discussion
Agricultural production [35][36][37], soil structure [38], soil properties [39], soil sample data characteristics [40], along with the methods of data preprocessing [16,41], all affect the STN concentration prediction modeling and model performance. Due to the results we achieved in this paper, the STN prediction models with good performance and generalization come from the data set with greater size and more evenly distributed within a country. Data size [22] and data distribution [42] have a major impact on the predictive performance. The prediction accuracy of the model is poor with small sample data size and/or unevenly data distributed, and the accuracy improves as the amount of data increase and/or evenly distributed. Since different countries have different farming histories, and the soil physical structure differ with each other too, so that more suitable models can be built for each country, even for each different farmland in the country. According to the clustering results in this paper, we can see that C-to-N ratio is relatively stable across different countries since the distribution of STN content is following with OC concentration. However, because the data segmentation breaks the national boundaries, the predicted models perform significantly worse. It indicates that different types of soil from different country, even if the carbon content is in the same range, a robust STN prediction is hard to be built. Therefore, the ideal dataset is that include enough same type of soil samples with similar physical structure, evenly distributed and accurate STN concentrations and accurate soil spectra data. When this kind of ideal dataset is not available, then deep learning modeling approach would be very helpful.
As for the modeling method of soil total nitrogen concentration, both conventional and deep learning approaches can achieve practical accuracy due to this study. Data preprocessing method is very important to conventional modeling approaches [43]. In terms of data preprocessing, baseline correction is optimal and MI feature selection performs worst according to the results of different pre-treatments on conventional machine learning approaches. The reason for this is because MI feature selection has removed the useful information, and results in worse model performance. For deep learning approach, network construction is critical to the model performance. In terms of CNN structure, the predictive accuracy of CNN increases with the increase of the number of convolution layers due to the nonlinear activation functions. By modeling approaches comparing results, the deep learning method, CNN, provides good performance and robust generalization for STN content modeling, and it can be used as a benchmark model for all soil type and all countries as long as there are enough training samples from different soil type and different countries.

Conclusions
STN content prediction based on Vis-NIRS is becoming more and more feasible, but the performance of prediction is affected by various factors, such as soil properties, modeling method, characteristics of datasets, etc. In order to find a practical way to build STN content prediction model, three conventional machine learning methods and one deep learning approach are investigated and their predictive performances are compared and analyzed. Based on the results, the following conclusions are achieved: (1) Misuse of preprocessing (such as feature selection) may introduce artifacts or remove useful patterns and result in worse model performance, but some preprocessing (such as baseline correction) may improve the results. (2) Different data characteristics show different impact on modeling approach, so the ideal dataset is that include enough same type of soil samples with similar physical structure, evenly distributed and accurate STN concentrations and accurate soil spectra.