Multi-Feature Optimization Study of Soil Total Nitrogen Content Detection Based on Thermal Cracking and Artiﬁcial Olfactory System

: To improve the accuracy of detecting soil total nitrogen (STN) content by an artiﬁcial olfactory system, this paper proposes a multi-feature optimization method for soil total nitrogen content based on an artiﬁcial olfactory system. Ten different metal–oxide semiconductor gas sensors were selected to form a sensor array to collect soil gas and generate response curves. Additionally, six features such as the response area, maximum value, average differential coefﬁcient, standard deviation value, average value, and 15th-second transient value of each sensor response curve were extracted to construct an artiﬁcial olfactory feature space (10 × 6). Moreover, the relationship between feature space and soil total nitrogen content was used to establish backpropagation neural network (BPNN), extreme learning machine (ELM), and partial least squares regression (PLSR) models were used, and the coefﬁcient of determination ( R 2 ), root mean square error ( RMSE ), and the ratio of performance to deviation ( RPD ) were selected as prediction performance indicators. The Monte Carlo cross-validation (MCCV) and K-means improved leave-one-out cross-validation (K-means LOOCV) were adopted to identify and remove abnormal samples in the feature space and establish the BPNN model, respectively. There were signiﬁcant improvements before and after comparing the two rejection methods, among which the MCCV rejection method was superior, where values for R 2 , RMSE , and RPD were 0.75671, 0.33517, and 1.7938, respectively. After removing the abnormal samples, the soil samples were then subjected to feature-optimized dimensionality reduction using principal component analysis (PCA) and genetic algorithm-based optimization backpropagation neural network (GA-BP). The test results showed that after feature optimization the model indicators performed better than those of the unoptimized model, and the PLSR model with GA-BP for feature optimization had the best prediction effect, with an R 2 value of 0.93848, RPD value of 3.5666, and RMSE value of 0.16857 in the test set. R 2 and RPD values improved by 14.01% and 50.60%, respectively, compared with those before optimization, and RMSE value decreased by 45.16%, which effectively improved the accuracy of the artiﬁcial olfactory system in detecting soil total nitrogen content and could achieve more accurate quantitative prediction of soil total nitrogen content.


Introduction
The sum of the various forms of nitrogen in the soil is called soil total nitrogen (STN).For arable soils, fertilization systems, crop rotations and utilization patterns all have a strong influence on the total soil nitrogen content.Moreover, it is an essential indicator for maintaining crop yield and plays a vital role in crop development and agroecosystems [1][2][3].
In precision agriculture, obtaining information on dynamic changes is important to improve nitrogen fertilizer utilization and cropping patterns [4,5].Therefore, it is of significance to improve the accuracy of measuring soil total nitrogen content and to obtain information on soil total nitrogen more rapidly and accurately [6].
Kjeldahl and Dumas combustion methods are the classical methods for the determination of total soil nitrogen, which have high measurement accuracy but are time-consuming and laborious, and the chemical reagents used are prone to secondary contamination.Elemental analyzers based on the Dumas combustion method are fast, but expensive, require high-precision analytical balances, and the copper powder normally required for the reduction reaction is environmentally hazardous.In recent years, soil total nitrogen detection methods based on remote sensing technology and spectral analysis have received attention from many scholars because of their advantages of being non-destructive, accurate, and efficient.Zhang et al. [7] used the CASI-1500 aerial hyperspectral imaging system to capture soil spectral information and three models to predict total nitrogen values in black soils, demonstrating that hyperspectral remote sensing is an efficient method for soil nutrient content estimation.Li et al. [8] applied hyperspectral techniques to extract characteristic wavelengths using an uninformative variable elimination algorithm (UVE) and successive projection algorithm (SPA), and then combined partial least squares (PLS) and extreme learning machine (ELM) to build a soil total nitrogen prediction model, achieving better prediction results.Although these methods compensate for the shortcomings of classical methods to a certain extent, the high cost of analytical instruments, the influence of the atmosphere, and iron-oxide in the soil severely limit their application [9].
Thermal cracking can crack large molecule compounds into volatile small-molecule gas compounds, and using gas sensor arrays to obtain cracking gas information, the artificial olfactory system can achieve detection of soil total nitrogen content.This method has the advantages of being convenient, fast, and inexpensive, while the gas sensors are inexpensive and reusable.However, redundant samples and dimensional disasters reduce machine learning efficiency, pattern recognition accuracy, and data mining efficiency, and increase the workload of experiments to some extent [10].Shi et al. [11] used various methods to reject abnormal samples for NIR light detection to improve the model prediction performance; Ji Ma et al. [12] studied the introduction of principal component analysis algorithm for dimensionality reduction to reduce the difficulty of deep learning in extracting image features and verified its feasibility with simulation experiments.Xu K et al. [13] used mean analysis, coefficient of variation analysis, cluster analysis and correlation analysis to obtain the feature matrix of the optimized electronic nose to detect hickory, and PLSR and backpropagation neural network (BPNN) to build a regression model to obtain evidence that the optimized method improved the performance of the electronic nose and reduced the dimensionality of the data.Vung Pham et al. [14] proposed an interactive visualization method for portable X-ray fluorescence (pXRF) data analysis of soil profiles and innovated a model RDNet to achieve accurate results for predicting pH H 2 O and pH KCl .Antonios Morellos et al. [15] compared the predictive performance of two linear multivariate methods (principal component regression and partial least squares regression) and two machine learning methods (least squares support vector machines and Cubist) for total soil nitrogen, organic carbon, and moisture, based on near-infrared spectral data collected from 140 soil samples.For purposes of solving the above problems, this paper explores the performance improvement of a thermal cracking and manual olfactory system-based method for the determination of total soil nitrogen, using coefficient of determination (R 2 ), root mean square error (RMSE), and the ratio of performance to deviation (RPD) as measures in the test set.The first stage of optimization (abnormal sample rejection) was to identify abnormal samples in the dataset by comparing the Monte Carlo cross-validation (MCCV) method with the K-means improved leave-one-out cross-validation (K-means LOOCV) method.The better rejection method is selected by comparing the performance of the BPNN model before and after the rejection of these abnormal samples.In the second stage of optimization (feature dimensionality reduction), soil olfaction spatial dimensionality reduction was performed using principal component analysis (PCA) and genetic algorithmbased optimization backpropagation neural network (GA-BP) methods, and BPNN, ELM, and partial least squares (PLSR) were established.The experimental results showed the method proposed in this study (MCCV + GA-BP) could effectively improve the performance index of the artificial olfaction system for detecting STN.

Study Area and Soil Samples
The study area is located in northeastern China, as shown in Figure 1 (44  19 E), with a sampling area of 187,400 km 2 , in a temperate continental monsoon climate with an average annual temperature of 5.1 • C, average annual rainfall of 400-600 mm, and one of the world's three prime maize belts, which is an important grain-producing area.It is one of the most important grain-producing regions in China.The main soil types in the region include dark brown loam, black calcium soil, white pulp soil, herbaceous soil, and black soil, and is one of the major production areas of maize and rice in China.Due to the crop rotation pattern and improper fertilization, the total nitrogen content of the soil has decreased, thus this study aims to help optimize fertilization to improve the soil nutrient content structure and protect the black soil.
samples.In the second stage of optimization (feature dimensionality reduction), soil olfaction spatial dimensionality reduction was performed using principal component analysis (PCA) and genetic algorithm-based optimization backpropagation neural network (GA-BP) methods, and BPNN, ELM, and partial least squares (PLSR) were established.The experimental results showed the method proposed in this study (MCCV + GA-BP) could effectively improve the performance index of the artificial olfaction system for detecting STN.

Study Area and Soil Samples
The study area is located in northeastern China, as shown in Figure 1 (44°50′ N, 121°38′ E, 46°19′ N, 131°19′ E), with a sampling area of 187,400 km 2 , in a temperate continental monsoon climate with an average annual temperature of 5.1 °C, average annual rainfall of 400-600 mm, and one of the world's three prime maize belts, which is an important grain-producing area.It is one of the most important grain-producing regions in China.The main soil types in the region include dark brown loam, black calcium soil, white pulp soil, herbaceous soil, and black soil, and is one of the major production areas of maize and rice in China.Due to the crop rotation pattern and improper fertilization, the total nitrogen content of the soil has decreased, thus this study aims to help optimize fertilization to improve the soil nutrient content structure and protect the black soil.Soil sampling was conducted from 1 September to 20 October 2020.Based on the land-use status, topographic features, etc., and considering the principles of randomness and representativeness, a total of 121 sampling points were selected, as shown in Figure 1.The latitude and longitude information of the sampling points were recorded by GPS, and soil was collected at a depth of 0-20 cm with a soil extractor [16], and stones, plant debris, and roots were removed from the soil.The collected samples were placed in selfsealing bags and brought back to the laboratory for processing.The samples were divided into two parts, one part used the Kjeldahl method to measure the STN content in the soil, where the samples were then naturally dried at 25 °C, crushed, placed through a 0.2 mm nylon sieve to filter out impurities, bagged and set aside.The whole nitrogen content was obtained as the actual value by this method.The other part was used for measuring STN Soil sampling was conducted from 1 September to 20 October 2020.Based on the land-use status, topographic features, etc., and considering the principles of randomness and representativeness, a total of 121 sampling points were selected, as shown in Figure 1.The latitude and longitude information of the sampling points were recorded by GPS, and soil was collected at a depth of 0-20 cm with a soil extractor [16], and stones, plant debris, and roots were removed from the soil.The collected samples were placed in self-sealing bags and brought back to the laboratory for processing.The samples were divided into two parts, one part used the Kjeldahl method to measure the STN content in the soil, where the samples were then naturally dried at 25 • C, crushed, placed through a 0.2 mm nylon sieve to filter out impurities, bagged and set aside.The whole nitrogen content was obtained as the actual value by this method.The other part was used for measuring STN content with an artificial olfactory system, and no special treatment was required for the samples.

Research on Artificial Olfactory System
The artificial olfactory system is divided into three main parts [17].In the first part, sample preparation; in the second part, detection system; and in the third part, data processing system.Figure 2 shows the hardware components of the detection system which mainly includes the muffle furnace, gas sensor array, reaction chamber, valve, gas pump (for cleaning the reaction chamber), signal processing module, data acquisition card, and computer.The muffle furnace, manufactured by Thermo Scientific Lindberg, USA, is used here for cracking large molecular compounds in soil.The gas sensor array is located in the reaction chamber and communicates with the signal processing module through the flexible flat cable (FFC).The data acquisition card connects to the signal processing module through a DuPont cable and transfers the acquired data to the computer via USB for display and storage.Power is supplied to the signal processing module by a 12 V power adapter.Among them, the gas sensor array is the basis of the detection system, as shown in Table 1, this study uses MOS sensors produced by Figaro Japan specifically for high-precision detection of low concentration gases.This sensor array has a high specificity and some cross-sensitivity, which improves the accuracy compared to a single type of sensor array.The signal processing module is used to power the sensor array and the measurement circuit output.A USB-6210 acquisition card from National Instruments (NI) was used to acquire the gas sensor array response data. samples.

Research on Artificial Olfactory System
The artificial olfactory system is divided into three main parts [17].In the first part, sample preparation; in the second part, detection system; and in the third part, data processing system.Figure 2 shows the hardware components of the detection system which mainly includes the muffle furnace, gas sensor array, reaction chamber, valve, gas pump (for cleaning the reaction chamber), signal processing module, data acquisition card, and computer.The muffle furnace, manufactured by Thermo Scientific Lindberg, USA, is used here for cracking large molecular compounds in soil.The gas sensor array is located in the reaction chamber and communicates with the signal processing module through the flexible flat cable (FFC).The data acquisition card connects to the signal processing module through a DuPont cable and transfers the acquired data to the computer via USB for display and storage.Power is supplied to the signal processing module by a 12 V power adapter.Among them, the gas sensor array is the basis of the detection system, as shown in Table 1, this study uses MOS sensors produced by Figaro Japan specifically for highprecision detection of low concentration gases.This sensor array has a high specificity and some cross-sensitivity, which improves the accuracy compared to a single type of sensor array.The signal processing module is used to power the sensor array and the measurement circuit output.A USB-6210 acquisition card from National Instruments (NI) was used to acquire the gas sensor array response data.

Sensor Type
Detection of Gas Types Measurement Range TGS826 Ammonia 30-300 ppm TGS2602 Toluene, ammonia, hydrogen sulfide 1-30 ppm TGS2610 Tropane, butane 500-10,000 ppm TGS2620 Ethanol, organic solvent 50-5000 ppm TGS821 Hydrogen 100-1000 ppm TGS2603 Trimethylamine, methyl mercaptan, etc. 1-10 ppm TGS2611 Methane, natural gas 500-10,000 ppm TGS823 Methane, ethanol vapor 50-300 ppm TGS2600 Hydrogen, alcohol, etc. 1-30 ppm TGS2612 methane, propane, isobutane 3000-9000 ppm Table 1.Sensor Type Table .Sensor Type Detection of Gas Types Measurement Range TGS826 Ammonia 30-300 ppm TGS2602 Toluene, ammonia, hydrogen sulfide 1-30 ppm TGS2610 Tropane, butane 500-10,000 ppm TGS2620 Ethanol, organic solvent 50-5000 ppm TGS821 Hydrogen 100-1000 ppm TGS2603 Trimethylamine, methyl mercaptan, etc. 1-10 ppm TGS2611 Methane, natural gas 500-10,000 ppm TGS823 Methane, ethanol vapor 50-300 ppm TGS2600 Hydrogen, alcohol, etc. 1-30 ppm TGS2612 methane, propane, isobutane 3000-9000 ppm Upon starting the system, 3 g of soil sample was weighed with an electronic scale and placed inside a quartz boat, which was placed in the middle of the quartz tube and sealed with vacuum flanges at both ends.The lysis temperature and time were 450 • C and 2 min, respectively [18].First, the flanges on both sides were opened and the vacuum pump fed completed soil gas from the cracking into the response chamber while the detection started.The sampling time was 80 s, and the sensor array converted the soil gas information into a voltage signal through a signal processing circuit to generate a soil sample response curve.After the test was completed, the air chamber was cleaned with 1200 mL•min −1 clean air, quartz boat and quartz tube were washed with water, and the two sampling intervals were 5 min.The test was completed sequentially according to the soil sample number.

Feature Extraction
The obtained ten response curves of soil samples were first processed by Savitzky-Golay convolution filtering to extract six characteristic values of response area (V RAV ), mean differential coefficient (V MDC ), standard deviation value (V SDV ), mean value (V MV ) maximum value (V MAX ), and 15th s transient value (V 15TH ) on the sensor response curve.There was a total of 60 features per soil sample.Since the different magnitudes of the data are not conducive to model building, the z-score method was chosen to complete the standardization of the data.V RAV , V MDC , V SDC , V MV were defined as follows: where X i is the i-th data collected by the sensor, ∆t is the time interval between 2 adjacent sampling points, taken as 0.1 s, and N the total number of collected data.

Training Set and Test Set Division
For purposes of finding the optimal adjustment parameters, preventing the phenomenon of "overfitting", and improving the generalization ability of the model, the data set was randomly divided according to the Kennard-Stone method in the ratio of 7:3, i.e., the training set was 85 and the test set was 36.Table 2 illustrates the statistical results of the STN content of the samples measured by the Kjeldahl method.The variance and mean values were 0.60 g•kg −1 and 1.64 g•kg −1 for the test set and 0.51 g•kg −1 and 1.56 g•kg −1 for the verification set, respectively, which can be approximated to show there is no significant difference between the two.

Sensor Array for Full Nitrogen Feature Space Response
In sequence to verify whether the sensor array composition was reasonable, fracking gas data were selected with a whole nitrogen content of 0.2 mg•kg −1 and 4.10 mg•kg −1 , respectively, and Figure 3  In sequence to verify whether the sensor array composition was reasonable, fracking gas data were selected with a whole nitrogen content of 0.2 mg•kg −1 and 4.10 mg•kg −1 , respectively, and Figure 3 was obtained.As shown in the figure, each sensor showed a large difference in response to different soil gases simultaneously, indicating that the array has good sensitivity to the difference in fracking gases.The response results indicate that the selected sensor array is reasonable.

Abnormal Sample Removal Method
The main reasons for generating abnormal samples are the design error of the artificial olfactory system itself, complexity of the samples, and instability of the instrument state [19].In the modeling process, these abnormal samples heavily interfere with the prediction performance of the model, thus this study explores MCCV and K-means LOOCV to discriminate and remove abnormal samples of the system, respectively, and aims to obtain the best abnormal sample removal method for the detection method of the artificial olfactory system according to the comparison of the performance prediction index of the processed BPNN model.

Monte Carlo Cross Validation Method
Monte Carlo cross-validation (MCCV) is a hypothesis-based method [20,21].In this study, the MCCV method is used to discriminate abnormal samples in the olfactory feature space.Firstly, 70% of samples are randomly selected on the training set for the construction of BPNN models, and the remaining 30% are predicted; then the above process is repeated to construct multiple BPNN models; finally, each model is ranked in ascending order according to the sum of squared residuals (PRESS) of the test set, and the cumulative

. Abnormal Sample Removal Method
The main reasons for generating abnormal samples are the design error of the artificial olfactory system itself, complexity of the samples, and instability of the instrument state [19].In the modeling process, these abnormal samples heavily interfere with the prediction performance of the model, thus this study explores MCCV and K-means LOOCV to discriminate and remove abnormal samples of the system, respectively, and aims to obtain the best abnormal sample removal method for the detection method of the artificial olfactory system according to the comparison of the performance prediction index of the processed BPNN model.

Monte Carlo Cross Validation Method
Monte Carlo cross-validation (MCCV) is a hypothesis-based method [20,21].In this study, the MCCV method is used to discriminate abnormal samples in the olfactory feature space.Firstly, 70% of samples are randomly selected on the training set for the construction of BPNN models, and the remaining 30% are predicted; then the above process is repeated to construct multiple BPNN models; finally, each model is ranked in ascending order according to the sum of squared residuals (PRESS) of the test set, and the cumulative probability ( f ac ) to determine abnormal samples.The definitions of PRESS and fac are as follows: (5) where (5) equation k is the number of predicted samples, ŷi and y i represent the predicted and observed values of the i-th sample; (6) equation m is the sample ordinal number and n is the sorted model ordinal number, f mn indicates whether sample m appears in the calibration set of model n, and is 1 if it appears, and 0 otherwise, N represents the total number of samples (121 in this study).By definition, the change of f ac with model ordinal number will reflect the probability of each sample in the model, since the model has been sorted by PRESS value.As the model ordinal number increases, the normal f ac will remain at about 70% of the sampling rate, and conversely, the abnormal sample will deviate from the normal sample.

K-Means LOOCV Cross Validation
Leave-one-out cross-validation (LOOCV) treats each sample as an abnormal sample and obtains a prediction model with the same number of samples by training modeling one by one, which is a computationally intensive process [22].K-means LOOCV is perfection of LOOCV in abnormal sample identification which is time-consuming and has the deficiency of misclassification.The olfactory space is clustered based on the K-means clustering method, and the number of clusters is set.Subsequently, the test set is screened, and based on the principle that normal samples are more concentrated while abnormal samples are more discrete, the classes with fewer samples in the clustering are taken as suspicious abnormal samples.To construct the prediction model, the remaining samples with the suspicious abnormal samples removed are used as the training set, and a BPNN prediction model is trained with this.The LOOCV step is then bridged to reduce the time to train the model.The steps of the K-means LOOCV method are as follows: ( Principal component analysis (PCA) is a mathematical dimensionality reduction method [23][24][25][26].The covariance matrix of the olfactory space is first calculated, followed by finding the eigenvalues of the covariance matrix and their corresponding eigenvectors, and ranking the eigenvectors according to the magnitude of the eigenvalues to obtain the eigenvector matrix; the first k (1 ≤ k < 60) vectors of the eigenvector matrix are selected, and the original olfactory space is reduced to k dimensions.The selection of its k can be determined by the cumulative contribution of variance information G(k) in Equation (8).As in Equation (7), let the variance contribution rate be p i and γ j denote the i-th (i < k) and j-th (j < 60) sorted eigenvalues, respectively.

GA-BP Optimization
In order to ensure the validity of soil olfactory feature space information to a greater extent, the genetic algorithm-backpropagation neural network (GA-BP) optimization method is used in soil total nitrogen detection.The process is shown in Figure 4.A total of 30 samples are randomly selected to form a population, the dimension of the original olfactory space is 60 in line with the chromosome coding length, the coding is binary, and a feature vector corresponds to one gene on the chromosome.If the value of the gene is 1, it will participate in BPNN modeling, and vice versa, if the gene is 0, the feature vector will not participate in modeling.The feature vector corresponding to the genetic individual is selected to build the BPNN model, and the model is trained with the data in the training set, and the inverse of the sum of squares of the errors in the test set data is used as the fitness function.Assuming g(x) is the fitness function, ŷi is the predicted value of the BPNN model for the i-th sample in the test set, and y i is the observed value of the i-th sample in the test set, then it can be expressed as follows: First determine whether the parameters of the feature vector satisfy the losing condition.If it satisfies then output the preferred feature vector and end the run; otherwise, perform the operations of genetic algorithm such as selection, crossover, mutation, and generation of new populations [27,28], and then repeat the above steps until the output condition is satisfied.
olfactory space is 60 in line with the chromosome coding length, the coding is binary, and a feature vector corresponds to one gene on the chromosome.If the value of the gene is 1, it will participate in BPNN modeling, and vice versa, if the gene is 0, the feature vector will not participate in modeling.The feature vector corresponding to the genetic individual is selected to build the BPNN model, and the model is trained with the data in the training set, and the inverse of the sum of squares of the errors in the test set data is used as the fitness function.Assuming g(x) is the fitness function,  is the predicted value of the BPNN model for the i-th sample in the test set, and  is the observed value of the ith sample in the test set, then it can be expressed as follows: ( ) First determine whether the parameters of the feature vector satisfy the losing condition.If it satisfies then output the preferred feature vector and end the run; otherwise, perform the operations of genetic algorithm such as selection, crossover, mutation, and generation of new populations [27,28], and then repeat the above steps until the output condition is satisfied.

Pattern Recognition Prediction Model for Artificial Olfactory 2.8.1. BPNN Prediction Algorithm
Backpropagation neural network (BPNN) belongs to multilayer feedforward neural networks, which have the advantages of simple structure and strong nonlinear mapping capability [29,30].According to Kolmogorov theory, a three-layer network containing one hidden layer can approximate any nonlinear function [31].This paper is based on the neural network toolbox in MATLAB (2019a) software, which is a mathematical software produced by MathWorks, based in Natick, MA, USA.The BPNN is first created by selecting the linear transfer function as the output layer function and the logarithmic transfer function as the hidden layer function.The number of nodes H in the hidden layer is too large or too small for the model.The approximate range can be obtained from the empirical Equation (10), and the optimal number of hidden layers can be obtained by combining the modeling metrics.The BPNN created to predict the total nitrogen content of the soil was trained for 1000 iterations with a learning rate of 0.001 and a convergence condition of 0.00004.The optimal number of implicit layer nodes for direct modelling was determined to be 8 based on the number of model input and output nodes and the RMSE.
where α is the regulation constant between 1 and 10, and m and n are the number of input and output nodes, respectively.

ELM Prediction Model
Extreme learning machine (ELM) is a special feedforward neural network developed on the basis of single implicit layer feedforward neural network [32].Unlike the traditional feedforward neural network based on the gradient descent method, the modeling process randomly generates the connection weights between the input layer and the hidden layer and the thresholds of the neurons in the hidden layer, and there is no necessity to adjust the training process, only the number of hidden nodes needs to be set, which transforms the problem of finding the optimal solution into a simple least-squares problem, which is not easy to fall into the local minima and has good generalization ability.The selection of the implicit function must be integrated with the prediction correctness of the test set to make an appropriate choice.The elmtrain function is used to create and train the ELM model.TYPE in the function is selected as 0, indicating regression fitting, the activation function is selected as sigmoid, and the elmpredict function is used for the predicted output of the model, which is set to be consistent with the elmtrain parameters.

PLSR Prediction Model
Partial least squares regression (PLSR) is a regression modeling method of multiple dependent variables on multiple independent variables in which the regression process is built by extracting the principal components of the dependent and independent variables as much as possible and by maximizing the correlation between the principal components extracted from them, respectively [33].In the modeling process, linear regression models are constructed by finding predictor variables and observable variables in a new space instead of finding the hyperplane of maximum variance between the response and independent variables.PLSR extracts the principal components from the variables to reduce the predictor variable covariance of the sample while addressing the problem of excess predictor variables.

Model Evaluation Metrics
Given objective evaluation of the advantages and disadvantages of various pretreatment methods, this study compares before and after treatment and for which model optimization is better, the R 2 , RMSE, and RPD indicators are introduced for the evaluation of soil property prediction models [34][35][36].R 2 is generally used to evaluate the prediction accuracy of a model, and a value closer to 1 indicates stronger prediction ability of the model.RPD can be used to further evaluate the prediction effectiveness and accuracy of the model, which can compensate for the shortcomings of R 2 for nonlinear model prediction.When RPD is less than 1.5 and R 2 is less than 0.5, the model is not available; when RPD is 1.5-2.0 and R 2 is 0.5-0.66, it can be used to distinguish between high and low values; when RPD is 2.0-2.5 and R 2 is 0.67-0.81, the model can be used to make a rough quantitative prediction; when RPD is 2.5-3.0 and R 2 is 0.82-0.90, the model can make good quantitative predictions, and when the RPD is greater than 3.0 and R 2 is greater than 0.90, the model can make excellent predictions [37].The formula is as follows: where n is the number of samples, y i is the observed value of the i-th sample, f i is the predicted value of the i-th sample, and SD is the standard deviation of y i .

Preliminary Modeling Results
The initial modeling refers to the development of an evaluation prediction model based on the training set (121 samples × 60 features) of the initial soil total nitrogen feature space (ISTNFS) and the chemically true values of the total nitrogen content of the corresponding soil samples, and the application of a test set to validate the prediction performance of the model.The test set performance metrics of the three prediction models without optimization treatment are obtained to facilitate the next comparison.To optimize the effect for general applicability, this study investigated the relationship between soil olfactory characteristics and soil total nitrogen content through the initial modeling calibration effect of three commonly used prediction models for soil olfactory characteristics, BPNN model, ELM model, and PLSR model.
The BPNN prediction model was constructed based on ISTNFS using H as 8 and predicted in the test set.Figure 5a shows the prediction results of R 2 V = 0.62413, RMSE V = 0.52902, and RPD V = 1.3762 for the test set.Based on the classification of soil properties RPD, the model RPD V < 1.5 and the model is not available.The implied layer neurons were set to 20, and the results are shown in Figure 5b, 1.5 < RPD V < 2.0, the model can only be used to distinguish between high and low values.Six pairs of principal component factors were preferably selected to construct the PLSR prediction model and predicted on the test set, and the prediction results can be obtained from Figure 5c, R 2 V = 0.82309, RMSE V = 0.30742, RPD V = 2.3682.Since 2.0 < RPD V < 2.5, the constructed PLSR model can be used for coarse quantitative prediction.The preliminary modeling results showed that all three assessment models, BPNN, ELM, and PLSR, had some predictive ability of soil total nitrogen content with R 2 V greater than 0.5 for the test set.This indicates that there is some correlation between ISTNFS and soil total nitrogen content.However, ISTNFS was not fully optimized, therefore further analysis is needed to determine whether other interferences exist.

Abnormal Sample Rejection Results
To eliminate the influence of abnormal samples on the later model prediction effect, the soil total nitrogen feature space data included a total of 121 samples.In this study, two different abnormal sample identification methods, MCCV and K-means LOOCV, were used to detect the abnormal samples within the soil total nitrogen feature space.
In the process of identifying abnormal samples in the soil olfactory feature space using the MCCV method, 85 (121 × 70%) samples were first randomly selected from the feature space to construct 1000 BPNN models, and the remaining 36 samples were used for prediction.Figure 6 shows the variation curve of the value of  for each sample with the model number after sorting, and the inset of the figure shows the  for each sample of the 121 models.It can be seen from the figure that as the number of models increases The preliminary modeling results showed that all three assessment models, BPNN, ELM, and PLSR, had some predictive ability of soil total nitrogen content with R 2 V greater than 0.5 for the test set.This indicates that there is some correlation between ISTNFS and soil total nitrogen content.However, ISTNFS was not fully optimized, therefore further analysis is needed to determine whether other interferences exist.

Abnormal Sample Rejection Results
To eliminate the influence of abnormal samples on the later model prediction effect, the soil total nitrogen feature space data included a total of 121 samples.In this study, two different abnormal sample identification methods, MCCV and K-means LOOCV, were used to detect the abnormal samples within the soil total nitrogen feature space.
In the process of identifying abnormal samples in the soil olfactory feature space using the MCCV method, 85 (121 × 70%) samples were first randomly selected from the feature space to construct 1000 BPNN models, and the remaining 36 samples were used for prediction.Figure 6 shows the variation curve of the value of f ac for each sample with the model number after sorting, and the inset of the figure shows the f ac for each sample of the 121 models.It can be seen from the figure that as the number of models increases (i.e., as PRESS increases), the f ac converges to a sampling rate of 70% for each sample in the training set, but the f ac curves for samples 6, 23, 38, 86, 91, and 93 are somewhat different from the other curves in that their f ac values remain greater than 80% in a larger range (model number ≥ 300) as the model serial number increases.Therefore, these six samples were identified as outliers and could be considered abnormal samples.

Abnormal Sample Rejection Results
To eliminate the influence of abnormal samples on the later model prediction effect, the soil total nitrogen feature space data included a total of 121 samples.In this study, two different abnormal sample identification methods, MCCV and K-means LOOCV, were used to detect the abnormal samples within the soil total nitrogen feature space.
In the process of identifying abnormal samples in the soil olfactory feature space using the MCCV method, 85 (121 × 70%) samples were first randomly selected from the feature space to construct 1000 BPNN models, and the remaining 36 samples were used for prediction.Figure 6 shows the variation curve of the value of  for each sample with the model number after sorting, and the inset of the figure shows the  for each sample of the 121 models.It can be seen from the figure that as the number of models increases (i.e., as PRESS increases), the  converges to a sampling rate of 70% for each sample in the training set, but the  curves for samples 6, 23, 38, 86, 91, and 93 are somewhat different from the other curves in that their  values remain greater than 80% in a larger range (model number ≥ 300) as the model serial number increases.Therefore, these six samples were identified as outliers and could be considered abnormal samples.When K-means LOOCV is used to detect abnormal samples in soil olfactory feature space, it must be clustered first, as shown in Table 3, into 10 classes.According to the principle that abnormal samples are more discrete than normal samples, class 6 and class 7 have the least number of samples, and they are regarded as suspicious abnormal samples and used as prediction samples, and the relative prediction error of suspicious samples is obtained by using the LOOCV method.The results of K-means LOOCV abnormal sample detection are shown in Figure 7.The threshold of abnormal sample determination is set to 0.2 in the figure, and only sample number 88 is found to be an abnormal sample.When K-means LOOCV is used to detect abnormal samples in soil olfactory feature space, it must be clustered first, as shown in Table 3, into 10 classes.According to the principle that abnormal samples are more discrete than normal samples, class 6 and class 7 have the least number of samples, and they are regarded as suspicious abnormal samples and used as prediction samples, and the relative prediction error of suspicious samples is obtained by using the LOOCV method.The results of K-means LOOCV abnormal sample detection are shown in Figure 7.The threshold of abnormal sample determination is set to 0.2 in the figure, and only sample number 88 is found to be an abnormal sample.According to the empirical Formula (10) and RMSE validation, the number of hidden layer neurons was selected as 8 for the BPNN model.On this basis, Table 4 was obtained.The MCCV and K-means LOOCV were used to reject 6 and 1 abnormal samples on the data set, respectively, and all model indexes were improved, among which the MCCV method had the best rejection effect, where R 2 V , RMSE V , and RPD V were 0.75671, 0.33517, and 1.7938, respectively.

Feature Optimization Results
To obtain the optimization of feature space by PCA, the new soil olfactory space based on the abnormal samples removed by MCCV (115 samples × 60 features) is referred to as updated soil total nitrogen feature space (USTNFS) and optimized by PCA method with the contribution of variance information of each principal component as pi, and the cumulative contribution of variance information (G(k)) set to 95%, and the results are obtained as in Figure 8.As can be observed from the figure, the cumulative contribution rate of variance information of the first 15 principal components is 94.32%, which can basically reflect the amount of information in the original feature space, i.e., the original feature space can be reduced to 15 dimensions and reconstruct a sample set (115 samples × 15 features).

Feature Optimization Results
To obtain the optimization of feature space by PCA, the new soil olfactory space based on the abnormal samples removed by MCCV (115 samples × 60 features) is referred to as updated soil total nitrogen feature space (USTNFS) and optimized by PCA method with the contribution of variance information of each principal component as pi, and the cumulative contribution of variance information (G(k)) set to 95%, and the results are obtained as in Figure 8.As can be observed from the figure, the cumulative contribution rate of variance information of the first 15 principal components is 94.32%, which can basically reflect the amount of information in the original feature space, i.e., the original feature space can be reduced to 15 dimensions and reconstruct a sample set (115 samples × 15 features).Similarly, the GA-BP method is used for USTNFS to optimize its features, and the output condition is set to 100 iterations.Figure 9 shows the evolution curve of the fitness function, from which it can be seen that the best fitness curve remains unchanged when the number of species iterations exceeds 32, indicating that it has been optimized to the Similarly, the GA-BP method is used for USTNFS to optimize its features, and the output condition is set to 100 iterations.Figure 9 shows the evolution curve of the fitness function, from which it can be seen that the best fitness curve remains unchanged when the number of species iterations exceeds 32, indicating that it has been optimized to the best effect.At this time, the number of the filtered set of optimal feature vectors are: 1, 2, 4, 6, 7, 10, 12,16,18,20,21,23,24,25,27,29,30,32,33,34,36,40,41,43,45,46,47,51,52,53,55,58, and the original feature space is reduced from 60 dimensions to 32 dimensions.A sample set (115 samples × 32 features) based on GA-BP optimization was reconstructed.Similarly, the GA-BP method is used for USTNFS to optimize its features, and the output condition is set to 100 iterations.Figure 9 shows the evolution curve of the fitness function, from which it can be seen that the best fitness curve remains unchanged when the number of species iterations exceeds 32, indicating that it has been optimized to the best effect.At this time, the number of the filtered set of optimal feature vectors are: 1, 2, 4,6,7,10,12,16,18,20,21,23,24,25,27,29,30,32,33,34,36,40,41,43,45,46,47,51,52,53,55,58, and the original feature space is reduced from 60 dimensions to 32 dimensions.A sample set (115 samples × 32 features) based on GA-BP optimization was reconstructed.In the case of comparing the feature selection effects of two feature optimization methods, PCA and GA-BP, the features preferred by the two methods are trained to the In the case of comparing the feature selection effects of two feature optimization methods, PCA and GA-BP, the features preferred by the two methods are trained to the corresponding prediction models by BPNN, ELM, and PLSR algorithms, and the test set (34 samples) data are used to verify the models.
In constructing the model using BPNN, due to the reduction of feature dimension, the range of the number of neurons (H) in the hidden layer can be determined according to the formula: 6-15, and the modeling preferred H is 10.As shown in Figure 10a,b, the difference between GA-BP and PCA optimization is small, and GA-BP is slightly better than PCA.When building the ELM model, the GA-BP optimization shown in Figure 10c,d has values 0.1542 and 0.03728 higher than the PCA processed models RPD V and R 2 , and the RMSE V is reduced by 0.0182.The preferred modeling principal component factor (PCF) is 4 when modeling and is constructed using PLSR, after cross-validation of RMSE C and bare pool information criterion Akaike Information Criterion (AIC) evaluation.The modeling parameters are preferred, and the model prediction results are shown in Figure 10e,f.The values of RPD V and R 2 of the model are 0.5058 and 0.02637 higher and the RMSE V is reduced by 0.02786 after GA-BP optimization compared to PCA treatment as shown in Figure 10e,f.
(PCF) is 4 when modeling and is constructed using PLSR, after cross-validation of RMSEC and bare pool information criterion Akaike Information Criterion (AIC) evaluation.The modeling parameters are preferred, and the model prediction results are shown in Figure 10e,f.The values of RPDV and R 2 of the model are 0.5058 and 0.02637 higher and the RMSEV is reduced by 0.02786 after GA-BP optimization compared to PCA treatment as shown in Figure 10e,f.

Discussion
In a bid to verify whether the optimized processing (i.e., MCCV anomaly rejection, GA-BP feature dimensionality reduction) is generalizable for artificial olfaction-based detection of total nitrogen models, three models, BPNN, ELM, and PLSR, were developed and evaluation metrics for each model test set were obtained as in Table 5.

Discussion
In a bid to verify whether the optimized processing (i.e., MCCV anomaly rejection, GA-BP feature dimensionality reduction) is generalizable for artificial olfaction-based detection of total nitrogen models, three models, BPNN, ELM, and PLSR, were developed and evaluation metrics for each model test set were obtained as in Table 5.

Figure 1 .
Figure 1.Study area and sampling locations.

Figure 1 .
Figure 1.Study area and sampling locations.
2.6.Multi-Feature Optimization Methods and Pattern Recognition Prediction Models 2.6.1

1 )
Spatial clustering of soil olfaction based on K-means clustering with a set number of clusters.(2) The classes with fewer samples in the clustering are treated as suspect samples and used as the test set for the BPNN model.(3) The remaining samples with suspected anomalies removed are taken as the training set and used to train a BPNN prediction model.(4) The input prediction of the test set with the trained model gives the corresponding prediction results and the relative error δ between the predicted and measured values is calculated.(5) Set the threshold value.If the value is greater than the threshold, it is considered an abnormal sample, otherwise it is considered a normal sample.2.7.Feature Dimensionality Reduction Methods 2.7.1.Principal Component Analysis

Figure 5 .
Figure 5. Graph of prediction results of three models for initial modeling.(a) Initial modeling BPNN model prediction results.(b) Initial modeling ELM model prediction result.(c) Initial modeling PLSR model prediction result.

Figure 5 .
Figure 5. Graph of prediction results of three models for initial modeling.(a) Initial modeling BPNN model prediction results.(b) Initial modeling ELM model prediction result.(c) Initial modeling PLSR model prediction result.

Figure 6 .
Figure 6.Cumulative frequency profile of each sample.

Figure 6 .
Figure 6.Cumulative frequency profile of each sample.

Figure 8 .
Figure 8. Cumulative contribution results of PCA principal components.

Figure 8 .
Figure 8. Cumulative contribution results of PCA principal components.

Figure 8 .
Figure 8. Cumulative contribution results of PCA principal components.

Figure 9 .
Figure 9. Evolutionary plot of optimal fitness function.

Figure 9 .
Figure 9. Evolutionary plot of optimal fitness function.

Figure 10 .
Figure 10.Results of three models after optimization of PCA and GA-BP features.(a) Prediction results of BPNN model after PCA optimization.(b) GA-BP optimized BPNN model prediction results.(c) Prediction results of ELM model after PCA optimization.(d) Prediction results of ELM model after GA-BP optimization.(e) Prediction results of PLSR model after PCA optimization.(f) Prediction results of PLSR model after GA-BP optimization.

Figure 10 .
Figure 10.Results of three models after optimization of PCA and GA-BP features.(a) Prediction results of BPNN model after PCA optimization.(b) GA-BP optimized BPNN model prediction results.(c) Prediction results of ELM model after PCA optimization.(d) Prediction results of ELM model after GA-BP optimization.(e) Prediction results of PLSR model after PCA optimization.(f) Prediction results of PLSR model after GA-BP optimization.

Table 1 .
Sensor Type Table.

Table 2 .
Organic matter concentrations in soil samples.
was obtained.As shown in the figure, each sensor showed a large difference in response to different soil gases simultaneously, indicating that the array has good sensitivity to the difference in fracking gases.The response results indicate that the selected sensor array is reasonable.
2.5.Sensor Array for Full Nitrogen Feature Space Response

Table 4 .
Comparison results of different abnormal sample rejection methods.

Table 5 .
Test set data.