Modeling and Prediction of Soil Organic Matter Content Based on Visible-Near-Infrared Spectroscopy

: In order to explore the ever-changing law of soil organic matter (SOM) content in the forest of the Greater Khingan Mountains, a prediction model of the SOM content with a high accuracy and stability has been developed based on visible near-infrared (VIS-NIR) technology and multiple regression analysis. A total of 105 soil samples were collected from Cuifeng forest farm in Jagdaqi City, Greater Khingan Mountains region, Heilongjiang Province, China. Five classical preprocessing algorithms, including Savitzky − Golay convolution smoothing (S-G smoothing), standard normal variate transformation (SNV), multiplicative scatter correction (MSC), ﬁrst derivative, second derivative, and the combinations of the above ﬁve methods were applied to the raw spectra. Wavelengths were optimized with ﬁve methods of competitive adaptive reweighted sampling (CARS), successive projections algorithm (SPA), uninformative variable elimination (UVE), synergy interval partial least square (SiPLS), and their combinations, and PLS models were developed accordingly. The results showed that when S-G smoothing is combined with SNV or MSC, both preprocessing strategies can improve the performance of the model. The prediction accuracy of SiPLS-PLS model and SiPLS-UVE-PLS model for the SOM content is higher than for other models, withan Rc 2 of 0.9663 and 0.9221, RMSEC of 0.0645 and 0.0981, Rv 2 of 0.9408 and 0.9270, and RMSEV of 0.0615 and 0.0683, respectively. The pretreatment strategies and characteristic variable selection methods used in this study could signiﬁcantly improve the model performance and predicting efﬁciency.


Introduction
Forests and soil are important terrestrial ecosystems, they have the role of regulating climate, improving soil quality, protecting biodiversity, and mitigating geologic hazards [1][2][3][4]. Soil plays an important role in the material cycle, energy flow, and information transfer in forest ecosystems, providing the necessary nutrients required for plant growth [5][6][7]. Soil nutrients directly affect the distribution, growth, and yield of forest timber and have an important impact on the distribution pattern of plant species within forest communities, thus affecting both forest ecosystems and terrestrial ecosystems [8][9][10][11]. The content of soil organic matter (SOM) is an important indicator of soil nutrients, which plays an important role in improving soil quality, increasing soil productivity, and enhancing soil erosion resistance [12].
Visible-near-infrared spectroscopy (VIS-NIRS) is a green, eco-friendly, simple, and effective qualitative and quantitative spectroscopic analytical technology. Coupled with suitable chemo-metric methods, NIR has been successfully applied in many fields such as the petrochemical, agriculture, food, pharmaceutical, and traditional Chinese medicine industries [13]. VIS-NIRS is the absorption spectrum of frequency multiplication and frequency synthesis of different H-base groups [14,15]. The application of visible near infrared reflectance spectroscopy (VIS-NIRS) to characterize the physical and chemical In the study area, seven typical sample plots of 30 m × 30 m were selected from coniferous and broad-leaved mixed forests, each sample plot was divided into 15 6 ×10 m sub-plot. The soils were collected from each sub-plot by the five-point method [20], with a total of 105 soil samples collected. At each sub-plot, about 1 kg of soil from the depth of 10~20 cm in the soil profile was taken back to the laboratory for determination of the SOM content. Additionally, cutting rings with a volume of 100 cm 3 were used to take soil samples at each sampling point in the sub-plot for determination of the physical properties of the soil, including bulk density, capillary porosity, and non-capillary porosity, etc. (Table  1). Statistical analysis was completed in IBM SPSS Statistics 26.0 (IBM, Armonk, NY, USA).  In the study area, seven typical sample plots of 30 m × 30 m were selected from coniferous and broad-leaved mixed forests, each sample plot was divided into 15 6 × 10 m sub-plot. The soils were collected from each sub-plot by the five-point method [20], with a total of 105 soil samples collected. At each sub-plot, about 1 kg of soil from the depth of 10~20 cm in the soil profile was taken back to the laboratory for determination of the SOM content. Additionally, cutting rings with a volume of 100 cm 3 were used to take soil samples at each sampling point in the sub-plot for determination of the physical properties of the soil, including bulk density, capillary porosity, and non-capillary porosity, etc. (Table 1). Statistical analysis was completed in IBM SPSS Statistics 26.0 (IBM, Armonk, NY, USA). Soil samples were air-dried in the laboratory to remove impurities, were ground and sieved (60 mesh), and the SOM content was determined by the wet oxidation method. The soil organic carbon (SOC) was completely oxidized to CO 2 using potassium dichromate (K 2 Cr 2 O 7 ) standard solution and concentrated sulfuric acid (chemically pure), and the remaining K 2 Cr 2 O 7 was titrated with ferrous sulfate (FeSO 4 ) standard solution, the amount of FeSO 4 was recorded and the SOC content was calculated, and then the SOM content was obtained indirectly through conversion. The specific operation steps were guided by GB 7857-87 "Determination of organic matter in forest soils and calculation of carbon to nitrogen ratio".

Visible-Near-Infrared Spectra Collection
The VIS-NIR spectra were collected using a LabSpec Pro FR/A114260 (Analytical Spectral Devices, Inc., Boulder, CO, USA). The wavelengths ranged from 350 to 2500 nm, with a minimum interval range of 2 nm, and a total of 2151 wavelength points were collected. The sampling time was 10 times/s. In order to eliminate the noise signal and improve the accuracy of spectral acquisition, the number of scans was increased to 30 times, and the average value of the repeated measurement spectra was taken as the original spectra of the soil samples, and the spectral signal-to-noise ratio was subsequently improved. The raw spectra of 105 soil samples were imported into the spectral data processing software ViewSpecPro to realize the conversion of diffuse reflectance into absorbance values according to Kubelka−Munk theory.

Modeling and Optimization
MATLAB R2016a (MathWorks, Natick, MA, USA) was used in this study for model development and the corresponding calculations.
The soil samples' spectra were partitioned into calibration set and validation set based on joint x−y distances (SPXY). SPXY is derived from the classical sample classification algorithm Kennard−Stone method (KS), which first selects the two samples with the farthest Euclidean distances into the calibration set, then selects the samples with the largest and smallest distances into the calibration set by calculating the Euclidean distances from each remaining sample to each known sample in the calibration set, and so on, until the number of samples in the calibration set reaches the specified number [21]. SPXY is similar to the selection process of KS, with the difference being that SPXY integrates the spectral response matrix and concentration column vectors of the samples to achieve effective coverage of the multidimensional vector space, increase the variability and representativeness of the samples, and improve the stability of the established model [22]. The distance calculation equation of SPXY dividing the sample set is as follows: where X p and X q denote two different samples and N is the number of wavelength points of the samples.

Spectral Preprocessing and Band Optimization
Pretreatment of the sample VIS-NIRS is aimed at eliminating systematic and random errors. Differential processing can reduce or eliminate some systematic errors, and methods such as smoothing can reduce random errors, so five classical preprocessing algorithms, namely, Savitzky−Golay convolution smoothing (S-G smoothing), standard normal variate transformation (SNV), multiplicative scatter correction (MSC), first derivative, and second derivative, were used. This experiment also explores the effect of combined strategy preprocessing to attenuate or even eliminate non-target information [23]. According to the combination of background correction and scattering correction [24], six combinations of S-G smoothing-SNV, S-G smoothing-MSC, first derivative-SNV, first derivative-MSC, second derivative-SNV, and second derivative-MSC are used to further eliminate irrelevant information and noise. The effect of PLS modeling on the pretreatment spectra is evaluated in order to select the best pretreatment method.
VIS-NIRS are characterized by high dimensionality and redundancy, and have problems of high correlation, loss of rank, and covariance when applied to quantitative analysis, leading to poor accuracy and robustness of SOM content prediction models constructed based on the global spectra. In order to find the weak interference information that can characterize the samples and integrate multivariate information to eliminate the strong interference information, wavelength variable selection becomes a key issue in the applied analysis of SOM content of VIS-NIRS [25]. Wavelength variable selection can be divided into two types of point screening and band screening, and this experiment explored the effect of three wavelength point screening methods (CARS, SPA, and UVE) and one band screening method (SiPLS) on the improvement of model quality, and tried to couple different types of methods to screen the characteristic variables. In this experiment, the screening effects of six coupling strategies, UVE-CARS, UVE-SPA, CARS-SPA, SiPLS-UVE, SiPLS-CARS, and SiPLS-SPA, were investigated.
CARS based on the principle of "survival of the fittest" in evolutionary theory has the characteristics of fast computation and a high screening efficiency [26]. This experiment sets the number of Monte Carlo samples to 100 and selects the subset with the smallest RMSECV by 10-fold cross-validation to find the set of optimal solution variables [27].
The successive projections algorithm (SPA) is capable of extracting the few characteristic wavelengths in the full waveband that contain the least redundant information and the least covariance, thus eliminating a larger number of irrelevant variables in the original spectral matrix, and is widely used in the field of high-precision analysis of NIRS [28]. For the original eigenspectral matrix X m×n (m is the number of samples and n is the number of wavelengths), x n(0) and K are the initial iteration vector and the number of wavelengths to be extracted, respectively. SPA is a forward cyclic variable selection method that minimizes vector space covariance [29].
According to the relationship between the spectral matrix and the concentration matrix, the uninformed variable elimination method (UVE) was used to select characteristic wavelengths based on the regression coefficients of the PLS model [30]. Based on the PLS model, the randomly generated noise matrix with the same number of variables as the dependent variable matrix was introduced and combined with the original spectral matrix to form a new matrix. C i in the interval of [1, n], which is less than C max , is excluded and the remaining variables are extracted to form X UVE .The specific formula is as follows: SiPLS is an extension and improvement of the interval partial least square (iPLS) method. iPLS splits the full-band data into several equal-width intervals, builds a PLS model for each subinterval separately, and selects the optimal modeling interval using RMSECV as the model indicator [31,32]. However, iPLS is sensitive to the interval width and the selected subinterval may not be exactly the information interval. SiPLS overcomes the drawback of iPLS single interval modeling by calculating all possible PLS combination models for two or more subintervals in the same interval division based on the idea of permutation and combination, and the optimal joint subinterval is used for PLS regression.
Each of the wavelength selection methods have their limitations. In order to improve the reliability of the characteristic wavelengths, this experiment explored the band optimization effects of six combination strategies: UVE-CARS, UVE-SPA, UVE-SiPLS, CARS-SPA, SiPLS-CARS, and SiPLS-SPA [33,34].

Modeling Methods and Model Evaluation
Partial least square regression (PLSR) combines the advantages of principal component analysis (PCA), typical correlation analysis (CCA), and multiple linear regression (MLR) to achieve high-dimensional data structure simplification, regression modeling, and analysis of multiple correlations of independent variables [35]. The main idea of PLSR-based NIR prediction model for SOM is to reduce the dimensionality and reveal the main influencing factors of organic matter content variation from the spectral data with a much smaller number of samples than the dimensionality, so that the model is robust. A partial least square (PLS) model was established by using NIRS combined with the multivariate selection method to quantitatively analyze the SOM content in mixed coniferous low-quality forest in the Jagdaqi region. Coefficient of determination (R 2 ) and root mean square error (RMSE) were selected for the model evaluation. The coefficient of determination (R2) is a concept in analysis of variance and regression analysis. It is a measure of the proportion of explained variance present in the data. The larger the R 2 is in the range of 0 to 1, the better the stability of the model, and the higher the fitting degree [36]. RMSE is used to detect the deviation between the estimated value and the true value. The smaller the RMSE, the better the prediction ability of the model. The computation equations for these criteria are as follows [37]:

Partition of Sample Sets
Based on SPXY, 105 soil samples were divided into the calibration set (70 soil samples) and validation set (35 soil samples), with a ratio of 2:1 ( Table 2).

Spectral Preprocessing
Taking the original spectra as the control set, five pretreatment methods and six combination methods were investigated in this study in terms of the model performance ( Figure 2). The results were compared with the original VIS-NIR spectral for model optimization and the best pretreatment methods were selected for VIS-NIR based SOM modeling and prediction. Figure 2A shows the original spectrum of the soil powder, and each color corresponds to a spectrum.
( Figure 2G) and S-G smoothing -MSC ( Figure 2H) compared with SNV and MSC, the S-G smoothing followed by scattering correction results in significant differences in absorbance between different samples and stronger characteristics of multiplicity and ensemble. As shown in Figure 2I-L, the combination of SNV and MSC correction of the first derivative and second derivative processing to extract the differences between adjacent wavelength variables amplified the noise contained in the original spectra, which cannot be effectively corrected for errors and has a lower signal-to-noise ratio [38]. In the process of PLS modeling, the optimal number of latent variables (nLVs) needs to be determined by 10-fold cross-validation before further analysis of the coefficient of determination (R 2 ) and the root mean square error of prediction (RMSEP). The root mean square error of cross-validation (RMSECV) of the PLS model varies with the number of latent variables, and the selection of the optimal number of PLS latent variables follows Comparing the S-G smoothing spectrum ( Figure 2B) with the original spectrum shows that the S-G smoothing with a filter window width of 5 and a quadratic polynomial fit to the data points completely retains the effective information of the spectrum, and the shape of the spectrum tends to be smooth at this point. The effects of uneven solid particle size, surface scattering, and optical path changes on diffuse reflectance spectra were eliminated in the NIRS pretreated with SNV and MSC, and the spectral characteristic peaks were also highlighted. It can be seen from Figure 2C,D that the characteristic variables were mainly concentrated in the vicinity of 550~600 nm, 1000~1100 nm and 1400~1900 nm. The soil spectra after the first derivative and the second derivative processing are shown in Figure 2E,F, respectively. At this time, the wavelength sampling points are dense and the differential spectrum is extremely noisy.
The different combinations of preprocessing strategies follow the optimization window parameters used in the preprocessing of a single algorithm. S-G smoothing -SNV ( Figure 2G) and S-G smoothing -MSC ( Figure 2H) compared with SNV and MSC, the S-G smoothing followed by scattering correction results in significant differences in absorbance between different samples and stronger characteristics of multiplicity and ensemble. As shown in Figure 2I-L, the combination of SNV and MSC correction of the first derivative and second derivative processing to extract the differences between adjacent wavelength variables amplified the noise contained in the original spectra, which cannot be effectively corrected for errors and has a lower signal-to-noise ratio [38].
In the process of PLS modeling, the optimal number of latent variables (nLVs) needs to be determined by 10-fold cross-validation before further analysis of the coefficient of determination (R 2 ) and the root mean square error of prediction (RMSEP). The root mean square error of cross-validation (RMSECV) of the PLS model varies with the number of latent variables, and the selection of the optimal number of PLS latent variables follows the principle of minimum RMSECV. The number of principal components that minimizes or is almost invariant to the RMSECV is chosen under the premise that the number of components is as small as possible. The results of the 10-fold cross-validation of different pretreatment methods are shown in Figure 3. The RMSECV of the soil organic matter correction model established by original spectrum with five latent variables selected is the minimum. The optimal number of latent variables for the PLS correction model was 6, 4, 4, 1, and 1 after S-G smoothing, SNV, MSC, first derivative, and second derivative treatments, respectively. The optimal number of latent variables was 6, 6, 1, 1, 1, and 2 after S-G-SNV, S-G-MSC, first derivative -SNV, first derivative -MSC, second derivative -SNV, and second derivative -MSC, respectively. the principle of minimum RMSECV. The number of principal components that minimizes or is almost invariant to the RMSECV is chosen under the premise that the number of components is as small as possible. The results of the 10-fold cross-validation of different pretreatment methods are shown in Figure 3. The RMSECV of the soil organic matter correction model established by original spectrum with five latent variables selected is the minimum. The optimal number of latent variables for the PLS correction model was 6, 4, 4, 1, and 1 after S-G smoothing, SNV, MSC, first derivative, and second derivative treatments, respectively. The optimal number of latent variables was 6, 6, 1, 1, 1, and 2 after S-G-SNV, S-G-MSC, first derivative -SNV, first derivative -MSC, second derivative -SNV, and second derivative -MSC, respectively.    A comprehensive comparison of the PLS models with five single preprocessing and six different combinations of preprocessing shows that the optimal preprocessing methods are S-G-SNV and S-G-MSC. The PLS model with S-G smoothing outperforms other PLS models with a single pre-treatment, indicating that S-G smoothing, either alone for NIRS or in combination with scattering correction, improves modeling to some extent compared to the original spectra. However, the overall noise reduction effect is weaker when S-G smoothing is performed alone. Therefore, it is necessary to preprocess the soil spectra using a suitable combination algorithm. After the NIRS data were preprocessed by a single algorithm, SNV had a stronger correction capability than MSC because MSC operates based on the spectral array of each sample group, while SNV operates on each spectrum. However, smoothing improves the model performance more than baseline correction and scattering correction. The PLS models of 1D-SNV, 1D-MSC, 2D-SNV, and 2D-MSC all performed lower than the original spectra and the single preprocessed spectra, indicating that preprocessing with two or more methods is not necessarily superior to a single preprocessing method. In this experiment, both the S-G-SNV-PLS correction model and the S-G-MSC-PLS correction model outperformed the other models, and the S-G-SNV and S-G-MSC treated soil fullspectrum data were used for variable screening.

Wavelength Selection
In this study, four single methods and six combined strategies were used to select wavelength variables for VIS-NIRS of SOM content, which are CARS, SPA, UVE, SiPLS, UVE-CARS, UVE-SPA, UVE-SiPLS, CARS-SPA, SiPLS-CARS, and SiPLS-SPA.
CARS screens the pretreated SOM content spectra of S-G-SNV and S-G-MSC, as shown in Figure 4. The S-G-SNV pretreatment spectra showed an overall decreasing trend of RMSECV although fluctuating when the sampling times was increased from 0 to 60, and its value reached the minimum at the 56th sampling. Then, 73 and 17 variables were selected separately for the calibration set and validation set, accounting for 3.39% and 0.79% of the original variables, respectively. The 52nd CARS sampling selected 63 and 14 variables, respectively, in the S-G-MSC calibration and validation set spectra, each representing 2. A comprehensive comparison of the PLS models with five single preprocessing and six different combinations of preprocessing shows that the optimal preprocessing methods are S-G-SNV and S-G-MSC. The PLS model with S-G smoothing outperforms other PLS models with a single pre-treatment, indicating that S-G smoothing, either alone for NIRS or in combination with scattering correction, improves modeling to some extent compared to the original spectra. However, the overall noise reduction effect is weaker when S-G smoothing is performed alone. Therefore, it is necessary to preprocess the soil spectra using a suitable combination algorithm. After the NIRS data were preprocessed by a single algorithm, SNV had a stronger correction capability than MSC because MSC operates based on the spectral array of each sample group, while SNV operates on each spectrum. However, smoothing improves the model performance more than baseline correction and scattering correction. The PLS models of 1D-SNV, 1D-MSC, 2D-SNV, and 2D-MSC all performed lower than the original spectra and the single preprocessed spectra, indicating that preprocessing with two or more methods is not necessarily superior to a single preprocessing method. In this experiment, both the S-G-SNV-PLS correction model and the S-G-MSC-PLS correction model outperformed the other models, and the S-G-SNV and S-G-MSC treated soil full-spectrum data were used for variable screening.

Wavelength Selection
In this study, four single methods and six combined strategies were used to select wavelength variables for VIS-NIRS of SOM content, which are CARS, SPA, UVE, SiPLS, UVE-CARS, UVE-SPA, UVE-SiPLS, CARS-SPA, SiPLS-CARS, and SiPLS-SPA.
CARS screens the pretreated SOM content spectra of S-G-SNV and S-G-MSC, as shown in Figure 4. The S-G-SNV pretreatment spectra showed an overall decreasing trend of RMSECV although fluctuating when the sampling times was increased from 0 to 60, and its value reached the minimum at the 56th sampling. Then, 73 and 17 variables were selected separately for the calibration set and validation set, accounting for 3.39% and 0.79% of the original variables, respectively. The 52nd CARS sampling selected 63 and 14 variables, respectively, in the S-G-MSC calibration and validation set spectra, each repre-  The accuracy and stability of the established PLS models vary greatly when SPA extracts different numbers of wavelength points [39]. S-G-SNV and S-G-MSC pretreatment both minimize the RMSE when 30 feature wavelengths were extracted, with RMSE of 0.2602 and 0.2329. The extraction of feature wavelengths is shown in Figure 5, where the red quadrilateral is used to mark the specific positions of the feature wavelengths.
The analysis shows that the feature wavelengths extracted by SPA overlap well in two different pretreatment backgrounds, especially those near 2200~2500 nm. For instance, 2194 nm are highly correlated with ArCH aromatic hydroxyl groups, and 2227, 2271, 2307, 2330, 2343, 2452, 2476 and 2481 nm are correlated with methyl groups (CH2), methylene group (CH1) and hypomethyl group (CH) [39]. The characteristic wavelengths which accounted for 1.  The accuracy and stability of the established PLS models vary greatly when SPA extracts different numbers of wavelength points [39]. S-G-SNV and S-G-MSC pretreatment both minimize the RMSE when 30 feature wavelengths were extracted, with RMSE of 0.2602 and 0.2329. The extraction of feature wavelengths is shown in Figure 5, where the red quadrilateral is used to mark the specific positions of the feature wavelengths. The analysis shows that the feature wavelengths extracted by SPA overlap well in two different pretreatment backgrounds, especially those near 2200~2500 nm. For instance, 2194 nm are highly correlated with ArCH aromatic hydroxyl groups, and 2227, 2271, 2307, 2330, 2343, 2452, 2476 and 2481 nm are correlated with methyl groups (CH2), methylene group (CH1) and hypomethyl group (CH) [39]. The characteristic wavelengths which accounted for 1. The effect of UVE screening is shown in Figure 6, where the left side of the vertical dividing line shows the original wavelength variables, the right side of the red area shows the randomly introduced noise variables, and the upper and lower solid lines parallel to the horizontal axis are the threshold lines for assessing the stability. From the principle of UVE, it can be seen that the wavelength points between the threshold lines do not contribute to the modeling and are uninformative variables. The parts marked with green "*" beyond the threshold are the feature variables containing the modeling information. The number of variables remaining after the UVE screening of S-G-SNV and S-G-MSC preprocessed spectra are 52 and 53, each accounting for 2.42% and 2.46%, respectively, of the global variables. Under the condition that the number of latent variables was 14, the S-G-MSC-UVE-PLS model was slightly more effective than the PLS model constructed after S-G-SNV treatment, with Rc 2 = 0.8173, Rv 2 = 0.7580, RMSEC = 0.1502, and RMSEV = 0.1243. The effect of UVE screening is shown in Figure 6, where the left side of the vertical dividing line shows the original wavelength variables, the right side of the red area shows the randomly introduced noise variables, and the upper and lower solid lines parallel to the horizontal axis are the threshold lines for assessing the stability. From the principle of UVE, it can be seen that the wavelength points between the threshold lines do not contribute to the modeling and are uninformative variables. The parts marked with green "*" beyond the threshold are the feature variables containing the modeling information. The number of variables remaining after the UVE screening of S-G-SNV and S-G-MSC preprocessed spectra are 52 and 53, each accounting for 2.42% and 2.46%, respectively, of the global variables. Under the condition that the number of latent variables was 14, the S-G-MSC-UVE-PLS model was slightly more effective than the PLS model constructed after S-G-SNV treatment, with Rc 2 = 0.8173, Rv 2 = 0.7580, RMSEC = 0.1502, and RMSEV = 0.1243.
This study explored the effect of the number of interval combinations on the performance of the SOM content PLS models. The effect of SiPLS on the band preference of the different pretreatment spectra when the number of interval combinations was 2, 3, and 4 is shown in Table 4 [40][41][42]. The results show that the optimal number of intervals was 3. The RMSECV of both types of SiPLS models tendes to decrease when the number of selected intervals to be combined increased from 2 to 3. However, as the number of combinations increased to 4, more of the input variables led to noise being introduced into the model, which increased the error and reduced the prediction accuracy. As can be seen from Figure 7, the PLS factor was 6 for both the S-G-SNV and S-G-MSC pretreated spectra, and the selected intervals were the same, 458~565 nm, 1430~1537 nm, and 1645~1751 nm, at which the RMSE obtained the minimum values of 0.2131 and 0.2120, respectively, when the number of interval combinations was equal to 3. After S-G-SNV pretreatment, the Rc 2 and Rv 2 of the SiPLS-PLS model were 0.9663 and 0.9408, and the RMSEC and RMSEV were 0.0645 and 0.0615, respectively. The characteristic spectra treated with S-G-MSC showed Rc 2 = 0.9659, Rv 2 = 0.9442, RMSEC = 0.0649, and RMSEV = 0.0597. This study explored the effect of the number of interval combinations on the performance of the SOM content PLS models. The effect of SiPLS on the band preference of the different pretreatment spectra when the number of interval combinations was 2, 3, and 4 is shown in Table 4 [40][41][42]. The results show that the optimal number of intervals was 3. The RMSECV of both types of SiPLS models tendes to decrease when the number of selected intervals to be combined increased from 2 to 3. However, as the number of combinations increased to 4, more of the input variables led to noise being introduced into the model, which increased the error and reduced the prediction accuracy. As can be seen from Figure 7, the PLS factor was 6 for both the S-G-SNV and S-G-MSC pretreated spectra, and the selected intervals were the same, 458~565 nm, 1430~1537 nm, and 1645~1751 nm, at which the RMSE obtained the minimum values of 0.2131 and 0.2120, respectively, when the number of interval combinations was equal to 3. After S-G-SNV pretreatment, the Rc 2 and Rv 2 of the SiPLS-PLS model were 0.9663 and 0.9408, and the RMSEC and RMSEV were 0.0645 and 0.0615, respectively. The characteristic spectra treated with S-G-MSC showed Rc 2 = 0.9659, Rv 2 = 0.9442, RMSEC = 0.0649, and RMSEV = 0.0597.   Different wavelength selection methods have their limitations. To further improve the reliability of the characteristic wavelength, on the one hand, multiple variable selection methods can be used to further refine the classical algorithm and make up for the defects of the classical algorithm. On the other hand, the noise and non-informative intervals can be eliminated by using the interval selection algorithm, and then other variable selection methods can be used to refine the selection. When the two variable selection methods are coupled, the optimization effect is not the same due to the different principles of wavelength selection between these two methods [33].
To compensate for the weakness of UVE in eliminating invalid variables, CARS was used to further refine the UVE, where 16 spectral feature variables of the S-G-SNV preprocessed correction set and 23 spectral feature variables of the S-G-MSC preprocessed correction set were selected by UVE-CARS, accounting for 30.77% and 43.40% of the UVE alone, respectively, which served to streamline the model through the input variables. The UVE-CARS-PLS models under the two preprocessing methods were established and compared with the UVE-PLS model, and the results are shown in Table 5. It was found that the combined screening did not affect the number of latent variables selected for the calibration model, but the model accuracy and stability were improved due to the optimization of the input variables, with Rc 2 increasing to 0.8344 and 0.8735; Rv 2 increasing to 0.8503 and 0.8990; RMSEC reduced to 0.1430 and 0.1250; and RMSEV reduced to 0.0978 and 0.0803, respectively. Different wavelength selection methods have their limitations. To further improve the reliability of the characteristic wavelength, on the one hand, multiple variable selection methods can be used to further refine the classical algorithm and make up for the defects of the classical algorithm. On the other hand, the noise and non-informative intervals can be eliminated by using the interval selection algorithm, and then other variable selection methods can be used to refine the selection. When the two variable selection methods are coupled, the optimization effect is not the same due to the different principles of wavelength selection between these two methods [33].
To compensate for the weakness of UVE in eliminating invalid variables, CARS was used to further refine the UVE, where 16 spectral feature variables of the S-G-SNV preprocessed correction set and 23 spectral feature variables of the S-G-MSC preprocessed correction set were selected by UVE-CARS, accounting for 30.77% and 43.40% of the UVE alone, respectively, which served to streamline the model through the input variables. The UVE-CARS-PLS models under the two preprocessing methods were established and compared with the UVE-PLS model, and the results are shown in Table 5. It was found that the combined screening did not affect the number of latent variables selected for the calibration model, but the model accuracy and stability were improved due to the optimization of the input variables, with Rc 2 increasing to 0.8344 and 0.8735; Rv 2 increasing to 0.8503 and 0.8990; RMSEC reduced to 0.1430 and 0.1250; and RMSEV reduced to 0.0978 and 0.0803, respectively.   The number of variables selected for SiPLS-UVE increased compared to UVE. A total of 117 variables were selected for three subintervals in the S-G-SNV pretreatment context, while for S-G-MSC treated NIRS, the number of variables selected was 126. UVE coupled with SiPLS largely optimized the UVE-PLS model, and the optimal number of latent variables established by cross-testing was consistent with the results when SiPLS was performed independently, with both nLVs being 20. Modeled after S-G-SNV pretreatment, the SiPLS-UVE-PLS model worked slightly better than the S-G-MSC pretreatment, with the metrics shown in Table 5  SiPLS-SPA selected more than 90% fewer variables than SiPLS, which greatly refined the results of the single SiPLS method to select feature variables. The UVE-SPA-PLS model had the worst selection effect of the wavelength variables, and compared with performing UVE alone, the percentage of the original variables decreased by 42.56% and 43.50%, respectively.
The results of the wavelength variable selection for a comprehensive comparison of the SOM content are shown in Table 6. The effects of SPA and UVE in the single wavelength variable selection method were relatively poor, and CARS and SIPLS had relatively good effects. The correction set and verification set of R 2 for the two preprocessing spectra of SIPLS were greater than 0.9 and the RMSE was less than 0.1. For CARS-PLS only under the preprocessing spectrum of SG-SNV, the R 2 of correction set and verification set were greater than 0.9, and the RMSE was less than 0.1. Among the six different combination strategies, only the R 2 of the correction set and verification set of SiPLS-UVE-PLS were greater than 0.9, and the RMSE of the SiPLS-UVE-PLS model pretreated by SG-SNV was less than 0.1. Through the comprehensive evaluation of the preprocessing method and wavelength selection method, the optimal PLS models were SiPLS-PLS and SiPLS-UVE-PLS after SG-SNV preprocessing. Directly using SiPLS or SiPLS coupled with UVE, the feature variables preferred by both algorithms were input into the SOM content prediction model, and the validation set was also input to evaluate the model. The specific PLS modeling results are shown in Figure 8. The results visually reflect the correlation between the measured and predicted values of SOM content, and the trend line of its calibration model overlaps with the target trend line nearly 1:1, indicating that the SiPLS-PLS and UVE-SiPLS-PLS models can achieve an effective prediction of SOM content within the sampling area. The fit between the trend line of its validation model and the target trend line is also high, indicating that the above two types of models can guarantee good prediction robustness under the premise of model accuracy. Directly using SiPLS or SiPLS coupled with UVE, the feature variables preferred by both algorithms were input into the SOM content prediction model, and the validation set was also input to evaluate the model. The specific PLS modeling results are shown in Figure 8. The results visually reflect the correlation between the measured and predicted values of SOM content, and the trend line of its calibration model overlaps with the target trend line nearly 1:1, indicating that the SiPLS-PLS and UVE-SiPLS-PLS models can achieve an effective prediction of SOM content within the sampling area. The fit between the trend line of its validation model and the target trend line is also high, indicating that the above two types of models can guarantee good prediction robustness under the premise of model accuracy.

Conclusions and Discussion
PLS models were constructed based on NIRS combined with various chemometric algorithms to quantify the organic matter composition of low-quality mixed coniferous forest soils in the Jagdaqi region. To ensure that the calibration set samples were fully representative and all samples were uniformly distributed within each set, SPXY was used to divide the data sets.
In this study, the effects of preprocessing by five single methods, as well as six combined methods were discussed. The results show that not all preprocessing methods are effective at eliminating noise and reducing errors for spectral data of complex sample systems. Compared with the original spectral modeling, there are three preprocessing strategies to improve the performance of the calibration model: S-G smoothing, S-G-SNV, and S-G-MSC. S-G-SNV and S-G-MSC with R 2 greater than 0.8 were selected as the optimal preprocessing.
The full spectrum of SOM content processed by S-G-SNV and S-G-MSC was used for wavelength variable screening. In this study, we selected 11 methods for feature variable selection (global PLS, CARS-PLS, SPA-PLS, UVE-PLS, SiPLS-PLS, UVE-CARS-PLS, UVE-SPA-PLS, UVE-SiPLS-PLS, CARS-SPA-PLS, SiPLS-CARS-PLS, and SiPLS-SPA-PLS) and the results were compared accordingly, the best ones were selected for model development.
When the four classical algorithms were executed individually, CARS was better at screening strong information variables and SiPLS was better at determining band combinations. Although SiPLS has the longest computation time and a weak ability to simplify the model inputs, the PLS regression analysis was significant. CARS selects the wavelength of two different pretreated VIS-NIR spectra. The differences in its computed speed, lower feature variable screening, and difference in the correction model were small, but with a larger difference in Rv 2 , explaining that CARS does not fully consider the synergy between adjacent variables under different backgrounds, which is consistent with the research results of Li Pao [43].
When the feature variable selection methods are used in combination, SiPLS-UVE is the best wavelength selection strategy. The ability of UVE-SPA to extract effective wavelengths was the worst and did not have the effect of improving model accuracy and precision. SiPLS-CARS and SiPLS-SPA further refined the variables ultimately involved in modeling. UVE-CARS and CARS-SPA simplified and optimized the full-spectrum model to some extent. This shows that when the feature variable selection methods were used in combination, some of the combinations could refine the variable screening results and improve the model run speed, and some of the combinations could improve the model quality and enhance the model prediction ability. The model simplification usually occurred along with the improvement of the model prediction ability. In addition, the selection of feature variables in different spectral preprocessing contexts could also have an impact on the constructed models. Therefore, which preprocessing method is more beneficial for modeling needs to be analyzed and discussed in conjunction with the wavelength variable selection process.
The results show that the optimal models are SiPLS-PLS and SiPLS-UVE-PLS, but not all feature variable selection methods can find strong information variables and eliminate weak information variables and irrelevant information variables. SiPLS-PLS and SiPLS-UVE-PLS compress global variables, reduce model calculation time, and enhance model prediction capabilities. The finally constructed model has a high prediction accuracy and good robustness, which can realize the effective prediction of SOM content in the sampling area. The two methods can provide methodological assistance for the rapid and accurate prediction of SOM content in the same forest type, and can provide the theoretical basis for the implementation of key forest management technologies, thus promoting the research progress of SOM dynamic monitoring of low-quality coniferous and broad-leaved mixed forest.