agronomy Combining Variable Selection and Multiple Linear Regression for Soil Organic Matter and Total Nitrogen Estimation by DRIFT-MIR Spectroscopy

: The successful estimation of soil organic matter (SOM) and soil total nitrogen (TN) contents with mid-infrared (MIR) reﬂectance spectroscopy depends on selecting appropriate variable selection techniques and multivariate methods for regression analysis. This study aimed to explore the potential of combining a multivariate method and spectral variable selection for soil SOM and TN estimation using MIR spectroscopy. Five hundred and ten topsoil samples were collected from Quzhou County, Hebei Province, China, and their SOM and TN contents and reﬂectance spectra were measured using DRIFT-MIR spectroscopy (diffuse reﬂectance infrared Fourier transform in the mid-infrared range, MIR, wavenumber: 4000–400 cm − 1 ; wavelength: 2500–25,000 nm). Two multivariate methods (partial least-squares regression, PLSR; multiple linear regression, MLR) combined with two variable selection techniques (stability competitive adaptive reweighted sampling, sCARS; bootstrapping soft shrinkage approach, BOSS) were used for model calibration. The MLR model combined with the sCARS method yielded the most accurate estimation result for both SOM (R p2 = 0.72 and RPD = 1.89) and TN (R p2 = 0.84 and RPD = 2.50). Out of the 2382 wavenumbers in a full spectrum, sCARS determined that only 31 variables were important for SOM estimation (accounting for 1.30% of all variables) and 27 variables were important for TN estimation (accounting for 1.13% of all variables). The results demonstrated that sCARS was a highly efﬁcient approach for extracting information on wavenumbers and mitigating redundant wavenumbers. In addition, the current study indicated that MLR, which is simpler than PLSR, when combined with spectral variable selection, can achieve high-precision prediction of SOM and TN content. As such, DRIFT-MIR spectroscopy coupled with MLR and sCARS is a good alternative for estimating the SOM and TN of soils.


Introduction
Monitoring the soil status is in great demand in precision agriculture to adjust practices such as tillage, fertilization, and irrigation [1]. Soil organic matter (SOM) and total nitrogen (TN) are essential elements in agricultural soil and play an important role in many biological and chemical activities for plant growth. Therefore, the assessment of SOM and TN in the soil is crucial in precision agriculture [2]. A variety of agricultural sensors have been applied over recent decades to determine soil properties rapidly [3]. Spectroscopy, in particular, has increased in popularity because it is rapid, timely, cost-effective, nondestructive, and straightforward [4]. As effective alternatives to traditional chemical Thus, the objectives of our study were: (i) to study the benefits of spectral variable selection (with BOSS and sCARS) for calibration accuracy and model parsimony of SOM and TN estimation using MIR spectroscopy; (ii) to identify the information content of the statistically selected spectral variables by relating them-if possible-to fundamental bands of relevant molecules or functional groups; and (iii) to investigate whether the simpler MLR combined with the variable selection algorithm can achieve similar or better results than the PLSR model.

Study Area and Soil Sampling
The 1 km×1 km grid survey was used to collect soil samples in Quzhou County, Hebei Province, China, which covers about 667 km 2 . The average annual temperature and precipitation in the study region are approximately 13.1 • C and 556 mm, respectively. The soil type is mainly Fluvo-aquic soil. Five hundred and ten topsoil (0-20 cm) samples were obtained using a soil auger. The detailed position information of the sampling sites was designed and recorded using a hand-held GNSS device ( Figure 1). relationships based on, for example, peak intensity measurements [7]. The commonly applied multivariate calibration tools are partial least-squares regression (PLSR) [24,39], multiple linear regression (MLR) [40,41], and support vector machine regression (SVMR) [42]. Therefore, combining the multivariate method with spectral variable selection is expected to generate a better estimation model of SOM and TN based on MIR spectroscopy.
Thus, the objectives of our study were: (i) to study the benefits of spectral variable selection (with BOSS and sCARS) for calibration accuracy and model parsimony of SOM and TN estimation using MIR spectroscopy; (ii) to identify the information content of the statistically selected spectral variables by relating them-if possible-to fundamental bands of relevant molecules or functional groups; and (iii) to investigate whether the simpler MLR combined with the variable selection algorithm can achieve similar or better results than the PLSR model.

Study Area and Soil Sampling
The 1 km×1 km grid survey was used to collect soil samples in Quzhou County, Hebei Province, China, which covers about 667 km 2 . The average annual temperature and precipitation in the study region are approximately 13.1 °C and 556 mm, respectively. The soil type is mainly Fluvo-aquic soil. Five hundred and ten topsoil (0-20 cm) samples were obtained using a soil auger. The detailed position information of the sampling sites was designed and recorded using a hand-held GNSS device ( Figure 1).

Spectral Measurement and SOM and TN Content Analysis
The obtained soil samples were air-dried at room temperature and crushed before being passed through a 2 mm mesh sieve to remove any plant roots and debris prior to analysis. Each soil sample was divided into two portions: one for spectral measurement and the other for SOM and soil TN content analysis. The soil samples to be used for MIR analysis were finely ground, passed through a 0.15 mm (100-mesh) sieve [43,44], dried in oven at 105 °C for 2 h, and then stored in a desiccator (Nalgene, Thermo Fisher, Waltham,

Spectral Measurement and SOM and TN Content Analysis
The obtained soil samples were air-dried at room temperature and crushed before being passed through a 2 mm mesh sieve to remove any plant roots and debris prior to analysis. Each soil sample was divided into two portions: one for spectral measurement and the other for SOM and soil TN content analysis. The soil samples to be used for MIR analysis were finely ground, passed through a 0.15 mm (100-mesh) sieve [43,44], dried in oven at 105 • C for 2 h, and then stored in a desiccator (Nalgene, Thermo Fisher, Waltham, MA, USA) until they were ready to be scanned. SOM was chemically determined using a potassium dichromate oxidation colorimetry [45], with values ranging from 3.20 g/kg to 32.70 g/kg (Table 1). Soil TN content was determined by the semi-micro automated Kjeldahl method [46], with values ranging from 0.21 g/kg to 2.42 g/kg (Table 1). Soil MIR diffuse reflectance spectra were collected using a Fourier transform infrared spectrometer (FT-IR) Tensor II with an HTS-XT high-throughput diffuse reflectance accessory (Bruker Optics, Karlsruhe, Germany). The spectrometer was equipped with a mercury cadmium telluride detector cooled by liquid nitrogen. The measured wavebands ranged from 4000 to 600 cm −1 with a resolution of 4 cm −1 . Soil samples were loaded into 24-well spot aluminum microtiter plates. Prior to each scan, the sensor was calibrated using the background spectrum of the reference plate. Gold background measurements of the first well were taken before each single measurement to account for changes in temperature and air humidity. Soil samples were loaded into two replicate wells, each scanned 32 times, and the two spectra were averaged to account for within-sample variability and differences in packing density and particle size.

Spectral Pre-Processing
Original soil MIR spectra and spectra after pre-processing are shown in Figure 2. Spectral pre-processing with multiplicative scatter correction (MSC) [47] was carried out for SOM, and spectra pre-processed by MSC are shown in Figure 2b. For TN, spectral pre-processing was completed using a standard normalized variate (SNV) [48], and preprocessed spectra by SNV are shown in Figure 2c.
MA, USA) until they were ready to be scanned. SOM was chemically determined using a potassium dichromate oxidation colorimetry [45], with values ranging from 3.20 g/kg to 32.70 g/kg (Table 1). Soil TN content was determined by the semi-micro automated Kjeldahl method [46], with values ranging from 0.21 g/kg to 2.42 g/kg (Table 1). Soil MIR diffuse reflectance spectra were collected using a Fourier transform infrared spectrometer (FT-IR) Tensor II with an HTS-XT high-throughput diffuse reflectance accessory (Bruker Optics, Karlsruhe, Germany). The spectrometer was equipped with a mercury cadmium telluride detector cooled by liquid nitrogen. The measured wavebands ranged from 4000 to 600 cm −1 with a resolution of 4 cm −1 . Soil samples were loaded into 24-well spot aluminum microtiter plates. Prior to each scan, the sensor was calibrated using the background spectrum of the reference plate. Gold background measurements of the first well were taken before each single measurement to account for changes in temperature and air humidity. Soil samples were loaded into two replicate wells, each scanned 32 times, and the two spectra were averaged to account for within-sample variability and differences in packing density and particle size.

Spectral Pre-Processing
Original soil MIR spectra and spectra after pre-processing are shown in Figure 2. Spectral pre-processing with multiplicative scatter correction (MSC) [47] was carried out for SOM, and spectra pre-processed by MSC are shown in Figure 2b. For TN, spectral preprocessing was completed using a standard normalized variate (SNV) [48], and pre-processed spectra by SNV are shown in Figure 2c.

Stability Competitive Adaptive Reweighting Algorithm (sCARS)
sCARS mainly relies on the stability of the selected variables, so it also improves the stability of variable selection while continuing CARS′s operation of screening variables [35].
The steps involved in the sCARS algorithm operation is as follows [34]:

Stability Competitive Adaptive Reweighting Algorithm (sCARS)
sCARS mainly relies on the stability of the selected variables, so it also improves the stability of variable selection while continuing CARS s operation of screening variables [35].
The steps involved in the sCARS algorithm operation is as follows [34]: Step 1: Compute the stability of each wavenumber variable.
Step 2: On the basis of the exponentially decreasing function (EDF) and adaptive reweighted sampling (ARS) method, a subset of variables with strong stability is selected.
Step 3: Repeat the first step and the second step, and finally obtain K subsets of variables, and use the selected subsets of variables to establish a PLSR model, in which the subset of variables with the lowest root-mean-squared error of cross-validation (RMSECV) is the best characteristic variable. The algorithm procedure can be illustrated in a flow chart of Figure 3 [34].
Step 1: Compute the stability of each wavenumber variable.
Step 2: On the basis of the exponentially decreasing function (EDF) and adaptive reweighted sampling (ARS) method, a subset of variables with strong stability is selected.
Step 3: Repeat the first step and the second step, and finally obtain K subsets of variables, and use the selected subsets of variables to establish a PLSR model, in which the subset of variables with the lowest root-mean-squared error of cross-validation (RMSECV) is the best characteristic variable.
The algorithm procedure can be illustrated in a flow chart of Figure 3 [34]. The following parameters were used in the research: Monte Carlo sampling (MCS) number M = 500, MCS sampling ratio r = 0.7, and calculation loops number N = 100. The processes of the sCARS algorithm were conducted using MATLAB R2020b.

Bootstrapping Soft Shrinkage (BOSS)
BOSS uses a strategy called a soft-shrinking strategy in variable selection. The softshrinking strategy assigns smaller weights to the less informative variables, while the hard-shrinking strategy directly eliminates the less informative variables [30]. Thus, these The following parameters were used in the research: Monte Carlo sampling (MCS) number M = 500, MCS sampling ratio r = 0.7, and calculation loops number N = 100. The processes of the sCARS algorithm were conducted using MATLAB R2020b.

Bootstrapping Soft Shrinkage (BOSS)
BOSS uses a strategy called a soft-shrinking strategy in variable selection. The softshrinking strategy assigns smaller weights to the less informative variables, while the hard-shrinking strategy directly eliminates the less informative variables [30]. Thus, these variables still have the chance to participate in the models. The advantage of the softshrinking strategy is that it can reduce the risk of eliminating important variables in the optimization process and can choose a combination of variables with better prediction ability [36].
The main process of the BOSS algorithm operation is as follows [36]: Suppose there is a matrix X N×P which contains N samples and P variables. The prediction attribute vector is y N×1 .
Step 1: Apply bootstrap sampling (BSS) in variable space to generate K subsets. In this process, each variable has the same probability to be selected into subsets.
Step 2: K PLSR sub-models are established by using the obtained subsets. Additionally, the RMSECV of the sub-model is calculated, and the lowest RMSECV is the best model.
Step 3: Calculate the regression coefficient (RC) of the extracted model. The absolute values of the elements in each regression vector are normalized, and the normalized regression vectors are summed to obtain the new weights of variables.
Step 4: Weighted bootstrap sampling (WBS) is used to generate new subsets of variables and extract unique variables to establish sub-models. Repeat Step 2 to Step 4 until the number of variables in the new subset is 1, and the subset with the lowest RMSECV is taken as the optimal variable set during calculation.
The algorithm procedure can be illustrated in a flow chart of Figure 4.
ability [36]. The main process of the BOSS algorithm operation is as follows [36]: Suppose there is a matrix XN×P which contains N samples and P variables. The prediction attribute vector is yN×1.
Step 1: Apply bootstrap sampling (BSS) in variable space to generate K subsets. In this process, each variable has the same probability to be selected into subsets.
Step 2: K PLSR sub-models are established by using the obtained subsets. Additionally, the RMSECV of the sub-model is calculated, and the lowest RMSECV is the best model.
Step 3: Calculate the regression coefficient (RC) of the extracted model. The absolute values of the elements in each regression vector are normalized, and the normalized regression vectors are summed to obtain the new weights of variables.
Step 4: Weighted bootstrap sampling (WBS) is used to generate new subsets of variables and extract unique variables to establish sub-models. Repeat Step 2 to Step 4 until the number of variables in the new subset is 1, and the subset with the lowest RMSECV is taken as the optimal variable set during calculation.
The algorithm procedure can be illustrated in a flow chart of Figure 4.  The following parameters were used in the research: the maximal number of latent variables for cross-validation = 5; fold = 5; pretreatment method = center; the number of bootstraps = 1000; speed = 0 [30,36]. The whole process of the BOSS method was conducted in MATLAB R2020b with the BOSS toolbox, available at: http://www.mathworks.com/ matlabcentral/fileexchange/52770-boss (accessed on 16 September 2021).

Model Calibration
The whole dataset was divided into a calibration dataset (n = 357) and a validation dataset (n = 135) using the Kennard-Stone algorithm [49]. Two multivariate methods (PLSR and MLR) were used for model calibration.
PLSR is similar to principal component regression, as both employ statistical rotations to overcome the problems of high dimensionality and multicollinearity [23]. This method selected continuous orthogonal factors to maximize the covariance between prediction variables and response variables [33]. Additional details of PLSR have been described by Viscarra Rossel and Behrens (2010) [40].
MLR analysis is also the most basic and simplest in multivariate regression analysis, and the mathematical formula expressing this quantitative relationship is called the multivariate linear regression model. Under the condition of studying linear correlation, linear regression can be divided into two types: one is an independent variable and a dependent variable, and the relationship between them can be approximated using a straight line. This regression analysis is called unary linear regression. The other is the multiple linear regression analysis in the case of two or more independent variables versus one dependent variable [50]. In fact, when studying problems, a phenomenon is often associated with multiple factors. It is more effective and more realistic to predict or estimate the dependent variable using the optimal combination of multiple independent variables. Therefore, multivariate linear regression is more effective than univariate linear regression. In the process of applying a regression model, only the same model and data can be used, and the unique result can be calculated using a standard statistical method, which ensures the accuracy and stability of the result. MLR is a modeling technique commonly used in spectral prediction that usually gives relatively good prediction accuracy [40,41].

Model Evaluation
To assess the goodness of fit of the predictive models and their performance against independent validation sets, we used the coefficient determination of prediction (R p 2 ) (Equation (1)), the root-mean-square error of prediction (RMSEp) (Equation (2)), and the residual prediction deviation of prediction (RPD) (Equation (3)). R 2 represents the degree to which the dependent variable is fully interpreted. The RMSE measures the accuracy of predictions, being an easily interpreted statistic because it has the same data units. The RPD is the ratio of the standard deviation of the measured values to the RMSEp, and it was used to define the quality of the derived models [51]. These evaluation parameters could be specifically defined by the following formulae: where y i andŷ i refer to the measured value and the corresponding estimated value, respectively, y i is the average of the estimated values, n denotes the number of samples, and SD is the standard deviation of the measured values. In soil science, the RPD was used as a classifier of prediction results in the common scheme defined by Chang et al. (2001). When RPD > 2.0, the estimation model was more reliable. Values of RPD between 1.4 and 2.0 suggest that the predictive power could be improved, and RPD < 1.4 indicates that the model failed to predict the soil properties (i.e., it had no prediction capability) [52].

Results
This section is divided by subheadings. It provides a concise and precise description of the experimental results, their interpretation, as well as the experimental conclusions that can be drawn.

Estimation Accuracies and Model Parsimony
The performances of different variable selection models are illustrated in Table 2. All estimation models of SOM have a similar prediction accuracy with 0.65 < R p 2 < 0.75 and 1.4 < RPD < 2.0 (satisfactory predictions), while R p 2 and RPD values in all estimation models of TN are greater than 0.75 or 2.0 (good predictions). The R p 2 and RPD of SOM estimation models based on the BOSS and sCARS variables selection methods were equal to or higher than that of the full MIR spectrum model. BOSS selected 14 variables (out of 2382 spectral variables), whereas sCARS selected 31 variables (Table 2). For TN, the PLSR models developed with selected wavenumbers gave better estimations compared to the models developed with the full spectrum. The BOSS algorithm, due to its instability, needs to run 50 times, which usually takes several hours. On the contrary, the sCARS algorithm runs only once since it is stable, taking usually less than an hour. Hence, due to the long computation time of BOSS, one alternative was to use the sCARS method, which is in agreement with another study [35].

Key Wavenumbers
In the BOSS algorithm, sub-models are generated according to the weights of variables. The weights of variables are obtained based on the RC of sub-models. Each sub-model corresponds to a random combination of variables, in which the variables with larger weights have larger probabilities to participate. Five-fold cross-validation is applied to explore its predictability. The evolution of RMSECV (g/kg) in sub-models in each iteration of the BOSS algorithm is displayed in Figure 5. For SOM, the RMSECV reached the minimum value (0.18 g/kg) when the number of sampling runs was equal to 24 (Figure 5a). Fourteen spectral wavenumber variables were selected out of 2382 spectral variables for SOM, which was 0.58% of the number of the full spectrum. For TN, the RMSECV reached the minimum value (0.12 g/kg) when the number of sampling runs was equal to 19 (Figure 5b). Forty-four spectral wavenumber variables were selected out of 2382 spectral variables for TN, which is 1.85% of the number of the full spectrum. The final selected spectral variables for SOM and TN are illustrated in Figure 6a in the red region and Figure 6b in the blue region, respectively. 5a). Fourteen spectral wavenumber variables were selected out of 2382 spectral variables for SOM, which was 0.58% of the number of the full spectrum. For TN, the RMSECV reached the minimum value (0.12 g/kg) when the number of sampling runs was equal to 19 (Figure 5b). Forty-four spectral wavenumber variables were selected out of 2382 spectral variables for TN, which is 1.85% of the number of the full spectrum. The final selected spectral variables for SOM and TN are illustrated in Figure 6a in the red region and Figure  6b in the blue region, respectively.   Figure 7 shows the variable selection process using sCARS method. The number of sampling variables of SOM and TN decreased with the increasing number of sampling runs (Figure 7a,b). Figure 7c,d presents the changing trend of 10-fold RMSECV values with the increasing number of sampling runs. The RMSECV reached the minimum value (0.19 g/kg for SOM and 0.11 g/kg for TN) when the number of sampling runs was equal to 80 and 48 for SOM and TN, respectively. Figure 7e,f illustrates the variation trend of the regression coefficient path for different sampling runs. The optimal subset corresponded to the lowest RMSECV, marked by a vertical line with asterisks. Accordingly, 31  Figure 7 shows the variable selection process using sCARS method. The number of sampling variables of SOM and TN decreased with the increasing number of sampling runs (Figure 7a,b). Figure 7c,d presents the changing trend of 10-fold RMSECV values with the increasing number of sampling runs. The RMSECV reached the minimum value (0.19 g/kg for SOM and 0.11 g/kg for TN) when the number of sampling runs was equal to 80 and 48 for SOM and TN, respectively. Figure 7e,f illustrates the variation trend of the regression coefficient path for different sampling runs. The optimal subset corresponded to the lowest RMSECV, marked by a vertical line with asterisks. Accordingly, 31 and 27 spectral variables of SOM and TN were selected by sCARS in MIR spectrum, which accounted for 1.30% and 1.13% of the full spectrum (Figure 6c,d). The final effective wavenumber variables selected by BOSS and sCARS for estimating SOM and TN content are shown in Figure 6 and Table 3 The final effective wavenumber variables selected by BOSS and sCARS for estimating SOM and TN content are shown in Figure 6 and Table 3. Keys were found in the X−H stretching (C-H, N-H, and O-H groups containing hydrogen) region from 4000 to 2500 cm −1 , in the triple-and double-bond regions from 2500 to 1500 cm −1 , and in the fingerprint MIR region from 1500 to 600 cm −1 . Then, 2894-2890 cm −1 , 1760-1755 cm −1 , 1563 cm −1 , 1555 cm −1 , and 1554 cm −1 were determined individually to be the sensitive wavenumbers of SOM content (Figure 6a,c), and 3733 cm −1 , 3728 cm −1 , 3707 cm −1 , 3560 cm −1 , 3127 cm −1 , 2825-2818 cm −1 , 1704-1700 cm −1 , 1675 cm −1 , 1673 cm −1 , 1671 cm −1 , and 817 cm −1 were determined individually to be the sensitive wavenumbers of soil TN content (Figure 6b,d).

Discussion
Important SOM features are located at 1740-1698 cm −1 (C=O groups in carboxylic acids, aldehydes, and ketones), 1640-1600 cm −1 (amide I band (C-O, C-N) of proteins, C=O of carboxylic acids and ketones), and 1575-1400 cm −1 (amide II band, N-H) [23]. The intensity of the region 1740-1698 cm −1 can be used as a measure of hydrophilic organic components [53] and was identified as a key region for soil TN [38]. Key wavenumbers of TN in MIR identified from the CARS-PLSR runs were 1676-1672 cm −1 , 1260 cm −1 , and 1036 cm −1 [23]. The report is consistent with the characteristic variables selected in the present study.
BOSS is a strategy that combines MPA and WBS. The absolute value of RC is the criteria of variables. The advantage of MPA is that it considers the combined effect between variables. WBS reduces the effect of collinearity [36]. BOSS focuses more on using the information in the model RC. Yan et al. (2019) reported that BOSS selected fewer variables, gave lower a RMSEP value, and better predicted the starch content of corn using NIR spectra compared to other methods, including Monte Carlo uninformative variables elimination (MCUVE), GA, and CARS [35]. Moreover, Lao et al. (2021) reported that BOSS-ELM (Extreme Learning Machine) models outperformed MCUVE-ELM models in estimating the contents of soil salt and soluble ions using Vis-NIR spectroscopy [30].
The sCARS method enhances the stability of variable selection by using variable stability as a variable selection indicator and continues the variable selection process of the CARS method with fast calculation speed. When Yan et al. (2019) employed sCARS and MCUVE to predict the real extract concentration of beer in NIR spectroscopy, the authors reported that both methods had comparative performance; however, MCUVE retained fifteen times more variables than sCARS did [35]. The sCARS method was also employed to extract the characteristic bands of SOM via Vis-NIR spectroscopy, and 51 variables were selected (out of 2000 spectral variables). Compared to modeling based on the full bands, the combination of variable selection and regression methods could effectively improve the modeling efficiency while ensuring the accuracy of the model [54].
The variable selection algorithm combines the prediction model, and the scatter diagrams of PLSR and MLR, respectively, are displayed in Figures 8 and 9. The solid black line in each plot represents the 1:1 line. As shown in the figures, the measured value and predicted value of SOM and TN content in the MLR models were generally closer to the 1:1 line compared to the PLSR models. MLR is used to substitute independent variables into the regression equation in turn, and eliminate irrelevant or insignificant independent variables according to the score and the contribution rate of independent variables to dependent variables so as to obtain an accurate and stable prediction model [55]. Jia et al. (2014) used MCUVE and SPA combined with PLSR and MLR for the prediction of soil nitrogen and organic carbon using NIR diffuse reflectance spectroscopy [56]. They reported that although the application of the SPA-MLR method did not lead to satisfactory predictions, the variable number was much reduced and the calibration models were simplified, which made it possible to use a cheap and simple spectrometer for the measurement of soil nitrogen.

FOR PEER REVIEW
13 of 17 predictions, the variable number was much reduced and the calibration models were simplified, which made it possible to use a cheap and simple spectrometer for the measurement of soil nitrogen. .

Conclusions
In this study, two variable selection techniques (i.e., sCARS and BOSS) were employed to circumvent the problem of multicollinearity in MLR to improve the estimation accuracy of SOM and TN models developed with DRIFT-MIR spectra. The results showed that by using a few selected wavenumbers, BOSS and sCARS can give even better predictive performance in much less computation time when compared with the full spectrum model. The MLR model combined with the sCARS method yielded the most accurate estimation result for SOM and TN. The results of the present study indicated that MLR with sCARS could decrease the computation complexity and simplify the model, and consequently improve the prediction ability. Thus, using DRIFT-MIR spectroscopy together with MLR and sCARS is a good alternative for estimating the SOM and TN of soils.

Conclusions
In this study, two variable selection techniques (i.e., sCARS and BOSS) were employed to circumvent the problem of multicollinearity in MLR to improve the estimation accuracy of SOM and TN models developed with DRIFT-MIR spectra. The results showed that by using a few selected wavenumbers, BOSS and sCARS can give even better predictive performance in much less computation time when compared with the full spectrum model. The MLR model combined with the sCARS method yielded the most accurate estimation result for SOM and TN. The results of the present study indicated that MLR with sCARS could decrease the computation complexity and simplify the model, and consequently improve the prediction ability. Thus, using DRIFT-MIR spectroscopy together with MLR and sCARS is a good alternative for estimating the SOM and TN of soils.

Data Availability Statement:
The datasets and codes are available from the first author upon request.

Conflicts of Interest:
The authors declare no conflict of interest.