Rapid and Accurate Measurement of Major Soybean Components Using Near-Infrared Spectroscopy

Li, Chenxiao; Yu, Jiatong; Wang, Sheng; Zhao, Qinglong; Song, Qian; Xu, Yanlei

doi:10.3390/agronomy15071505

Open AccessArticle

Rapid and Accurate Measurement of Major Soybean Components Using Near-Infrared Spectroscopy

by

Chenxiao Li

¹

,

Jiatong Yu

¹,

Sheng Wang

¹,

Qinglong Zhao

¹,

Qian Song

² and

Yanlei Xu

^1,*

¹

College of Information Technology, Jilin Agricultural University, Changchun 130118, China

²

College of Physics, Jilin University, Changchun 130012, China

^*

Author to whom correspondence should be addressed.

Agronomy 2025, 15(7), 1505; https://doi.org/10.3390/agronomy15071505

Submission received: 28 May 2025 / Revised: 19 June 2025 / Accepted: 19 June 2025 / Published: 21 June 2025

(This article belongs to the Special Issue Application of Machine Learning and Modelling in Food Crops)

Download

Browse Figures

Versions Notes

Abstract

This study addresses the urgent need for the rapid, non-destructive assessment of key soybean components, including moisture, fat, and protein, using near-infrared (NIR) spectroscopy. This study provides technical and theoretical support for achieving the efficient and accurate detection of major soybean components and for the development of portable near-infrared (NIR) instruments. Thirty soybean samples from diverse sources were collected, and 360 spectral measurements were acquired using a 900–1700 nm NIR spectrometer after grinding and standardized sampling. To improve model robustness, preprocessing strategies such as standard normal variate (SNV), multiplicative scatter correction (MSC), and Savitzky–Golay derivatives were applied. Feature selection was conducted using competitive adaptive reweighted sampling (CARS), successive projections algorithm (SPA), and uninformative variable elimination (UVE), followed by model construction with partial least squares regression (PLSR), support vector regression (SVR), and random forest (RF). Comparative analysis revealed that the RF model consistently outperformed the others across most combinations. Specifically, the SPASNV + D₁–RF combination achieved an RPD of 14.7 for moisture, CARS–SNV + D₁–RF reached 5.9 for protein, and CARS–SG + D₂–RF attained 12.0 for fat, all significantly surpassing alternative methods and demonstrating a strong nonlinear learning capacity and predictive precision. These findings show that integrating optimal preprocessing and feature selection strategies can markedly enhance the predictive accuracy in NIR-based soybean analyses. The RF model offers exceptional stability and performance, providing both technical reference and theoretical support for the development of portable NIR devices and practical rapid-quality assessment systems for soybeans in industrial applications.

Keywords:

near-infrared spectroscopy; soybean; chemometrics; variable selection; prediction model

1. Introduction

Soybean is an important food and economic crop worldwide; is widely used in the food, feed, and oil industries; and is a major source of plant protein and vegetable oil. The moisture, fat, and protein contents are core indicators for assessing soybean quality, directly influencing its nutritional value, processing suitability, and market grade [1]. In breeding selection, storage management, and quality grading, there is an urgent need for efficient and rapid detection technologies to achieve the precise measurement of key soybean components [2,3].

Currently, the detection of soybean quality components mainly relies on traditional physicochemical methods, such as the oven-drying method for moisture determination [4], Soxhlet extraction for fat determination [5], and the Kjeldahl method for protein determination [6]. Although these methods offer excellent measurement accuracy, they generally suffer from complex procedures and long detection cycles, making them inadequate to meet the urgent demand for rapid and accurate measurement—especially in on-site detection and high-throughput application scenarios. Therefore, the development of green, portable, and real-time detection methods has become a major research focus in this field [7,8].

In recent years, near-infrared spectroscopy (NIR) has attracted widespread attention for its rapid and efficient analytical capabilities. However, studies focusing on its application in simultaneous multi-component analysis remain relatively limited worldwide, highlighting the need for systematic investigations to advance its broader adoption [9,10,11]. By analyzing the absorption characteristics of functional groups such as O–H, C–H, and N–H in samples, it enables the rapid prediction of components such as moisture, protein, and fat [12,13]. Previous studies have shown that NIRS can be used for the physicochemical assessment of various agricultural products, including fruits [14], leaves [15], dairy products [16], soils [17,18], grains [19,20], meats [21], and aquatic products [22]. Chadalavada et al. [20] used 328 multi-grain samples to compare benchtop and portable NIR devices, combining traditional statistical methods with machine learning approaches (including convolutional neural networks) to develop a rapid prediction model for grain protein content, thus providing a reliable quality assessment tool for agricultural research and grain trade. In addition, studies by Yu et al. [15] and Luo et al. [14] showed that combining modeling methods such as CARS, RBF networks, and explainable artificial intelligence can further improve prediction accuracy and model transparency.

Ferreira et al. used 40 soybean samples to compare NIR and MIR spectroscopy for predicting soybean components [23]. Although multiple quality traits were considered, their approach focused on two spectral methods and linear modeling, leaving challenges in spectral preprocessing, variable selection, and nonlinear modeling unaddressed. Current techniques still face limitations in improving model robustness and generalization, highlighting the need for a more systematic framework to enhance prediction accuracy.

Against this background, this study established a systematic near-infrared (NIR) spectral modeling framework for predicting multiple soybean components, including moisture, fat, and protein. Using 30 soybean samples, NIR spectral data were collected across the 900–1700 nm range. A total of nine preprocessing methods, three variable selection algorithms, and three modeling techniques were combined to construct 72 modeling pathways, comprehensively evaluating their performance in multi-index prediction. The specific objectives of this study are to clarify the synergistic effects of various preprocessing and variable selection methods under different modeling techniques, identify the optimal model-preprocessing combination, and enhance prediction accuracy and robustness. Particular emphasis is placed on comparing the performance differences between linear and nonlinear models in simultaneous multi-indicator prediction, aiming to provide a scientific basis for model selection in rapid soybean quality assessment. This study aims to identify the optimal modeling strategy and provide theoretical and technical support for the development of accurate and versatile rapid soybean detection systems.

2. Materials and Methods

2.1. Sample Preparation

The 30 soybean samples used in this study were modeled and analyzed from 30 soybean samples of different varieties and growing locations in Jilin Province, China. There were no strict restrictions on the collection time and specific location, with the aim of reflecting the natural variability of soybean components under actual production conditions. To ensure representativeness and consistency, all samples were pre-screened to remove impurities, retaining only plump, intact seeds with normal coloration. Each sample contains 50 g of soybeans, which are ground into a fine powder using a laboratory grinder. The powder is then filtered through a 30-mesh sieve to remove coarse particles. The filtered powder is placed in a clean, dry, flat-bottomed glass container and gently pressed flat to ensure a smooth surface for subsequent measurements. Spectral measurements are performed using an OTO SW2860-050u spectrometer, which has a wavelength range of 900 nm to 1700 nm, with a resolution of 4.5 nm, equipped with a halogen tungsten lamp light source (model OTO-LS-HA), covering a wavelength range of 350 nm to 2500 nm. The integration time was set to 20 milliseconds, and each spectrum was averaged over 10 scans. As shown in Figure 1, in order to improve the comprehensiveness and repeatability of spectral acquisition, each container was divided into four different regions (1–4 on the right side of Figure 1), and three independent near-infrared (NIR) spectra were acquired for each sample in each region. This resulted in a total of 360 spectra, providing a robust dataset for subsequent component analysis and model development.

2.2. Spectral Measurement

Under stable room temperature conditions, near-infrared (NIR) spectral measurements were performed on the ground soybean powder samples. Each sample was placed into a clean, dry container, divided into four equal sections, and multiple spectra were collected from each section to enhance data stability. Spectral data were acquired using an NIR spectrometer covering the 900–1700 nm wavelength range, and the data were transferred to a computer via the instrument’s software. During the experiment, the instrument was periodically calibrated using white reference and dark current adjustments, with barium sulfate standard plates serving as the white reference material. The sample reflectance (R) was calculated using the following equation:

\begin{matrix} R = \frac{I_{s i m p l e} - I_{d a r k}}{I_{w h i t e} - I_{d a r k}} \end{matrix}

(1)

Here,

I_{s i m p l e}

denotes the sample light intensity,

I_{d a r k}

denotes the dark current signal, and

I_{w h i t e}

denotes the white reference light intensity. Under the 900–1700 nm wavelength range, the final reflectance spectra were obtained and subsequently used for soybean composition analysis and chemometric modeling.

2.3. Chemometrics

The spectral data were analyzed and modeled using Python 3.7, with the scikit-learn library employed for model development and performance evaluation. Given that NIR spectra contain abundant latent information, effective data preprocessing and modeling are essential to accurately predict key soybean components, including moisture, fat, and protein. To this end, a chemometric approach was employed, following three main steps: (1) preprocessing the raw spectra to enhance signal quality; (2) selecting spectral features relevant to the target components; and (3) constructing predictive models and validating them on independent samples to evaluate model stability and predictive performance.

2.3.1. Spectral Preprocessing

Prior to modeling, raw near-infrared (NIR) spectral data typically require preprocessing to construct quantitative prediction models suitable for multivariate regression analysis [24]. The primary goals of preprocessing are to eliminate spectral variations unrelated to the modeling targets (such as background noise and scattering interference) and to strengthen the correlations between the spectra and target components, including moisture, fat, and protein [25]. To improve model stability and generalizability, this study systematically compared the effectiveness of multiple spectral preprocessing strategies, including scatter correction (SC), spectral derivatives (SDs), and their combinations.

The scatter correction (SC) methods included standard normal variate (SNV) transformation and multiplicative scatter correction (MSC), while the Savitzky–Golay (SG) smoothing filter was applied to compute first-order (D₁) and second-order (D₂) derivatives to enhance subtle absorption features and improve spectral resolution. The final preprocessing schemes adopted included SG, MSC, SNV, SG + D₁, SG + D₂, MSC + D₁, MSC + D₂, SNV + D₁, and SNV + D₂.

In this study, the spectral range of 900–1700 nm was selected as the effective wavelength interval for model development, as it covers the near-infrared region rich in target information. The dataset was then randomly divided into a training set (80%) and a test set (20%) for model construction and performance evaluation.

2.3.2. Selection of Optimal Wavelength

Owing to the high dimensionality and substantial information redundancy in near-infrared (NIR) spectral data, directly applying full-spectrum modeling can lead to excessive model complexity, reduced computational efficiency, and diminished predictive performance due to the inclusion of irrelevant or interfering variables [26]. Additionally, in the development of portable detection systems, full-spectrum acquisition places greater demands on hardware design and energy consumption management [27]. Therefore, performing effective feature wavelength selection helps eliminate redundant variables and retain those highly relevant to target components (such as moisture, fat, and protein), thereby enhancing modeling accuracy, simplifying model structure, and providing technical support for the implementation of rapid detection systems.

This study employed three commonly used variable selection methods—CARS (competitive adaptive reweighted sampling), SPA (successive projections algorithm), and UVE (uninformative variable elimination)—to extract features from the preprocessed spectral data and compare their respective impacts on the modeling performance of major soybean components.

CARS: This method is inspired by Darwinian evolutionary principles, employing Monte Carlo sampling combined with partial least squares regression (PLSR) coefficients to dynamically select spectral variables. It gradually eliminates wavelengths with small regression coefficient weights through an exponential decay function, retaining feature bands that significantly contribute to the target variables. The optimal variable subset is determined by minimizing the root mean square error of cross-validation (RMSECV). While reducing data dimensionality, the CARS method preserves wavelength information sensitive to the target variables, making it well-suited for dimensionality reduction modeling of high-dimensional spectral datasets [28].

SPA: This algorithm maximizes the projection values between variables to avoid selecting highly collinear wavelengths, thereby enhancing model stability and interpretability. SPA iteratively selects variables with the largest projection values until the preset number of variables is reached. Although SPA performs well when selecting a small number of variables, it has relatively high computational demands; therefore, in this study, it was primarily used to extract key wavelength feature sets to simplify the model structure [29].

UVE: This method calculates the stability index of each variable based on partial least squares regression (PLSR) modeling to identify and eliminate redundant wavelengths that have insignificant or highly fluctuating effects on the model. UVE can significantly improve model robustness and is suitable for eliminating invalid or noisy variables from the spectra; in this study, it was employed to optimize the full-spectrum modeling results [26].

By comparing the modeling performance of the three feature selection methods across different models—partial least squares regression (PLSR), random forest (RF), and support vector regression (SVR)—the optimal set of feature wavelengths was ultimately selected for constructing an efficient and accurate prediction model for major soybean components.

2.3.3. Development of Quantitative Prediction Models

This study employed one linear model—partial least squares regression (PLSR)—and two nonlinear machine learning models—random forest (RF) and support vector regression (SVR)—to analyze the preprocessed near-infrared (NIR) spectral data (900–1700 nm) of soybean samples, aiming to predict the contents of three major components: moisture, fat, and protein.

Partial least squares regression (PLSR) is one of the most widely used linear modeling approaches in near-infrared (NIR) spectral data analyses and is particularly suitable for problems involving high collinearity and multivariate coexistence between the independent variables (X) and the dependent variables (Y) [30]. This method constructs a set of latent variables (LVs) that simultaneously maximize the covariance between X and Y, enabling the effective dimensionality reduction of high-dimensional spectral data while preserving the structural information most relevant to the target components. To avoid model overfitting or underfitting, this study employed six-fold cross-validation to determine the optimal number of LVs. Specifically, the dataset was randomly divided into six subsets; in each iteration, five subsets were used for training and one for validation, rotating across all combinations. The optimal number of LVs was determined based on the minimum root mean square error of cross-validation (RMSECV), and this optimal configuration was used to construct the final model.

Random forest (RF) is a nonlinear regression model based on the ensemble learning framework, composed of multiple decision trees. It employs bootstrap sampling to repeatedly draw subsamples from the original training set, building multiple tree models whose averaged predictions serve as the final output, thereby significantly reducing the model’s sensitivity to individual training data and enhancing its generalization ability [31]. In this study, the main tuning parameters for the RF model were the number of trees (n_tree) and the number of features considered at each node split (mtry). The n_tree parameter was initially set to 100 and incremented in steps of 50 up to 2000; the optimal number of trees was selected based on the minimum root mean square error (RMSE) on the validation set. The mtry parameter was set to its default value, which was one-third of the total number of features. The RF model demonstrated strong robustness in handling high-dimensional features and nonlinear relationships in near-infrared spectral data.

Support vector regression (SVR) is an extension of the support vector machine (SVM) framework designed for regression tasks, based on the principle of structural risk minimization. It constructs an optimal hyperplane in the feature space to enable the prediction of continuous variables. SVR demonstrates excellent fitting ability and generalization performance, particularly in high-dimensional, small-sample scenarios [32]. In this study, the radial basis function (RBF) was employed as the kernel function to construct nonlinear mappings and capture complex feature patterns in the spectral data [33]. During SVR model training, grid search combined with five-fold cross-validation was employed to optimize two key parameters: the kernel parameter γ (which controls the width of the nonlinear mapping) and the penalty factor C (which balances the fitting error and model complexity). The parameter search ranges were set as γ ∈ [0.5, 100] and C ∈ [0.5, 100], with a step size of 20.5; the optimal parameter combination was selected based on the minimum validation root mean square error (RMSE).

By synthesizing the modeling performance of the three approaches, this study evaluated their differences in predicting the moisture, fat, and protein contents in soybean samples, thereby identifying the optimal modeling strategy to support subsequent rapid and accurate measurement and modeling applications for soybean components.

2.4. Evaluation Metrics

After completing spectral preprocessing and modeling, this study employed five commonly used metrics to quantitatively evaluate the performance of various modeling methods in predicting the soybean moisture, fat, and protein contents. The correlation coefficient (Rc) of the calibration set is used to assess the model’s ability to capture the true variation among the samples. A value closer to 1 indicates a better fit. The root mean square error of calibration (RMSEC) represents the average prediction error on the calibration samples, where a smaller value suggests better model fitting performance. The prediction set correlation coefficient (Rp) is used to evaluate the generalization ability on the independent prediction set, reflecting the linear correlation between the predicted and actual values, calculated in the same way as Rc. The root mean square error of prediction (RMSEP) represents the actual average error on the prediction samples and serves as a key indicator of the model’s extrapolation capability. The residual predictive deviation (RPD) is defined as the ratio of the standard deviation of the reference values in the prediction set to the RMSEP, used to assess the model’s predictive ability; an RPD value greater than 2 generally indicates strong predictive performance. The calculation formulas for these metrics are provided in Equations (2)–(5).

\begin{matrix} R_{c} = \frac{\sum_{i = 1}^{n} (y_{i, a c t u a l} - {\bar{y}}_{i, a c t u a l}) (y_{i, p r e d i c t e d} - {\bar{y}}_{i, p r e d i c t e d})}{\sqrt{\sum_{i = 1}^{n} {(y_{i, a c t u a l} - {\bar{y}}_{i, a c t u a l})}^{2}} \times \sqrt{\sum_{i = 1}^{n} {(y_{i, p r e d i c t e d} - {\bar{y}}_{i, p r e d i c t e d})}^{2}}} \end{matrix}

(2)

\begin{matrix} R M S E C = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i, a c t u a l} - {\bar{y}}_{i, p r e d i c t e d})}^{2}} \end{matrix}

(3)

\begin{matrix} R M S E P = \sqrt{\frac{1}{m} \sum_{i = 1}^{m} {(y_{i, a c t u a l} - {\bar{y}}_{i, p r e d i c t e d})}^{2}} \end{matrix}

(4)

\begin{matrix} R P D = \frac{{S D}_{r e f e r e n c e}}{R M S E P} \end{matrix}

(5)

2.5. Software

This study employed Python 3.7 as the programming language, with model construction and performance evaluation conducted using the scikit-learn library. Data processing and visualization were performed using libraries such as NumPy 1.21.6, Pandas 1.3.5, and Matplotlib 3.5.3.

3. Results

3.1. Descriptive Statistics

To establish near-infrared (NIR) quantitative prediction models for soybean crude fat, moisture, and protein contents, the SPXY (spectral–physicochemical value co-distance) algorithm was employed to partition 64 samples, thereby enhancing sample representativeness and model stability. Based on the physicochemical measurements (Y variables) and corresponding near-infrared spectra (X variables), the dataset was divided at a 4:1 ratio into a calibration set (288 samples) and a prediction set (72 samples).

To assess the sample distribution, descriptive statistical analyses were conducted on three components across both datasets, including the mean, standard deviation (SD), maximum, minimum, and range. The statistical results are presented in Table 1. The results indicate that all indicators display wide value ranges in both datasets, which helps improve the model’s generalization ability and robustness. The contents of all components are expressed as percentages (%), and the data were provided by the Grain and Oil Testing Center of Jilin Province. The moisture content in soybeans was determined using the oven-drying method [4], while the crude fat and protein contents were measured using the Soxhlet extraction method and the Kjeldahl method, respectively [5,6].

3.2. Selection of the Optimal Spectral Preprocessing Method

Figure 2 illustrates the average absorbance spectra (log(1/R)) of soybean samples in the wavelength range of 900–1700 nm. The differently colored curves represent the raw reflectance spectra of individual soybean samples, with color variations used to distinguish between samples. The overall spectral curve exhibits multiple peaks and valleys, indicating that different molecular groups in the samples display distinct absorption characteristics toward near-infrared (NIR) radiation. Specifically, the absorption bands at 950–970 nm, 1140–1160 nm, and 1400–1440 nm are closely associated with the stretching vibrations and overtone absorptions of O–H groups in water. The N–H and C–N groups in proteins exhibit strong absorptions near 1180–1210 nm, 1320–1350 nm, and 1550–1600 nm. The C–H groups in fats mainly absorb in the regions of 1200–1250 nm and 1680–1700 nm. Notably, a marked increase in absorption is observed around 900–970 nm, consistent with the second overtone characteristic absorption regions of moisture and protein.

To improve model accuracy and robustness, the raw spectra were preprocessed to eliminate background interference and scattering effects and to enhance spectral feature resolution. First, standard normal variate (SNV) transformation and multiplicative scatter correction (MSC) were applied to correct scattering differences and baseline drift among the samples. Next, the Savitzky–Golay (SG) smoothing algorithm was applied to reduce noise, using a window width of 5 points and a polynomial order of two. Subsequently, first-order (D₁) and second-order (D₂) derivatives were calculated to enhance weak absorption bands and improve spectral resolution.

Since derivative operations can introduce additional noise, this study limited the preprocessing to second-order derivatives to avoid excessive loss of spectral information. Figure 3 presents a comparison of the near-infrared (NIR) spectra of soybean samples under different preprocessing strategies, showing that preprocessing significantly improves signal quality and enhances spectral details, thereby laying the foundation for subsequent feature selection and modeling. Each curve corresponds to an individual sample, with colors used solely to differentiate samples and aid in the visual comparison of spectral variations.

3.3. Feature Variable Extraction

3.3.1. CARS

To further identify the key wavelengths highly correlated with soybean moisture, protein, and fat contents, this study employed the competitive adaptive reweighted sampling (CARS) algorithm for feature variable selection on near-infrared spectral data. Figure 4 shows the CARS variable selection results across the full spectral range after preprocessing, with red dots indicating the final retained feature wavelengths.

The results showed that for moisture prediction, important bands were mainly concentrated in the characteristic O–H absorption regions at 950–970 nm, 1140–1160 nm, and 1400–1440 nm; for protein prediction, key bands included 1180–1210 nm, 1320–1350 nm, and 1550–1600 nm, corresponding to the N–H and C–N functional groups; for fat prediction, the C–H stretching and bending vibration absorption bands (1200–1250 nm, 1680–1700 nm) were prominently retained.

Compared with VIP selection, CARS not only effectively reduced the number of wavelengths but also enhanced modeling specificity and accuracy. In summary, CARS-based feature wavelength selection provided an information-rich, low-redundancy optimal variable subset for soybean multi-index prediction models, laying a solid foundation for subsequent multi-model optimization and the application of portable near-infrared detection devices.

3.3.2. SPA

To identify the feature wavelengths closely associated with soybean moisture, protein, and fat contents, this study employed the successive projection algorithm (SPA) for feature selection on near-infrared spectral data. SPA is a variable selection method based on vector projection principles that progressively eliminates redundant information and effectively reduces multicollinearity among variables.

As shown in Figure 5, in moisture prediction, the main wavelengths selected by SPA were distributed at 930–950 nm, 1140–1160 nm, and 1400–1440 nm, corresponding to the characteristic absorption of O–H groups. In protein prediction, the key wavelengths were concentrated at 1180–1220 nm, 1320–1350 nm, and 1550–1600 nm, reflecting the characteristic signals of the N–H and C–N functional groups. For fat, the characteristic wavelengths included 1200–1250 nm and 1680–1700 nm, corresponding to the C–H stretching vibration regions.

Compared with full-spectrum modeling, SPA significantly reduced the number of input variables, simplified the model structure, and maximally retained the spectral information closely related to the prediction targets. This not only enhanced the robustness and interpretability of the models but also provided a more efficient variable input for subsequent rapid near-infrared detection of multiple soybean components, laying an important technical foundation for portable and practical applications.

3.3.3. UVE

To eliminate redundant bands and improve model stability, this study employed the uninformative variable elimination (UVE) method for spectral variable selection. UVE calculates the stability index of each variable, retaining wavelengths that significantly contribute to modeling while eliminating ineffective or highly variable interference.

As shown in Figure 6, the key wavelengths selected by UVE based on first-derivative preprocessing were mainly concentrated in the regions of 1150–1230 nm, 1340–1360 nm, and 1440–1570 nm, corresponding to the C–H and O–H absorption features of fat and moisture, respectively. The red-marked wavelengths were mostly located at absorption peaks or edge regions, indicating their significant role in predictive modeling.

The UVE method effectively reduces the number of variables, enhances model performance, and facilitates subsequent applications in rapid detection systems.

3.4. Model Evaluation

3.4.1. Estimation of Moisture Content

To investigate the applicability of different feature selection methods and modeling approaches for predicting the soybean moisture content, this study performed a comparative analysis using three feature selection algorithms—SPA, CARS, and UVE—in combination with three modeling methods—PLSR, SVR, and RF. Model performance was evaluated using the prediction set correlation coefficient (Rp), root mean square error of prediction (RMSEP), and residual predictive deviation (RPD), with the results summarized in Table 2.

Among all combinations, the SPA–SNV + D₁–RF model demonstrated the best performance, achieving an Rp of 0.995, an RMSEP of 0.360%, and an RPD of 14.7, significantly outperforming the other combinations. This result indicates that combining SNV with first-order derivative processing effectively enhances the spectral response to moisture content, while the nonlinear learning capacity of the RF model maximizes the predictive accuracy.

In comparison, the SPA-PLSR and SPA-SVR models using the same SNV + D₁ preprocessing achieved RPD values of 7.1 and 5.0, respectively. These results demonstrate promising modeling potential; however, there remains room for improvement relative to the current best-performing model. This suggests that under this feature variable combination, both the linear PLSR model and SVR fail to fully capture the nonlinear structures among the variables. Although models using the CARS and UVE methods also achieved good results under certain preprocessing strategies (e.g., CARS–RF achieved an RPD of 14.3 with SNV + D₁), none surpassed the overall performance of the SPA–RF model.

In summary, the feature variables selected using the SPA method, combined with SNV + D₁ preprocessing and the RF modeling strategy, demonstrated the best performance in moisture content prediction and are recommended as the optimal approach for near-infrared (NIR) quantitative modeling of soybean moisture.

3.4.2. Estimation of Protein Content

To evaluate the performance of different feature selection methods and modeling approaches for predicting the soybean protein content, this study compared the calibration and prediction performance of models developed using three feature selection strategies—SPA, CARS, and UVE—in combination with three modeling methods—PLSR, SVR, and RF—as summarized in Table 3. Overall, the CARS–RF model demonstrated strong predictive ability under most preprocessing methods, performing best when using SNV combined with second-order derivative preprocessing (SNV + D₂), achieving an Rp of 0.971, the lowest RMSEP (0.547%), and the highest RPD (5.9) among all models. These results indicate that the wavelengths selected by CARS are highly correlated with protein content variations and that the RF model effectively captures their nonlinear characteristics. The RPD value exceeding 5.0 indicates excellent predictive performance, making the model suitable for quantitative analysis.

In comparison, the SVR model exhibited moderate performance across most combinations; although it achieved relatively high Rp values, some models showed high RMSEP and substantial RPD fluctuations, making its stability slightly lower than that of the RF and PLSR models.

In summary, the strategy combining CARS feature selection with RF modeling demonstrated the best performance for predicting the soybean protein content, outperforming comparable SPA- and UVE-based model combinations and proving particularly suitable for applications requiring high-precision modeling. Meanwhile, the PLSR model combined with UVE also exhibited good stability and can serve as an alternative modeling approach offering stronger interpretability.

3.4.3. Estimation of Fat Content

To evaluate the performance of different feature selection methods and modeling approaches in predicting the soybean fat content, this study compared the combinations of three variable selection methods (SPA, CARS, and UVE) with three modeling algorithms (PLSR, SVR, and RF), as summarized in Table 4.

Overall, the CARS–RF model demonstrated the best performance, particularly achieving an Rp = 0.993 and RPD = 12.0 under SG + D₂ preprocessing, showing outstanding modeling accuracy and generalization capability, making it suitable for high-precision quantitative analysis of the fat content. Under the SPA method, the best combination was SG + D₂ with the RF model, yielding an RPD = 11.9, which provided good predictive performance but was slightly lower than that of the CARS–RF model.

In contrast, the UVE method showed generally moderate performance across most combinations, with only the UVE–RF model under MSC + D₁ achieving an RPD = 3.8, while most other combinations yielded RPD values below 4, indicating relatively limited variable selection capability.

From the model comparison perspective, RF consistently outperformed SVR and PLSR, exhibiting stronger nonlinear modeling ability and stability. Therefore, combining CARS feature selection with the RF model offers the greatest advantage for soybean fat content prediction and is recommended for future modeling applications.

Although some studies have employed near-infrared spectroscopy (NIRS) to develop predictive models for the nutritional components of crops and fruit trees, research on the rapid and accurate measurement of major quality components (moisture, protein, fat) in high-protein crops such as soybeans remains relatively limited, particularly regarding improvements in model accuracy, generalizability, and on-site adaptability. Therefore, this study combined multiple spectral preprocessing and variable selection methods to construct and compare different modeling strategies for predicting soybean quality components, aiming to provide technical support for establishing a more stable and efficient NIR-based intelligent evaluation system.

For spectral preprocessing, standard normal variate (SNV) and multiplicative scatter correction (MSC) were employed to correct scattering effects, while Savitzky–Golay derivative processing (D₁, D₂) was applied to enhance spectral resolution, resulting in eight combined strategies (e.g., SNV + D₁, MSC + D₂). The results showed that, compared to single methods, combined approaches such as SNV + D₁ and SG + D₂ performed better in removing background noise and enhancing weak absorption signals, significantly improving model robustness and sensitivity.

To reduce spectral redundancy, improve model efficiency, and enhance generalization ability, this study employed three feature selection methods: SPA, CARS, and UVE. Specifically, SPA minimizes collinearity to obtain stable variable subsets, CARS guides sampling based on PLSR regression coefficients, and UVE evaluates wavelength contributions based on variable stability. The results combined with random forest (RF) modeling are shown in Table 5.

In moisture prediction, the SPA + SNV + D₁ + RF combination achieved the highest accuracy, with an RPD of 14.7, R² = 0.995, and an RMSE of only 0.36%, demonstrating outstanding modeling capability and reliability. The CARS + SNV + D₁ + RF combination also achieved an RPD of 14.3, indicating that both methods effectively extracted key variables from the moisture-related absorption bands.

In protein modeling, the CARS + SNV + D₁ + RF combination achieved an RPD of 8.3, representing the best combination for protein prediction and significantly outperforming SPA (RPD = 4.8) and UVE (RPD = 5.1). This indicates that the protein component’s response to near-infrared (NIR) signals exhibits certain nonlinear characteristics, making it suitable for integrated modeling that leverages the iterative selection mechanism of CARS and the nonlinear learning advantages of the RF model.

In fat modeling, both the CARS and SPA methods combined with the SG + D₂ + RF model achieved strong performance, with RPD values of 12.0 and 11.9, respectively; the scatter points in the fitting plots were mostly aligned along the ideal line, showing minimal error fluctuations. In contrast, the UVE method achieved only an RPD of 3.8 for this metric, indicating limited representativeness and contribution of its selected variables.

Combined with the scatter plot results of predicted versus measured values (Figure 7), it is further evident that the feature variables selected by CARS and SPA exhibit high correlations in fitting all three components, with tightly clustered fitting points and no obvious outliers. In contrast, the UVE-based models exhibited greater fitting dispersion and relatively limited improvements in protein and fat modeling performance. The shaded area around the regression line represents the confidence interval, indicating the range within which the true regression line is expected to lie with a given confidence level. A narrower confidence interval signifies higher precision and reliability of the model, while a wider interval indicates greater uncertainty.

By leveraging enhanced preprocessing techniques such as SNV + D₁ and SG + D₂, combined with the SPA or CARS variable selection mechanisms and the RF modeling framework, high-precision prediction can be achieved for moisture, protein, and fat indicators. This approach offers excellent generalization ability and portability, making it suitable for practical applications in rapid screening, quality evaluation, and industrial grading of soybeans and other agricultural products while also providing a theoretical foundation and technical support for the development of on-site, portable near-infrared (NIR) detection systems.

4. Discussion

This study, based on 30 soybean samples, developed a systematic multi-path modeling framework that integrates nine spectral preprocessing methods, three feature selection algorithms, and three regression models (PLSR, SVR, RF) for the rapid and non-destructive prediction of multiple components, including moisture, fat, and protein. The experimental results showed that the nonlinear models, particularly RF and SVR, outperformed the traditional linear model PLSR in multi-index prediction tasks, demonstrating superior robustness and generalization ability. This aligns with previous studies showing the advantages of combining NIR with machine learning in modeling complex samples such as fruits, dairy products, and tea, further validating the effectiveness of multi-algorithm integration and feature selection optimization in improving model performance.

Specifically, different combinations of preprocessing and variable selection had significant impacts on model accuracy: SG first derivative combined with CARS performed best in protein prediction, while SNV combined with SPA showed superior performance in moisture and fat prediction. This suggests that in complex multi-component and multi-target scenarios, a single optimization approach is insufficient, and flexible strategies tailored to specific target variables are required. This finding not only fills a research gap in soybean NIR modeling but also provides a new paradigm for promoting the standardization of spectral modeling across multi-component and multi-scenario applications.

In terms of research significance, the integrated optimization framework proposed in this study offers important support for the algorithmic design of future portable, real-time NIR detection devices, contributing to the advancement of rapid, accurate, and intelligent high-throughput detection technologies in agriculture. However, this study has certain limitations, such as a relatively small sample size and concentrated sample sources, which may limit the model’s generalization ability. Future research is recommended to expand the sample library across a wider range, including soybean samples from different origins, varieties, and batches, and to explore more advanced modeling approaches such as deep learning to further improve prediction accuracy and applicability. In addition, integrating chemometrics with explainable AI approaches may help uncover deeper relationships between spectral signals and soybean components.

In summary, this study not only achieved rapid and high-precision modeling of multiple soybean indicators but also provided a transferable methodological framework and future research directions for agricultural spectral analysis, holding significant academic value and application prospects.

5. Conclusions

This study established a systematic near-infrared spectroscopy (NIRS) modeling framework for the rapid and accurate prediction of three major soybean components: moisture, protein, and fat. By comparing various spectral preprocessing methods, variable selection algorithms, and modeling strategies, the results showed that the SPA-RF model combined with SNV + D₁ preprocessing achieved the best performance for moisture prediction (RPD = 14.7), while CARS-RF models coupled with SG or SNV preprocessing excelled in protein and fat predictions (RPDs of 5.9 and 12.0, respectively). Overall, the random forest models outperformed the traditional linear methods, demonstrating superior nonlinear fitting ability and robustness.

The innovation of this study lies in developing and validating a comprehensive optimization workflow covering preprocessing, variable selection, and modeling, which significantly improves the accuracy and stability of multi-component simultaneous prediction. These findings provide theoretical and technical support for the development of portable NIRS devices and rapid soybean quality grading.

Future work will expand sample sources and sizes to enhance model generalizability. Additionally, the integration of deep learning and explainable artificial intelligence methods will be explored to further boost the predictive performance and transparency, thereby promoting the application of intelligent agricultural product quality detection technologies.

Author Contributions

C.L.: Conceptualization, Methodology, Funding acquisition. J.Y.: Data curation, Writing—original draft. S.W.: Visualization, Investigation. Q.Z.: Data curation, Supervision. Q.S.: Software, Validation. Y.X.: Writing—review and editing, Project administration. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Scientific Research Foundation of Jilin Province Department of Education (JJKH20240421KJ).

Data Availability Statement

The authors confirm that the original contributions presented in this study are included in the article Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Anderson, E.J.; Ali, M.L.; Beavis, W.D.; Chen, P.; Clemente, T.E.; Diers, B.W.; Graef, G.L.; Grassini, P.; Hyten, D.L.; McHale, L.K.; et al. Soybean [Glycine max (L.) Merr.] breeding: History, improvement, production and future opportunities. Adv. Plant Breed. Strateg. Legumes 2019, 7, 431–516. [Google Scholar]
Peiris, K.H.S.; Bean, S.R.; Wu, X.; Sexton-Bowser, S.A.; Tesso, T. Performance of a Handheld MicroNIR Instrument for Determining Protein Levels in Sorghum Grain Samples. Foods 2023, 12, 3101. [Google Scholar] [CrossRef] [PubMed]
Raffo, M.A.; Sarup, P.; Jensen, J.; Guo, X.; Jensen, J.D.; Orabi, J.; Jahoor, A.; Christensen, O.F. Genomic prediction for yield and malting traits in barley using metabolomic and near-infrared spectra. Theor. Appl. Genet. 2025, 138, 24. [Google Scholar] [CrossRef] [PubMed]
Nirmaan, A.M.C.; Rohitha Prasantha, B.D.; Peiris, B.L. Comparison of microwave drying and oven-drying techniques for moisture determination of three paddy (Oryza sativa L.) varieties. Chem. Biol. Technol. Agric. 2020, 7, 1. [Google Scholar] [CrossRef]
Ramluckan, K.; Moodley, K.G.; Bux, F. An evaluation of the efficacy of using selected solvents for the extraction of lipids from algal biomass by the soxhlet extraction method. Fuel 2014, 116, 103–108. [Google Scholar] [CrossRef]
Moore, J.C.; DeVries, J.W.; Lipp, M.; Griffiths, J.C.; Abernethy, D.R. Total Protein Methods and Their Potential Utility to Reduce the Risk of Food Protein Adulteration. Compr. Rev. Food Sci. Food Saf. 2010, 9, 330–357. [Google Scholar] [CrossRef]
Li, C.; Zhao, C.; Ren, Y.; He, X.; Yu, X.; Song, Q. Microwave traveling-standing wave method for density-independent detection of grain moisture content. Measurement 2022, 198, 111373. [Google Scholar] [CrossRef]
Liu, J.; Munnaf, M.A.; Mouazen, A.M. Micro-Near-Infrared (Micro-NIR) sensor for predicting organic carbon and clay contents in agricultural soil. Soil Tillage Res. 2024, 242, 106155. [Google Scholar] [CrossRef]
Abasi, S.; Minaei, S.; Jamshidi, B.; Fathi, D.; Khoshtaghaza, M.H. Rapid measurement of apple quality parameters using wavelet de-noising transform with Vis/NIR analysis. Sci. Hortic. 2019, 252, 7–13. [Google Scholar] [CrossRef]
Ni, H.; Fu, W.; Wei, J.; Zhang, Y.; Chen, D.; Tong, J.; Chen, Y.; Liu, X.; Luo, Y.; Xu, T. Non-destructive detection of polysaccharides and moisture in Ganoderma lucidum using near-infrared spectroscopy and machine learning algorithm. LWT 2023, 184, 115001. [Google Scholar] [CrossRef]
Yao, K.; Sun, J.; Cheng, J.; Xu, M.; Chen, C.; Zhou, X. Monitoring S-ovalbumin content in eggs during storage using portable NIR spectrometer and multivariate analysis. Infrared Phys. Technol. 2023, 131, 104685. [Google Scholar] [CrossRef]
Wang, Y.; Xu, Y.; Wang, X.; Wang, H.; Liu, S.; Chen, S.; Li, M. Optimizing the effects of potato size and shape on near-infrared prediction models of potato quality using a linear-nonlinear algorithm. J. Food Compos. Anal. 2024, 135, 106679. [Google Scholar] [CrossRef]
Teixido-Orries, I.; Molino, F.; Gatius, F.; Sanchis, V.; Marín, S. Near-infrared hyperspectral imaging as a novel approach for T-2 and HT-2 toxins estimation in oat samples. Food Control. 2023, 153, 109952. [Google Scholar] [CrossRef]
Luo, Y.; Jin, Q.; Lu, H.; Li, P.; Qiu, G.; Qi, H.; Li, B.; Zhou, X. Advancing Loquat Total Soluble Solids Content Determination by Near-Infrared Spectroscopy and Explainable AI. Agriculture 2025, 15, 281. [Google Scholar] [CrossRef]
Yu, M.; Bai, X.; Bao, J.; Wang, Z.; Tang, Z.; Zheng, Q.; Zhi, J. The Prediction Model of Total Nitrogen Content in Leaves of Korla Fragrant Pear Was Established Based on Near Infrared Spectroscopy. Agronomy 2024, 14, 1284. [Google Scholar] [CrossRef]
Saugo, M.; Franzoi, M.; Niero, G.; De Marchi, M. Mineral equilibrium in commercial curd and predictive ability of near-infrared spec-troscopy. J. Dairy Sci. 2021, 104, 3947–3955. [Google Scholar] [CrossRef]
Mendes, W.S.; Sommer, M. Advancing soil organic carbon and total nitrogen modelling in peatlands: The impact of envi-ronmental variable resolution and vis-NIR spectroscopy integration. Agronomy 2023, 13, 1800. [Google Scholar] [CrossRef]
Cao, L.; Sun, M.; Yang, Z.; Jiang, D.; Yin, D.; Duan, Y. A Novel Transformer-CNN Approach for Predicting Soil Properties from LUCAS Vis-NIR Spectral Data. Agronomy 2024, 14, 1998. [Google Scholar] [CrossRef]
Zhang, J.; Guo, Z.; Ren, Z.; Wang, S.; Yin, X.; Zhang, D.; Wang, C.; Zheng, H.; Du, J.; Ma, C. A non-destructive determination of protein content in potato flour noodles using near-infrared hyperspectral imaging technology. Infrared Phys. Technol. 2023, 130, 104595. [Google Scholar] [CrossRef]
Chadalavada, K.; Anbazhagan, K.; Ndour, A.; Choudhary, S.; Palmer, W.; Flynn, J.R.; Mallayee, S.; Pothu, S.; Prasad, K.V.S.V.; Varijakshapanikar, P.; et al. NIR Instruments and Prediction Methods for Rapid Access to Grain Protein Content in Multiple Cereals. Sensors 2022, 22, 3710. [Google Scholar] [CrossRef]
Mihaljev, Ž.A.; Jakšić, S.M.; Prica, N.B.; Ćupić, Ž.N.; Živkov-Baloš, M.M. Comparison of the Kjeldahl method, Dumas method and NIR method for total nitrogen determination in meat and meat products. Gas 2015, 2, 1. [Google Scholar]
Difford, G.F.; Horn, S.S.; Dankel, K.R.; Ruyter, B.; Dagnachew, B.S.; Hillestad, B.; Sonesson, A.K.; Afseth, N.K. The heritable landscape of near-infrared and Raman spectroscopic measurements to improve lipid content in Atlantic salmon fillets. Genet. Sel. Evol. 2021, 53, 12. [Google Scholar] [CrossRef] [PubMed]
Ferreira, D.; Galão, O.; Pallone, J.; Poppi, R. Comparison and application of near-infrared (NIR) and mid-infrared (MIR) spectroscopy for determination of quality parameters in soybean samples. Food Control. 2014, 35, 227–232. [Google Scholar] [CrossRef]
Pasquini, C. Near infrared spectroscopy: A mature analytical technique with new perspectives—A review. Anal. Chim. Acta 2018, 1026, 8–36. [Google Scholar] [CrossRef]
Bi, Y.; Yuan, K.; Xiao, W.; Wu, J.; Shi, C.; Xia, J.; Chu, G.; Zhang, G.; Zhou, G. A local pre-processing method for near-infrared spectra, combined with spectral segmentation and standard normal variate transformation. Anal. Chim. Acta 2016, 909, 30–40. [Google Scholar] [CrossRef]
Xiaobo, Z.; Jiewen, Z.; Povey, M.J.; Holmes, M.; Hanpin, M. Variables selection methods in near-infrared spectroscopy. Anal. Chim. Acta 2010, 667, 14–32. [Google Scholar] [CrossRef]
Li, J.; Wang, Q.; Xu, L.; Tian, X.; Xia, Y.; Fan, S. Comparison and Optimization of Models for Determination of Sugar Content in Pear by Portable Vis-NIR Spectroscopy Coupled with Wavelength Selection Algorithm. Food Anal. Methods 2019, 12, 12–22. [Google Scholar] [CrossRef]
Zheng, K.; Li, Q.; Wang, J.; Geng, J.; Cao, P.; Sui, T.; Wang, X.; Du, Y. Stability competitive adaptive reweighted sampling (SCARS) and its applications to multivariate calibration of NIR spectra. Chemom. Intell. Lab. Syst. 2012, 112, 48–54. [Google Scholar] [CrossRef]
Ouyang, A.; Liu, J. Classification and determination of alcohol in gasoline using NIR spectroscopy and the successive projections algorithm for variable selection. Meas. Sci. Technol. 2013, 24, 025502. [Google Scholar] [CrossRef]
Abdi, H.; Williams, L.J. Partial least squares methods: Partial least squares correlation and partial least square regression. Methods Mol. Biol. 2012, 930, 549–579. [Google Scholar]
Prajwala, T.R. A comparative study on decision tree and random forest using R tool. Int. J. Adv. Res. Comput. Commun. Eng. 2015, 4, 196–199. [Google Scholar]
Khanna, R.; Awad, M. Support vector regression. In Efficient Learning Machines: Theories, Concepts, and Applications for Engineers and System Designers; Apress: Berkeley, CA, USA, 2015; pp. 67–80. [Google Scholar] [CrossRef]
Zhao, Z.; Luo, S.; Zhao, X.; Zhang, J.; Li, S.; Luo, Y.; Dai, J. A Novel Interpolation Method for Soil Parameters Combining RBF Neural Network and IDW in the Pearl River Delta. Agronomy 2024, 14, 2469. [Google Scholar] [CrossRef]

Figure 1. Workflow of near-infrared spectral acquisition and data processing for soybean samples.

Figure 2. Raw spectral plot.

Figure 3. Spectral plots of soybean samples under different preprocessing methods.

Figure 4. Identification of key wavelengths in soybean component modeling using the CARS method.

Figure 5. Identification of key wavelengths in soybean component modeling using the SPA method.

Figure 6. Identification of key wavelengths in soybean component modeling using the UVE method.

Figure 7. Comparison between measured and predicted values for the best models of three quality indicators.

Table 1. Descriptive statistics of physicochemical properties (protein, moisture, fat) of soybean samples.

Parameter	Calibration Set					Validation Set
	Mean	SD	Min	Max	Range	Mean	SD	Min	Max	Range
Moisture	15.506	5.2176	6.59	25	18.41	16.003	5.1745	6.59	25	18.41
Protein	39.159	3.2829	31.5	45.4	13.9	39.455	2.6735	31.5	45.4	13.9
Fat	21.65	1.659	14.5	23.8	9.29	21.524	1.8715	14.5	23.8	9.29

Table 2. Performance comparison of moisture content prediction models under different combinations.

Feature Selection	Pre-Processing	Model	Calibration Set		Prediction Set
			Rc	RMSEC (%)	Rp	RMSEP (%)	RPD
SPA	SG + D₁	PLSR	0.984	0.642	0.979	0.772	6.9
		SVR	0.981	0.711	0.985	0.655	8.1
		RF	0.999	0.187	0.995	0.386	13.7
	SG + D₂	PLSR	0.976	0.802	0.975	0.830	6.4
		SVR	0.975	0.819	0.977	0.809	6.6
		RF	0.997	0.282	0.982	0.716	7.4
	MSC + D₁	PLSR	0.985	0.632	0.980	0.751	7.1
		SVR	0.975	0.811	0.970	0.912	5.8
		RF	0.998	0.241	0.991	0.495	10.7
	MSC + D₂	PLSR	0.976	0.793	0.977	0.805	6.6
		SVR	0.982	0.688	0.982	0.712	7.4
		RF	0.998	0.219	0.985	0.642	8.3
	SNV + D₁	PLSR	0.985	0.636	0.980	0.750	7.1
		SVR	0.971	0.881	0.960	1.057	5.0
		RF	0.999	0.181	0.995	0.360	14.7
	SNV + D₂	PLSR	0.974	0.825	0.976	0.821	6.5
		SVR	0.981	0.708	0.980	0.748	7.1
		RF	0.998	0.226	0.986	0.631	8.4
CARS	SG + D₁	PLSR	0.984	0.649	0.977	0.808	6.6
		SVR	0.985	0.630	0.988	0.589	9.0
		RF	0.998	0.202	0.992	0.483	11.0
	SG + D₂	PLSR	0.984	0.659	0.974	0.856	6.2
		SVR	0.980	0.725	0.975	0.840	6.3
		RF	0.997	0.297	0.984	0.676	7.8
	MSC + D₁	PLSR	0.984	0.649	0.980	0.742	7.1
		SVR	0.976	0.800	0.968	0.943	5.6
		RF	0.998	0.208	0.990	0.521	10.2
	MSC + D₂	PLSR	0.983	0.665	0.976	0.815	6.5
		SVR	0.981	0.709	0.979	0.766	6.9
		RF	0.998	0.246	0.976	0.820	6.5
	SNV + D₁	PLSR	0.985	0.633	0.979	0.767	6.9
		SVR	0.974	0.822	0.970	0.918	5.8
		RF	0.998	0.211	0.995	0.370	14.3
	SNV + D₂	PLSR	0.983	0.661	0.980	0.755	7.0
		SVR	0.976	0.791	0.961	1.043	5.1
		RF	0.997	0.268	0.977	0.800	6.6
UVE	SG + D₁	PLSR	0.985	0.630	0.979	0.765	6.9
		SVR	0.981	0.703	0.979	0.768	6.9
		RF	0.999	0.189	0.994	0.425	12.5
	SG + D₂	PLSR	0.985	0.632	0.973	0.875	6.1
		SVR	0.986	0.615	0.987	0.593	8.9
		RF	0.998	0.208	0.994	0.423	12.5
	MSC + D₁	PLSR	0.982	0.688	0.978	0.782	6.8
		SVR	0.982	0.692	0.981	0.726	7.3
		RF	0.998	0.235	0.994	0.425	12.5
	MSC + D₂	PLSR	0.964	0.977	0.957	1.105	4.8
		SVR	0.966	0.946	0.957	1.100	4.8
		RF	0.995	0.347	0.980	0.746	7.1
	SNV + D₁	PLSR	0.981	0.708	0.977	0.806	6.6
		SVR	0.981	0.700	0.981	0.724	7.3
		RF	0.998	0.220	0.994	0.425	12.5
	SNV + D₂	PLSR	0.977	0.774	0.973	0.867	6.1
		SVR	0.980	0.731	0.984	0.672	7.9
		RF	0.997	0.264	0.980	0.750	7.1

Table 3. Performance comparison of protein content prediction models under different combinations.

Feature Selection	Pre-Processing	Model	Calibration Set		Prediction Set
			Rc	RMSEC (%)	Rp	RMSEP (%)	RPD
SPA	SG + D₁	PLSR	0.955	0.668	0.949	0.731	4.4
		SVR	0.947	0.726	0.938	0.805	4.0
		RF	0.990	0.321	0.954	0.692	4.7
	SG + D₂	PLSR	0.942	0.757	0.929	0.861	3.7
		SVR	0.943	0.750	0.925	0.881	3.7
		RF	0.989	0.330	0.921	0.907	3.5
	MSC + D₁	PLSR	0.926	0.855	0.931	0.845	3.8
		SVR	0.737	1.609	0.785	1.493	2.2
		RF	0.994	0.252	0.957	0.671	4.8
	MSC + D₂	PLSR	0.943	0.752	0.927	0.868	3.7
		SVR	0.927	0.849	0.908	0.977	3.3
		RF	0.983	0.408	0.925	0.881	3.7
	SNV + D₁	PLSR	0.940	0.772	0.945	0.753	4.3
		SVR	0.907	0.957	0.898	1.031	3.1
		RF	0.993	0.259	0.945	0.755	4.3
	SNV + D₂	PLSR	0.934	0.807	0.926	0.876	3.7
		SVR	0.935	0.799	0.929	0.858	3.8
		RF	0.980	0.447	0.873	1.149	2.8
CARS	SG + D₁	PLSR	0.965	0.583	0.954	0.688	4.7
		SVR	0.956	0.661	0.949	0.725	4.4
		RF	0.988	0.345	0.913	0.951	3.4
	SG + D₂	PLSR	0.963	0.602	0.936	0.816	3.9
		SVR	0.962	0.611	0.947	0.742	4.3
		RF	0.989	0.334	0.925	0.881	3.7
	MSC + D₁	PLSR	0.960	0.631	0.949	0.728	4.4
		SVR	0.954	0.672	0.952	0.707	4.6
		RF	0.988	0.342	0.944	0.759	4.2
	MSC + D₂	PLSR	0.958	0.645	0.937	0.810	4.0
		SVR	0.958	0.643	0.947	0.738	4.4
		RF	0.989	0.329	0.950	0.723	4.5
	SNV + D₁	PLSR	0.960	0.631	0.951	0.712	4.5
		SVR	0.955	0.662	0.950	0.717	4.5
		RF	0.994	0.250	0.971	0.547	5.9
	SNV + D₂	PLSR	0.957	0.650	0.932	0.843	3.8
		SVR	0.962	0.610	0.951	0.716	4.5
		RF	0.991	0.290	0.924	0.886	3.6
UVE	SG + D₁	PLSR	0.966	0.582	0.950	0.722	4.5
		SVR	0.947	0.721	0.944	0.765	4.2
		RF	0.991	0.290	0.958	0.664	4.9
	SG + D₂	PLSR	0.960	0.624	0.940	0.789	4.1
		SVR	0.942	0.757	0.937	0.810	4.0
		RF	0.995	0.231	0.961	0.638	5.0
	MSC + D₁	PLSR	0.956	0.656	0.949	0.724	4.5
		SVR	0.920	0.890	0.919	0.917	3.5
		RF	0.991	0.289	0.957	0.669	4.8
	MSC + D₂	PLSR	0.950	0.701	0.938	0.804	4.0
		SVR	0.951	0.698	0.946	0.749	4.3
		RF	0.992	0.288	0.959	0.655	4.9
	SNV + D₁	PLSR	0.956	0.655	0.946	0.745	4.3
		SVR	0.941	0.762	0.929	0.856	3.8
		RF	0.992	0.284	0.955	0.681	4.7
	SNV + D₂	PLSR	0.948	0.716	0.929	0.860	3.7
		SVR	0.919	0.892	0.915	0.936	3.4
		RF	0.993	0.263	0.962	0.627	5.1

Table 4. Performance comparison of fat content prediction models under different combinations.

Feature Selection	Pre-Processing	Model	Calibration Set		Prediction Set
			Rc	RMSEC (%)	Rp	RMSEP (%)	RPD
SPA	SG + D₁	PLSR	0.982	0.694	0.977	0.801	6.6
		SVR	0.979	0.752	0.972	0.880	6.0
		RF	0.998	0.199	0.992	0.475	11.2
	SG + D₂	PLSR	0.961	1.008	0.958	1.089	4.9
		SVR	0.978	0.767	0.981	0.728	7.3
		RF	0.998	0.215	0.993	0.446	11.9
	MSC + D₁	PLSR	0.981	0.716	0.975	0.831	6.4
		SVR	0.979	0.738	0.972	0.885	6.0
		RF	0.999	0.173	0.989	0.544	9.7
	MSC + D₂	PLSR	0.975	0.817	0.969	0.941	5.6
		SVR	0.975	0.819	0.979	0.767	6.9
		RF	0.998	0.238	0.991	0.507	10.5
	SNV + D₁	PLSR	0.983	0.673	0.977	0.801	6.6
		SVR	0.979	0.745	0.973	0.870	6.1
		RF	0.999	0.147	0.987	0.594	8.9
	SNV + D₂	PLSR	0.969	0.900	0.969	0.934	5.7
		SVR	0.970	0.887	0.972	0.883	6.0
		RF	0.998	0.254	0.970	0.921	5.8
CARS	SG + D₁	PLSR	0.985	0.622	0.978	0.786	6.7
		SVR	0.976	0.788	0.966	0.983	5.4
		RF	0.998	0.216	0.992	0.482	11.0
	SG + D₂	PLSR	0.973	0.849	0.958	1.082	4.9
		SVR	0.977	0.785	0.968	0.945	5.6
		RF	0.998	0.248	0.993	0.444	12.0
	MSC + D₁	PLSR	0.982	0.680	0.977	0.799	6.6
		SVR	0.976	0.798	0.970	0.922	5.8
		RF	0.998	0.229	0.983	0.695	7.6
	MSC + D₂	PLSR	0.982	0.680	0.975	0.845	6.3
		SVR	0.978	0.765	0.979	0.771	6.9
		RF	0.998	0.226	0.983	0.689	7.7
	SNV + D₁	PLSR	0.981	0.717	0.973	0.865	6.1
		SVR	0.981	0.716	0.975	0.841	6.3
		RF	0.998	0.230	0.985	0.639	8.3
	SNV + D₂	PLSR	0.979	0.753	0.971	0.898	5.9
		SVR	0.978	0.768	0.978	0.784	6.8
		RF	0.998	0.208	0.978	0.789	6.7
UVE	SG + D₁	PLSR	0.557	3.419	0.611	3.305	1.6
		SVR	0.610	3.207	0.688	2.959	1.8
		RF	0.917	1.476	0.780	2.485	2.1
	SG + D₂	PLSR	0.548	3.454	0.624	3.250	1.6
		SVR	0.589	3.293	0.732	2.744	1.9
		RF	0.811	2.232	0.724	2.786	1.9
	MSC + D₁	PLSR	0.881	1.770	0.877	1.859	2.9
		SVR	0.912	1.521	0.880	1.834	2.9
		RF	0.991	0.495	0.930	1.401	3.8
	MSC + D₂	PLSR	0.724	2.700	0.694	2.930	1.8
		SVR	0.822	2.169	0.778	2.497	2.1
		RF	0.978	0.759	0.859	1.991	2.7
	SNV + D₁	PLSR	0.883	1.758	0.879	1.846	2.9
		SVR	0.914	1.507	0.883	1.815	2.9
		RF	0.984	0.644	0.929	1.417	3.7
	SNV + D₂	PLSR	0.584	3.314	0.510	3.709	1.4
		SVR	0.622	3.160	0.497	3.759	1.4
		RF	0.958	1.059	0.589	3.398	1.6

Table 5. Developed prediction models for soybean component contents (moisture, protein, fat).

Element	Feature Selection	Model	RPD
Moisture	SPA	SNV + D₁ + RF	14.7
	CARS	SNV + D₁ + RF	14.3
	UVE	SG + D₂ + RF	12.5
Protein	SPA	MSC + D₁ + RF	4.8
	CARS	SNV + D₁ + RF	5.9
	UVE	SNV + D₂ + RF	5.1
Fat	SPA	SG + D₂ + RF	11.9
	CARS	SG + D₂ + RF	12.0
	UVE	MSC + D₁ + RF	3.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, C.; Yu, J.; Wang, S.; Zhao, Q.; Song, Q.; Xu, Y. Rapid and Accurate Measurement of Major Soybean Components Using Near-Infrared Spectroscopy. Agronomy 2025, 15, 1505. https://doi.org/10.3390/agronomy15071505

AMA Style

Li C, Yu J, Wang S, Zhao Q, Song Q, Xu Y. Rapid and Accurate Measurement of Major Soybean Components Using Near-Infrared Spectroscopy. Agronomy. 2025; 15(7):1505. https://doi.org/10.3390/agronomy15071505

Chicago/Turabian Style

Li, Chenxiao, Jiatong Yu, Sheng Wang, Qinglong Zhao, Qian Song, and Yanlei Xu. 2025. "Rapid and Accurate Measurement of Major Soybean Components Using Near-Infrared Spectroscopy" Agronomy 15, no. 7: 1505. https://doi.org/10.3390/agronomy15071505

APA Style

Li, C., Yu, J., Wang, S., Zhao, Q., Song, Q., & Xu, Y. (2025). Rapid and Accurate Measurement of Major Soybean Components Using Near-Infrared Spectroscopy. Agronomy, 15(7), 1505. https://doi.org/10.3390/agronomy15071505

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Rapid and Accurate Measurement of Major Soybean Components Using Near-Infrared Spectroscopy

Abstract

1. Introduction

2. Materials and Methods

2.1. Sample Preparation

2.2. Spectral Measurement

2.3. Chemometrics

2.3.1. Spectral Preprocessing

2.3.2. Selection of Optimal Wavelength

2.3.3. Development of Quantitative Prediction Models

2.4. Evaluation Metrics

2.5. Software

3. Results

3.1. Descriptive Statistics

3.2. Selection of the Optimal Spectral Preprocessing Method

3.3. Feature Variable Extraction

3.3.1. CARS

3.3.2. SPA

3.3.3. UVE

3.4. Model Evaluation

3.4.1. Estimation of Moisture Content

3.4.2. Estimation of Protein Content

3.4.3. Estimation of Fat Content

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI