Next Article in Journal
Research on Large Divergence Angle Laser Ranging System
Previous Article in Journal
L-Band Erbium-Doped Fiber Optimization and Transmission Investigation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Analysis of Nutritional Content in Rice Seeds Based on Near-Infrared Spectroscopy

1
Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun 130033, China
2
University of Chinese Academy of Sciences, Beijing 101408, China
3
College of Opto-Electronic Engineering, Changchun University of Science and Technology, Changchun 130022, China
*
Author to whom correspondence should be addressed.
Photonics 2025, 12(5), 481; https://doi.org/10.3390/photonics12050481
Submission received: 14 April 2025 / Revised: 2 May 2025 / Accepted: 4 May 2025 / Published: 14 May 2025

Abstract

:
The nutritional quality of rice seeds is mainly determined by the content of key components such as protein, fat, and starch. Traditional chemical detection methods are time-consuming, labor-intensive, inefficient, and harmful to the environment. To overcome these limitations, this study developed a non-destructive detection method using near-infrared spectroscopy (1000–2200 nm) combined with linear regression modeling to achieve efficient and simultaneous multi-component analysis through the principle of anharmonic molecular vibration. By combining nutrient data from chemical analysis with spectroscopic measurements, we established a comprehensive rice seed composition dataset. After preprocessing with Gaussian denoising, first-order derivative transformation, SPA wavelength selection, and multiplicative scatter correction (MSC), we constructed partial least squares regression (PLS) and orthogonal partial least squares (OPLS), as well as artificial neural network (ANN) models. The OPLS model performed well in fat prediction (R2 = 0.971, Q2 = 0.926, RMSE = 0.175, RMSECV = 0.186), followed by starch (R2 = 0.956, Q2 = 0.907, RMSE = 0.159, RMSECV = 0.146) and protein (R2 = 0.967, Q2 = 0.936, RMSE = 0.164, RMSECV = 0.156). Our results confirm that the combination of the moving average, first order derivative, SPA, and MSC preprocessing of the OPLS model significantly improves the prediction. The developed non-destructive testing equipment provides a practical solution for automated, high-precision sorting of rice seeds based on nutrient composition.

1. Introduction

The main nutritional components of rice seeds include starch, fat, protein, cellulose, and trace minerals [1]. Among these, starch is the most predominant component, accounting for 70–80% of the dry weight, serving as the primary energy source for the human body. Protein content is relatively low (7–9%) but plays a crucial role in cell construction, repair, and vital biological activities. Fat content is minimal (1–3%) yet contributes to cholesterol regulation and supports brain and nervous system health. Consequently, the detection of nutritional components in rice seeds has become an important research focus. Currently, traditional detection methods for rice seeds include chemical analysis and microscopic observation. Chemical analysis can measure various components such as protein and minerals; however, the procedure is complex, time-consuming, and has significant environmental impacts. Microscopic observation primarily focuses on examining microstructures but has limited applicability and provides relatively restricted information. These methods struggle to meet the demands of modern agriculture for rapid and non-destructive detection of seed nutritional components.
Near-infrared spectroscopy (NIRS) is a rapid and non-destructive analytical technique that utilizes the near-infrared light spectrum (approximately 780 nm to 2526 nm) to obtain chemical and physical information about a sample by measuring its absorption and scattering characteristics at specific wavelengths. This technology has been widely applied in various fields, including agriculture, food processing, and environmental monitoring, due to its advantages of speed, accuracy, non-destructiveness, and environmental friendliness [2]. NIRS analyzes material composition and properties by measuring the absorption characteristics of near-infrared light by the sample. It leverages the vibrational absorption features of chemical bonds (e.g., C-H, O-H, and N-H) within molecular structures, offering benefits such as rapid analysis, high precision, operational convenience, and eco-friendliness. These attributes make it particularly suitable for non-destructive testing of rice seeds. Currently, NIRS has been extensively adopted in agriculture, especially for food quality assessment, soil analysis, and plant health monitoring [3]. For instance, Feng Haikuan et al. employed the Adaboost algorithm to monitor rice nitrogen nutrition and grain protein content, achieving an R2 of 0.960, RMSE of 0.175, and MAE of 0.150 [4]. Zhang Linxin et al. applied NIRS for rapid, non-destructive detection of hollow defects in hickory nuts, with a cross-validation accuracy of 86.44% [5]. Huang Deyao et al. integrated deep convolutional generative adversarial networks (DCGAN) with visible-near-infrared hyperspectral reflectance to enhance the prediction accuracy of anthocyanin content in rice seeds, yielding a coefficient of determination (R2) of 0.87, root mean square error (RMSE) of 9.40, and residual predictive deviation (RPD) of 2.88 [6].
Predictive models, such as partial least squares regression (PLS), orthogonal partial least squares discriminant analysis (OPLS), and artificial neural network (ANN), are widely applied in metabolomics component analysis, near-infrared spectroscopy (NIRS), and numerous other fields. These methods demonstrate exceptional performance when analyzing complex high-dimensional data, particularly multivariate datasets like infrared spectra. They effectively handle high-dimensional data, address collinearity issues, enhance predictive accuracy, and improve model interpretability. For example, Zhang Jing et al. successfully developed a model for predicting rice moisture content using a portable NIRS device combined with the Successive Projections Algorithm (SPA) and PLS regression [7]. This approach exhibited robust predictive performance in soil moisture estimation, high-dimensional data analysis, and multicollinearity resolution. Guo Zhonghua et al. used a radial basis function artificial neural network (RBF-ANN) to establish the prediction models of protein and fat contents of four dairy products, and their correlation coefficients and prediction set mean squares reached 0.9997 and 0.0968, respectively [8]. P. R. Armstrong improved soybean moisture and protein content predictions by establishing a PLS regression model with multiplicative scatter correction (MSC). Using artificially humidified soybean samples with controlled moisture gradients for NIRS calibration, the model achieved a prediction set R2 of 0.992 and a standard error of 0.082, validating the efficacy of this sample preparation method [9]. Chayanid Sringarm et al. implemented an OPLS model with NIRS to classify industrial cassava starch hydrolysates based on Brix and glucose equivalence levels [10]. Ebrahim Taghinezhad et al. further improved the interpretability of the model by using a combination of decision trees and a learning automata meta heuristic algorithm (DT-LA) to screen for effective wavelengths [11]. Lingjiao Zhong et al. used three feature wavelength selection methods to achieve fast and non-destructive quality assessment of GLSP, with an RMSEP of 0.1981 and a qualified RPD of 2.826 [12].
At present, research on grain composition analysis, both nationally and internationally, has focused mainly on crops such as maize, soybean, and wheat. In the context of rice seeds, most studies have concentrated on protein content prediction, while investigations into multi-component prediction of different nutritional elements (e.g., fat and protein) remain inadequate. In addition, existing near-infrared spectroscopy (NIRS) instruments often face challenges such as high signal noise, redundant wavelength variables, and strong correlations between adjacent wavelengths, which limit the measurement accuracy of nutritional components in rice. To address these issues, this study uses NIRS technology combined with machine learning algorithms to accurately predict the nutritional content of rice seeds. By collecting spectral data from rice seeds using a spectroscopic instrument and applying an optimized preprocessing approach—integrating classical Gaussian denoising with second-order derivative transformation and multiplicative scatter correction (MSC)—we constructed an orthogonal partial least squares (OPLS) model. This methodology effectively mitigates common NIRS limitations, including signal noise, wavelength redundancy, and inter-wavelength correlation, thereby improving the accuracy of rice nutrient measurements. In addition, we developed an automated transmission spectral acquisition device to enable high-throughput, non-destructive detection of rice seeds, increasing the potential for large-scale agricultural applications.

2. Materials and Methods

2.1. Experimental Samples and Quantitative Analysis

The experimental samples included six rice varieties: Japanese Taisei rice, 9311 rice, and four novel colored rice cultivars developed by the Hubei Academy of Agricultural Sciences (classified by leaf color as white-leaf, red-leaf, purple-leaf, and yellow-leaf rice). Spectral data were collected separately for each variety. In order to conduct chemical reference analysis, the seeds collected by spectroscopy were grouped and labeled according to variety, with 10 seeds per pack, for a total of 1200 seeds in 20 packs × 10 seeds/pack × 6 varieties. These seeds were ground every three grains and used in a reagent kit to detect proteins, fats, and starches. There were also a large number of seeds used for traditional chemical detection methods for mass testing, calibration, and validation of the detection accuracy of the reagent kit. Data with a maximum error of 8% were retained for dataset construction. Quantitative chemical analyses were performed for starch, triglycerides, and protein content. Starch quantification: Soluble sugars were first separated using 80% ethanol. Starch was then hydrolyzed into glucose via acid hydrolysis, with glucose content determined by acetone colorimetry to calculate starch concentration. Triglyceride measurement: The GPO-PAP method was employed, comparing the absorbance ratio between samples and standard tubes. Samples were homogenized and centrifuged, and the supernatant was used for analysis. Protein determination: The BCA assay was adopted, where proteins react with BCA reagent to form a colored complex. Protein concentration was calculated by measuring the complex’s absorbance against a standard curve. These chemical measurements served as reference values for spectral model calibration.

2.2. Spectral Data Acquisition

As shown in Figure 1, the system primarily consists of three major components: the optical system, mechanical structure, and control system. The optical system uses a 22 W tungsten halogen lamp (spectral range 350~2500 nm) as the illumination source, and a near-infrared spectrometer (AvaSpec-NIR256-2.5-HSC-EVO, AVante, Eindhoven, The Netherlands) as the spectral acquisition device, with a spectral collection range of 900–2500 nm. The mechanical structure includes a feeding device and a detection turntable. The control system utilizes a computer program to control two stepper motors, which respectively drive the roller feeding mechanism and the rotation of the detection turntable. To ensure accurate detection of rice seeds, the entire system must address issues such as light source interference during spectral acquisition and uneven seed feeding. Therefore, the optical system design adopts a transmission-mode spectral acquisition method, while the mechanical structure has been optimized to guarantee precise single-seed feeding, ensuring the system’s high accuracy and reliability.
In terms of the optical system, since the husk fragments adhering to the surface of rice seeds may reflect incident light, this could result in spectral data lacking internal seed information while being affected by surface husk interference [13]. We tested rice seeds with floating hulls. Rice seeds are usually opaque, and transmitted light is able to penetrate their interior and provide more comprehensive information on nutrient content. Compared with reflectance spectroscopy, transmission spectroscopy can effectively avoid the interference of surface structures, reduce the instability of reflected signals, and obtain more accurate analytical results. In addition, transmission spectrometry can provide information about the depth of the interior of the sample, which is more suitable for multi-component analysis, and therefore has higher accuracy and reliability in detecting the nutrient content of rice seeds. To address this issue, this study adopted a transmission-mode spectroscopy method. The transmission spectroscopy not only enables the acquisition of whole-grain internal brown rice composition information, avoiding interference from surface husk reflections and external light, but also eliminates measurement errors caused by the heterogeneous distribution of internal components such as protein, fat, starch, and cellulose within different parts of the seed. In the process of spectral data acquisition, the light source is concentrated into an integrating sphere from top to bottom through a fiber optic cable, and then guided by another fiber optic cable to illuminate the rice seeds. At the exit, the straightness of the light beam is adjusted by using a collimating lens, and the transmitted light is collected and transmitted again to the spectrometer through an optical fiber, thus completing near-infrared spectrum acquisition. When designing the instrument, the input and output ends of two optical fibers are placed on a vertical line, ensuring that the light passes through the seed vertically.
In terms of mechanical structure design, a drum-type feeding structure was developed to enhance the accuracy, stability, and reliability of rice seed spectral detection. This structure optimizes seed flow and distribution to ensure each seed uniformly enters the spectral detection area, thereby improving detection accuracy. The drum design effectively controls seed feeding speed and frequency while minimizing external interference, ensuring process stability. Furthermore, the drum-type feeding structure exhibits high reliability by reducing fluctuations caused by equipment malfunctions or human factors, guaranteeing consistent and trustworthy detection results. The optimized trapezoidal groove design and adjustable opening settings effectively address seed jamming and uneven feeding issues, ensuring single-seed feeding for spectral detection to improve process repeatability and measurement accuracy. This design effectively supports efficient operation of the spectral system while meeting performance requirements for precise single-seed feeding.

2.3. Data Preprocessing

In this study, the objective is to accurately determine the protein, fat, and starch content in rice seeds. The primary challenges in conducting such analyses include the complexity and high-dimensional nature of the data, as well as the spectral similarity of different components in the near-infrared range, which makes their accurate differentiation difficult. To achieve efficient and precise compositional analysis, it is essential to select appropriate preprocessing methods and modeling approaches.
Regarding preprocessing, preliminary preprocessing methods are crucial for optimizing spectral data. Commonly used preliminary preprocessing techniques include the Savitzky–Golay filtering algorithm, Gaussian denoising, and moving average algorithm. These methods primarily serve to remove noise and irregular fluctuations from spectral data, improve signal smoothness, and thereby enhance data quality [14]. Specifically, the Savitzky–Golay filtering algorithm performs local smoothing of spectral data through polynomial fitting, reducing noise while preserving key spectral features. This method effectively eliminates high-frequency noise while maintaining data structural information, making it widely used for signal smoothing. Gaussian denoising applies Gaussian function smoothing to spectral signals, transforming noisy signals into smoother profiles. This helps reduce signal fluctuations and improves the accuracy of subsequent analyses. The moving average algorithm employs a fixed window size to calculate average values for smoothing spectral data. It mitigates short-term fluctuations and enhances data stability, particularly for transient interference in signals. This systematic preprocessing approach ensures reliable spectral data for subsequent modeling and analysis. In order to reduce the redundant information of the characteristic spectrum, the computational cost of the model is reduced. After the above preprocessing, the spectral data is subjected to continuous projection calculation (SPA) to optimize the wavelength selection. SPA is a forward-looking feature selection algorithm that can continuously eliminate redundant information with iteration and effectively reduce the collinearity problems between wavelengths.
In addition to initial noise suppression, subsequent preprocessing operations such as derivative processing and multiplicative scatter correction (MSC) also play important roles. Derivative processing can effectively extract changing trends in spectra and enhance the resolution of spectral data. By calculating the first derivative of the spectrum, derivative processing eliminates effects caused by baseline drift and highlights subtle differences between components, thereby improving feature distinguishability. Furthermore, derivative processing can accelerate the identification of changes in data and plays a key role in distinguishing components in complex samples. Multiplicative scatter correction (MSC) is a method used to eliminate scattering effects in spectra [15]. In actual measurements, factors such as rough sample surfaces and irregular shapes often cause light scattering effects that lead to distortion and instability in spectral data. The MSC method eliminates scattering-induced interference and corrects spectral differences between different samples through smoothing and standardization of spectral data, thereby improving data comparability and consistency. MSC can not only remove the influence of environmental factors but also reduce the impact of sample thickness, density, and other factors on the spectrum, optimizing data stability.
To select the optimal algorithm model, this study compared the effects of different preprocessing methods and modeling approaches, ultimately determining the best combination to obtain more accurate prediction models for protein, fat, and starch content in rice seeds. The application of this method can effectively improve analytical accuracy, providing strong technical support for rapid and reliable identification of rice seed components.

2.4. Prediction Models

This study uses partial least squares regression (PLS), orthogonal partial least squares regression (OPLS), and artificial neural network (ANN) algorithms for quantitative analysis of rice seed components. These methods are chosen for their strengths in handling complex multivariate data and improving prediction accuracy. PLS reduces dimensionality and extracts key information to address highly correlated variables, while OPLS enhances PLS by removing irrelevant noise, improving model accuracy and stability. ANN, by simulating biological neural networks, learns nonlinear relationships in the data, making it effective for complex pattern recognition and large datasets. Combining these algorithms enables more accurate and reliable predictions for rice seed component analysis.

2.4.1. PLS Prediction Model

Partial least squares (PLS) is a multivariate statistical method used to address collinearity issues, enable simultaneous analysis of multiple dependent variables (Y), and study influence relationships with small sample sizes. The PLS model makes no distributional assumptions about the data, making conventional distribution-based statistical tests unsuitable for measuring its predictive validity. To evaluate the predictive validity of PLS models, a non-parametric test method—the Stone–Geisser test—is required. The resulting validity measure is generally expressed as Q2, indicating whether the observations reconstructed through the model and parameters are reasonable. PLS is primarily used to handle situations with multicollinearity (i.e., high correlations between variables) or when the number of independent variables far exceeds the number of observations. The core objective of PLS is to identify a set of new latent variables (called principal components) that can not only explain the variance in independent variables but also maximize the explanation of variance in dependent variables [16].
First, we treated the spectral data as independent variables (X) and the nutritional content as dependent variables (Y). Subsequently, we extracted the first pair of components from both X and Y variables, aiming to maximize the correlation between them.
X = TPT + E
Y = UQT + F
In this formulation, P represents the loading matrix of the spectral matrix X and Q denotes the loading matrix of the measurement matrix Y; E is the residual matrix remaining after decomposing matrix X; T is the score matrix of the spectral matrix X and U is the score matrix of the measurement matrix Y; and F represents the residual matrix remaining after decomposing matrix Y. The second step of the PLS algorithm involves performing linear regression on the two score matrices T and U, yielding the following:
U = TB
B = TTU(TTT) − 1
where B is the regression coefficient matrix.
To predict unknown samples, we first calculate the score T1 of the spectral matrix X according to the above formula, and then substitute it into the corresponding formula to obtain the predicted value of the measurement through transformation. The specific formula is as follows:
Y = T1BQ

2.4.2. OPLS Prediction Model

OPLS is a variant of PLS that incorporates orthogonal signal correction. This method provides better differentiation between correlated and uncorrelated components of explanatory and response variables. OPLS demonstrates superior performance to conventional PLS in enhancing model interpretability and prediction accuracy, particularly when handling complex biological data [17].
The mathematical model of OPLS can be expressed as:
X = T P T + O
where T represents the components of independent variables correlated with dependent variable Y, P is the loading matrix, and O denotes the orthogonal components unrelated to Y. The OPLS method extracts principal components T to maximize the covariance between independent variables and dependent variable Y, while maintaining orthogonality between T and O. Each principal component is calculated with the objective of maximizing covariance, mathematically represented as:
Max(cov(T,Y))
Through an iterative process, the score matrix T is optimized so that each new T becomes more strongly correlated with dependent variable Y while eliminating unrelated components. Once the score matrix T is successfully calculated, we can construct a regression model based on T to predict values of dependent variable Y. The regression model can be expressed by the following formula:
Y = U Q T + F
where T is the score matrix of spectral matrix X, U is the score matrix of measurement matrix Y, and F represents the residual matrix remaining after decomposing matrix Y.

2.4.3. ANN Prediction Mode

An artificial neural network (ANN) is a mathematical model that mimics the structure and function of biological neural networks, such as brains. It consists of a large number of nodes (or “neurons”) and the connections between them, and is used to model complex relationships between data. Its basic structure consists of an input layer, a hidden layer, and an output layer. The input layer receives external data, the hidden layer processes the data, and the output layer generates the final prediction result. Each neuron receives input from other neurons, makes a weighted sum, and generates an output via an activation function.
Artificial neural network (ANN) models can be divided into three main components: input weights, activation functions, and output. Firstly, input signals are summed through weights and biases to form a total. Then, this weighted sum undergoes a nonlinear transformation through an activation function, generating the output of the neuron. Finally, the activation values of the hidden or input layers are transmitted to the output layer, where a final network output is calculated through weighted summation and an activation function. This structure enables the ANN to learn complex patterns and make predictions.
Inputs and weights:
z = i = 1 n   w i x i   + b
Enter the signal: xi
Enter the signal weight: wi
Offset: b
Activate the function:
a = f z
Activation functions (e.g., Sigmoid, ReLU, etc.): f(z)
Weighted Sum (Weighted Sum of Input Signals): z
Output:
y = g i = 1 m v m a m + c
Weights of neurons in the output layer: vi
The activation value of the previous layer: ai
Offset: c
Activation function of the output layer (e.g., SoftMax, Sigmoid, etc.): g.

2.5. Evaluation Metrics

The model evaluation metrics adopted in this study primarily include root mean squared error (RMSE), root mean square error of cross-validator (RMSECV), coefficient of determination (R2), and model predictive ability (Q2). Building upon these, permutation tests were introduced to comprehensively assess model performance. The incorporation of these evaluation criteria can holistically and stably reflect the predictive capability of models, avoiding potential biases from single metrics while strengthening the validation of model assumptions, thereby making model evaluation more scientific and accurate.
Root mean squared error (RMSE):
RMSE = 1 m i = 1 m ( y i y i ) 2
yi y Λ i = Real value − predicted value on the test set.
Root mean square error of cross-validator (RMSECV):
R M S E C V = i = 1 n ( y i y i Λ ) 2 n 1
Coefficient of determination R2:
R 2 = 1 i ( y i y i ) 2 i ( y i ¯ y i ) 2
Here, the numerator represents the sum of squared differences between the true values and predicted values in the test set, while the denominator represents the sum of squared differences between the true values and their mean in the test set. ȳ is the mean of the true values in the test set.
Predictive ability Q2:
Q 2 = 1 i ( y i y i ) 2 i ( y i ¯ y i ) 2
Here, the numerator represents the sum of squared differences between the true values and predicted values in the test set, while the denominator represents the sum of squared differences between the true values and their mean in the training set. ȳ is the mean of the true values in the training set. Similar to R2, the denominator of Q2 represents the sum of squared differences between the true values and their mean in the test set.
Permutation Test:
The permutation test is a non-parametric statistical method used to evaluate whether an observed statistic is significant. Its core concept involves constructing an empirical distribution through random data shuffling (permutation) to calculate the significance (p-value) of the statistic. This method does not rely on distributional assumptions (e.g., normality) and is particularly suitable for small samples or complex data [18].
p = i = 1 N I ( Δ p e r m ( i ) Δ o b s ) + 1 N + 1
Permutation Test Procedure:
Calculate the actual statistic (e.g., the mean difference Δobs between two groups). Randomly shuffle the labels and recalculate the statistic Δperm. Repeat N times (e.g., 500 times) to generate the permutation distribution. I() is the indicator function (equals 1 when the condition is met, otherwise 0). Finally, compute the p-value by comparing the position of Δobs in the permutation distribution to evaluate statistical significance.

3. Results

3.1. Analysis of Near-Infrared Absorption Characteristics in Rice Seeds

As shown in Figure 2, the measured data reveal that rice seeds exhibit significant absorption characteristics in specific near-infrared spectral bands (1150–1220 nm, 1410–1450 nm, 1510–1540 nm, 1660–1800 nm, and 1910–1950 nm). These absorption peaks are closely associated with various chemical components in rice seeds, such as methyl groups, proteins, amino acids, fats, and cellulose. Variations in these components can reflect seed quality and viability [19].
  • 1150–1220 nm Band
In this band, rice seeds display distinct absorption features, which are linked to the overtone vibrations of C-H bonds in methyl (CH3) and methylene (CH2) groups. Changes in absorption within this band can be used to evaluate the content of fatty acid compounds in rice seeds. Specifically, variations in fatty acid content lead to noticeable shifts in the absorption peaks within the 1150–1220 nm range.
  • 1410–1450 nm Band
Absorption in this band is associated with vibrations of C-H bonds and O-H bonds in proteins. The protein content in rice seeds significantly influences absorption in this range, and fluctuations in protein levels result in corresponding changes in the absorption peaks.
  • 1510–1540 nm Band
The absorption characteristics in this band are related to the overtone vibrations of N-H bonds in amino acids and proteins. Variations in amino acid and protein content directly affect the absorption peaks in this range, providing a basis for assessing amino acid levels in seeds.
  • 1660–1800 nm Band
Absorption in this band corresponds to the overtone vibrations of CH3 and CH2 groups in fats. Changes in fatty acid content cause significant shifts in the absorption peaks within the 1660–1800 nm range, making this band useful for evaluating fat content in rice seeds.
  • 1910–1950 nm Band
Absorption in this band is associated with the overtone vibrations of C=O bonds in lipids and cellulose, as well as O-H vibrations in starch and cellulose. Variations in the content of cellulose, starch, and lipids in rice seeds significantly influence absorption characteristics in this range, making it an important reference band for assessing seed quality.

3.2. Effects of Raw Spectral Preprocessing

In spectral analysis, the raw spectra of samples contain critical information about the target components but are also affected by electrical noise, stray signals, and background interference, which can compromise analytical accuracy. Therefore, to establish precise calibration models, spectral preprocessing methods must be employed to effectively eliminate these interfering signals. Techniques such as noise reduction, first-derivative processing, and multiplicative scatter correction (MSC) significantly enhance spectral data quality and analytical precision.

Advanced Preprocessing Results

As shown in Figure 3, ABCD are divided into (A) raw spectral data; (B) Gaussian denoising spectrum; (C) Moving average denoising spectrum; (D) Savitzky Golay filter denoising spectrum and Figure 4, Figure 5 and Figure 6 demonstrate the effects of secondary preprocessing on the initially processed data from Figure 3:
Figure 4, Figure 5, and Figure 6, respectively, show the advanced preprocessing corresponding to the initial preprocessing method of the spectral data in Figure 3. Figure 4 shows the spectral data curve after Gaussian denoising, first-order derivative processing, and multivariate scattering correction. Figure 5 shows the spectral data curve after Savitzky–Golay denoising, first-order derivative processing, and multivariate scattering correction. Figure 6 shows the spectral data curve after moving average denoising, first-order derivative processing, and multivariate scattering correction. These methods can effectively remove background noise and instrument noise, and improve the clarity of spectral signals. After first-derivative preprocessing, the baseline drift problem of the spectrum has been effectively improved, but the peak values of the spectral lines are not obvious. Therefore, after preprocessing, it is necessary to perform multivariate scattering correction on the data to eliminate the influence of particle size and uneven distribution on spectral scattering, in order to simplify the model and improve the spectral signal-to-noise ratio.
Through the application of these comprehensive preprocessing methods, spectral data have been significantly improved, eliminating the interference of noise, baseline drift, and scattering effects. This provides purer and higher signal-to-noise ratio data for subsequent modeling analysis, making it possible to establish more accurate and reliable prediction models. In the following chapters, we will further explore how these preprocessing methods provide support for improving model performance.
As shown in Table 1, the spectral data that have been denoised and modeled using the OPLS algorithm have shown significant improvements in quality, interpretability, and prediction accuracy compared to the original spectral data model. However, this model still has problems such as low model interpretability and low prediction accuracy.
As shown in Table 2, the interpretability and prediction accuracy of the model are further improved after combining the initial denoising and derivative processing. In particular, after the introduction of the multiple scattering correction (MSC) algorithm, the interpretability and prediction accuracy of the model reach a better level. Combined with the noise reduction processing methods such as Gaussian denoising, Savitzky–Golay, and moving average, and the preprocessing methods such as first-derivative processing and the MSC algorithm, the nutrient content model achieved optimal prediction results under the GAUSS + First derivative + MSC modeling method, in which the fat model achieved R2 = 0.959, Q2 = 0.903, RMSE = 0.1942, and RMSECV = 0.212; R2 = 0.951, Q2 = 0.901, RMSE = 0.1789, and RMSECV = 0.201 were achieved for the starch model; and R2 = 0.945, Q2 = 0.928, RMSE = 0.1887, and RMSECV = 0.235 were achieved for the protein model. Currently, the accuracy in the prediction of nutrient content of rice seeds has achieved good prediction results, but there is still room for improvement.
As shown in Table 3, the interpretability and prediction accuracy of the model are further improved after the introduction of the SPA wavelength selection method. In particular, after combining with the multiple scattering correction (MSC) algorithm, the model interpretability and prediction accuracy reach a better level. Combined with the noise reduction processing methods such as Gaussian denoising, Savitzky–Golay, and moving average, and the preprocessing methods such as first-derivative processing and MSC algorithm, the fat model achieves the optimal prediction results under the moving average + first-order derivative + MSC + SPA + OPLS modeling method, with R2 = 0.971, Q2 = 0.926, RMSE = 0.175, RMSE = 0.926, RMSE = 0.926, RMSE = 0.926, and RMSE = 0.926. RMSE = 0.175, and RMSECV = 0.186. The starch model has R2 = 0.964, Q2 = 0.913, RMSE = 0.298, and RMSECV = 0.237 under the GAUS + first-order derivative + MSC + SPA + OPLS modeling method, whereas the moving average + first-order derivative + MSC + SPA + OPLS model has R2 = 0.964, Q2 = 0.913, RMSE = 0.298, and RMSECV = 0.237. R2 = 0.956, Q2 = 0.907, RMSE = 0.159, RMSECV = 0.146 was achieved with the moving average + first order derivatives + MSC + SPA + OPLS modeling method; this achieved a smaller RMSE and RMSECV, which is better in comparison. In the protein model, the moving average + first order derivatives + MSC + SPA + OPLS modeling approach achieved optimal prediction, with R2 = 0.967, Q2 = 0.936, RMSE = 0.164, and RMSECV = 0.156.
In order to further validate the performance of the models, we performed substitution tests on the three best prediction models for protein, fat, and starch obtained under the moving average + first order derivatives + MSC + SPA + OPLS modeling approach. After 200 rounds of cross-validation, as shown in Figure 7, Figure 8 and Figure 9. The R2 of the fat prediction model is 0.971, the Y-axis intercept is 0.168, which is less than 0.3, and the Q2 is 0.926, with an intercept of −0.107, which is much less than 0.05, indicating that the model is not overfitted; similarly, the R2 of the starch prediction model is 0.956, with an intercept of 0.114, which is less than 0.3, and the Q2 is 0.907, with an intercept of −0.096, which is much less than 0.05, indicating that there is no overfitting. The protein prediction model also performed well, with R2 = 0.967, a Y-axis intercept of 0.196, which is less than 0.3, and Q2 = 0.936, with an intercept of −0.137, which is less than 0.05, indicating no overfitting.

4. Conclusions

In this study, we used near-infrared spectroscopy combined with a regression modeling algorithm to collect spectral data of rice seeds using a spectrometer to achieve accurate prediction of the nutrient composition of rice seeds. Combining the preprocessing methods of moving average noise reduction, first-order derivative processing, the SPA wavelength selection algorithm, and the MSC algorithm, the fat prediction model under the OPLS modeling method has R2 = 0.971, RMSE = 0.175, and RMSECV = 0.186. The Y-axis intercept is 0.168, which is less than 0.3, Q2 = 0.926, and the intercept is −0.107, which is less than 0.05. Similarly, the starch prediction model has R2 = 0.956, RMSE = 0.159, and RMSECV = 0.146, with an intercept of 0.114, which is less than 0.3, and Q2 = 0.907, with an intercept of −0.096, which is less than 0.05. The protein model has R2 = 0.967, RMSE = 0.164, and RMSECV = 0.156. The Y-axis intercept is 0.196, which is less than 0.3, and Q2 = 0.936, with an intercept of −0.137, which is much smaller than 0.05. The study shows that the OPLS model constructed by using the preprocessing method combining Gaussian denoising, first-order derivative processing, and multiple scattering correction has a significant prediction effect on the nutrient composition of rice, and it provides an effective and high-precision method for non-destructive testing of the nutritional composition of rice.

Author Contributions

Conceptualization, G.L.; Methodology, H.K., J.C. and J.W.; Software, H.K. and J.W.; Validation, H.K. and J.W.; Investigation, J.C., Z.X. and H.K.; Resources, G.L. and J.W.; Data curation, H.K. and Z.X.; Writing—original draft, H.K. and J.W.; Writing—review and editing, H.K. and J.W.; Visualization, J.W.; Supervision, G.L. and J.W.; Project administration, G.L. and J.W. All authors have read and agreed to the published version of the manuscript.

Funding

Key Research and Development Program of Jilin Provincial Science and Technology Development Plan (No. 20220203195SF), Youth Innovation Promotion Association CAS No. 2023229, and Open bidding for selecting the best candidates of ChangChun City (23JG06).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Tang, Y. Analysis of high-yield rice cultivation techniques and pest control measures. Seed Sci. Technol. 2025, 43, 141–143. [Google Scholar] [CrossRef]
  2. Yang, H.E.; Kim, N.W.; Lee, H.G.; Kim, M.J.; Sang, W.G.; Yang, C.; Mo, C. Prediction of protein content in paddy rice (Oryza sativa L.) combining near-infrared spectroscopy and deep-learning algorithm. Front. Plant Sci. 2024, 15, 1398762. [Google Scholar] [CrossRef] [PubMed]
  3. Li, Z.; Hou, M.; Cui, S.; Chen, M.; Liu, Y.; Li, X.; Chen, H.; Liu, L. Near-infrared analysis of flavonoid content in peanut kernels. Spectrosc. Spectr. Anal. 2024, 44, 1112–1116. [Google Scholar]
  4. Zhang, J.; Xu, B.; Feng, H.K.; Jing, X.; Wang, J.J.; Ming, S.K.; Fu, Y.Q.; Song, X.Y. Monitoring nitrogen nutrition and grain protein content in rice based on ensemble learning. Spectrosc. Spectr. Anal. 2022, 42, 1956–1964. [Google Scholar]
  5. Zhang, L.; Wang, H.; Cai, L.; Yu, C.; Sun, T. Rapid and nondestructive detection of hollow defects in pecan nuts based on near-infrared spectroscopy and voting method. J. Food Compos. Anal. 2025, 141, 107381. [Google Scholar] [CrossRef]
  6. Bao, X.; Huang, D.; Yang, B.; Li, J.; Opeyemi, A.T.; Wu, R.; Cheng, Z. Combining deep convolutional generative adversarial networks with visible-near infrared hyperspectral reflectance to improve prediction accuracy of anthocyanin content in rice seeds. Food Control 2025, 174, 111218. [Google Scholar] [CrossRef]
  7. Zhang, J.; Guo, Z.; Wang, S.; Yue, M.; Zhang, S.; Peng, H.; Yin, X.; Du, J.; Ma, C. Comparative study on portable near-infrared and visible light spectrometers for detecting water content in rice. Spectrosc. Spectr. Anal. 2023, 43, 2059–2066. [Google Scholar]
  8. Guo, Z.; Wang, L.; Jin, L.; Zheng, C. Detection of protein and fat content in dairy products based on near-infrared transmission spectroscopy. Optoelectron. Laser 2013, 24, 1163–1168. [Google Scholar]
  9. Armstrong, P.R. Rapid single-kernel NIR measurement of grain and oil-seed attributes. Appl. Eng. Agric. 2006, 22, 767–772. [Google Scholar] [CrossRef]
  10. Sringarm, C.; Numthuam, S.; Jiamyangyuen, S.; Kittiwachana, S.; Saeys, W.; Rungchang, S. Classification of industrial tapioca starch hydrolysis products based on their brix and dextrose equivalent values using near-infrared spectroscopy. J. Sci. Food Agric. 2024, 104, 7249–7257. [Google Scholar] [CrossRef] [PubMed]
  11. Taghinezhad, E.; Szumny, A.; Figiel, A.; Amoghin, M.L.; Mirzazadeh, A.; Blasco, J.; Mazurek, S.; Castillo-Gironés, S. The potential application of HSI and VIS/NIR spectroscopy for non-invasive detection of starch gelatinization and head rice yield during parboiling and drying process. J. Food Compos. Anal. 2025, 142, 107443. [Google Scholar] [CrossRef]
  12. Zhong, L.; Fan, Y.; Wu, Y.; Gao, Y.; Gao, Z.; Zhou, A.; Shao, Q.; Zhang, A. Data fusion strategy for rapid prediction of glyceryl trioleate and polysaccharide content in Ganoderma lucidum spore powder based on near-infrared spectroscopy and hyperspectral imaging. J. Food Compos. Anal. 2025, 141, 107403. [Google Scholar] [CrossRef]
  13. Jin, W.; Cao, N.; Zhu, M.; Chen, W.; Zhang, P.; Zhao, Q.; Liang, J.; Yu, Y. Study on the non-destructive classification of rice seed vitality based on near-infrared supercontinuum laser spectroscopy. Chin. Opt. 2020, 13, 1032–1043. [Google Scholar]
  14. Liu, Q.; Du, Y.; Jiang, J.; Chen, S.; Hu, S.; Xie, J. Application of Savitzky-Golay smoothing combined with multivariate scatter correction in infrared spectroscopy data preprocessing for formaldehyde-fixed deep vein thrombosis. Shandong Chem. Ind. 2024, 53, 126–130. [Google Scholar] [CrossRef]
  15. Li, C.; Wang, N.; Liu, J.; Wu, S.; Fu, W.; Ren, K. Spectral confocal shift measurement method analysis and research based on multivariate scatter correction. Infrared Laser Eng. 2025, 54, 231–240. [Google Scholar]
  16. Zhang, Z.; Yan, C.; Yue, C.; An, D.; Liu, X.; Wang, M.; Zhang, T.; Li, H. Prediction of methanol content in biodiesel using near-infrared spectroscopy combined with partial least squares method. Phys. Chem. Test.-Chem. Ed. 2024, 60, 606–611. [Google Scholar]
  17. Forsgren, E.; Bjorkblom, B.; Trygg, J.; Jonsson, P. OPLS-Based Multiclass Classification and Data-Driven Interclass Relationship Discovery. J. Chem. Inf. Model. 2025, 65, 1762–1770. [Google Scholar] [CrossRef] [PubMed]
  18. Danyluik, M.; Zeighami, Y.; Mukora, A.; Lepage, M.; Shah, J.; Joober, R.; Misic, B.; Iturria-Medina, Y.; Chakravarty, M.M. Evaluating permutation-based inference for partial least squares analysis of neuroimaging data. Imaging Neurosci. 2025, 3, p.imag_a_00434. [Google Scholar] [CrossRef]
  19. Chen, J.; Li, M.; Pan, T.; Pang, L.; Yao, L.; Zhang, J. Rapid and non-destructive analysis for the identification of multi-grain rice seeds with near-infrared spectroscopy. Spectrochim. Acta Part A Mol. Biomol. Spectrosc. 2019, 219, 179–185. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Structure diagram of detection system.
Figure 1. Structure diagram of detection system.
Photonics 12 00481 g001
Figure 2. Near-infrared spectral characteristics of rice seeds.
Figure 2. Near-infrared spectral characteristics of rice seeds.
Photonics 12 00481 g002
Figure 3. Primary preprocessing of spectral data. (A) Raw spectral data; (B) Gaussian noise-reduced spectra; (C) moving average noise-reduced spectra; (D) Savitzky–Golay filter noise-reduced spectra.
Figure 3. Primary preprocessing of spectral data. (A) Raw spectral data; (B) Gaussian noise-reduced spectra; (C) moving average noise-reduced spectra; (D) Savitzky–Golay filter noise-reduced spectra.
Photonics 12 00481 g003
Figure 4. Gaussian noise reduction and subsequent treatment. (A) raw spectral data; (B) Gaussian denoising spectrum; (C) First derivative processing; (D) MSC processing.
Figure 4. Gaussian noise reduction and subsequent treatment. (A) raw spectral data; (B) Gaussian denoising spectrum; (C) First derivative processing; (D) MSC processing.
Photonics 12 00481 g004
Figure 5. Savitzky–Golay noise reduction and subsequent treatment. (A) raw spectral data; (B) Savitzky–Golay denoising spectrum; (C) First derivative processing; (D) MSC processing.
Figure 5. Savitzky–Golay noise reduction and subsequent treatment. (A) raw spectral data; (B) Savitzky–Golay denoising spectrum; (C) First derivative processing; (D) MSC processing.
Photonics 12 00481 g005
Figure 6. Moving average noise reduction and subsequent treatment. (A) raw spectral data; (B) Moving average denoising spectrum; (C) First derivative processing; (D) MSC processing.
Figure 6. Moving average noise reduction and subsequent treatment. (A) raw spectral data; (B) Moving average denoising spectrum; (C) First derivative processing; (D) MSC processing.
Photonics 12 00481 g006
Figure 7. Validation results of fat prediction model.
Figure 7. Validation results of fat prediction model.
Photonics 12 00481 g007
Figure 8. Validation results of starch prediction model.
Figure 8. Validation results of starch prediction model.
Photonics 12 00481 g008
Figure 9. Verification results of protein prediction model.
Figure 9. Verification results of protein prediction model.
Photonics 12 00481 g009
Table 1. Modeling and analysis of raw data and initial denoising data.
Table 1. Modeling and analysis of raw data and initial denoising data.
MethodFatR2FatQ2RMSECVRMSEStarchR2StarchQ2RMSECVRMSEProteinR2ProteinQ2RMSERMSECV
OrginPls0.7820.6040.3710.38340.7600.6390.3650.37210.2840.07990.36590.364
Opls0.7890.5830.3270.28650.7140.6150.3030.24610.4390.1790.25230.311
ANN0.6140.3240.4160.4090.6890.4450.4220.3960.5310.4170.5620.486
savitzkygolayPls0.8570.6230.3200.33120.7900.6520.3560.34610.3930.1430.34170.323
Opls0.8850.6590.3210.24160.7850.6510.3110.26620.5950.2160.25180.307
ANN0.6590.4330.3590.3680.7360.4520.3640.2870.6110.4690.3710.364
Moving AveragePls0.5880.3130.4160.26930.7010.3730.3530.27560.3880.1150.28670.344
Opls0.6490.2210.3250.19980.6030.4060.2740.25160.3340.1300.23980.269
ANN0.5900.4640.6230.6410.6910.5490.6610.5690.7980.6500.3540.319
GAUSSPls0.4660.1470.3690.28560.6900.3450.4240.24660.3830.1590.28590.295
Opls0.9170.6490.2570.18390.8520.6110.3140.20620.7900.3460.20950.241
ANN0.8130.7010.5310.4850.7930.6360.3070.3410.6410.5960.2640.326
Table 2. Modeling of Initial Denoising + First Derivative + MSC preprocessing method.
Table 2. Modeling of Initial Denoising + First Derivative + MSC preprocessing method.
MethodFatR2FatQ2RMSECVRMSEStarchR2StarchQ2RMSECVRMSEProteinR2ProteinQ2RMSERMSECV
GAUSS + First derivativePls0.7190.5540.2760.23890.7550.5360.3650.24530.4310.2190.26850.298
Opls0.8920.7500.2330.19430.8300.7410.2590.17830.7630.6910.16930.192
ANN0.9010.7630.3230.2980.8180.6990.3170.3620.7980.6570.3740.326
Savitzkygolay + First derivativePls0.8580.7590.3320.23660.7510.5230.3410.29130.2680.1460.27300.269
Opls0.8980.8640.2350.11440.8210.8010.2490.16540.8360.7970.12150.189
ANN0.7930.7620.4130.3980.9260.8970.2640.2230.8460.6480.2680.211
Moving Average + First derivativePls0.8620.6780.2930.26190.7460.5600.3110.26330.4940.1250.27310.195
Opls0.8900.8200.1920.11390.9250.8290.2950.13460.9030.8540.15480.233
ANN0.9110.8690.2510.1980.8610.7190.2640.3150.7460.6930.1680.264
GAUSS + First derivative + MSCPls0.9190.7800.3140.26990.8280.6200.2890.23660.4250.0920.27600.269
Opls0.9580.9030.2120.19420.9510.9010.2010.17890.9450.9280.18870.235
ANN0.9020.8300.3260.2640.8490.7930.2350.3240.8970.7290.2350.198
Savitzkygolay + First derivative + MSCPls0.9130.6140.3440.25490.8710.5010.2650.26510.7450.6580.27490.334
Opls0.9280.8730.1880.16970.9160.8690.2230.18140.9340.8980.19880.196
ANN0.8660.7850.1580.1690.9320.8940.1460.2090.8760.8260.3030.295
Moving Average + First derivative + MSCPls0.3990.2220.2610.25190.5830.3090.2760.24170.1840.0520.25570.233
Opls0.9080.8560.2170.19860.9350.8810.2120.17930.9420.9050.21600.134
ANN0.9320.9160.3650.3220.9420.8960.3170.2460.8650.8430.2220.296
Table 3. Modeling of Initial Denoising + First Derivative + MSC preprocessing +SPA method.
Table 3. Modeling of Initial Denoising + First Derivative + MSC preprocessing +SPA method.
MethodFatR2FatQ2RMSECVRMSEStarchR2StarchQ2RMSECVRMSEProteinR2ProteinQ2RMSERMSECV
GAUSS + First derivative + SPAPls0.5170.4350.3680.2490.6930.5240.3190.2660.5950.3680.2770.343
Opls0.9460.7970.3420.2870.8630.8190.2660.3200.8710.7230.3050.290
ANN0.9220.8430.2980.2470.9080.7890.3010.2960.8670.7990.3350.276
Savitzkygolay + First derivative + SPAPls07930.6520.2720.2300.8160.7430.2760.2330.5600.3280.4010.341
Opls0.9170.8430.3170.1980.8720.7880.2670.3550.9230.8400.3550.264
ANN0.8660.7830.2750.2570.9380.8650.2360.2700.8720.7540.3320.286
Moving Average + First derivative + SPAPls0.8430.7880.4530.3970.7980.7060.2970.3420.6890.5670.2620.341
Opls0.9030.7380.2580.2710.9390.9070.1930.1560.9130.8750.1470.190
ANN0.8770.7590.2560.2040.9020.8130.2440.3060.8960.8420.1560.149
G + First derivative + MSC + SPAPls0.9230.8230.2170.19650.7590.6820.2940.2780.6890.3690.1780.253
Opls0.9640.9160.2640.2050.9740.9330.2170.1960.9580.9300.1680.249
ANN0.9430.8930.2350.3300.9050.8750.3090.2480.9010.7980.2630.254
SG + First derivative + MSC + SPAPls0.9070.7490.2660.1980.9140.8230.2460.3190.6980.4980.2610.296
Opls0.9550.9320.2090.1850.9370.9090.1960.1770.9610.9050.2070.321
ANN0.8990.8650.2410.2380.9040.8460.2690.3100.9210.8930.2240.156
MA + First derivative + MSC + SPAPls0.4860.3690.3560.3010.7800.6560.3140.2910.6530.5260.2660.197
Opls0.9710.9260.1860.1750.9560.9070.1460.1590.9670.9360.1560.164
ANN0.9420.9010.2840.3240.9220.9070.3290.2610.9360.9100.2060.159
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kong, H.; Wang, J.; Lin, G.; Chen, J.; Xie, Z. Analysis of Nutritional Content in Rice Seeds Based on Near-Infrared Spectroscopy. Photonics 2025, 12, 481. https://doi.org/10.3390/photonics12050481

AMA Style

Kong H, Wang J, Lin G, Chen J, Xie Z. Analysis of Nutritional Content in Rice Seeds Based on Near-Infrared Spectroscopy. Photonics. 2025; 12(5):481. https://doi.org/10.3390/photonics12050481

Chicago/Turabian Style

Kong, Hengyuan, Jianing Wang, Guanyu Lin, Jianbo Chen, and Zhitao Xie. 2025. "Analysis of Nutritional Content in Rice Seeds Based on Near-Infrared Spectroscopy" Photonics 12, no. 5: 481. https://doi.org/10.3390/photonics12050481

APA Style

Kong, H., Wang, J., Lin, G., Chen, J., & Xie, Z. (2025). Analysis of Nutritional Content in Rice Seeds Based on Near-Infrared Spectroscopy. Photonics, 12(5), 481. https://doi.org/10.3390/photonics12050481

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop