1. Introduction
In recent decades, China’s rapid development in animal husbandry has spurred corresponding growth in the feed industry. The healthy development of the feed sector relies on a stable raw material supply system. According to a report by the China Feed Industry Association, China’s protein feed gap—including soybeans—is projected to exceed 124 million tons by 2035. Feed demand continues to rise due to growing demand for livestock products and structural shifts, making shortages increasingly prominent [
1]. Cottonseed is rich in protein and oil, and the protein content of cottonseed meal after dehulling and oil extraction far exceeds that of major feed crops like corn and wheat; crude protein content varies significantly among different varieties of cottonseed meal, ranging from approximately 39.12% to 59.63% [
2]. China’s annual cottonseed production reaches approximately 10 million tons. According to a notice from the General Office of the Ministry of Agriculture and Rural Affairs regarding the “Three-Year Action Plan for Reducing Soybean Meal Replacement,” cottonseed protein has considerable potential to replace soybean meal protein [
3,
4]. According to an industry report (QYResearch, “Current Status and Future Development Trends of the Global and Chinese Cottonseed Protein Market, 2025–2031”), the global dephenolized cottonseed protein market size reached approximately
$441 million in 2024 and continues to grow steadily. Representative domestic enterprises include Xinjiang Chenguang Biotechnology Group Co., Ltd., Tumxuk, China. and Xinjiang Jinlan Plant Protein Co., Ltd., Shihezi, China. When procuring cottonseed, feed enterprises typically classify and price purchases based on nutritional indicators (particularly crude protein content). Consequently, they tend to select cottonseed with higher crude protein levels to enhance cost-effectiveness. However, there is currently a lack of rapid prediction technology for crude protein content in raw cottonseed. When purchasing raw cottonseed, pricing is only roughly determined based on variety and origin. Therefore, it is necessary to develop a cottonseed protein content detection technology to protect the economic interests of feed and cotton seed processing enterprises.
Currently, to determine the protein content of raw cottonseed, the seed’s fluff and hull must be removed, and the cottonseed kernel ground into powder. The protein content is then measured using chemical methods [
5]. According to GB 5009.5-2016 “Determination of Protein in Foods” [
6], the primary chemical methods for measuring protein content in cottonseed include the Kjeldahl method [
7], spectrophotometric method [
8], and combustion method. Among these, the Kjeldahl method is the most widely applied. The procedure involves the following steps: First, prepare the necessary chemical reagents. Catalytically heat the protein in the cottonseed to decompose it, producing ammonia that reacts with sulfuric acid to form ammonium sulfate. Alkaline distillation releases the ammonia, which is absorbed by boric acid and then titrated with a standard hydrochloric acid solution. Finally, nitrogen content is calculated based on acid consumption, multiplied by a conversion factor to yield protein content. This method is complex, time-consuming, and destructive, failing to meet the demands of feed enterprises [
9].
To achieve rapid non-destructive detection of protein content in cottonseed, researchers have conducted relevant studies using spectroscopic methods. Qin et al. [
10] employed partial least squares regression to construct a near-infrared calibration model for protein content in cottonseed meal powder using 207 samples of cottonseed material grown across multiple years and locations, achieving an R
2 value of 0.933. Huang et al. [
11] collected Fourier transform NIR reflectance spectra of ginned cotton seeds. They employed a nonlinear least-squares support vector machine combined with a Monte Carlo non-informative variable elimination method to construct a near-infrared calibration model for whole cotton seed protein content, achieving an R
2 of 0.959 for the protein model and enabling rapid analysis of whole cotton seed nutritional quality. Ma et al. [
12] collected Fourier-transform NIR reflectance spectra data from fuzzy cottonseeds. They optimized the model using spectral preprocessing methods such as multiple scattering correction and first-order derivatives, and established a rapid NIR determination model for protein content in raw cotton seeds via random forest analysis, achieving an R
2 value as high as 0.9459. Yang et al. [
13] determined the chemical values of cotton seed protein content using the Kjeldahl method. They collected spectral characteristic information of three different cotton seed morphologies—raw seeds, polished seeds, and kernels—using a near-infrared spectrometer. After applying various mathematical and scattering corrections to the raw spectra, they established near-infrared spectral models for rapid non-destructive detection of protein content in raw seeds, polished seeds, and kernels using an improved partial least squares method. The models achieved R
2 values of 0.957, 0.971, and 0.989, respectively. The models achieved R
2 values of 0.957, 0.971, and 0.989, respectively.
NIR reflection spectroscopy is an effective method for determining the protein content of cottonseed. Combining NIR reflectance spectroscopy with machine learning has also been successfully applied to the detection of other indicators in cottonseed, such as oil content [
14], fatty acid content [
15], gossypol content [
16], and phytic acid [
17]. The aforementioned studies for detecting cottonseed protein content all used Fourier NIR scanning spectroscopy, which collects more detailed data and provides theoretical support for protein content detection. However, this method has the drawbacks of slow detection speed and high equipment costs. Compared to the FT-NIR scanning method for samples, the sample collection approach using fiber optic probes and multi-grain sample pools employed in this study offers superior detection speed. While FT-NIR boasts higher resolution and stable optical paths, fiber optic NIR instruments are slightly less sophisticated in specifications but are more cost-effective and better aligned with industrial needs. However, there is limited research on its application in cottonseed prediction, primarily due to the difficulty in collecting NIR spectral data from multiple cottonseed samples using fiber optic spectroscopy [
18]. Additionally, fuzzy cottonseed significantly increases the difficulty of NIR fiber optic spectroscopy detection due to the presence of cotton lint. Only Li et al. [
4] employed near-infrared fiber optic spectroscopy to collect near-infrared spectral data of fuzzy cottonseed. Through spectral preprocessing and feature wavelength selection algorithms, they established a regression model for fuzzy cottonseed oil content, thereby achieving the detection of oil content in fuzzy cottonseed. However, no literature has been found regarding studies on detecting the protein content of fuzzy cottonseed using NIR fiber optic spectroscopy.
To this end, this study focuses on fuzzy cottonseed as the research subject, constructing a multi-sample pool of fuzzy cottonseed and collecting its near-infrared (NIR) spectral profiles of the samples using an NIR fiber optic spectrometer. The NIR spectral profiles of the fuzzy cottonseed are then preprocessed using relevant spectral data preprocessing algorithms to reduce noise during the experimental process. Subsequently, a spectral feature wavelength selection algorithm is used to choose a set of key wavelengths that reflect the protein content of fuzzy cottonseed. Finally, a regression model for detecting the protein content of fuzzy cottonseed is established using machine learning algorithms. Therefore, establishing a rapid, non-destructive, and efficient detection technique for fuzzy cottonseed protein will accelerate the development of corresponding rapid prediction devices, thereby accelerating the development of corresponding rapid detection devices.
2. Materials and Methods
2.1. Fuzzy Cottonseed Sample
To obtain fuzzy cottonseed samples with different protein content distributions, 3600 fuzzy cottonseeds from three varieties (1200 seeds per variety) were selected from Xinlai Da Pengfeng Cotton Industry Co., Ltd. in Alar, China. The three varieties were Xinluzao-50, Xinluzao-49, and Xinluzao-27. To maintain sample consistency, the fuzzy cottonseeds of all three varieties were sourced from Alar, China. To ensure sample uniformity and stability, fuzzy cotton seeds were carefully selected using the boiling water scalding method. Each type of fuzzy cotton seed was scalded with boiling water and stirred for 1 min, then mixed with three times its volume of cool water until evenly distributed. Healthy seeds exhibiting deep brown or dark reddish-brown hues were selected, dried at 38–40 °C, allowed to reach moisture equilibrium for 2 days, and stored in sealed containers. Selected fuzzy cotton seeds were of comparable size. For each variety, 1200 seeds were selected using this method and randomly divided into 40 groups of 30 seeds each, yielding a total of 120 groups across the three varieties. The samples were stored at 20 °C with a relative humidity of 60%, and NIR spectral data were collected while determining the protein content.
Table 1 shows the protein content distribution ranges of the fuzzy cottonseed samples from the three varieties. The average protein content for each variety was 39.326%, 40.619%, and 40.851%, respectively. By selecting samples from these three varieties, a range of protein content distributions can be obtained, thereby enhancing the reliability of subsequent modeling results.
2.2. NIR Spectroscopy Acquisition
2.2.1. Spectral Acquisition System
As shown in
Figure 1, the NIR spectroscopy data acquisition system for cottonseed consists of four components: a computer, an NIR spectrometer, an optical fiber probe, and a sample cell. The instrument model is the SupNIR-1500 portable NIR analyzer (manufactured by Hangzhou Polytech Co., Ltd., Hangzhou, China), with a wavelength range of 1000 nm to 1800 nm and a data sampling interval of 1 nm. It employs a grating scanning configuration and is equipped with a fiber optic probe (featuring a 10 W NIR light source). The sample cell is a transparent cuvette measuring 6 cm in length, 5 cm in width, and 1 cm in thickness.
2.2.2. Spectral Analysis of Fuzzy Cottonseed
Thirty cottonseed grains were uniformly distributed in a 5 × 6 grid within the sample cell. NIR spectral data were collected from five distinct positions within the cell. The spectral average from these five positions was calculated as the spectral data for a single sample. A total of 120 cottonseed samples yielded NIR reflectance spectral data.
Figure 1 indicates the positions where cottonseed spectra were collected, marked by red circles.
2.3. Determination of Protein Content in Samples
After collecting the spectral data of cottonseeds, a sample containing 30 cottonseeds was processed to remove lint and hulls, yielding cottonseed kernels, which were then ground into powder. The protein content of the cotton seed kernel powder was then determined using the Kjeldahl method, following the experimental procedures specified in GB 5009.5-2016 “Determination of Protein in Foods” [
6].
The main steps of the Kjeldahl nitrogen determination method are as follows: Using weighing paper, take approximately 0.4 g of each group of cottonseed samples and place them into digestion tubes. Add catalyst tablets (composed of copper sulfate and potassium sulfate) and an appropriate amount of concentrated sulfuric acid to the digestion tubes. Then label them and place them in the digestion furnace for digestion. Digestion occurs in two stages: first, digestion at 200 °C for one hour, followed by digestion at 420 °C for one hour. After digestion, remove the tubes. The liquid inside will appear green and transparent. After cooling for one hour, it turns pale blue and transparent. During the digestion period, prepare the relevant standard solutions required for the automatic Kjeldahl nitrogen analyzer. Finally, use the automated Kjeldahl nitrogen determinator (preheat the instrument for at least 15 min, set the standard solution concentration, configure the desired program, and input the sample weight into the program) to prediction the protein content of the cooled liquid, thereby determining the protein content of the raw cottonseed sample. The Kjeldahl nitrogen determinator model used in this study is an 8400 Kjeltec™ Automated Kjeldahl Nitrogen Determinator, FOSS Analytical A/S, Hillerød, Denmark.
2.4. Dataset Preparation
The partitioning of the fuzzy cottonseed NIR spectral dataset directly affects the reliability of the protein content detection model for fuzzy cottonseed established by subsequent machine learning algorithms. If the number or representativeness of the calibration samples is insufficient, the fuzzy cottonseed protein prediction model will suffer from underfitting; conversely, an excessive number of samples will lead to overfitting of the model [
19]. Common dataset partitioning methods in the field of spectral analysis include the Kennard-Stone (K-S) algorithm and the Sample Set Partitioning Based on Joint X-Y Distance (SPXY) method [
20]. The K-S algorithm assigns fuzzy cottonseed samples with relatively large Euclidean distances in the variable space to the calibration set; however, this method does not account for the influence of the sample concentration variable. The SPXY method, based on the K-S algorithm, redefines the relative Euclidean distance of fuzzy cottonseed samples by incorporating sample concentration (protein content), thus avoiding the problem of uneven sample distribution caused by the K-S method [
21]. Given that SPXY offers advantages in enhancing inter-sample variability and representativeness, this study used the SPXY method [
22] to partition the NIR data and protein content data of 120 fuzzy cottonseed samples into calibration and prediction datasets at a 3:1 ratio. The calibration set consisted of 80 samples, with 29, 33, and 18 samples from Xinluzao-50, Xinluzao-49, and Xinluzao-27, respectively, and an independent prediction set included 40 samples, with 11, 7, and 22 samples from Xinluzao-50, Xinluzao-49, and Xinluzao-27, respectively.
2.5. Pretreatment of Spectral Data
During the collection of fuzzy cottonseed NIR spectral data, factors such as changes in ambient temperature and light, as well as the instrument and sample particle size, can result in the inclusion of noisy data in the fuzzy cottonseed NIR spectral data [
22,
23]. The aforementioned noisy data can cause significant interference in the subsequent establishment of the fuzzy cottonseed protein content prediction model, potentially leading to an unreliable model. Therefore, it is essential to use relevant spectral data preprocessing algorithms to eliminate or reduce noise interference [
24]. Common spectral data preprocessing methods include Standard Normal Variable transformation (SNV), Savitzky–Golay convolution smoothing (SG), normalization, first derivative (1D), and Multiplicative Signal Correction (MSC) [
25,
26]. MSC and SNV are mainly used to eliminate surface scattering noise from fuzzy cottonseed samples. SG effectively removes random noise, 1D eliminates or reduces baseline and background interference, and normalization corrects spectral variations caused by slight optical path differences. This study employs MATLAB R2022a to preprocess spectral data, select characteristic wavelengths, and develop predictive models.
2.6. Feature Wavelength Point Selection
The fuzzy cottonseed NIR spectral data collected in this study consists of 800 dimensions, containing both valid information for protein content detection and a significant amount of irrelevant information. This irrelevant information not only leads to data redundancy but also severely interferes with the speed and accuracy of the fuzzy cottonseed protein content prediction model established subsequently. Therefore, it is necessary to use relevant feature wavelength extraction algorithms to select the information in the NIR spectral data that reflects the protein content of fuzzy cottonseed, thereby reducing the dimensionality of the fuzzy cottonseed NIR data. This study used Uninformative Variable Elimination (UVE), Competitive Adaptive Reweighted Sampling (CARS), and Random Frog (RF) algorithms to extract the feature wavelengths that reflect the protein content of fuzzy cottonseed.
The UVE algorithm introduces random noise of the same dimension into the original NIR spectral matrix of fuzzy cottonseed, constructing a composite feature matrix that combines real spectral variables with irrelevant noise. Subsequently, a model was established using partial least squares regression, with its regression coefficients serving as key indicators for evaluating the stability of each wavelength variable. UVE compares these statistical values with the maximum stability of the random noise variables and eliminates the uninformative variables whose stability is below the threshold [
27]. The remaining variables form a new feature subset, which is then used to establish a faster and more accurate prediction model. CARS performs adaptive reweighting sampling based on the exponential decay function, calculating the absolute value weights of regression coefficients. Through the theory of survival of the fit prediction, it eliminates wavelength variables with smaller weight values from the model. Finally, as the number of samples increases, the cross-validation root mean square error (RMSECV) value continuously decreases. When RMSECV reaches its minimum, the corresponding wavelength variable set constitutes the extracted spectral feature wavelength variables [
28]. The RF algorithm is a feature wavelength selection method based on probabilistic sampling [
29]. It first randomly initializes a set of wavelength points to construct the initial prediction model, and then changes the wavelength combination state through operations such as “addition,” “deletion,” and “exchange” of variables, simulating the jumping behavior of frogs. After each jump, the acceptance probability is calculated based on the model’s prediction performance, and the new wavelength combination is accepted or rejected according to the Metropolis–Hastings criterion. Finally, the frequency at which each wavelength point is selected during the sampling process is counted, and the most characteristic wavelength set is chosen.
2.7. Modeling Methods
The feature wavelengths selected by the aforementioned algorithms still cannot directly predict the protein content of fuzzy cottonseed; machine learning algorithms are required to achieve the prediction of protein content. This study repeatedly attempted to use three regression algorithms—Partial Least Squares Regression (PLSR), Least Squares Support Vector Machine (LSSVM), and Support Vector Regression (SVR)—to establish a protein content prediction model for fuzzy cottonseed. PLSR is an algorithm that combines principal component analysis and multivariate linear regression. It projects spectral data and fuzzy cottonseed protein content data into a new space, where it seeks a multivariate linear regression model between the independent and dependent variables to predict protein content [
30]. LSSVM transforms the inequality constraints in SVM into equality constraints, thereby simplifying the solution of Lagrange multipliers. This converts the quadratic programming problem into a linear equation system, reducing computational complexity and improving calibration speed and convergence accuracy. Therefore, this paper employs a radial basis function (RBF) kernel with a penalty parameter of 100 and a kernel width of 25 to achieve an LSSVM with good generalization capability. Ten-fold cross-validation is used to tune the hyperparameters [
31,
32], with the 10-fold cross-validation applied exclusively within the calibration set for hyperparameter optimization. The SVR model employs the concept of support vectors to perform a nonlinear mapping of low-dimensional data into a high-dimensional space, thereby enabling regression analysis within that high-dimensional space. The advantage of the SVR model lies in its lack of restrictions on the distribution of the data, effectively addressing challenges associated with small sample sizes, nonlinear relationships, and high dimensionality [
33]. In this study, the SVR uses the RBF kernel.
2.8. Model Evaluation Indicators
The evaluation metrics for regression models primarily include the coefficient of determination (R
2), Root Mean Square Error (RMSE), Residual Predictive Deviation (RPD), and Range Error Ratio (RER). R
2 reflects the model’s ability to explain total variation, with values ranging from 0 to 1. The closer R
2 is to 1, the better the model’s performance. RMSE primarily measures the deviation between the observed and predicted values. A smaller RMSE indicates better predictive performance of the model. RMSE
C/RMSE
P represent the root mean square error for the calibration/prediction sets, respectively. RPD is the ratio of the standard deviation of the sample’s true values to RMSE, which characterizes the model’s predictive ability. RPD is the ratio of the standard deviation of the true sample values to the RMSE, characterizing the model’s predictive capability. Generally, an RPD value exceeding 1.4 indicates the model can be used for rough sample evaluation, while an RPD value exceeding 2 signifies excellent predictive capability. RER is the ratio of the true range of the sample to the RMSE, reflecting the model’s performance across the entire data range. When 7 ≤ RER < 10, the model is suitable for screening and calibration; when RER ≥ 10, the model is suitable for quantitative analysis
In this equation, the variable n denotes the total number of samples, which may refer to either the calibration or prediction set, depending on whether the calculation is based on the calibration set or the prediction set. Is the predicted value of the -th cottonseed sample from the model, is the actual value of the -th cottonseed sample, indicates the mean of all actual cottonseed sample values, and denotes the standard deviation of these actual values in the prediction set. Additionally, represents for the prediction set. corresponds to the protein content of fuzzy cottonseed in the prediction set.
4. Discussion
In this study, NIR fiber optic spectral data for fuzzy cottonseed were first collected, and the SNV algorithm was applied to remove surface scattering and noise from the fuzzy spectra. Following this, the UVE, CARS, and RF algorithms, either in combination or individually, were used to select feature wavelengths from the fuzzy NIR spectra of cottonseed, accurately identifying wavelength points associated with fuzzy cottonseed protein content. Subsequently, PLSR, LSSVM, and SVR machine learning methods were employed to construct the corresponding models for the protein content in cottonseed, effectively tackling the challenge of rapid protein content detection. Numerous studies have examined the detection of protein content in cottonseed. For instance, Qin et al. [
10] constructed a NIR calibration model for protein content in cottonseed meal using 207 samples of cottonseed materials grown across multiple years and locations. The model employed partial least squares regression and achieved an R
2 value of 0.933. Qin et al. enhanced the detection accuracy by eliminating spectral interference from cottonseed hulls and fibers, thereby improving the characterization of key chemical bond signals in cottonseed meal powder. In comparison to this study, the absence of interference from cottonseed shells and fibers facilitates the detection of chemical bonds within cottonseed kernel powder, thereby improving detection accuracy. Huang et al. [
11] gathered Fourier NIR spectral data from light cottonseed and developed a protein content prediction model using MC-UVE and LSSVM, achieving an R
2 of 0.959. Compared to studies employing FT-NIR, the LSSVM model based on the combination of UVE and CARS in this research achieved R
2 = 0.8571, RPD = 2.7078, and RER = 10.72, meeting the criteria for quantitative analysis while offering distinct advantages in terms of speed and low cost. Therefore, the methodology adopted by this research will play a significant role in the feed processing industry, effectively addressing the key issues in the detection of protein content in cottonseed.
Due to the small size of cottonseeds, determining the protein content of individual cottonseeds using an NIR spectrometer or chemometrics presents significant challenges. Therefore, protein content can typically only be measured for a group of cottonseeds. Collecting the NIR spectra of individual fuzzy cottonseeds in each group is not only time-consuming but also requires considerable manpower. In this study, 30 fuzzy cottonseeds were grouped as a single sample, and five NIR spectra were collected from different positions of each sample. The mean value of these spectra was used as the representative spectral data for the sample. Subsequently, these spectral data underwent preprocessing, followed by feature wavelength selection. Based on the evaluation of the constructed models, this data collection approach proves to be highly effective in detecting fuzzy cottonseed protein content while ensuring operational efficiency.
Due to the light-dispersing effect of cotton seed fibers, direct PLSR modeling of raw cotton seed spectra after SNV preprocessing yielded poor protein content prediction results, indicating substantial redundant information persists even in preprocessed spectra. However, the predictive models established after UVE-CARS screening yielded significantly superior results compared to those built using the full spectrum. This indicates that feature spectra filtered by characteristic wavelengths possess greater interpretability than the full spectrum. Moreover, the number of wavelengths screened by UVE-CARS was reduced by over ninety percent relative to the full spectrum. This maximally minimized redundant information within the spectrum, thereby streamlining the predictive models.
This study employed only three cotton seed varieties, with a limited number of experimental samples. The portable spectrometer used supported only the 1000~1800 nm wavelength range, resulting in certain limitations. In future studies, if conditions permit, we plan to collect a larger number of cottonseed samples and conduct systematic comparisons across different seed kernel counts. This will enable quantitative assessment of accuracy at varying kernel densities. Additionally, we intend to acquire an advanced portable near-infrared spectrometer with a broader spectral coverage. These measures will help further optimize the existing model, enhancing its performance and generalization capabilities. The algorithms employed in this paper are relatively few. For preprocessing, techniques such as wavelet transforms could be explored. Regarding feature wavelength extraction algorithms, methods like the Successive Projection Algorithm (SPA) and Principal Component Analysis (PCA) could be tested. For modeling, subsequent studies could experiment with algorithms like Random Forests. Chemometrics research is advancing rapidly, with algorithms continually refined and optimized. Future research may explore new algorithmic modeling approaches.
5. Conclusions
This study combines NIR fiber spectroscopy with machine learning to establish a predictive model capable of accurately and effectively forecasting the protein content of fuzzy cottonseed. Prediction models for raw cottonseed protein content were developed using PLSR, LSSVM, and SVR. Results indicate that SNV serves as the optimal preprocessing algorithm. The model based on UVE-CARS demonstrated the highest predictive performance, with LSSVM yielding superior results compared to PLSR and SVR. The LSSVM model based on UVE-CARS demonstrated the highest performance, with model R2, RMSE, RPD, and RER values of 0.8571, 0.0033, 2.7078, and 10.72, respectively. The feature wavelengths reflecting cotton seed protein content in the 1000 nm~1800 nm near-infrared band were primarily distributed between 1070~1130 nm, 1160~1200 nm, 1250~1280 nm, 1380~1460 nm, 1480~1630 nm, and 1660~1740 nm. Furthermore, compared to FT-NIR, near-infrared fiber spectroscopy offers advantages such as faster detection speed and lower cost, making it highly suitable for the rapid, non-destructive testing requirements of feed enterprises. Therefore, this study provides an effective method for determining crude protein content in cottonseed. In future applications, excluding ineffective spectral bands can reduce the cost of developing future instruments, laying a foundation for related equipment research and development.