Detection of Soluble Solid Content in Xinyu Pears Using Near-Infrared Spectroscopy and Deep Fusion of Multi-Preprocessed Spectral Data

Qi, Hengnian; Wang, Hao; Liao, Quanqing; Han, Zijun; Zhang, Chu

doi:10.3390/app16104732

Open AccessArticle

Detection of Soluble Solid Content in Xinyu Pears Using Near-Infrared Spectroscopy and Deep Fusion of Multi-Preprocessed Spectral Data

by

Hengnian Qi

^†

,

Hao Wang

^†,

Quanqing Liao

,

Zijun Han

and

Chu Zhang

^*

School of Information Engineering, Huzhou Normal University, Huzhou 313000, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2026, 16(10), 4732; https://doi.org/10.3390/app16104732

Submission received: 9 April 2026 / Revised: 1 May 2026 / Accepted: 5 May 2026 / Published: 10 May 2026

(This article belongs to the Section Food Science and Technology)

Download

Browse Figures

Versions Notes

Abstract

Xinyu pear is one of the important pear cultivars in China. Owing to its rich nutritional composition, high quality, and distinctive flavor, it is highly favored in the market. In this study, near-infrared spectroscopy was employed to determine the soluble solid content (SSC) of Xinyu pears. To investigate the influence of spectral preprocessing on SSC prediction, near-infrared spectra of two batches of Xinyu pear samples were collected using the same portable spectrometer under different acquisition parameters, resulting in differences in spectral bands. A linear interpolation method was introduced to the first batch to generate a new dataset to match the dimensionality of the second batch, and a total of three datasets were used. Five preprocessing methods, including moving average smoothing (MA), standard normal variate transformation (SNV), multiplicative scatter correction (MSC), first derivative (D1), and second derivative (D2), together with three regression models, namely partial least squares regression (PLSR), support vector regression (SVR), and convolutional neural network (CNN), were systematically evaluated and compared in terms of predictive accuracy. Overall, PLSR achieved the best prediction performance, followed by CNN and SVR. Certain differences in model performance were observed among the three datasets. In general, MA exhibited the best overall performance across different datasets and models. Although SNV and MSC were slightly inferior to MA, they showed relatively stable predictive accuracy. By contrast, prediction models based on derivative spectra generally performed poorly. To further exploit the complementary information contained in differently preprocessed spectra, a four-branch CNN model was constructed using raw spectra, MA-preprocessed spectra, SNV-preprocessed spectra, and MSC-preprocessed spectra as separate inputs. Based on the fused features extracted by the CNN, PLSR and SVR models were subsequently developed. The prediction correlation coefficients of the feature-fusion CNN model on the prediction sets of the three datasets were 0.8811, 0.8259, and 0.7064, respectively. For the original datasets of the first and second batches, the feature-fusion model outperformed all single-preprocessing models. For the dataset generated by linear interpolation, the predictive performance of the feature-fusion strategy was comparable across the three models; specifically, its accuracy in SVR exceeded that of all single-preprocessing models, while its accuracies in CNN and PLSR surpassed those of most preprocessing methods. These results demonstrate that integrating feature information from spectra subjected to different preprocessing methods is a feasible strategy for improving prediction accuracy. This study provides an effective reference for SSC prediction in Xinyu pears based on portable spectrometers.

Keywords:

Xinyu pear; near-infrared spectroscopy; soluble solid content; spectral preprocessing; feature fusion

1. Introduction

The pear is a deciduous fruit tree of the genus Pyrus in the family Rosaceae and is one of the important fruits widely cultivated worldwide. Its fruit is rich in nutrients, containing abundant water, soluble sugars, organic acids, vitamins, and various trace elements, and thus has high edible and medicinal value [1]. The pear is widely cultivated in China, and its production scale ranks third in China’s fruit industry, behind only oranges and apples [2]. Among pear cultivars, Xinyu pear is one of the important pear cultivars in China. It is rich in nutrients and is highly favored by the market for its high quality and unique taste, with broad market prospects.

Fruit quality directly affects consumer acceptance and market value; among the quality attributes, internal quality attributes are particularly important [3,4]. However, traditional fruit quality detection is currently inefficient and still relies mainly on manual observation of external appearance. It does not allow quantitative analysis, is easily affected by subjectivity, and also incurs high labor costs. At the same time, destructive biochemical experiments can be used to accurately measure a specific component of fruit, but this method severely damages fruit samples, involves cumbersome procedures, and has strong limitations in the indicators it can detect, making it unsuitable for multi-index detection [5].

Visible/near-infrared (Vis/NIR) spectroscopy is a nondestructive detection technique based on the spectral characteristics of visible and near-infrared light in different electromagnetic wavebands, and it is a nondestructive, rapid, stable, and accurate method [6]. Owing to individual differences in internal chemical composition and biological structure among fruits, different fruit samples produce different spectral responses in the absorption, scattering, and reflection of spectral radiation. Based on this property, spectroscopic techniques can obtain different spectral data from different samples [7]. By establishing and analyzing correlation models between spectral data and indicators such as soluble solid content (SSC), total acidity (TA), pH, firmness index (FI), and moisture content (MC) in fruits [8], spectroscopic techniques can rapidly determine the contents of multiple components, with the advantages of being rapid, nondestructive, and accurate. In recent years, Vis/NIR spectroscopy has been widely applied to the quality detection of various fruits, such as apples [9], peaches [10], and citrus [11], because of its rapid and nondestructive characteristics.

SSC is one of the important parameters for evaluating fruit sweetness and maturity, and it is mainly composed of sucrose, glucose, and other soluble substances [12]. In research, SSC is commonly used as a quality indicator for fruit. A higher SSC in pear fruit means a sweeter taste and greater popularity in the consumer market. Therefore, sweetness is a very critical factor for consumers when selecting pears, and the SSC content is strongly correlated with pear sales. Nowadays, more and more researchers use Vis/NIR spectroscopy and machine learning methods to establish models between pear spectra and SSC [13,14,15,16], which has led to a growing demand for the application of artificial intelligence technologies to assist in SSC detection in pears.

In recent years, machine learning and deep learning models have gradually become the main tools for multidimensional data modeling in combination with analytical instruments such as spectrometers. However, during the acquisition of spectral data, interference from external objective factors, such as physical differences among individual samples, surface light scattering, and instrument dark current, is unavoidable. Therefore, preprocessing raw spectra before establishing prediction models, in order to remove background noise and standardize the spectral baseline, has become a common method in the nondestructive detection of fruit quality [17].

A large number of researchers have conducted targeted studies on the modeling performance of different preprocessing algorithms. Jiang et al. [18] conducted a cross-comparison of five preprocessing methods and five algorithms in citrus quality detection and confirmed that Savitzky–Golay (SG) smoothing was the most suitable for improving the prediction accuracy of citrus SSC. Carvalho et al. [19] comprehensively compared seven preprocessing methods and six algorithms in spectral data analysis, and ultimately established that the combination of SNV and SVR was the optimal solution for eliminating interference and improving model prediction performance. Sharabiani et al. [20] conducted an in-depth comparison of multiple single and combined preprocessing techniques in the nondestructive detection of peaches and found that MSC produced the most obvious improvement in the prediction of peach SSC. Jiang et al. [21] effectively eliminated spectral scattering interference caused by differences in fruit diameter by using SNV in experiments on apple SSC detection, thereby improving the predictive performance of the model. Vega-Castellote et al. [22], in online quality detection of watermelon, likewise confirmed the necessity of spectral preprocessing in dealing with complex physical interference in industrial assembly lines. Zhou et al. [23] also pointed out, in a study using near-infrared spectroscopy to predict fig quality, that the combination of MSC and SG could objectively and effectively reduce baseline drift in spectral data. Zhao et al. [24] pointed out that, because models established on the basis of a single or small-scale dataset are often unable to cope with unknown sample variations in real-world scenarios, comparisons among different datasets of similar samples have important research value.

Although spectral preprocessing can improve data quality in most specific scenarios, some studies have also shown that preprocessing does not necessarily lead to better prediction results. Pornchaloempong et al. [25] found, in the quality evaluation of tropical fruits, that although the MSC algorithm was suitable for predicting acidity in mango, raw spectra without any preprocessing achieved better model performance when predicting SSC. Liu et al. [26] compared derivative-based and smoothing-based preprocessing methods in the early nondestructive diagnosis of diseases in Qiuyue pear, and the results likewise showed that the highest overall recognition accuracy could be achieved by using raw data or only a conservative smoothing treatment. This objectively demonstrates that, while reconstructing spectral features, complex preprocessing algorithms may also carry the potential risk of simultaneously amplifying high-frequency noise or masking originally useful information.

In summary, the introduction of appropriate spectral preprocessing algorithms has a clear positive effect on improving modeling stability. However, because different preprocessing algorithms differ fundamentally in their mathematical reconstruction mechanisms, their final intervention effects often vary depending on the object under study. Agustina et al. [27] pointed out that the effectiveness of a preprocessing strategy is highly dependent on the physical characteristics of the sample itself and the specific detection index, and that there is no single algorithm that is absolutely universal. To address this issue and improve robustness, recent studies have explored optimization at the variable level. For example, Yang et al. [28] used a combination of bootstrapping soft shrinkage (BOSS) and successive projections algorithm (SPA) to extract optimal wavelength variables, effectively improving the partial least squares (PLS) model for predicting SSC in pears. Similarly, Zhan et al. [29] enhanced the PLS model by integrating variables mathematically selected via LASSO regression with the chemical group response spectra. Although previous studies have generally recognized the variability of preprocessing effects, most work in the field remains limited to a single dataset or the framework of traditional linear regression models. For preprocessing strategies with different degrees of mathematical intervention, especially when pear spectra are the research object, there is still a lack of systematic and in-depth investigation into what adaptation patterns exist, what methodological strategies should be adopted, and whether there is potential for deeper feature fusion at the preprocessing level when different spectral datasets of similar samples are analyzed using different regression models.

To this end, based on Vis/NIR spectral data, this study aims to systematically explore and compare the application of combinations of multiple preprocessing methods and regression modeling methods in the nondestructive detection of SSC in pears, so as to provide systematic methodological strategies for subsequent research. The specific objectives of this study were as follows: (1) to systematically introduce and compare five preprocessing algorithms (MA, SNV, MSC, D1, and D2) on different datasets of Xinyu pear and to perform spectral preprocessing on the Xinyu pear data; (2) to establish PLSR, support vector regression (SVR), and convolutional neural network (CNN) models to evaluate the effects of different preprocessing methods on SSC prediction performance; and (3) to establish a multi-preprocessing feature fusion model to investigate the complementary ability of features derived from different preprocessing methods.

Although many studies have adopted different combinations of preprocessing methods and achieved good results [30,31,32], the number of possible combinations of different preprocessing methods in this study is too large. Therefore, in order to achieve the research objectives of this study more reliably and scientifically, this study focused only on comparison and fusion based on single preprocessing methods.

The novelty and contributions of this study are as follows: (1) From the perspective of spectral preprocessing strategies and their combination mechanisms, this study systematically analyzes the effects of different preprocessing methods under multiple datasets and multiple modeling conditions, thereby extending the existing research perspective, which has mainly focused on local methodological optimization. (2) This study introduces the idea of feature fusion at the preprocessing level and explores the potential complementarity among different preprocessing results, providing a new analytical perspective for the collaborative utilization of multi-source spectral information. (3) Comparative analyses are conducted based on multiple spectral datasets of similar samples, and the stability and generalization ability of the models are analyzed from the perspective of data variation, providing a reference for method selection in practical applications.

2. Materials and Methods

2.1. Pear Samples

The Xinyu pear samples used in this study were collected in two batches over two years. To ensure the scientific validity of the experiment and control of variables, the collected Xinyu pear samples were obtained during the same mature season in 2024 and 2025.

2.1.1. The First Batch of Samples

On 8 September 2024, 144 mature Xinyu pear samples were obtained from Anji Fumin Ecological Agriculture Development Co., Ltd., Huzhou, China. During the experiment, the on-site humidity was maintained at 50–70%, and the temperature was maintained at 24 °C. In the experiment, near-infrared spectra were first collected from the samples, and then their physicochemical indices were measured. During the experiment, a PLSR model was established based on all samples, and 8 samples with relatively large errors were manually removed. The remaining 136 samples were used for modeling.

2.1.2. The Second Batch of Samples

From 26 July to 3 August 2025, 600 Xinyu pear samples were obtained from Anji Fumin Ecological Agriculture Development Co., Ltd., Huzhou, China. To maintain the reliability and diversity of the experimental samples and ensure the generalization ability of the model, the maturity range of the collected samples was controlled from approximately 60% maturity to full maturity, with each maturity level accounting for a similar proportion of the total sample set. The sample data collection procedure and experimental environment were the same as those for the first batch. During the experiment, a PLSR model was established based on all samples, and 48 samples with relatively large errors were manually removed. The remaining 552 samples were used for modeling.

2.2. Instruments and Spectra Acquisition

Pear spectra were collected using an integrated device, which enabled the simultaneous acquisition of visible and near-infrared reflection spectra at the same measurement point, with the recorded values representing the spectral reflectance. The visible spectrometer used was the SE2030 from OtO Photonics (OtO Photonics SE2030-025-FUVN, Hsinchu, China), with a spectral measurement range of 180–1100 nm. The near-infrared spectrometer used was the SW2540 from OtO Photonics (OtO Photonics SW2540-025-NIRA, Hsinchu, China), with a spectral measurement range of 950–1700 nm.

It is worth noting that, although using an integrating sphere for reflection measurements can accurately capture scattered radiation and reduce the influence of pear surface roughness, such equipment is relatively bulky and restricted to a fixed laboratory environment [33,34]. Therefore, to meet the requirements for portable detection devices in subsequent applications, this study employed a fiber-optic probe for spectral measurement. During spectral acquisition, the near-infrared fiber-optic probe was fixed on a bracket. This probe was capable of both emitting the light source and receiving spectral information. Throughout the experiment, the fruit surface was kept in close contact with the fiber-optic probe, and four measurements were taken at fixed intervals around the equatorial plane of each pear sample. Only the spectra obtained by the near-infrared spectrometer were used for further analysis. A schematic of the measurement positions is shown in Figure 1, where the yellow dashed line denotes the equatorial plane of the sample and the red dots indicate the approximate positions of the fiber-optic probe measurements.

During the spectra acquisition of the two batches of samples, the instrumental settings of the near-infrared spectrometer were different, resulting in different spectral bands of the spectra of the two batches of samples. The head and the tail of the spectra contained noise; thus, for both batches of experiments, only the spectra in the range of 1086–1568 nm were used for further analysis. For the first batch of experiments, there were 74 spectral bands in this range; for the second batch of experiments, there were 483 spectral bands in this range.

2.3. SSC Measurement

SSC was measured using a handheld refractometer (ATAGO PR-101 digital refractometer; Fukaya-shi, Saitama, Japan) and expressed as °Brix. The measurement procedure included extracting pulp from cored pears using a manual juicer, followed by filtering the juice through clean gauze. After two rounds of filtration, the filtered pear juice was measured using the handheld refractometer. The average of three measurements was taken as the SSC value. After each measurement, the juicer, beaker, and refractometer were cleaned and wiped dry.

2.4. Spectral Resampling (Linear Interpolation)

Because the spectra of the two batches of samples used in this study differed in the number of acquired spectral features, with the spectra of the first batch of samples containing 74 spectral features and the spectra of the second batch of samples containing 483 spectral features, this study used the linear interpolation method to interpolate and expand the dimensionality of the spectra of the first batch of samples, thereby constructing a new dataset with the same feature dimensionality as the spectra of the second batch of samples, so that data of the same type of samples collected by the same device but with different dimensions could be comparable.

Spectral resampling (linear interpolation) refers to a method for calculating the reflectance at any position between two adjacent wavelength points by using the reflectance values at known basic wavelength points and a linear formula. This method can expand or reduce the spectral dimensionality without changing the overall trend of the original spectrum. The linear interpolation formula used in this study is as follows:

y = y_{0} + \frac{y_{1} - y_{0}}{x_{1} - x_{0}} (x - x_{0})

(1)

where y is the estimated reflectance at the target wavelength point x; x₀ and x₁ are two adjacent known wavelength points in the source spectrum; and y₀ and y₁ are the actual measured reflectance values at wavelengths x₀ and x₁, respectively.

In practical engineering scenarios, there may likewise be cases in which the same spectral acquisition device produces data of different feature dimensions for the same type of samples because of different settings. Therefore, this method has practical significance for real-world applications.

2.5. Dataset Division

The dataset constructed from the raw data of the first batch of experiments was named Dataset 1, and the dataset constructed by linear interpolation based on Dataset 1 was named Dataset 2. The dataset from the second batch of experiments was named Dataset 3.

Dataset 1 was divided into a training set, a validation set, and a prediction set at a ratio of 4:1:1, with 92 samples in the training set and 22 samples each in the validation and prediction sets. The training set was used for model training, the validation set for evaluating model robustness, and the prediction set for evaluating the generalization ability of the final model. Because inappropriate dataset division may aggravate overfitting in regression models, a PLSR model was used at the dataset division stage to evaluate the performance of different dataset partition combinations. After experimentation, a reasonable dataset division scheme under the PLSR model was obtained. This scheme was conducive to improving the reliability of subsequent model construction, did not tend to produce obvious overfitting, and could be used for further analysis.

Dataset 2 was obtained by linear interpolation based on Dataset 1, and therefore its sample division was identical to that of Dataset 1.

Because the number of samples contained in Dataset 3 was much larger than that in Dataset 2, a larger proportion of training samples was considered more reasonable for modeling. Therefore, in this study, the samples were divided into a training set, a validation set, and a prediction set at a ratio of 6:1:1, with 414 samples in the training set and 69 samples each in the validation and prediction sets. The training set was used for model training, the validation set for evaluating model robustness, and the prediction set for evaluating the generalization ability of the final model. The effectiveness of the division scheme for Dataset 3 was verified using the same PLSR model used for Dataset 1.

After dataset division, the summary statistics of the physicochemical indices of the training, validation, and prediction sets were calculated. The summary statistics of SSC are shown in Table 1, including the minimum, maximum, mean, and standard deviation values. The ranges of SSC in the divided datasets were scientifically reasonable, indicating that they were suitable for subsequent modeling analysis.

3. Spectral Preprocessing Algorithms and Regression Models

3.1. Spectral Preprocessing Algorithms

In the actual acquisition of near-infrared spectra, the raw spectra usually contain a certain amount of physical background noise and spectral baseline drift due to the combined effects of objective factors such as differences in the surface curvature of pear samples, light scattering caused by internal pulp tissues, and the dark current inside the spectrometer. If these raw data containing interference signals are directly used for modeling, they often interfere with the fitting of regression models. Therefore, before establishing prediction models, it is necessary to perform mathematical transformations on the data through spectral preprocessing algorithms in order to weaken background interference and highlight effective chemical bond vibration information.

In the preprocessing-based modeling of this study, in order to achieve an objective comparison of the effects of different preprocessing methods on the models and to facilitate subsequent feature fusion based on appropriate preprocessing algorithms, each preprocessing algorithm was used independently without combining it with other algorithms. The preprocessing methods used in this study, together with their purposes, main effects, and parameter settings, are summarized in Table 2.

3.1.1. Moving Average Smoothing (MA)

MA is commonly used to reduce noise in spectral data [35]. Its basic idea is to set a sliding window with a fixed width, move it point by point along the entire spectral curve, and replace the reflectance at the center point with the arithmetic mean of the reflectance values of all wavelength points within the window.

Reasonable selection of the sliding window size is the key to the practical application of this algorithm. If the window is too small, the smoothing and denoising effect will be limited; if the window is too large, it may smooth out key narrow-band absorption peaks in the spectrum, resulting in the loss of effective physicochemical information. In this study, a sliding window with a window length of 7 was uniformly used for MA preprocessing in the modeling process.

3.1.2. Standard Normal Variate (SNV)

SNV is commonly used to reduce the effects caused by scattering [36]. During processing, this algorithm does not rely on other samples; instead, each spectrum is treated as an independent random variable sequence. By independently centering and standardizing each sequence, spectral baseline drift caused by surface scattering can be removed. After SNV correction, the mean of each spectrum is forced to zero and the variance is normalized to one, thereby effectively unifying the spectral scale among different samples.

3.1.3. Multiplicative Scatter Correction (MSC)

MSC is also used to reduce the effects caused by scattering [37]. This algorithm uses an ideal reference spectrum as the baseline and directly calculates and removes the baseline drift and scaling deviation in each spectrum to be corrected through a univariate linear regression method, thereby improving the comparability among different samples. To strictly prevent data leakage during modeling, this study used only the mean spectrum of all samples in the training set as the reference spectrum, without introducing information from the validation set or prediction set.

Although SNV and MSC differ in their underlying computational logic, in practical spectral signal processing, their performance is usually similar in reducing interference caused by particle scattering and optical path differences.

3.1.4. Derivative Preprocessing Algorithms (First Derivative and Second Derivative)

Derivative preprocessing can effectively eliminate baseline drift and background interference, separate overlapping peaks, and amplify subtle spectral feature changes [38]. The derivative preprocessing methods used in this study were D1 and D2, both implemented using SG filter.

In the experiment, the sliding window length of D1 was set to 7, and the polynomial order was set to 1. For D2, the sliding window length and polynomial order were set to 7 and 2, respectively.

To ensure the reproducibility of the experiments under different preprocessing methods and the fairness of result comparisons among methods, a unified experimental procedure was adopted for analysis across the three datasets in this study. Specifically, the samples were first divided into the training set, validation set, and prediction set according to the predetermined division scheme. Then, the necessary parameters for each preprocessing method, such as the MA window length, the polynomial order for SG derivatives, and the reference spectrum for MSC, were calculated based only on the training set, and the same processing rules were subsequently applied to the validation set and prediction set to avoid information leakage. The preprocessed spectral data were then input into the PLSR, SVR, or one-dimensional CNN (1D-CNN) model for modeling, and the model hyperparameters were adjusted according to the performance on the validation set, thereby ensuring the comparability of different preprocessing methods under the same model.

3.2. Regression Prediction Methods

3.2.1. Partial Least Squares Regression (PLSR)

PLSR is a statistical modeling method that combines the features of principal component analysis and multiple linear regression, and it belongs to traditional linear regression algorithms. In near-infrared spectral modeling, PLSR achieves dimensionality reduction by constructing a latent variable space and retaining information that is highly correlated with the response variable, and can effectively adapt to dataset structures characterized by many variables and few samples [39].

The core hyperparameter of PLSR is the number of latent variables extracted by the model. In this study, the search range for the number of latent variables was set from 1 to 20, and 10-fold cross-validation was adopted. During the optimization process, the algorithm traversed this range in sequence and strictly used the validation correlation coefficient of the independent validation set (r_v) as the evaluation criterion. Finally, the number of latent variables corresponding to the global maximum r_v was selected as the optimal complexity parameter. The same hyperparameter tuning procedure and search ranges were applied to all PLSR models in this study.

3.2.2. Support Vector Regression (SVR)

SVR is a modeling method constructed based on the principle of structural risk minimization in statistical learning theory, and it belongs to nonlinear regression algorithms. This algorithm can solve nonlinear regression problems that are difficult to handle in low-dimensional space and can be used for spectral regression modeling tasks involving small samples, high-dimensional data, and nonlinear data [40].

In this study, the radial basis function (RBF) kernel was uniformly adopted to handle the complex nonlinear mapping relationships in high-dimensional spectral data. Before the spectral features were input into the model, standardization preprocessing (Standard Scaler) was uniformly performed to eliminate differences in scale among different bands. The predictive performance of SVR based on the RBF kernel is mainly determined by three core hyperparameters: the penalty coefficient C, the kernel parameter γ, and the width of the insensitive loss band ϵ. In the experiment, grid search combined with 10-fold cross-validation was used to optimize the parameters on the training set. Specifically, the search spaces of C and γ were both set from 10⁻⁷ to 10⁷, with 15 logarithmically spaced nodes each, while the search space of ϵ was set from 10⁻⁶ to 10⁻¹, with 6 logarithmically spaced nodes. The algorithm took the minimization of the root mean square error of cross-validation (RMSECV) as the optimization objective and exhaustively traversed the parameter grid matrix, thereby determining the globally optimal combination of hyperparameters. The same hyperparameter tuning procedure and search ranges were applied to all SVR models in this study.

3.2.3. Convolutional Neural Network (CNN)

CNN is a deep learning model based on local perception and multilayer feature extraction mechanisms, with strong nonlinear modeling capability. Its typical structure includes convolutional layers, activation functions, pooling layers, and fully connected layers, enabling it to automatically learn high-level mapping relationships between input and output in an end-to-end manner. It has been widely applied in data modeling tasks with complex structures [41].

In near-infrared spectral analysis, spectral data are presented as one-dimensional continuous sequences, and there is significant local correlation among bands, making them suitable for modeling with a one-dimensional convolutional structure. Without the need for explicit variable selection, CNN can automatically extract important spectral features, identify key absorption regions, and effectively capture nonlinear response patterns, thereby improving model prediction performance. Although it is sensitive to sample size and network design and has limited model interpretability, CNN has become an important complement to traditional linear methods in spectral modeling scenarios involving high-dimensional and complex structures, and it has significant application potential [42].

For the CNN models used in this study to evaluate the effects of different preprocessing methods, multiple 1D-CNN architectures and corresponding parameter configurations were designed according to the data characteristics of different preprocessing results for different datasets, so as to facilitate a reasonable analysis of the experimental results. Because different preprocessed spectra may differ in dimensionality and feature distribution, a single fixed CNN architecture may not be equally suitable for all input conditions. Accordingly, in this study, the CNN architectures were adjusted according to the characteristics of different input spectra, so that each type of preprocessed data could be modeled under a relatively appropriate network configuration. The CNN model diagrams, structural settings, and descriptions for different preprocessed datasets are provided in the Supplementary Materials.

As representative examples, the CNN architectures used in this study ranged from a basic single-convolution architecture to deeper two- or three-convolution architectures according to the characteristics of the input data. For the raw spectra of Dataset 1, the CNN model consisted of a one-dimensional convolutional layer, followed by a one-dimensional batch normalization (BatchNorm1d) layer, a rectified linear unit (ReLU) activation function, a one-dimensional max pooling (MaxPool1d) layer, a flattening operation, and two fully connected layers for regression output. For more complex input conditions, such as high-dimensional spectra or derivative-preprocessed spectra, some CNN models adopted two- or three-layer convolutional structures combined with dropout, normalization layers, or adaptive pooling to better match the characteristics of the input spectra.

During CNN modeling, all CNN models were trained using the mean squared error (MSE) loss function and the Adam optimizer. The training set was used for model fitting, the validation set was used as a reference for model selection and hyperparameter adjustment, and the prediction set was reserved for final performance evaluation. To ensure fair comparison, the PLSR, SVR, and CNN models were evaluated using the same training, validation, and prediction sets within each dataset, as well as the same evaluation metrics.

3.2.4. Multi-Preprocessing Feature Fusion Model

After multiple preprocessing methods are applied, although the spectra all originate from the same sample, different processing methods do not place the same emphasis on the retention and adjustment of spectral information. For example, raw spectra retain the original waveform characteristics of the sample more completely; spectra processed by MA are helpful for reducing local random fluctuations; and spectra processed by SNV and MSC reduce, to a certain extent, the overall shifts among samples caused by physical differences. Therefore, the different spectral representations may not be simple repetitions, but may instead contain a certain amount of differential information. To investigate the complementarity and fusion capability of features among different preprocessing methods, this study designed a fusion modeling experiment based on multi-preprocessing features.

In the specific modeling process, raw spectra, MA spectra, SNV spectra, and MSC spectra were used as multi-branch inputs of the same sample, and the corresponding SSC value of each sample was used as the unified output. Subsequently, a multi-branch CNN was used to extract and concatenate features from the four types of spectra separately, thereby forming a unified fused feature representation.

Dataset 1 is a small-sample, low-dimensional dataset. As shown in Figure 2, the feature fusion model for Dataset 1 adopts a multi-branch 1D-CNN structure. From left to right, the four branches correspond to raw spectra, MA spectra, SNV spectra, and MSC spectra, respectively. Each branch contains two convolutional layers, while Dropout layers are designed for the SNV and MSC branches. After feature extraction is completed in each branch, the model concatenates the 32-dimensional feature vectors obtained from the four branches to form a 128-dimensional feature vector, and then outputs the main regression result through a fully connected layer. At the same time, auxiliary output heads are separately set for the raw, MA, SNV, and MSC branches, so that each branch is constrained by the prediction target before participating in feature fusion, thereby enhancing the association between branch features and SSC.

Dataset 2 is a high-dimensional dataset obtained by linear interpolation based on Dataset 1, and therefore can be further used to examine the performance of multi-preprocessing feature fusion modeling in the context of interpolation-based dimensionality expansion. As shown in Figure 3, the feature fusion model for Dataset 2 generally maintains the same network depth and fusion strategy as Dataset 1.

Compared with the first two datasets, Dataset 3 has a larger sample size, a higher native spectral dimensionality, and more complex individual differences among samples. Therefore, certain adjustments were required in feature fusion modeling. As shown in Figure 4, after the features from each branch are concatenated into a 128-dimensional vector, an additional fully connected layer is designed to map it to 256 dimensions, followed by a BatchNorm1d layer and a ReLU layer, and finally another fully connected layer is used to obtain the final output.

3.3. Software, Hardware, and Performance Evaluation

The preprocessing algorithms and network models used in this study were developed using PyCharm (version 2024.1.2; JetBrains, Prague, Czech Republic) and Python (version 3.10.1; Python Software Foundation, Wilmington, DE, USA). All data analyses were conducted on a computer equipped with an Intel(R) Core(TM) i5-12600KF processor (Intel Corporation, Santa Clara, CA, USA) operating at 3.7 GHz. The computer had 32 GB of memory and used an NVIDIA GeForce RTX 4060 Ti GPU (NVIDIA Corporation, Santa Clara, CA, USA) with 16 GB of video memory.

To objectively evaluate the actual optimization effects of different spectral preprocessing algorithms on the predictive performance of subsequent regression models, this study mainly used the correlation coefficient (r) and root mean square error (RMSE) as the core evaluation metrics. In the actual experiments, these metrics were applied separately to the training set, validation set, and prediction set, yielding the training-set correlation coefficient (r_c), root mean square error of calibration (RMSEC), validation-set correlation coefficient (r_v), root mean square error of validation (RMSEV), and prediction-set correlation coefficient (r_p) and root mean square error of prediction (RMSEP). The correlation coefficient reflects the degree of linear correlation between the model predictions and the true values. The closer its absolute value is to 1, the better the model fit. RMSE quantifies the absolute deviation between the predicted values and the true values. The closer it is to 0, the higher the predictive accuracy of the model. The specific formulas for these two core metrics are as follows:

r = \frac{\sum_{i = 1}^{n} (y_{i} - \bar{\hat{y}}) ({\hat{y}}_{i} - \bar{\hat{y}})}{\sqrt{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}} \sqrt{\sum_{i = 1}^{n} {({\hat{y}}_{i} - \bar{\hat{y}})}^{2}}}

(2)

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}

(3)

where n is the total number of samples involved in the evaluation;

y_{i}

is the true measured value of the physicochemical indicator (SSC) for sample i;

{\hat{y}}_{i}

is the corresponding predicted value given by the model;

\bar{y}

is the arithmetic mean of the true measured values of all samples involved in the evaluation; and

\bar{\hat{y}}

is the arithmetic mean of the predicted values of all samples.

In addition, the residual predictive deviation (RPD) was introduced as an additional metric to further evaluate the predictive ability of the models. RPD is defined as the ratio of the standard deviation of the reference values, namely the measured SSC values, to the corresponding RMSE. A higher RPD value generally indicates better predictive performance. In this study, RPD was calculated for the prediction set and denoted as RPDp. The formula is as follows:

R P D = \frac{S D}{R M S E}

(4)

where SD is the standard deviation of the reference values, namely the measured SSC values, in the corresponding dataset, and RMSE is the root mean square error of the corresponding dataset.

To further analyze the contribution of different spectral bands to SSC prediction, variable importance in projection (VIP) was also employed in this study. VIP is a widely used variable-importance metric in PLSR models that evaluates the relative contribution of individual spectral variables to the prediction target. In general, a higher VIP value indicates a greater contribution of the corresponding wavelength variable to the model, and variables with VIP values greater than 1 are typically considered to have relatively high importance.

4. Results

4.1. Spectral Profiles

Figure 5, Figure 6 and Figure 7 show the spectral profiles of the three datasets under different preprocessing methods, respectively. The vertical axis represents reflectance, and the horizontal axis represents wavelength in nm, with a spectral range of 1086–1568 nm. Each curve in the figures represents a single sample. The different colors of the curves are used solely to improve visual distinction, while the variations between spectra in the same figure reflect the individual differences among the samples.

From the perspective of different datasets, the spectral profiles of the three datasets remained generally consistent after MA, SNV, and MSC preprocessing. This indicates that the two batches of acquired spectral data were consistent in their overall trends and were comparable for research purposes. Meanwhile, linear interpolation enriched the features of the original data while preserving the overall trend of the raw spectra well.

From the perspective of different preprocessing algorithms, after MA preprocessing, the spectral curves maintained the overall trend of the raw spectra, while the spectra became smoother overall. In contrast, the spectral profiles after SNV and MSC preprocessing showed a more clustered pattern. The spectral profiles after D1 and D2 preprocessing exhibited obvious oscillatory patterns in Dataset 2 and Dataset 3, which had higher feature dimensionality.

4.2. Experimental Results

4.2.1. Spectral Preprocessing and Modeling Results for Dataset 1

The spectral preprocessing and modeling evaluation experiment conducted on Dataset 1 compared data under six preprocessing conditions, namely raw spectra, MA, SNV, MSC, D1, and D2, and used three models, PLSR, SVR, and CNN, to validate predictive performance. The specific evaluation metrics of each model are shown in Table 3.

To provide a more intuitive comparison of the prediction performance of the regression models under different preprocessing methods on Dataset 1, the prediction-set r_p values are shown as a bar plot in Figure 8.

Under the condition of raw spectra, the prediction correlation coefficient r_p of the PLSR model was 0.8573, and its training-set correlation coefficient r_c was 0.9355, showing better performance than the CNN model, whose r_p was 0.7680, and the SVR model, whose r_p was 0.7504, under the same condition. In terms of error metrics, the root mean square error of prediction (RMSEP) of the PLSR model was 0.5622, which was lower than 0.6922 for SVR and 0.6570 for CNN, and its RPDp reached 1.8321.

After MA preprocessing, the RMSEP of the PLSR model decreased slightly from 0.5622 to 0.5596, indicating stable accuracy, and its r_p was 0.8334, which was the best result under this condition. At this point, the performance of the CNN model improved, with its r_p increasing from 0.7680 without preprocessing to 0.7990, whereas the r_p of the SVR model was 0.7464, showing only a small change.

For the data preprocessed by SNV and MSC, a certain degree of divergence appeared in the modeling performance between linear and nonlinear models. After SNV preprocessing, the CNN model achieved an r_p of 0.7930, which was higher than 0.7789 for PLSR and 0.7507 for SVR. Comparison showed that SNV preprocessing reduced the r_p of the PLSR model from 0.8573 without preprocessing to 0.7789, while its RMSEP increased from 0.5622 to 0.7078. The results after MSC preprocessing were similar to those after SNV preprocessing, with the CNN model achieving an r_p of 0.7988, which was higher than 0.7851 for PLSR and 0.7495 for SVR.

For derivative-based preprocessing, D1 performed better overall than D2. Under D1 preprocessing, the prediction correlation coefficient r_p of the CNN model was 0.7869, which was higher than 0.7739 for PLSR and 0.7715 for SVR. After D2 preprocessing, the metrics of all models declined numerically. Specifically, the r_p of the PLSR model decreased to 0.7107, that of the SVR model decreased to 0.6970, and that of the CNN model decreased to 0.7426. In terms of error metrics, the RMSEP values under D2 preprocessing reached 0.7089 for PLSR, 0.7530 for SVR, and 0.8262 for CNN, all of which increased to varying degrees compared with those under D1.

4.2.2. Spectral Preprocessing and Modeling Results for Dataset 2

Compared with Dataset 1, Dataset 2 differs in that the raw spectra were linearly interpolated to the same feature dimensionality as Dataset 3 (483 dimensions). Therefore, its preprocessing performance should show a certain relationship with that of the original data before interpolation. The preprocessing methods and regression model settings used in the experiment were consistent with those for Dataset 1. Table 4 presents the specific evaluation metrics of each model under different preprocessing conditions for this dataset.

To provide a more intuitive comparison of the prediction performance of the regression models under different preprocessing methods on Dataset 2, the prediction-set r_p values are shown as a bar plot in Figure 9.

Under the condition of raw spectra, the prediction correlation coefficient r_p of the PLSR model was 0.8334, and the root mean square error of prediction (RMSEP) was 0.5712, representing the best performance among the three models. The prediction correlation coefficient r_p of the CNN model was 0.8209, and its validation-set correlation coefficient r_v was 0.8110, both of which were better than 0.7478 for the r_p of the SVR model and 0.7632 for its r_v. In terms of error metrics, the RMSEP of the PLSR model was 0.5712, which was lower than 0.6994 for SVR and 0.7915 for CNN.

After MA preprocessing, the r_p of the PLSR model increased to 0.8682, its training-set correlation coefficient r_c reached 0.9243, and its RMSEP decreased to 0.5236, with an RPDp of 1.9671, which was the best result under this condition. At this point, the performance of the CNN model also improved, with its RMSEP decreasing from 0.7915 without preprocessing to 0.6388, and its validation error RMSEV decreasing from 0.8407 to 0.7198. The r_p of the SVR model increased slightly from 0.7478 to 0.7505, while its corresponding RMSEP decreased slightly from 0.6994 to 0.6957.

Under SNV and MSC preprocessing, the predictive performance of linear and nonlinear models likewise showed a certain degree of numerical divergence. After SNV preprocessing, the CNN model achieved an r_p of 0.8501, which was higher than 0.8084 for SVR and 0.7897 for PLSR. At the same time, SNV increased the prediction correlation coefficient of SVR from 0.7478 without preprocessing to 0.8084, while its RMSEP decreased from 0.6994 to 0.5926. Similarly, under MSC preprocessing, the r_p of the CNN model was 0.8190, which was higher than 0.8029 for SVR and 0.7961 for PLSR. At this point, the RMSEP values of the PLSR model under SNV and MSC preprocessing were 0.6905 and 0.6719, respectively, both of which were higher than 0.5712 under raw spectra.

For derivative-based preprocessing, D1 also performed better overall than D2. Under D1 preprocessing, the prediction correlation coefficient r_p of the CNN model was 0.7889, which was higher than 0.7843 for SVR and 0.7027 for PLSR. After D2 preprocessing, the metrics of all models declined numerically, with the r_p of CNN decreasing to 0.6115, that of SVR decreasing to 0.6097, and that of PLSR decreasing to 0.5426. In terms of error metrics, the RMSEP values after D2 preprocessing reached 0.8888 for PLSR, 0.8676 for SVR, and 1.0316 for CNN, all of which increased to varying degrees compared with 0.8669, 0.6951, and 0.6775 under D1 preprocessing.

4.2.3. Spectral Preprocessing and Modeling Results for Dataset 3

Compared with Dataset 1, Dataset 3 not only had a larger sample size, but also objectively introduced greater individual differences among samples and larger fluctuations in internal characteristics. The experiment continued to compare the dataset without preprocessing and the other five preprocessed datasets, combined with the three models of PLSR, SVR, and CNN to validate predictive performance. The specific evaluation metrics of each model for this dataset are summarized in Table 5.

To provide a more intuitive comparison of the prediction performance of the regression models under different preprocessing methods on Dataset 3, the prediction-set r_p values are shown as a bar plot in Figure 10.

Under the condition of raw spectra, the prediction correlation coefficient r_p of the PLSR model was 0.6764, representing the best performance among the three models, and the corresponding RMSEP was 0.4717, which was the lowest in the whole group, with an RPDp of 1.2983. The r_p of the CNN model was 0.6631, which was lower than that of PLSR but higher than 0.6420 for the SVR model.

After MA preprocessing, the r_p of the PLSR model was 0.6712, still maintaining the highest predictive accuracy in this group. Compared with the condition of raw spectra, the performance of the SVR model improved under MA preprocessing, with r_p increasing from 0.6420 to 0.6576 and RMSEP decreasing from 0.4967 to 0.4920. In contrast, the performance of the CNN model declined, with r_p decreasing from 0.6631 to 0.6415 and the error metric increasing from 0.4894 to 0.5540.

When SNV and MSC preprocessing were introduced, the predictive performance of all models was lower than that under the no-preprocessing condition. Under SNV preprocessing, the r_p of the PLSR model was 0.6200, which was higher than 0.6090 for CNN and 0.5497 for SVR. The results of MSC were similar to those of SNV, with the r_p of PLSR being 0.6230, again showing greater stability than 0.5509 for SVR and 0.5521 for CNN.

For derivative-based preprocessing, the evaluation metrics of both methods declined, but D1 still performed better than D2. Under D1 preprocessing, the r_p of PLSR was 0.6290, which was higher than 0.5981 for SVR and 0.5610 for CNN. Under second-derivative (D2) preprocessing, the metrics of all models decreased substantially, among which the r_p of the PLSR model was 0.5571, still higher than 0.3412 for CNN and 0.1241 for SVR.

4.2.4. Results of Multi-Preprocessing Feature Fusion Modeling

In this study, the investigation of multi-preprocessing feature fusion was conducted in two ways. On the one hand, the fusion model was directly used to perform CNN-based modeling analysis. On the other hand, after training the feature fusion model, the fused features obtained after concatenation were extracted and then used for regression modeling with PLSR and SVR, respectively, so as to compare the predictive performance of different models under the condition of multi-preprocessing feature fusion. To ensure the completeness of the result analysis, experiments were conducted on Dataset 1, Dataset 2, and Dataset 3. Table 6 presents the overall results of multi-preprocessing feature fusion modeling for the three datasets.

Results of Multi-Preprocessing Feature Fusion Modeling for Dataset 1

As shown in Table 6, multi-preprocessing feature fusion modeling achieved relatively good predictive results on Dataset 1 as a whole. Among the three fusion models, the CNN model showed the best performance on the prediction set, with an r_p of 0.8811 and an RMSEP of 0.4978, and an RPDp of 2.0691. The SVR model ranked second, with an r_p of 0.8758 and an RMSEP of 0.5006, whereas the PLSR model yielded an r_p of 0.8686 and an RMSEP of 0.5155. Overall, the results of all three fusion models on the prediction set remained at a relatively high level, among which the fusion CNN model performed relatively better.

Comparison with the individual modeling results under different preprocessing methods for Dataset 1 further indicates that the predictive performance of multi-preprocessing feature fusion modeling on this dataset was better than that of the best-performing single-preprocessing model. As shown in Table 3, the raw spectra combined with the PLSR model achieved an r_p of 0.8573 and an RMSEP of 0.5622, representing the best result among the individual modeling results under different preprocessing methods for Dataset 1. The MA-PLSR model yielded an r_p of 0.8334 and an RMSEP of 0.5596, while the r_p values of the SNV-CNN and MSC-CNN models were 0.7930 and 0.7988, respectively. In contrast, the r_p values of the multi-preprocessing feature fusion CNN, PLSR, and SVR models increased to 0.8811, 0.8686, and 0.8758, respectively, while the RMSEP values decreased to 0.4978, 0.5155, and 0.5006, respectively. Compared with the individual modeling results under different preprocessing methods for Dataset 1, all three fusion models showed certain improvements in both the correlation coefficient and prediction error.

Overall, on Dataset 1, multi-preprocessing feature fusion modeling performed well as a whole, and all three models exceeded the highest predictive accuracy previously achieved by single-preprocessing models. Although certain differences still existed among the models, all three fusion models achieved relatively high r_p values and relatively low RMSEP values, indicating that under the conditions of this dataset, multi-preprocessing feature fusion modeling had certain advantages over individual modeling under different preprocessing methods.

Results of Multi-Preprocessing Feature Fusion Modeling for Dataset 2

As shown in Table 6, on Dataset 2, multi-preprocessing feature fusion modeling still maintained a relatively good predictive level overall. Among the three fusion models, the SVR model achieved the best overall performance on the prediction set, with an r_p of 0.8288 and an RMSEP of 0.5779, and an RPDp of 1.7823. The CNN model ranked second, with an r_p of 0.8259 and an RMSEP of 0.5762, whereas the PLSR model yielded an r_p of 0.8032 and an RMSEP of 0.6029. Overall, the differences among the three fusion models on the prediction set were small. The SVR model showed a relatively higher correlation coefficient, whereas the CNN model showed a relatively lower prediction error, and their overall performance was relatively close, with the PLSR model being slightly lower.

Comparison with the individual modeling results under different preprocessing methods for Dataset 2 further indicates that the multi-preprocessing feature fusion modeling results did not show a clear advantage. As shown in Table 4, the better results among the individual preprocessing-based models for Dataset 2 were mainly obtained by the MA-PLSR and SNV-CNN combinations. Among them, the combination of MA with the PLSR model achieved an r_p of 0.8682 and an RMSEP of 0.5236, representing the best overall result. The SNV-CNN model achieved an r_p of 0.8501 and, although its RMSEP was relatively higher, it still performed well among the CNN models. In addition, the combination of raw spectra with the PLSR model achieved an r_p of 0.8334 and an RMSEP of 0.5712, which was also relatively close overall. In contrast, the r_p values of the fusion CNN, PLSR, and SVR models were 0.8259, 0.8032, and 0.8288, respectively, while the RMSEP values were 0.5762, 0.6029, and 0.5779, respectively.

A comprehensive comparison shows that, on Dataset 2, the predictive accuracy of the three models was relatively similar. Among them, the predictive accuracy of SVR was better than that of all previous single-preprocessing models, whereas the performance of CNN and PLSR exceeded that of most single-preprocessing models.

Results of Multi-Preprocessing Feature Fusion Modeling for Dataset 3

As shown in Table 6, on Dataset 3, multi-preprocessing feature fusion modeling achieved relatively good predictive results overall. Among the three fusion models, the CNN model showed the best performance on the prediction set, with an r_p of 0.7064 and an RMSEP of 0.5239. The PLSR model ranked second, with an r_p of 0.6996 and an RMSEP of 0.4572, whereas the SVR model yielded an r_p of 0.6570 and an RMSEP of 0.4809. Overall, the results of all three fusion models on the prediction set remained at a moderately good level, among which the fusion PLSR model performed relatively better in terms of error control.

Comparison with the individual modeling results under different preprocessing methods for Dataset 3 further indicates that the predictive performance of multi-preprocessing feature fusion modeling on this dataset was better than that of the best-performing single-preprocessing model. As shown in Table 5, the combination of raw spectra with the PLSR model achieved an r_p of 0.6764 and an RMSEP of 0.4717, representing the best result among the individual modeling results under different preprocessing methods for Dataset 3. The MA-PLSR model yielded an r_p of 0.6712 and an RMSEP of 0.4757, while the r_p values of the SNV-CNN and MSC-CNN models were 0.6090 and 0.5521, respectively. In contrast, the r_p values of the multi-preprocessing feature fusion CNN, PLSR, and SVR models increased to 0.7064, 0.6996, and 0.6570, respectively, while the RMSEP values decreased to 0.5239, 0.4572, and 0.4809, respectively. Compared with the individual modeling results, all three fusion models showed certain improvements in both the correlation coefficient and prediction error.

Overall, on Dataset 3, multi-preprocessing feature fusion modeling performed relatively well as a whole. Although certain differences still existed among the models, all three fusion models achieved relatively high r_p values and relatively low RMSEP values, indicating that under the conditions of this dataset, multi-preprocessing feature fusion modeling had certain advantages over individual spectral modeling and was able to integrate information from different spectral representations relatively stably to improve model predictive performance.

4.3. VIP-Based Wavelength Importance Analysis

To further improve the wavelength-level interpretability of the models, variable importance in projection (VIP) analysis was performed based on the MA-PLSR models of the three datasets. MA-preprocessed spectra were selected because MA showed relatively good overall predictive performance and stability across different datasets and modeling methods in this study. In addition, compared with the multi-branch CNN fusion model, VIP analysis with the MA-PLSR model is more straightforward to implement and interpret. After the optimal number of latent variables was determined according to the validation-set correlation coefficient, VIP scores were calculated from the final PLSR model. Wavelength variables with VIP scores greater than 1 were considered to have relatively high contributions to SSC prediction.

As shown in Figure 11, the VIP curves indicated that the MA-PLSR models mainly relied on several specific wavelength regions rather than uniformly using all spectral variables. For Dataset 1, the important wavelength regions were mainly distributed at the shorter-wavelength end of the measured spectral range and in the longer-wavelength regions of approximately 1410–1480 nm and 1500–1568 nm. For Dataset 2, the important regions showed a generally similar distribution pattern, mainly appearing around 1380–1490 nm and 1520–1568 nm, while relatively high VIP values were also observed at the shorter-wavelength end. The VIP curve of Dataset 2 was more continuous, which may be related to the higher number of spectral variables after resampling and the smoothing effect of MA preprocessing. For Dataset 3, high-VIP regions were mainly observed around 1350–1500 nm and 1560–1568 nm, and relatively high VIP values were also present at the shorter-wavelength end. Compared with Dataset 1 and Dataset 2, the VIP curve of Dataset 3 showed stronger fluctuations, which may be associated with the larger sample variability in this dataset.

From the perspective of spectral interpretation, the relatively consistent high-VIP regions above approximately 1350 nm may provide useful spectral information for SSC prediction. In particular, the region around 1400–1500 nm is commonly associated with O-H overtone absorption and water-related spectral responses, while the longer-wavelength region near and above 1500 nm may contain overlapping information related to O-H and C-H overtones or combination bands [43]. Since SSC is closely related to soluble sugars and other soluble substances in pear fruit, these wavelength regions may contribute to the prediction of SSC. However, because near-infrared absorption bands are generally broad and overlapping, these interpretations should be regarded as supportive wavelength-level explanations rather than direct identification of specific sugar components.

5. Discussion

5.1. Effects of Different Preprocessing Methods

Different preprocessing methods had different effects on model predictive performance. MA showed relatively good predictive performance and stability across all three datasets while preserving the waveform characteristics of the raw spectra. Although SNV and MSC can reduce the influence of scattering effects, their predictive performance was inconsistent across the three datasets. While removing scattering interference, SNV and MSC may also have weakened, to some extent, the absorption features related to SSC, thereby leading to inconsistent performance across different datasets. D1 and D2 can effectively eliminate baseline drift and background interference and highlight spectral features, but they may also amplify spectral noise and cause the loss of some useful features, resulting in decreased predictive accuracy. In particular, when the spectral dimensionality is relatively high, derivative methods are more likely to amplify noise, which explains the general decline in predictive accuracy after derivative preprocessing in Dataset 2 and Dataset 3.

In the multi-preprocessing feature fusion modeling, the overall performance on Dataset 2 showed only limited improvement compared with the results of multiple models built using individual preprocessing methods. Considering that this dataset was generated by expanding low-dimensional raw spectra through linear interpolation, the poorer modeling results may be attributed to the fact that the increase in dimensionality after interpolation does not necessarily mean a simultaneous enhancement of useful information, but may also introduce feature information unrelated to SSC prediction. Therefore, under this type of data condition, multi-preprocessing feature fusion did not further demonstrate a clear advantage. This also indirectly indicates that the effectiveness of multi-preprocessing feature fusion modeling is related to the structural characteristics of the data itself.

It should be noted that linear interpolation is a spectral resampling method and does not introduce new chemical information. Since the newly generated variables in Dataset 2 were calculated from adjacent original spectral variables, interpolation may increase local correlations among neighboring variables. However, the spectral profiles showed that Dataset 2 generally preserved the overall shape and main variation trends of Dataset 1, indicating that the original spectral morphology was not markedly distorted. The more noticeable fluctuations observed under D1 and D2 preprocessing suggest that the interpolated high-dimensional spectra may be more sensitive to derivative-based transformations. Therefore, in this study, Dataset 2 was used as an auxiliary resampled dataset to compare model performance under unified dimensionality, rather than as truly high-resolution spectral data containing additional chemical information.

5.2. Effects of Different Modeling Methods

Different modeling methods had different effects on predictive performance. As a linear regression method, PLSR can better adapt to the linear relationship between near-infrared spectra and SSC, and therefore showed relatively stable predictive performance across all three datasets. In contrast, CNN has a strong nonlinear mapping capability, but its performance advantages are usually reflected in scenarios where the data have complex nonlinear structures. In the datasets used in this study, the structure of the spectral features was relatively simple, and thus the modeling advantages of CNN could not be fully exploited. As a result, its predictive performance was slightly inferior to that of PLSR in most cases. SVR is constructed based on the principle of structural risk minimization and is theoretically suitable for small-sample nonlinear regression. However, its predictive results are relatively sensitive to the choice of kernel function and hyperparameter settings, and its predictive performance in this study was not sufficiently stable. It is worth noting that, in this study, the predictive accuracy of the CNN model did not reach the best level among the three models, and similar patterns have also been observed in existing related studies [44,45,46].

5.3. Advantages of Multi-Preprocessing Feature Fusion

Different preprocessing methods place different emphases on denoising and scatter correction. Therefore, the information contained in the fused features is more comprehensive, thereby improving the model’s ability to predict SSC. In Dataset 1 and Dataset 3, the predictive performance of the fusion models was better than that of the best-performing single-preprocessing model. It is worth noting that, in Dataset 1, the predictive performance of the CNN model after feature fusion exceeded that of the best PLSR result obtained from single-preprocessing modeling for this dataset, indicating that the multi-branch CNN feature fusion structure can more fully exploit the complementary information among spectra preprocessed by different methods, thereby improving the SSC prediction capability. In Dataset 2, although the improvement in accuracy was limited in some models, relatively close and high prediction accuracy of Xinyu pear SSC was maintained across all three models, demonstrating that the multi-preprocessing feature fusion strategy remained feasible for interpolated data. Other researchers have also shown that preprocessing or transforming the raw spectral features of the research object and then performing feature fusion can effectively improve model predictive performance [47,48,49,50].

6. Conclusions

Based on three different datasets of Xinyu pear, this study systematically introduced five preprocessing methods (MA, SNV, MSC, D1, and D2) and three regression modeling methods (PLSR, SVR, and CNN) for modeling, comparison, and analysis, and systematically evaluated and discussed the performance of different preprocessing methods in different models across different datasets of similar samples. This study also constructed a multi-preprocessing feature fusion model to achieve fuller utilization of spectral features obtained from different preprocessing methods. The results showed that different preprocessing methods had different effects on the SSC prediction performance of Xinyu pear under different modeling methods. MA showed the best overall performance across different datasets and models. Although the effects of SNV and MSC were not as good as those of MA, they still showed relatively stable predictive accuracy. D1 showed performance decline in some models, whereas the performance degradation of D2 was more frequent and more obvious. The results of the multi-preprocessing feature fusion model also showed that using the complementary information among different preprocessing methods for prediction is a feasible way to improve the SSC prediction accuracy of Xinyu pear. In addition, VIP analysis based on the MA-PLSR models identified several wavelength regions with relatively high contributions to SSC prediction, providing supplementary wavelength-level interpretability for the modeling results. This study provides a reference for the investigation of different preprocessing methods and different modeling approaches across different datasets of similar samples for the nondestructive detection of SSC in Xinyu pear. Future work will further expand the types of samples and modeling methods and further explore combinations of different preprocessing methods and modeling approaches under more conditions.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/app16104732/s1, Figure S1: CNN models for Dataset 1 under different preprocessing methods. (a) CNN model for raw spectra of Dataset 1; (b) CNN model for spectra preprocessed by MA of Dataset 1; (c) CNN model for spectra preprocessed by MSC/D1/SNV of Dataset 1; (d) CNN model for spectra preprocessed by D2 of Dataset 1. Figure S2: CNN model designs for Dataset 2 under different preprocessing methods. (a) CNN model for raw spectra and spectra preprocessed by MA/MSC of Dataset 2; (b) CNN model for spectra preprocessed by SNV of Dataset 2; (c) CNN model for spectra preprocessed by D1 of Dataset 2; (d) CNN model for spectra preprocessed by D2 of Dataset 2. Figure S3: CNN model designs for Dataset 3 under different preprocessing methods. (a) CNN model for raw spectra and spectra preprocessed by SNV of Dataset 3; (b) CNN model for spectra preprocessed by MA of Dataset 3; (c) CNN model for spectra preprocessed by MSC/D1 of Dataset 3; (d) CNN model for spectra preprocessed by D2 of Dataset 3.

Author Contributions

H.Q.: Conceptualization, Supervision, Funding Acquisition, Resources, Methodology, Project Administration, Writing—Original Draft, Writing—Review and Editing; H.W.: Investigation, Methodology, Software, Formal Analysis, Data Curation, Visualization, Writing—Original Draft, Writing—Review and Editing; Q.L.: Investigation, Methodology, Writing—Review and Editing; Z.H.: Investigation, Writing—Review and Editing; C.Z.: Conceptualization, Investigation, Methodology, Visualization, Validation, Writing—Original Draft, Writing—Review and Editing. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Huzhou Key R&D Program (Grant Number: 2023ZD2030).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liu, C.; Li, H.; Ren, A.; Chen, G.; Ye, W.; Wu, Y.; Ma, P.; Yu, W.; He, T. A Comparison of the Mineral Element Content of 70 Different Varieties of Pear Fruit (Pyrus ussuriensis) in China. PeerJ 2023, 11, e15328. [Google Scholar] [CrossRef]
Zhang, Y.; Cheng, Y.; Ma, Y.; Guan, J.; Zhang, H. Regulation of Pear Fruit Quality: A Review Based on Chinese Pear Varieties. Agronomy 2025, 15, 58. [Google Scholar] [CrossRef]
Li, J.; Zhang, M.; Li, X.; Khan, A.; Kumar, S.; Allan, A.C.; Lin-Wang, K.; Espley, R.V.; Wang, C.; Wang, R.; et al. Pear Genetics: Recent Advances, New Prospects, and a Roadmap for the Future. Hortic. Res. 2022, 9, uhab040. [Google Scholar] [CrossRef]
Zhang, H.; Lai, L.; Gu, J.; Wen, L.; Li, X.; Wang, C. Applications of Near-Infrared Spectroscopy in Pear Quality Assessment: A Comprehensive Review. J. Food Process Eng. 2025, 48, e70086. [Google Scholar] [CrossRef]
Anjali; Jena, A.; Bamola, A.; Mishra, S.; Jain, I.; Pathak, N.; Sharma, N.; Joshi, N.; Pandey, R.; Kaparwal, S.; et al. State-of-the-Art Non-Destructive Approaches for Maturity Index Determination in Fruits and Vegetables: Principles, Applications, and Future Directions. Food Prod. Process. Nutr. 2024, 6, 56. [Google Scholar] [CrossRef]
He, Y.; Xiao, Q.; Bai, X.; Zhou, L.; Liu, F.; Zhang, C. Recent Progress of Nondestructive Techniques for Fruits Damage Inspection: A Review. Crit. Rev. Food Sci. 2022, 62, 5476–5494. [Google Scholar] [CrossRef]
Nicolaï, B.M.; Beullens, K.; Bobelyn, E.; Peirs, A.; Saeys, W.; Theron, K.I.; Lammertyn, J. Nondestructive Measurement of Fruit and Vegetable Quality by Means of NIR Spectroscopy: A Review. Postharvest Biol. Technol. 2007, 46, 99–118. [Google Scholar] [CrossRef]
Liu, J.; Sun, J.; Wang, Y.; Liu, X.; Zhang, Y.; Fu, H. Non-Destructive Detection of Fruit Quality: Technologies, Applications and Prospects. Foods 2025, 14, 2137. [Google Scholar] [CrossRef]
Guo, Z.; Zhang, Y.; Wang, J.; Liu, Y.; Jayan, H.; El-Seedi, H.R.; Alzamora, S.M.; Gómez, P.L.; Zou, X. Detection Model Transfer of Apple Soluble Solids Content Based on NIR Spectroscopy and Deep Learning. Comput. Electron. Agric. 2023, 212, 108127. [Google Scholar] [CrossRef]
Bu, Y.; Luo, J.; Tian, Q.; Li, J.; Cao, M.; Yang, S.; Guo, W. Nondestructive Detection of Internal Quality in Multiple Peach Varieties by Vis/NIR Spectroscopy with Multi-Task CNN Method. Postharvest Biol. Technol. 2025, 227, 113579. [Google Scholar] [CrossRef]
Li, C.; Jin, C.; Zhai, Y.; Pu, Y.; Qi, H.; Zhang, C. Simultaneous Detection of Citrus Internal Quality Attributes Using Near-Infrared Spectroscopy and Hyperspectral Imaging with Multi-Task Deep Learning and Instrumental Transfer Learning. Food Chem. 2025, 481, 143996. [Google Scholar] [CrossRef]
Gong, Z.; Zhi, Z.; Zhang, C.; Cao, D. Non-Destructive Detection of Soluble Solids Content in Fruits: A Review. Chemistry 2025, 7, 115. [Google Scholar] [CrossRef]
Zhao, Y.; Li, Q.; An, C.; Tao, K.; Yu, Y.; Xu, H. Improving the Prediction Performance of Soluble Solid Content in Bagged “Cuiguan” Pear Using Vis/NIR Spectroscopy with Spectral Correction. Food Control 2026, 179, 111596. [Google Scholar] [CrossRef]
Yu, Y.; Yao, M. Is This Pear Sweeter than This Apple? A Universal SSC Model for Fruits with Similar Physicochemical Properties. Biosyst. Eng. 2023, 226, 116–131. [Google Scholar] [CrossRef]
Che, J.; Liang, Q.; Xia, Y.; Liu, Y.; Li, H.; Hu, N.; Cheng, W.; Zhang, H.; Lan, H. The Study on Nondestructive Detection Methods for Internal Quality of Korla Fragrant Pears Based on Near-Infrared Spectroscopy and Machine Learning. Foods 2024, 13, 3522. [Google Scholar] [CrossRef] [PubMed]
Xin, Z.; Ju, S.; Zhang, D.; Zhou, X.-G.; Guo, S.; Pan, Z.; Wang, L.; Cheng, T. Construction of Spectral Detection Models to Evaluate Soluble Solids Content and Acidity in Dangshan Pear Using Two Different Sensors. Infrared Phys. Technol. 2023, 131, 104632. [Google Scholar] [CrossRef]
Xue, H.; Zhang, H.; Liu, Y.; Wang, H.; Zhang, H.; Xu, Q.; Guo, J. An Accurate Firmness Detection Method for Korla Pears Based on Data Augmentation. J. Food Compos. Anal. 2026, 149, 108829. [Google Scholar] [CrossRef]
Jiang, T.; Zuo, W.; Ding, J.; Yuan, S.; Qian, H.; Cheng, Y.; Guo, Y.; Yu, H.; Yao, W. Machine Learning Driven Benchtop Vis/NIR Spectroscopy for Online Detection of Hybrid Citrus Quality. Food Res. Int. 2025, 201, 115617. [Google Scholar] [CrossRef]
Carvalho, J.K.; Moura-Bueno, J.M.; Ramon, R.; Almeida, T.F.; Naibo, G.; Martins, A.P.; Santos, L.S.; Gianello, C.; Tiecher, T. Combining Different Pre-Processing and Multivariate Methods for Prediction of Soil Organic Matter by near Infrared Spectroscopy (NIRS) in Southern Brazil. Geoderma Reg. 2022, 29, e00530. [Google Scholar] [CrossRef]
Sharabiani, V.R.; Saadati, N.; Alizadeh, F.; Szymanek, M. Non-Destructive Assessment of Quality Parameters in Javadi Cv. Peach Fruits Using Vis/NIR Spectroscopy and Multiple Regression Analysis. Food Chem. 2025, 495, 146401. [Google Scholar] [CrossRef]
Jiang, X.; Zhu, M.; Yao, J.; Zhang, Y.; Liu, Y. Calibration of Near Infrared Spectroscopy of Apples with Different Fruit Sizes to Improve Soluble Solids Content Model Performance. Foods 2022, 11, 1923. [Google Scholar] [CrossRef]
Vega-Castellote, M.; Pérez-Marín, D.; Torres-Rodríguez, I.; Sánchez, M.-T. Implementing near Infrared Spectroscopy for the Online Internal Quality and Maturity Stage Classification of Intact Watermelons at Industry Level. Spectrochim. Acta A 2025, 339, 126254. [Google Scholar] [CrossRef]
Zhou, J.; Liu, X.; Sun, R.; Sun, L. Rapid nondestructive detection of the pulp firmness and peel color of figs by NIR spectroscopy. Food Anal. Methods 2022, 15, 2575–2593. [Google Scholar] [CrossRef]
Zhao, Y.; Zhou, L.; Wang, W.; Zhang, X.; Gu, Q.; Zhu, Y.; Chen, R.; Zhang, C. Visible/near-infrared Spectroscopy and Hyperspectral Imaging Facilitate the Rapid Determination of Soluble Solids Content in Fruits. Food Eng. Rev. 2024, 16, 470–496. [Google Scholar] [CrossRef]
Pornchaloempong, P.; Sharma, S.; Phanomsophon, T.; Srisawat, K.; Inta, W.; Sirisomboon, P.; Prinyawiwatkul, W.; Nakawajana, N.; Lapcharoensuk, R.; Teerachaichayut, S. Non-Destructive Quality Evaluation of Tropical Fruit (Mango and Mangosteen) Purée Using Near-Infrared Spectroscopy Combined with Partial Least Squares Regression. Agriculture 2022, 12, 2060. [Google Scholar] [CrossRef]
Liu, L.; Zhang, H.; Wu, L.; Gu, S.; Xu, J.; Jia, B.; Ye, Z.; Heng, W.; Jin, X. An Early Asymptomatic Diagnosis Method for Cork Spot Disorder in ‘Akizuki’ Pear (Pyrus pyrifolia Nakai) Using Micro near Infrared Spectroscopy. Food Chem. X 2023, 19, 100851. [Google Scholar] [CrossRef] [PubMed]
Agustina, S.; Devianti; Bulan, R.; Muslih, M.; Sitorus, A. Performance Evaluation of Pre-Processing and Pre-Treatment Algorithm for Near-Infrared Spectroscopy Signals: Case Study pH of Intact Mango “Arumanis”. Int. J. Des. Nat. Ecodynamics 2022, 17, 571–577. [Google Scholar] [CrossRef]
Yang, X.; Zhu, L.; Huang, X.; Zhang, Q.; Li, S.; Chen, Q.; Wang, Z.; Li, J. Determination of the Soluble Solids Content in Korla Fragrant Pears Based on Visible and Near-Infrared Spectroscopy Combined With Model Analysis and Variable Selection. Front. Plant Sci. 2022, 13, 938162. [Google Scholar] [CrossRef] [PubMed]
Zhan, B.; Li, P.; Li, M.; Luo, W.; Zhang, H. Detection of Soluble Solids Content (SSC) in Pears Using Near-Infrared Spectroscopy Combined with LASSO–GWF–PLS Model. Agriculture 2023, 13, 1491. [Google Scholar] [CrossRef]
Li, P.; Jin, Q.; Liu, H.; Han, L.; Li, C.; Luo, Y. Determination of Soluble Solids Content in Loquat Using Near-Infrared Spectroscopy Coupled with Broad Learning System and Hybrid Wavelength Selection Strategy. LWT-Food Sci. Technol. 2024, 206, 116570. [Google Scholar] [CrossRef]
Guo, Z.; Chen, X.; Zhang, Y.; Sun, C.; Jayan, H.; Majeed, U.; Watson, N.J.; Zou, X. Dynamic Nondestructive Detection Models of Apple Quality in Critical Harvest Period Based on Near-Infrared Spectroscopy and Intelligent Algorithms. Foods 2024, 13, 1698. [Google Scholar] [CrossRef] [PubMed]
Xu, S.; Lu, H.; Liang, X.; Ference, C.; Qiu, G.; Fan, C. Modeling and De-Noising for Nondestructive Detection of Total Soluble Solid Content of Pomelo by Using Visible/Near Infrared Spectroscopy. Foods 2023, 12, 2966. [Google Scholar] [CrossRef]
Liu, Y.; Huo, Y.; Wang, G.; Li, X. Optical Properties Combined with Convolutional Neural Networks to Predict Soluble Solids Content of Peach. J. Food Meas. Charact. 2023, 17, 5012–5023. [Google Scholar] [CrossRef]
Xia, Y.; Lei, H.; Zhang, W.; Che, T.; Yang, Y.; Liu, W.; Kang, J.; Tang, W.; Fan, S. Recent Advances in Emerging Techniques for Optical Properties Analysis of Fruits and Vegetables: A Review. Food Bioprocess Technol. 2026, 19, 264. [Google Scholar] [CrossRef]
Liu, S.; Huang, W.; Lin, L.; Fan, S. Effects of Orientations and Regions on Performance of Online Soluble Solids Content Prediction Models Based on Near-Infrared Spectroscopy for Peaches. Foods 2022, 11, 1502. [Google Scholar] [CrossRef] [PubMed]
Cai, L.; Li, J.; Zhang, H.; Zhang, Y.; Zhang, J.; Hao, H. Determination of the SSC in Oranges Using Vis-NIR Full Transmittance Hyperspectral Imaging and Spectral Visual Coding: A Practical Solution to the Scattering Problem of Inhomogeneous Mixtures. Food Chem. 2025, 474, 143239. [Google Scholar] [CrossRef]
Cai, L.; Zhang, Y.; Cai, Z.; Shi, R.; Li, S.; Li, J. Detection of Soluble Solids Content in Tomatoes Using Full Transmission Vis-NIR Spectroscopy and Combinatorial Algorithms. Front. Plant Sci. 2024, 15, 1500819. [Google Scholar] [CrossRef]
Janaszek-Mańkowska, M.; Mańkowski, D.R. Hyperspectral Classification of Kiwiberry Ripeness for Postharvest Sorting Using PLS-DA and SVM: From Baseline Models to Meta-Inspired Stacked SVM. Processes 2025, 13, 3446. [Google Scholar] [CrossRef]
Xia, Y.; Liu, Y.; Zhang, H.; Che, J.; Liang, Q. Study on Color Detection of Korla Fragrant Pears by Near-Infrared Spectroscopy Combined with PLSR. Horticulturae 2025, 11, 352. [Google Scholar] [CrossRef]
Liu, Q.; Yu, C.; Ma, Y.; Zhang, H.; Yan, L.; Fan, S. Prediction of Key Quality Parameters in Hot Air-Dried Jujubes Based on Hyperspectral Imaging. Foods 2025, 14, 1855. [Google Scholar] [CrossRef]
Ige, A.O.; Sibiya, M. State-of-the-Art in 1D Convolutional Neural Networks: A Survey. IEEE Access 2024, 12, 144082–144105. [Google Scholar] [CrossRef]
Li, Z.-Y.; Huang, X.; Yang, J.-X.; Luo, S.-H.; Wang, J.; Fang, Q.-L.; Hui, A.-L.; Liang, F.-X.; Wu, C.-Y.; Wang, L.; et al. An Improved 1D CNN with Multi-Sensor Spectral Fusion for Detection of SSC in Pears. J. Food Compos. Anal. 2025, 144, 107732. [Google Scholar] [CrossRef]
Golic, M.; Walsh, K.; Lawson, P. Short-Wavelength Near-Infrared Spectra of Sucrose, Glucose, and Fructose with Respect to Sugar Concentration and Temperature. Appl. Spectrosc. 2003, 57, 139–145. [Google Scholar] [CrossRef]
Tang, Z.; Ma, S.; Qi, H.; Zhang, X.; Zhang, C. Nondestructive Detection of Rice Milling Quality Using Hyperspectral Imaging with Machine and Deep Learning Regression. Foods 2025, 14, 1977. [Google Scholar] [CrossRef]
Park, S.; Yang, M.; Yim, D.G.; Jo, C.; Kim, G. VIS/NIR Hyperspectral Imaging with Artificial Neural Networks to Evaluate the Content of Thiobarbituric Acid Reactive Substances in Beef Muscle. J. Food Eng. 2023, 350, 111500. [Google Scholar] [CrossRef]
Sharma, S.; Sirisomboon, P.; K.c, S.; Terdwongworakul, A.; Phetpan, K.; Kshetri, T.B.; Sangwanangkul, P. Near-Infrared Hyperspectral Imaging Combined with Machine Learning for Physicochemical-Based Quality Evaluation of Durian Pulp. Postharvest Biol. Technol. 2023, 200, 112334. [Google Scholar] [CrossRef]
Song, Y.; Yi, W.; Liu, Y.; Zhang, C.; Wang, Y.; Ning, J. A Robust Deep Learning Model for Predicting Green Tea Moisture Content during Fixation Using Near-Infrared Spectroscopy: Integration of Multi-Scale Feature Fusion and Attention Mechanisms. Food Res. Int. 2025, 203, 115874. [Google Scholar] [CrossRef]
Chen, C.; Wang, T.; Zhou, G.; Wu, Z.; Liu, J.; Yang, X.; Yan, H.; Duan, J. Near-Infrared Spectroscopy Combined with Multi-Source Feature Fusion and Transformer for Identifying the Extent of Sulfur Fumigation in Dried Ginger. Spectrochim. Acta A 2026, 350, 127396. [Google Scholar] [CrossRef] [PubMed]
Tan, A.; Wang, H.; Zuo, Y.; Zhao, R.; Ma, W.; He, Y.; Zhao, Y. IFCNN-Based Fusion of GAF and MTF Encoded near-Infrared Spectral Images for Quantitative Analysis of Microplastics. Spectrochim. Acta A 2026, 348, 127069. [Google Scholar] [CrossRef]
Chen, B.; Li, Q.; Zhang, F.; Hu, Z. A Multi-Branch and Multi-Level Feature Extraction Network Model for near-Infrared Spectroscopy Quantitative Analysis. Infrared Phys. Technol. 2026, 153, 106264. [Google Scholar] [CrossRef]

Figure 1. Schematic diagram of the fiber-optic probe measurement positions on the equatorial plane of a Xinyu pear sample. The yellow dashed line denotes the equatorial plane, and the red dots indicate the measurement positions.

Figure 2. Structure of the multi-branch 1D-CNN feature fusion model for Dataset 1. The four branches correspond to raw, MA, SNV, and MSC spectra.

Figure 3. Structure of the multi-branch 1D-CNN feature fusion model for Dataset 2. The four branches correspond to raw, MA, SNV, and MSC spectra.

Figure 4. Structure of the multi-branch 1D-CNN feature fusion model for Dataset 3. The four branches correspond to raw, MA, SNV, and MSC spectra.

Figure 5. Raw and preprocessed spectral profiles of Dataset 1 (136 samples, 74 spectral bands, 1086–1568 nm). (a) Raw spectra; (b) MA; (c) SNV; (d) MSC; (e) D1; (f) D2.

Figure 6. Raw and preprocessed spectral profiles of Dataset 2 (136 samples, 483 spectral bands after linear interpolation, 1086–1568 nm). (a) Raw spectra; (b) MA; (c) SNV; (d) MSC; (e) D1; (f) D2.

Figure 7. Raw and preprocessed spectral profiles of Dataset 3 (552 samples, 483 spectral bands, 1086–1568 nm). (a) Raw spectra; (b) MA; (c) SNV; (d) MSC; (e) D1; (f) D2.

Figure 8. Comparison of prediction-set r_p among different preprocessing methods and regression models (PLSR, SVR, CNN) for Dataset 1.

Figure 9. Comparison of prediction-set r_p among different preprocessing methods and regression models (PLSR, SVR, CNN) for Dataset 2.

Figure 10. Comparison of prediction-set r_p among different preprocessing methods and regression models (PLSR, SVR, CNN) for Dataset 3.

Figure 11. VIP scores of wavelength variables calculated from the MA-PLSR models for the three datasets. The vertical axis represents the VIP score, and the horizontal axis represents wavelength (nm). (a) Dataset 1; (b) Dataset 2; (c) Dataset 3. The dashed line indicates the VIP = 1 threshold.

Table 1. Summary statistics (minimum, maximum, mean, and standard deviation) of SSC (°Brix) for the training, validation, and prediction sets of Dataset 1 and Dataset 3.

Dataset	Sample Set	Number	Minimum	Maximum	Mean	Standard Deviation
Dataset 1	Training set (°Brix)	92	9.7	13.3	11.3	0.8
	Validation set (°Brix)	22	8.8	12.9	11.1	1.1
	Prediction set (°Brix)	22	9.2	13.7	11.1	1.0
Dataset 3	Training set (°Brix)	414	8.4	13.3	11.4	0.6
	Validation set (°Brix)	69	10.1	13.4	11.6	0.6
	Prediction set (°Brix)	69	10.0	13.0	11.5	0.6

Table 2. Purposes, main effects, and parameter settings of the five spectral preprocessing methods (MA, SNV, MSC, D1, D2) used in this study.

Method	Purpose	Main Effect	Settings in This Study
MA	Noise reduction	Smooths the spectral curve while preserving the overall spectral trend.	Window length = 7
SNV	Scatter correction	Centers and standardizes each individual spectrum, reducing baseline drift and scale differences among samples.	Applied independently to each spectrum
MSC	Scatter correction	Reduces baseline drift and scaling deviation relative to a reference spectrum.	Training-set mean spectrum used as reference
D1	Baseline correction and feature enhancement	Enhances first-order spectral changes and local slope information.	SG filter; window length = 7; polynomial order = 1
D2	Feature enhancement	Highlights second-order spectral variations, but may be more sensitive to noise.	SG filter; window length = 7; polynomial order = 2

Table 3. Prediction performance (r_c, r_v, r_p, RMSEC, RMSEV, RMSEP, and RPDp) of PLSR, SVR, and CNN models under different preprocessing methods for Dataset 1.

Preprocessing	Model	r_c	r_v	r_p	RMSEC	RMSEV	RMSEP	RPDp
Raw spectra	PLSR	0.9355	0.8931	0.8573	0.2961	0.5783	0.5622	1.8321
	SVR	0.8308	0.7990	0.7504	0.4745	0.7851	0.6922	1.4880
	CNN	0.8222	0.8049	0.7680	0.4971	0.6993	0.6570	1.5677
MA	PLSR	0.9135	0.8877	0.8334	0.3410	0.6161	0.5596	1.8406
	SVR	0.7926	0.7696	0.7464	0.5151	0.8005	0.6934	1.4854
	CNN	0.7986	0.7982	0.7990	0.5076	0.7885	0.6750	1.5259
SNV	PLSR	0.9039	0.7870	0.7789	0.3585	0.7713	0.7078	1.4552
	SVR	0.7960	0.7805	0.7507	0.5089	0.7614	0.6684	1.5410
	CNN	0.8002	0.8042	0.7930	0.5297	0.6751	0.6397	1.6101
MSC	PLSR	0.9009	0.7913	0.7851	0.3638	0.7498	0.6945	1.4831
	SVR	0.7957	0.7803	0.7495	0.5092	0.7627	0.6702	1.5368
	CNN	0.8021	0.7957	0.7988	0.5366	0.6807	0.6404	1.6083
D1	PLSR	0.8942	0.8190	0.7739	0.3753	0.6860	0.6796	1.5156
	SVR	0.7809	0.7651	0.7715	0.5290	0.7950	0.6965	1.4788
	CNN	0.7857	0.7842	0.7869	0.5245	0.7722	0.6560	1.5701
D2	PLSR	0.7794	0.8328	0.7107	0.5252	0.6876	0.7089	1.4529
	SVR	0.6996	0.6642	0.6970	0.6014	0.8798	0.7530	1.3678
	CNN	0.7563	0.6796	0.7426	0.7214	0.9041	0.8262	1.2467

Table 4. Prediction performance (r_c, r_v, r_p, RMSEC, RMSEV, RMSEP, and RPDp) of PLSR, SVR, and CNN models under different preprocessing methods for Dataset 2.

Preprocessing	Model	r_c	r_v	r_p	RMSEC	RMSEV	RMSEP	RPDp
Raw spectra	PLSR	0.9137	0.8931	0.8334	0.3408	0.5599	0.5712	1.8032
	SVR	0.8098	0.7632	0.7478	0.5012	0.8237	0.6994	1.4727
	CNN	0.8233	0.8110	0.8209	0.5707	0.8407	0.7915	1.3013
MA	PLSR	0.9243	0.8958	0.8682	0.3199	0.6088	0.5236	1.9671
	SVR	0.8070	0.7639	0.7505	0.5041	0.8215	0.6957	1.4805
	CNN	0.8148	0.8126	0.8143	0.5793	0.7198	0.6388	1.6124
SNV	PLSR	0.9053	0.7921	0.7897	0.3560	0.7628	0.6905	1.4916
	SVR	0.8219	0.7563	0.8084	0.4832	0.7645	0.5926	1.7381
	CNN	0.8618	0.8275	0.8501	0.5030	0.7466	0.7480	1.3770
MSC	PLSR	0.9014	0.8003	0.7961	0.3629	0.7331	0.6719	1.5329
	SVR	0.8238	0.7597	0.8029	0.4803	0.7611	0.6004	1.7155
	CNN	0.8331	0.8269	0.8190	0.5319	0.7690	0.6475	1.5907
D1	PLSR	0.9475	0.8627	0.7027	0.2681	0.6067	0.8669	1.1881
	SVR	0.7661	0.6973	0.7843	0.5556	0.8576	0.6951	1.4818
	CNN	0.9729	0.6080	0.7889	0.2335	0.8855	0.6775	1.5203
D2	PLSR	0.7085	0.6921	0.5426	0.5916	0.8533	0.8888	1.1588
	SVR	0.8661	0.5267	0.6097	0.4883	1.0037	0.8676	1.1872
	CNN	0.7872	0.6215	0.6115	0.9234	1.0656	1.0316	0.9984

Table 5. Prediction performance (r_c, r_v, r_p, RMSEC, RMSEV, RMSEP, and RPDp) of PLSR, SVR, and CNN models under different preprocessing methods for Dataset 3.

Preprocessing	Model	r_c	r_v	r_p	RMSEC	RMSEV	RMSEP	RPDp
Raw spectra	PLSR	0.7218	0.7147	0.6764	0.4214	0.4808	0.4717	1.2983
	SVR	0.6696	0.6971	0.6420	0.4562	0.5108	0.4967	1.2330
	CNN	0.7176	0.7048	0.6631	0.4245	0.5067	0.4894	1.2513
MA	PLSR	0.7190	0.7143	0.6712	0.4231	0.4825	0.4757	1.2874
	SVR	0.6638	0.7024	0.6576	0.4596	0.5126	0.4920	1.2447
	CNN	0.7145	0.7264	0.6415	0.4441	0.5558	0.5540	1.1054
SNV	PLSR	0.6725	0.6838	0.6200	0.4505	0.5032	0.5189	1.1802
	SVR	0.6649	0.5864	0.5497	0.4675	0.5545	0.5405	1.1330
	CNN	0.7314	0.6211	0.6090	0.5154	0.6763	0.6819	0.8981
MSC	PLSR	0.6713	0.6813	0.6230	0.4512	0.5052	0.5164	1.1859
	SVR	0.6650	0.5872	0.5509	0.4675	0.5542	0.5398	1.1345
	CNN	0.7594	0.5695	0.5521	0.3985	0.5655	0.5610	1.0916
D1	PLSR	0.6869	0.6825	0.6290	0.4424	0.5113	0.5006	1.2233
	SVR	0.6100	0.6163	0.5981	0.4888	0.5430	0.5111	1.1982
	CNN	0.6203	0.6103	0.5610	0.4857	0.5665	0.5531	1.1072
D2	PLSR	0.7098	0.6238	0.5571	0.4289	0.5292	0.5329	1.1492
	SVR	0.4016	0.2345	0.1241	0.5782	0.6499	0.6144	0.9968
	CNN	0.3328	0.4321	0.3412	0.5864	0.5983	0.5807	1.0546

Table 6. Prediction performance (r_c, r_v, r_p, RMSEC, RMSEV, RMSEP, and RPDp) of the multi-preprocessing feature fusion models (CNN, PLSR, SVR) for Dataset 1, Dataset 2, and Dataset 3.

Dataset	Model	r_c	r_v	r_p	RMSEC	RMSEV	RMSEP	RPDp
Dataset 1	CNN	0.9904	0.8854	0.8811	0.1260	0.6229	0.4978	2.0691
	PLSR	0.9936	0.8853	0.8686	0.0945	0.6053	0.5155	1.9980
	SVR	0.9937	0.8789	0.8758	0.0944	0.6149	0.5006	2.0575
Dataset 2	CNN	0.9259	0.8546	0.8259	0.3300	0.6274	0.5762	1.7875
	PLSR	0.9421	0.8514	0.8032	0.2811	0.5866	0.6029	1.7084
	SVR	0.9226	0.8155	0.8288	0.3258	0.6729	0.5779	1.7823
Dataset 3	CNN	0.7310	0.7251	0.7064	0.4416	0.5604	0.5239	1.1689
	PLSR	0.7240	0.7058	0.6996	0.4200	0.4857	0.4572	1.3395
	SVR	0.7576	0.7053	0.6570	0.3982	0.4823	0.4809	1.2735

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Qi, H.; Wang, H.; Liao, Q.; Han, Z.; Zhang, C. Detection of Soluble Solid Content in Xinyu Pears Using Near-Infrared Spectroscopy and Deep Fusion of Multi-Preprocessed Spectral Data. Appl. Sci. 2026, 16, 4732. https://doi.org/10.3390/app16104732

AMA Style

Qi H, Wang H, Liao Q, Han Z, Zhang C. Detection of Soluble Solid Content in Xinyu Pears Using Near-Infrared Spectroscopy and Deep Fusion of Multi-Preprocessed Spectral Data. Applied Sciences. 2026; 16(10):4732. https://doi.org/10.3390/app16104732

Chicago/Turabian Style

Qi, Hengnian, Hao Wang, Quanqing Liao, Zijun Han, and Chu Zhang. 2026. "Detection of Soluble Solid Content in Xinyu Pears Using Near-Infrared Spectroscopy and Deep Fusion of Multi-Preprocessed Spectral Data" Applied Sciences 16, no. 10: 4732. https://doi.org/10.3390/app16104732

APA Style

Qi, H., Wang, H., Liao, Q., Han, Z., & Zhang, C. (2026). Detection of Soluble Solid Content in Xinyu Pears Using Near-Infrared Spectroscopy and Deep Fusion of Multi-Preprocessed Spectral Data. Applied Sciences, 16(10), 4732. https://doi.org/10.3390/app16104732

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Detection of Soluble Solid Content in Xinyu Pears Using Near-Infrared Spectroscopy and Deep Fusion of Multi-Preprocessed Spectral Data

Abstract

1. Introduction

2. Materials and Methods

2.1. Pear Samples

2.1.1. The First Batch of Samples

2.1.2. The Second Batch of Samples

2.2. Instruments and Spectra Acquisition

2.3. SSC Measurement

2.4. Spectral Resampling (Linear Interpolation)

2.5. Dataset Division

3. Spectral Preprocessing Algorithms and Regression Models

3.1. Spectral Preprocessing Algorithms

3.1.1. Moving Average Smoothing (MA)

3.1.2. Standard Normal Variate (SNV)

3.1.3. Multiplicative Scatter Correction (MSC)

3.1.4. Derivative Preprocessing Algorithms (First Derivative and Second Derivative)

3.2. Regression Prediction Methods

3.2.1. Partial Least Squares Regression (PLSR)

3.2.2. Support Vector Regression (SVR)

3.2.3. Convolutional Neural Network (CNN)

3.2.4. Multi-Preprocessing Feature Fusion Model

3.3. Software, Hardware, and Performance Evaluation

4. Results

4.1. Spectral Profiles

4.2. Experimental Results

4.2.1. Spectral Preprocessing and Modeling Results for Dataset 1

4.2.2. Spectral Preprocessing and Modeling Results for Dataset 2

4.2.3. Spectral Preprocessing and Modeling Results for Dataset 3

4.2.4. Results of Multi-Preprocessing Feature Fusion Modeling

Results of Multi-Preprocessing Feature Fusion Modeling for Dataset 1

Results of Multi-Preprocessing Feature Fusion Modeling for Dataset 2

Results of Multi-Preprocessing Feature Fusion Modeling for Dataset 3

4.3. VIP-Based Wavelength Importance Analysis

5. Discussion

5.1. Effects of Different Preprocessing Methods

5.2. Effects of Different Modeling Methods

5.3. Advantages of Multi-Preprocessing Feature Fusion

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI