1. Introduction
In the context of the close interconnection between modern agricultural scientific research and production practices, the precise measurement and in-depth analysis of crop physiological characteristic parameters have become essential for advancing agriculture towards refined and scientifically informed management, thereby ensuring the high quality and yield of crops [
1,
2].
As a fruit variety distinguished by its unique characteristics and significant economic value in China, the Korla Xiangli enjoys considerable popularity in both domestic and international markets. The key physiological index of leaf water content (LWC) serves as a barometer for crop growth, providing direct and comprehensive insights into growth trends, water metabolism levels, and the overall health status of plants. Understanding the mechanisms underlying the growth and development of Korla fragrant pear is crucial for optimizing planting management strategies.
In recent years, spectral analysis technology has demonstrated significant potential in the detection of crop physiological parameters, owing to its advantages of non-destructive testing, rapid data acquisition, and the ability to reveal the chemical structural characteristics of substances [
3,
4,
5]. Building on this foundation, in the present study, spectral data were collected from Korla fragrant pear leaves, characteristic information related to LWC was accurately extracted, and a reliable and efficient predictive model was developed.
Sample collection serves as the foundation of this research, with the quality of the dataset directly influencing the reliability and generalization capability of the model [
6]. Korla fragrant pear exhibits significant variations in physiological characteristics and water requirements across different growth stages [
7], particularly during the late fruit expansion phase (S1) and the early maturity phase (S2). During the S1 stage, the fruit undergoes rapid expansion, accompanied by active cell division and material accumulation, which heavily depend on nutrients synthesized through leaf photosynthesis and an adequate water supply [
8]. LWC plays a pivotal role in regulating photosynthetic efficiency, nutrient transport, and overall plant metabolism, ultimately determining the quality of fruit expansion [
9,
10,
11]. As the fruit transitions into the S2 stage, key quality attributes such as sweetness, color, and texture begin to stabilize [
12].
However, existing research predominantly focuses on isolated growth stages, lacking a comprehensive exploration of the distinct key growth phases of Korla fragrant pear, such as the fruit enlargement and maturity stages. Furthermore, studies on the spectral detection of LWC in Korla fragrant pear remain limited, and a cohesive theoretical framework and technical methodology have yet to be established.
Given the unique characteristics of the S1 and S2 stages, this study employs a scientifically sound random splitting method to partition the dataset. This approach ensures an accurate capture of the data distribution patterns of pears across different growth phases while minimizing potential biases [
13]. Specifically, 75% of the samples are allocated for model development, and the remaining 25% are reserved for testing model performance. Additionally, the sample size is carefully selected based on the distinct growth traits of the S1 and S2 stages, ensuring that the dataset comprehensively reflects the actual conditions of each phase. This meticulous approach lays a robust data foundation for subsequent research.
The spectral data obtained within the 4000–10,000 cm
−1 range exhibit complex trends, with significant fluctuations in absorbance values. While these data contain rich spectral feature information, they are also plagued by interference factors such as baseline shifts and noise, which impede the precise extraction of LWC-related features [
14,
15,
16]. Despite these challenges, existing research predominantly relies on a single preprocessing method, lacking a thorough exploration and optimization of the combined effects of multiple preprocessing techniques.
Spectral preprocessing plays a pivotal role in addressing the complexities of spectral data. The SG (Savitzky–Golay) convolution smoothing method enhances the smoothness of spectral curves, providing a stable foundation for subsequent analysis [
17]. Meanwhile, MSC (Multiplicative Scatter Correction) and SNV (Standard Normal Variate) techniques effectively eliminate baseline shifts and background interference, thereby enhancing spectral features and creating optimal conditions for identifying characteristic wavelengths [
18]. On the other hand, FD (first derivative) and SD (second derivative) transformations excel at highlighting subtle spectral changes, such as the positions of reflection peaks and valleys. However, these methods are prone to amplifying noise, potentially introducing glitch phenomena [
19]. Given these considerations, the judicious selection and combination of preprocessing methods are crucial for optimizing spectral data quality, laying a solid groundwork for subsequent research.
Feature band selection is a critical factor influencing the overall performance of a model. However, research on the extraction of characteristic wavelengths for Korla fragrant pear LWC remains limited, and the variations in these wavelengths across different growth stages have yet to be fully elucidated. Notably, the LWC characteristic bands of Korla fragrant pear exhibit significant differences between the S1 and S2 periods, reflecting dynamic changes in leaf physiological traits, chemical composition, and spectral interactions. By precisely identifying these characteristic bands, researchers can optimize spectral models to better align with the actual growth patterns of pears, thereby enhancing the accuracy and reliability of leaf water content predictions. For instance, Ye et al. [
20] successfully employed a hyperspectral imaging system (866.4–1701.0 nm) combined with the CARS (Competitive Adaptive Reweighted Sampling) algorithm to extract spectral features, achieving an impressive 97.92% accuracy in shrimp freshness recognition using an ELM (Extreme Learning Machine) model. Tang et al. [
21] utilized hyperspectral technology to detect soil nitrogen ion content in offshore environments. The researchers employed the SPA to extract spectral features and developed a partial least squares regression (PLSR) model for soil total nitrogen (TN) content. The results demonstrated an R
2 of 0.649 and an RPD (Residual Predictive Deviation) of 1.72%, indicating that the SPA algorithm effectively extracted spectral features and achieved satisfactory performance. Inspired by these findings, this study adopts both the CARS and SPA algorithms to extract spectral feature bands and reduce the dimensionality of the spectral data.
In the model development phase, existing research predominantly relies on a single model for prediction, lacking a systematic comparative analysis of the applicability of different models across various growth stages. Additionally, studies on prediction models for Korla fragrant pear LWC remain scarce, and an efficient, reliable prediction model has yet to be established. Common prediction models, such as RFR (Random Forest Regression), BP (Backpropagation Neural Network), and SVR (Support Vector Regression), each possess unique strengths and limitations. These models exhibit significant variations in performance metrics, such as R2 (determination coefficients) and RMSE (root mean square errors), depending on the feature selection methods employed. Therefore, a thorough and systematic comparative analysis is essential to evaluate the performance of each model under diverse conditions, clarifying their respective applicability and advantages.
In the model selection process in this study, we rigorously evaluated each model using key metrics, such as the relative deviation between the actual and predicted values. By identifying the most suitable models for different growth stages, the accuracy and reliability of the pear LWC model were ensured. This approach not only offers robust technical support for the precise irrigation and growth monitoring of Korla fragrant pear but also establishes a solid foundation for informed scientific decision-making in agricultural practices.
This study marks the first systematic application of spectral analysis technology to achieve the non-destructive and rapid detection of LWC in Korla fragrant pear, with a focus on two critical growth stages: fruit expansion and early maturity. By integrating advanced data preprocessing and feature selection methods, the quality of the data and the precision of feature extraction were significantly enhanced. Through a comparative analysis of three machine learning algorithms—RFR, BP, and SVR—the optimal LWC prediction model was identified. This research provides scientific support for the precise irrigation and growth monitoring of Korla fragrant pear while also offering a technical framework that can be adapted for physiological parameter detection in other crops.
3. Results and Analysis
3.1. Sample Collection Data Statistics
In the process of model construction and subsequent testing, to ensure that both the modeling and testing datasets accurately capture the inherent distribution characteristics of the entire dataset, we employed a scientifically robust data partitioning method. This approach effectively mitigates biases caused by specific data distributions, thereby enhancing the model’s generalization capability. Specifically, the dataset was divided according to a predefined ratio, with 75% of the total samples allocated to the modeling dataset for model construction and the remaining 25% reserved for the testing dataset to evaluate model performance.
For the two distinct growth stages of Korla fragrant pear, samples were carefully selected based on the aforementioned principles. During the S1 growth stage, 252 samples were assigned to the modeling dataset, while 108 samples were allocated to the testing dataset. Similarly, for the S2 growth stage, 137 samples were included in the modeling dataset, and 58 samples were allocated to the testing dataset.
Figure 3 illustrates the distribution of samples across the modeling and testing datasets for both growth stages.
From
Figure 3, the mean and SD values for each growth stage can be derived. In the S1 stage, the LWC in the modeling dataset ranged from 4.88% to 83.45%, with an SD of 11.33%. In the testing dataset, the LWC ranged from 22.03% to 74.29%, with an SD of 9.70% (
Figure 3a). For the S2 stage, the LWC in the modeling dataset ranged from 28.77% to 76.52%, with an SD of 6.09%, while in the testing dataset, the LWC ranged from 30.86% to 77.55%, with an SD of 8.32% (
Figure 3b).
The analysis of the mean and SD values between the modeling and testing datasets for both growth stages reveals only minor differences. This outcome strongly demonstrates that the dataset partitioning is both reasonable and scientifically sound, meeting the requirements for subsequent model construction and performance evaluation. This careful partitioning lays a solid data foundation for the successful progression of the research.
3.2. Spectral Collection Data Visualization
The original spectral data usually have problems such as noise interference, baseline drift, abnormal samples, high dimensionality and redundant information, uneven sample distribution, and overlapping spectral features. These problems may affect data quality and model performance. The Mahalanobis distance method can effectively detect abnormal samples, improve data quality, enhance model robustness, and consider data distribution characteristics to more accurately measure the distance between samples. Therefore, the Mahalanobis distance method was used in this study to process the original data.
As illustrated in
Figure 4, the absorbance values exhibit a complex trend across the entire wavelength range (4000 cm
−1 to 10,000 cm
−1). The absorbance values cover a wide dynamic range from 0 to 2.0, indicating significant variations in the light absorption capacity of the material or system under investigation within this spectral region. In the 4000 cm
−1 to 5000 cm
−1 range, the absorbance values gradually increase from a relatively low baseline. Notably, some curves exhibit a steep rise in this interval, suggesting the presence of specific absorption mechanisms or material components that interact strongly with light in this wavelength range.
The 5000 cm−1 to 6000 cm−1 region displays more intricate curve shapes, with distinct peaks and valleys observed across multiple curves. This complexity likely arises from the diverse vibrational and rotational modes of molecular structures or chemical bonds within the material, leading to varied light absorption characteristics. The differences in the colored curves in this region may reflect subtle variations in molecular structure or composition among the samples.
In the 6000 cm−1 to 8000 cm−1 range, the absorbance curves generally exhibit a fluctuating trend. Although the overall absorbance decreases, the decline is not monotonic, with localized peaks and valleys still present. This suggests that, despite the overall reduction in absorption capacity, certain wavelengths still exhibit enhanced absorption, possibly due to the presence of specific functional groups or chemical bonds within the material.
Finally, in the spectral range of 8000 cm−1 to 10,000 cm−1, the absorbance values are relatively low with smoother spectral curves, indicating weaker light absorption by the material in this short-wavelength region. This phenomenon can be attributed to the higher photon energy in this range, which exceeds the required energy thresholds for effectively exciting most molecular transitions or chemical bond vibrations in the material.
3.3. Spectral Processing
Figure 5 illustrates the original spectral curves and their mathematical transformations for Korla fragrant pear leaf samples. As shown in
Figure 5a, the original absorbance (A) varies significantly among samples, with noticeable baseline shifts and tilts. These variations can be attributed to differences in light absorption characteristics and optical path lengths within the leaves. After applying SG convolution smoothing
Figure 5b, the spectral curves become more centered, though the overall trend remains largely unchanged. Following MSC and SNV
Figure 5c,d treatments, the differences in absorbance are significantly reduced, resulting in more concentrated and consistent spectral curves.
These results demonstrate that MSC and SNV effectively address spectral offset issues, eliminate background interference and noise, and enhance spectral features, laying a solid foundation for the more accurate identification of characteristic wavelengths. In contrast, FD and SD transformations
Figure 5e,f excel at highlighting subtle spectral changes, such as the positions of reflection peaks and valleys, thereby reducing background interference and improving feature extraction accuracy. However, in
Figure 5e,f, a burr phenomenon is observed in specific wavelength ranges, primarily due to weak signal strength in those regions, which amplifies noise during the derivative conversion process. In summary, each spectral preprocessing method has unique characteristics and plays a vital role in analyzing Korla fragrant pear leaf spectra, offering diverse perspectives and effective data processing approaches for subsequent research and analysis.
3.4. Correlation Analysis Between Korla Fragrant Pear Leaves and LWC
Based on the actual measured LWC data from two key stages of Korla fragrant pear fruit development, the correlation between absorbance and LWC was analyzed before and after mathematical transformations of the spectral data. The results are shown in
Figure 6. The correlation analysis between Korla fragrant pear LWC and different transformed spectra revealed diverse characteristics and trends. In the original spectrum (
Figure 6a), fluctuations are evident, with distinct peaks and troughs. While these features may relate to LWC, the presence of baseline shifts, tilts, and various interference factors makes it challenging to directly and accurately extract relevant information.
The purpose of SG convolution smoothing is to enhance the smoothness of the spectrum. However, as shown in the correlation diagram (
Figure 6b), there is no significant difference compared to the original spectrum, suggesting that SG smoothing primarily acts as a filter. While its effect on improving the correlation between LWC and spectral data is limited, it provides a stable foundation for subsequent analysis. In contrast, the MSC-treated spectrum (
Figure 6c) and SNV-treated spectrum (
Figure 6d) show significantly reduced absorbance differences, higher spectral concentrations, and more consistent curve characteristics. These methods effectively eliminate scattering and background interference, highlighting spectral features related to LWC, which facilitates the exploration of their correlation and provides a robust data basis for establishing quantitative analysis models.
The FD-transformed spectrum (
Figure 6e) and SD-transformed spectrum (
Figure 6f) excel at emphasizing subtle changes, reducing background interference, and improving feature extraction accuracy. However, due to weak signal strength in certain spectral regions, noise is amplified during the derivative conversion process, resulting in burrs. To address this, noise reduction measures should be incorporated to ensure accurate and reliable results.
Based on these findings, this study will focus on exploring correlations and screening characteristic wavelengths using MSC, SNV, and other preprocessed data. By refining and confirming relevant spectral characteristics and optimizing the data processing workflow, this research aims to provide a scientific foundation and practical guidance for developing a rapid detection model for LWC in Korla fragrant pear.
3.5. Selection of the LWC Feature Zones of Korla Fragrant Pear
The LWC feature zones of Korla fragrant pear during the critical fruit development stages (S1 and S2) exhibit significant differences, as clearly illustrated in the corresponding images. The horizontal axis of the images represents wavelengths (wavelength/cm−1), ranging from 4000 to 10,000 cm−1.
During the S1 stage (
Figure 7a), the SNV-SPA feature bands, represented by green dots, are primarily concentrated around 5000 cm
−1, 6000 cm
−1, and 7000 cm
−1, indicating specific correlations with LWC. The MSC-SPA feature bands, depicted by blue dots, are distributed around 5500 cm
−1 and 7500 cm
−1, reflecting the unique spectral characteristics of the Korla fragrant pear LWC. The SNV-CARS feature bands, shown as red dots, cover a broader range around 4500 cm
−1, 5500 cm
−1, and 8000 cm
−1, suggesting a wider association with LWC. The MSC-CARS feature bands, marked by black dots, are concentrated near 6000 cm
−1 and 7000 cm
−1, highlighting their importance within specific wavelength ranges.
In the S2 stage (
Figure 7b), the SNV-SPA green dots are mainly located around 5500 cm
−1 and 7000 cm
−1. Compared to the S1 stage, the feature band near 5000 cm
−1 disappears, and new positions emerge, reflecting changes in the growth characteristics of Korla fragrant pear. The MSC-SPA blue dots are predominantly located around 6000 cm
−1 and 7500 cm
−1, with the feature band near 5500 cm
−1 disappearing and shifting towards longer wavelengths, indicating physiological and spectral response changes in Korla fragrant pear. The SNV-CARS red dots are distributed near 4500 cm
−1, 6000 cm
−1, and 8500 cm
−1, with the emergence of a new feature band at 8500 cm
−1 demonstrating dynamic changes in the characteristic wavelength zones. The MSC-CARS black dots are fewer in number, mainly around 7000 cm
−1, showing a more concentrated distribution and positional changes.
Comparing the images from both stages, it is evident that the physiological characteristics, chemical composition, and spectral interactions of Korla fragrant pear leaves undergo dynamic changes during growth. These findings provide valuable references for researchers to optimize spectral models according to the characteristics of different growth stages, thereby improving the accuracy and reliability of LWC prediction. This, in turn, offers scientific support and technical guarantees for precise irrigation and growth monitoring in agricultural production.
3.6. Korla Fragrant Pear LWC Model Establishment
Figure 8 illustrates the performance indicators of the LWC predictive model for Korla fragrant pear, comprising four subfigures labeled as
Figure 8a–d. Each subfigure features three curves representing distinct predictive models: RFR (denoted by a black dotted line), BP (indicated by a red dotted line), and SVR (represented by a blue dotted line). The x-axis of each subfigure corresponds to various feature selection methods, namely, MSC-CARS, SNV-CARS, MSC-SPA, and SNV-SPA. The y-axis in
Figure 8a,c displays the coefficient of determination (R
2), whereas
Figure 8b,d depict the RMSE.
In
Figure 8a, the RFR model demonstrates high and stable R
2 values across different feature selection methods, consistently exceeding 0.75. The peak R
2 value, nearing 0.85, is achieved using the SNV-SPA method, indicating robust predictive capability, minimal sensitivity to feature selection, and excellent stability and generalization. Conversely, the BP model exhibits significant variability in R
2 values. While it achieves a higher R
2 value of approximately 0.75 under the MSC-CARS method, this value drops to around 0.65 with the SNV-CARS method, highlighting the model’s sensitivity to feature selection. The SVR model also shows variability in R
2 values across different methods, generally lower than those of the RFR model. It performs relatively well under the MSC-SPA method but is less effective than the RFR model overall.
Comparing
Figure 8a,c, the RFR model maintains small variations in R
2 values and better stability across both periods. The BP model shows reduced variability in
Figure 8c compared to
Figure 8a, though some fluctuations persist. The SVR model’s R
2 values follow a similar trend in both periods, performing well under certain feature selection methods but poorly under others.
In
Figure 8b, the RFR model’s RMSE values are consistently low and stable, all below 0.7%, indicating minimal deviation from actual values and high predictive accuracy. The BP model’s RMSE values remain around 0.6% across different methods, suggesting consistent performance in predictive error despite fluctuations in R
2 values. The SVR model’s RMSE values vary more significantly, generally higher than those of the RFR model, with some values nearing 0.8% under certain methods, indicating larger prediction errors and a need for improved accuracy.
When comparing
Figure 8b,d, the RFR model’s RMSE values show little change, maintaining stable predictive accuracy. The BP model’s RMSE values in
Figure 8d are similar to those in
Figure 8b, indicating consistent error stability across periods. The SVR model’s RMSE values exhibit a similar trend in both periods, with some fluctuations and generally higher error levels.
3.7. Model Selection and Verification
To further refine the model selection process, a comprehensive validation of all models was conducted.
Figure 9 presents the validation results of the combined models constructed using each algorithm, comprising a total of 12 subplots. In each subplot, green scatter points represent the data points, the black line denotes the fitted regression line, and key statistical metrics—including the coefficient of R
2, RMSE, and RPD—are provided below each subplot.
Among the subplots, the models based on the RFR algorithm demonstrate strong performance in most cases. For instance, the SNV-CARS-BP model achieves an R2 value of 0.81594, indicating an excellent fit, with an RMSE of 0.0311%, reflecting minimal prediction error, and an RPD of 2.3369, highlighting robust predictive capability and stability. Similarly, the SNV-SPA-BP model performs well, with an R2 of 0.8082, an RMSE of 0.0264%, and an RPD of 2.2913. In contrast, certain RFR-based models, such as the SNV-SPA-RFR model, exhibit slightly lower performance, with an R2 of 0.79328, an RMSE of 0.048503%, and an RPD of 2.2002. Another RFR model shows an R2 of 0.77447, an RMSE of 0.060111%, and an RPD of 2.1061, indicating room for improvement.
The SVR models, such as the MSC-SPA-SVR model, display comparatively weaker performance, with an R2 of only 0.76365, an RMSE of 0.052419%, and an RPD of 1.6326, reflecting similar limitations in predictive accuracy.
From the subplots and associated statistical metrics, it is evident that models based on the BP generally outperform the others in most cases, particularly in terms of the coefficient of R2, RMSE, and RPD. In summary, the SNV-CARS-BP model during the S1 period emerges as the most suitable choice for predicting the LWC of Korla fragrant pears.
Figure 10 presents the validation results of the combined models constructed by each algorithm during the S2 period, comprising 12 subplots. The horizontal and vertical axes represent the measured and predicted percentages, respectively. Each subplot includes green scatter points, a black fitted regression line, and key statistical metrics—such as the coefficient of determination (R
2), root mean square error of prediction (RMSEP), and RPD—displayed below.
Among the models, those based on RFR demonstrate strong performance. For instance, the SNV-SPA-RFR model achieves an R2 of 0.817756, indicating an excellent fit, with an RMSE of 0.02503%, reflecting minimal deviation, and an RPD of 2.3428, showcasing robust predictive capability and stability. Similarly, the MSC-CARS-RFR model, with an R2 of 0.79331, an RMSEP of 0.02933%, and an RPD of 1.9331, also provides valuable predictive insights.
In contrast, models based on the BP exhibit mixed performance during the S2 period. While some BP-based models, such as SNV-CARS-BP and MSC-SPA-BP, show relatively lower R2 values, higher RMSEP, and reduced RPD, indicating limitations in fitting accuracy and predictive performance, others demonstrate potential but require further optimization. Similarly, SVR models, such as MSC-CARS-SVR and SNV-SPA-SVR, display scattered data points, with R2 values of 0.78685 and 0.76365, RMSEP values of 0.028041% and 0.030419%, and RPD values of 1.7152 and 1.6326, respectively. These results suggest that SVR models also need improvement or exploration of alternative methods to enhance their predictive accuracy.
In summary, the SNV-SPA-RFR model during the S2 period is well suited for data prediction, offering a balance of high R2, low RMSEP, and strong RPD. However, the performance of BP- and SVR-based models highlights the need for further optimization or exploration of new methodologies to improve their predictive capabilities. These findings provide valuable insights for subsequent research and model refinement, contributing to a deeper understanding of the performance characteristics and applicability of each algorithm in this context. Practical applications should consider these results comprehensively to select the most appropriate model for specific predictive tasks.
Model accuracy is a critical factor in determining the usability of a predictive model. To further validate the constructed models, we assessed the relative deviation between the actual values and the predicted values, which serves as a key indicator of model accuracy. As illustrated in
Figure 11, the relative deviations for the SNV-CARS-BP model during the S1 period and the SNV-SPA-RFR model during the S2 period are notably small, ranging from 0.00041% to 2.96649% and 0.00039% to 0.13751%, respectively. These deviations fall well within the acceptable threshold of less than 5%, demonstrating that the LWC prediction models for Korla fragrant pear leaves exhibit high accuracy.
This result confirms the reliability of the models and their potential for practical applications in predicting the water content of Korla fragrant pear leaves. The minimal deviations further validate the robustness and precision of the selected models, highlighting their suitability for real-world use.
4. Discussion
In this study, a random splitting method was employed during the sample collection stage to partition the dataset, ensuring that both the modeling and test datasets accurately capture the inherent distribution characteristics of the data and enhance the model’s generalization ability. Specifically, in the S1 growth stage of Korla fragrant pear, 252 samples were allocated to the modeling dataset, while 108 samples were designated for the test dataset. Similarly, in the S2 stage, 137 and 58 samples were selected for the modeling and test datasets, respectively.
The distribution range and standard deviation of LWC were analyzed for both stages. In the S1 stage, the LWC distribution of the modeling set ranged from 4.88% to 83.45%, with a standard deviation of 11.33, while the test set ranged from 22.03% to 74.29%, with a standard deviation of 9.70. The corresponding data for the S2 stage also exhibited reasonable differences, with minimal discrepancies in the mean values and standard deviations between the modeling and test sets. These results strongly validate the rationality and scientific rigor of the dataset partitioning method. A well-partitioned dataset serves as the cornerstone for constructing high quality models in subsequent analyses [
41]. Compared with traditional methods, which usually rely on hand designed feature extraction methods, machine learning algorithms can well reflect the characteristics of RFR, SVR, and BP that can effectively process high dimensional data (such as spectral data), extract key information through feature selection and dimensionality reduction techniques, and reduce the interference of redundant data. By adopting this approach, the model is exposed to a diverse range of data during training, thereby avoiding issues such as overfitting caused by data bias and ensuring robust predictive performance when applied to new data [
42]. However, it is worth noting that while the current partitioning method has yielded satisfactory results, the sample collection process in practical applications may be influenced by various factors, such as limitations in sampling sites and the inherent randomness of individual samples. The traditional method is sensitive to noise and outliers, and it is easy to overfit or underfit. However, RFR has high robustness to noise and outliers, which can effectively avoid overfitting problems. SVR can also improve the generalization ability of the model through regularization technology. In the future, expanding the sampling range to include Korla fragrant pear samples from a wider variety of growth environments could enhance the dataset’s representativeness and further optimize model performance.
The spectral acquisition data exhibit complex trends. Within the range of 4000–10,000 cm
−1, the absorbance varies significantly across different intervals, reflecting the diverse and intricate light absorption characteristics of the material [
43]. Each spectral preprocessing method has its own strengths and limitations. SG convolution smoothing enhances spectral smoothness but has a limited effect on improving the correlation between LWC and spectral data. In contrast, MSC and SNV processing significantly reduce absorbance differences, enhance spectral features, and effectively address issues such as spectral offset and background interference. While FD and SD transformations excel at highlighting subtle changes to aid feature extraction, they tend to amplify noise and produce burrs in regions with weak signals.
In practical research, the comprehensive application of various preprocessing methods offers diverse perspectives and effective tools for subsequent analysis. However, this also highlights the need to carefully weigh the characteristics of different methods and research objectives when selecting preprocessing techniques. For instance, if the focus is on noise reduction and obtaining more stable spectral features, MSC and SNV methods may be more suitable. As demonstrated by Chenbo et al. [
44] in their construction of a hyperspectral monitoring model for oat grain β-glucan content, the model based on SNV-transformed spectra and SPA–multiple linear regression (SPA-MLR) achieved the highest accuracy, enabling effective hyperspectral monitoring of oat grain β-glucan content.
The wavelength range of near-infrared (NIR) light spans from 780 nm to 2500 nm, enabling it to penetrate various organic substances and interact with chemical bonds. The O-H bond in water molecules (H2O) exhibits characteristic absorption bands in the NIR region, particularly near 1450 nm and 1940 nm. When NIR light irradiates fruit tree tissue, photons are absorbed by the O-H bond, causing a transition in the molecular vibration energy levels and increasing vibrational energy. By measuring the light absorption intensity at different wavelengths, the absorption spectrum of water molecules can be obtained, with peaks corresponding to the vibrational modes of the O-H bond.
However, near-infrared spectroscopy detection may encounter interference from other functional groups, such as O-H bonds in alcohols and phenols, as well as N-H and C-H bonds, which also exhibit absorption bands in the near-infrared region. To mitigate these interferences, feature selection can be applied after spectral preprocessing to identify wavelength points most relevant to the target variable, such as moisture content. For instance, prioritizing water-specific absorption bands (e.g., 1450 nm and 1940 nm) can enhance the signal of water molecules while minimizing the influence of other functional groups. This approach significantly improves the accuracy and specificity of detection, providing reliable support for the application of near-infrared spectroscopy in analyzing fruit tree water content.
In spectral analysis, accurately capturing subtle changes often requires advanced preprocessing techniques. FD and SD transformations, particularly when combined with noise reduction measures, play a crucial role in enhancing spectral features. Previous studies have demonstrated the effectiveness of these approaches; for instance, Li et al. [
45] analyzed β-glucan and total starch in oats using spectral data and chemometric methods. Their results showed varying model performance across components, with the total starch model achieving optimal results after SD-SPA treatment (Rp
2 = 0.768, RMSEP = 2.057% relative to a mean response value of 78.3%). This represents a prediction error of ±2.63% of the measured range and a 62.5% improvement over previous approaches. These findings underscore the importance of derivative transformations in spectral modeling, which we have further developed in our current study. The findings demonstrate that this technology enables accurate quantification and provides an effective approach for oat quality detection. Furthermore, with the ongoing advancement of spectral technology, exploring new and more targeted spectral preprocessing methods, or refining and optimizing existing ones, holds promise for further enhancing spectral data quality. This improvement would lay a stronger foundation for more accurate LWC prediction models.
Notably, the selection of LWC characteristic bands for Korla fragrant pear during the S1 and S2 periods revealed significant differences. The positions of characteristic bands identified by various feature screening algorithms varied between the two stages. For instance, during the S1 period, the SNV-SPA characteristic bands were concentrated around 5000 cm−1, 6000 cm−1, and 7000 cm−1, with corresponding shifts observed in the S2 period. These variations reflect the dynamic changes in leaf physiological characteristics, chemical composition, and spectral interactions throughout the growth of Korla fragrant pear.
Accurately understanding these characteristic band differences is crucial for optimizing spectral models. Researchers can adjust model parameters or select appropriate model structures based on the unique characteristics of different growth stages, thereby enhancing the accuracy and reliability of LWC predictions. However, it is important to recognize that current feature band selection methods rely on existing data and algorithms, which may have inherent limitations. In the future, hyperparameter optimization research can be enhanced through multiple approaches to improve the performance and practical value of the Korla fragrant pear LWC prediction model. First, more efficient and intelligent optimization methods, such as Bayesian optimization or reinforcement learning-based algorithms, can be explored to reduce computational costs and enhance search efficiency. Second, by integrating dynamic feature selection methods, the feature selection strategy can be automatically adjusted according to the physiological characteristics of Korla fragrant pear at different growth stages, further improving the model’s adaptability.
In the model development process, the performance of machine learning models is significantly influenced by the selection of hyperparameters. Optimizing these hyperparameters directly impacts the model’s generalization ability and prediction accuracy [
46]. To ensure optimal model performance, grid search and cross-validation were employed in this study to fine-tune the hyperparameters of RFR, SVR, and BP. The results are illustrated in
Figure 12a,b.
For the S2 period of Korla fragrant pear, taking the SNV-SPA-RFR model as an example, R2 exhibits a clear trend, with varying values for the number of decision trees and leaf nodes. When the number of leaf nodes (cotyledons) is set to 8 and the number of decision trees is 100, the model achieves the highest R2 value of 0.81519, indicating optimal accuracy. Additionally, the RMSE reaches its minimum value of 0.02503% at this parameter combination, further confirming the best model fit. Therefore, the optimal parameters for this model are cotyledon = 8 and decision tree = 100.
For the SVR model, the SNV-SPA-RFR model in the S2 period of Korla fragrant pear was used as an example to optimize the hyperparameters C and γ. As shown in
Figure 12c,d, R
2 exhibits a clear trend, with varying values for C and γ. When C = 0.3 and γ = 0.1, the model achieves the highest R
2 value of 0.73719, indicating optimal generalization ability. Additionally,
Figure 12c,d illustrate the changes in R
2 values corresponding to different C and γ parameters. Consistent with the RMSE heat map, these results further confirm the best model fit under this parameter combination. Therefore, the optimal parameters for this model are C = 0.3 and γ = 0.1.
For the BP model, the SNV-CARS-BP model of the S1 stage of Korla fragrant pear was used as an example, and grid search was employed to determine the optimal constant α. As shown in
Figure 12e, variations in the number of q and the constant α result in significant fluctuations in the model’s R
2 and RMSE values. Notably, when q = 8 and α = 4, the model achieves the highest R
2 value of 0.81594 and the lowest RMSE value of 0.0311%, indicating the best model fit. Therefore, the optimal parameters for this model are q = 8 and α = 4.
During the model selection phase, we compared the performance of RFR (Random Forest Regression), BP (Backpropagation Neural Network), and SVR (Support Vector Regression) under various feature selection methods. The results demonstrated that the RFR model exhibited superior stability and generalization capabilities, with minimal performance variations across different feature selection approaches. Conversely, the BP model showed significant sensitivity to feature selection, leading to substantial performance fluctuations. The SVR model, while also influenced by feature selection, displayed slightly lower overall predictive accuracy compared to RFR.
Further model validation revealed that the SNV-CARS-BP model performed exceptionally well during the S1 stage, while the S2-SPA-RFR model excelled in the S2 stage. Both models maintained low relative deviations between predicted and actual values, confirming their high accuracy, as detailed in
Table 1 and
Table 2. However, this does not negate the value of other models. Different algorithms may exhibit unique strengths under specific application scenarios or data characteristics. For instance, the BP neural network shows potential in handling complex nonlinear relationships. By optimizing its algorithmic structure to address current limitations, the BP model could achieve enhanced performance in predicting LWC (Leaf Water Content) for Korla fragrant pears. Similarly, refining the kernel functions and parameter configurations of the SVR model may further improve its effectiveness.
Our results are consistent with those of several recent studies using machine learning to predict LWC. For example, Li et al. [
11] used RFR to analyze hyperspectral data and obtained a result of R
2 = 0.85, which is comparable to the result of our SNV-SPA-RFR model (R
2 = 0.815). Similarly, Liu et al. [
26] used SVR to estimate the LWC of crops, and the reported RMSE was 0.030%, which was slightly higher than the RMSE (0.025%) of our SNV-SPA-RFR model. These comparisons highlight the robustness of our method and its potential for generalization under different crops and conditions. However, unlike previous studies, our work systematically evaluated a variety of preprocessing methods (SG, MSC, SNV, FD, SD) and machine learning algorithms (RFR, SVR, BP) to determine the best combination of LWC predictions for Korla fragrant pear, providing a more comprehensive framework for future research.
This study has obtained important results in the key links of dataset division, spectral preprocessing, feature band selection, model hyperparameter optimization, model selection, and verification. The random splitting method is used to divide the dataset, which ensures that the modeling and test dataset can accurately capture the internal distribution characteristics of the data, improve the generalization ability of the model, and lay the foundation for the subsequent high quality model construction. In terms of spectral preprocessing, SG, MSC, SNV, FD, SD, and other methods have their own advantages and disadvantages. The comprehensive application can provide diversified perspectives and effective means to improve the quality of spectral data. The selection of characteristic bands showed that there were significant differences in the LWC characteristic bands of Korla fragrant pear during the S1 and S2 periods, which reflected the dynamic changes of leaf physiological characteristics during its growth process. Accurately understanding these differences is helpful to optimize the spectral model. The model hyperparameter optimization determines the optimal parameter combination of RFR, SVR, and BP models through grid search and cross-validation, which directly improves the generalization ability and prediction accuracy of the model. These results not only provide a solid foundation for the construction and optimization of Korla fragrant pear LWC prediction models but also highlight directions for future research, which has important theoretical and practical value.
Moreover, practical applications should not rely solely on a single evaluation metric for model selection. It is crucial to consider additional factors such as model complexity, computational efficiency, and interpretability [
47]. Future research should explore integrated strategies that leverage the strengths of different models, aiming to develop more robust and versatile LWC prediction models for Korla fragrant pears. Such advancements would provide more precise technical support for agricultural practices, including targeted irrigation and growth monitoring.
In conclusion, this study has made significant progress through a comprehensive analysis of multiple aspects related to Korla fragrant pear LWC. However, there remains ample room for optimization and expansion in various areas. Continued exploration and advancement in the aforementioned directions will further enhance research in this field, ultimately contributing to more effective agricultural production practices.