1. Introduction
Cucumber (
Cucumis sativus L.) is an important vegetable crop worldwide. According to statistics, the global production of cucumber reached 94.7 million tons in 2022, of which China, as a major producer, accounts for more than 80% of the global production [
1]. The Xishuangbanna cucumber, also known as the semi-wild cucumber, is only found near the border between China and Southeast Asian countries [
2]. It is an endemic cucumber variety in Yunnan and its appearance and flesh are distinctly different from the common cucumber. It is difficult for it to flower and sit in other cultivation areas, and it is widely cultivated only in Yunnan. It has a rare and endangered germplasm [
3,
4].
In agricultural production, seeds play a crucial role as the basis for plant reproduction [
5,
6]. Seed vigor, as a key indicator of seed quality, has a direct impact on crop yield [
7]. High-vigor seeds are not only highly resistant to storage, but also perform well in the field, and are more likely to germinate and form robust seedlings under suitable conditions [
8,
9]. As a rare and endangered species, the collection of germplasm resources of Xishuangbanna cucumber is extremely difficult. After the successful collection of seeds, they are usually stored in germplasm banks for subsequent breeding and utilization. However, it is often necessary to determine the original germination rate to assess seed viability before the seed is stored. Moreover, seed viability needs to be monitored at all times during seed storage, in accordance with the Genebank Standards for Plant Genetic Resources for Food and Agriculture [
10]. However, traditional methods of testing seed viability, such as germination tests, although accurate in determining seed viability, can cause irreversible damage to the seed due to their “irreversible” nature [
11]. Once germination tests are conducted, these valuable seeds can no longer be used for conservation or research, which is particularly serious for endangered species such as the Xishuangbanna cucumber, where seed stocks in the germplasm bank are scarce and seeds of certain lineages are not available anymore.
In addition, similar to germination tests, methods for detecting seed viability, such as conductivity tests, immunoassays, polymerase chain reaction, tetrazolium staining, and accelerated aging tests, also suffer from cumbersome and time-consuming procedures and damage to seeds [
9,
12,
13]. These methods not only consume a lot of time and resources, but may also irreversibly affect seed viability and genetic integrity, further exacerbating the conservation dilemma of rare and endangered germplasm.
In contrast, hyperspectral imaging, as an emerging non-destructive testing technology, is able to acquire both image and spectral information of the sample to be tested without damaging the seed at all [
14,
15]. Moreover, the technology also combines the advantages of imaging and spectroscopic techniques, effectively overcoming the limitations of machine vision and near-infrared spectroscopic techniques in detecting samples [
15,
16]. Therefore, hyperspectral imaging has been widely used in agricultural research fields such as seed viability prediction [
9], seed variety identification [
17], plant disease detection [
18], pesticide residue detection [
19], and crop yield prediction [
20].
Previous studies have revealed the advantages of hyperspectral imaging in predicting seed viability in peanut seeds [
16], wheat seeds [
21], pepper seeds [
22], and maize seeds [
23], and these studies have confirmed the great potential of hyperspectral imaging in the field of seed viability prediction. However, different seeds have different phenotypic characteristics and different inclusions, so it is necessary to construct a targeted hyperspectral vigor detection technology system according to the specificity of the seed source. However, up to now, no research has been reported on the hyperspectral vigor prediction technology system developed for cucumber seeds in Xishuangbanna. Therefore, in order to reduce the possible loss of Xishuangbanna cucumber in the process of seed collection and preservation, this study selected Xishuangbanna cucumber seeds as experimental objects, collected spectral data of seeds of different years and lineages using hyperspectral imaging technology, and developed a classification model for predicting the vigor of Xishuangbanna cucumber seeds. The effects of different models, preprocessing methods, and feature band extraction algorithms on the accuracy of the seed vigor prediction model of Xishuangbanna cucumber were studied in depth, providing a theoretical basis for the rapid and non-destructive identification of the vigor of Xishuangbanna cucumber seeds, and at the same time supporting the protection of rare and endangered germplasm resources.
2. Materials and Methods
2.1. Experimental Materials
Xishuangbanna cucumber materials collected in Xishuangbanna Dai Autonomous Prefecture, Yunnan Province, were planted in the Nankou Experimental Base of the Institute of Vegetable and Flower Research, Chinese Academy of Agricultural Sciences, and several inbred lines were obtained through multi-generation selfing. In order to accurately construct the prediction model of seed viability under natural aging, 96 line numbers of Xishuangbanna cucumber seeds stored in the germplasm bank with different years and different viability were specially selected as experimental materials in this experiment.
2.2. Hyperspectral Imaging System and Data Acquisition
A Specim PFD4k hyperspectral imaging system from Specim, Spectral Imaging Ltd. of Oulu, Finland was used for data acquisition during the experiment (
Figure 1). The hyperspectral imaging system mainly consists of a Specim LabScanner 40 × 20, Spectral Camera PFD4k, calibrated whiteboard, power supply and camera control unit, and an accompanying computer. The Specim LabScanner 40 × 20 is a compact scanner for the laboratory that includes a 400 × 200 mm sample tray, a camera mount, halogen illumination and optional camera height adjustment. The Spectral Camera PFD4k consists of an ImSpector V10E for the wavelength range 400–1000 nm and a detector. It is capable of capturing images in the full spectral range at frequencies up to 100 Hz and has a spatial resolution of 1775 pixels. At the same time, the camera has an offset correction function that limits the dark noise level to approximately 300DN within an integration time of approximately 20 ms.
Before the hyperspectral data acquisition of cucumber seeds in Xishuangbanna, in order to prevent the seeds from moving and affecting the acquisition process, they were fixed on the black plate before being transferred to the scanning platform, and then the starting and ending positions of the sample and the calibrated white plate were set during scanning. As the platform moves during the scanning process, the hyperspectral camera first collects the white plate information to obtain a white image with reflectivity close to 100%, and then automatically closes the electro-mechanical shutter to obtain a dark reflection image with reflectivity close to 0%, so as to correct the image in black and white in order to reduce the noise formed by the equipment and the environment, and then obtain the original spectral information of the samples when the samples are moved to the scanning range of the camera.
2.3. Germination Experiment
After the hyperspectral images of the seeds were obtained, they were placed in the HPS-280 biochemical incubator manufactured by Donglian Electronic Technology Development Co., Ltd. in Harbin, China, for germination, and the temperature in the incubator was set at 25 °C. The germination rate was counted at day 8 using seed breakthrough as the germination criterion, with germination labeled as “1” and non-germination labeled as “0”, and the germination results were recorded line by line in the measured spectral data table.
2.4. Hyperspectral Data Extraction
The seed region of interest was extracted using ENVI 5.3, and the entire area of a single seed sample was delineated as the region of interest. The average of the spectral reflectance within the region of interest of the sample was calculated as the data for the spectral reflectance of this sample.
2.5. Spectral Dataset Creation
After obtaining the germination of 96 line numbers, a total of nine Xishuangbanna seed line numbers with different vigor, 23S-1, 22S-6, 22S-18, 21S-30, 21S-7, 21S-45, 21S-36, 21S-54, and 22S-1 were selected to establish a sample set for predicting the germination rate of Xishuangbanna cucumber seeds (
Table 1). Among them, 747 seeds germinated, and 601 seeds did not germinate. The spectral data of the seeds and the germination situation corresponded to each other, and the germinated seeds were marked as “1” and the non-germinated seeds were marked as “0” to establish the corresponding data sets. The Kennard–Stone (KS) [
24] method was used to select 80% as the training set and the remaining 20% as the test set.
2.6. Spectral Preprocessing
In the process of raw spectral data acquisition using hyperspectral systems, although the interference of the external environment has been minimized, the noise that comes with the instrument and the noise caused by ambient light is still inevitably collected, which causes the raw spectra to appear in the baseline drift, baseline rotation, and other undesirable phenomena [
25,
26,
27]. Therefore, it is particularly important to preprocess the raw spectral data with the aim of improving the quality of the spectral data and attenuating the adverse effects of noise on subsequent modeling. In this experiment, a total of five algorithms, including Multivariate Scattering Correction (MSC) [
28], Standard Normal Variety (SNV) [
29], First Derivative (FD) [
30], Second Derivative (SD) [
31] and L2 Norm Normalization (L2NN) [
8], are used to preprocess the raw data.
2.7. Feature Band Extraction Algorithm
Xishuangbanna cucumber seed raw hyperspectral data is high-dimensional data which is interspersed with a large amount of redundant information leading to the model accuracy being affected [
32,
33]. Therefore, three algorithms, Uninformative Variables Elimination (UVE), Successive Projections Algorithm (SPA), and Competitive Adaptive Reweighted Sampling (CARS), which have been widely used in hyperspectral imaging research and have performed well in most studies, were selected in this experiment to select the effective wavelength that carries the largest amount of spectral information, thereby reducing extraneous variables, shortening the modeling time, improving the model computational efficiency, and optimizing the model performance [
34,
35,
36,
37,
38].
2.8. Different Classification Models and Model Evaluation Indicators
Machine learning is often combined with hyperspectral image data processing to dig deeper into the information in hyperspectral images [
39]. Appropriate machine learning methods are able to extract valuable information from hyperspectral data and utilize this information to accurately classify or predict the samples to be measured, and are therefore widely used in the field of agricultural research [
40]. In this study, two machine learning models, K-Nearest Neighbor (KNN) [
9] and LogitBoost [
41], were built to predict the seed viability of cucumber in Xishuangbanna.
KNN is one of the simplest algorithms in machine learning [
32]. Its core concept involves traversing the training set to locate the k closest training samples to a new sample using distance metrics, and determining the predicted value of the new sample based on the majority voting principle [
42]. The advantage of the KNN model lies in its independence from data distribution assumptions, as it directly employs the training set for sample classification, making it widely applicable for both classification and regression tasks [
42,
43].
LogitBoost, as one of the representative algorithms in boosting methods, is an improved version of AdaBoost. The fundamental principle of this algorithm is to construct a basic weak classifier from the existing sample dataset and iteratively refine it by focusing more on misclassified samples. In each iteration, greater weights are assigned to incorrectly classified samples [
44]. Through multiple iterations, a weighted voting mechanism is applied to the weak classifiers generated in each round, ultimately forming a composite strong classifier that yields highly accurate predictive models.
Although classical machine learning algorithms are relatively simple, they form the foundational core of machine learning development and retain significant research value [
45].
In evaluating the model performance, a total of four model evaluation metrics were used in this experiment, including accuracy, precision, recall, and F1-score [
46].
3. Results
3.1. Raw Spectral Analysis
In order to gain a deeper understanding of the relationship between the spectral properties of Xishuangbanna cucumber seeds and their seed viability, we conducted a comprehensive spectral analysis of all Xishuangbanna cucumber seeds from nine different lineages.
The results of the analysis showed that the spectral distributions of all Xishuangbanna cucumber seeds followed a similar trend, but the vibrational amplitude of the reflectance of different seed samples differed (
Figure 2A). Subsequently, to further understand the variation of seed reflectance under different vigor states, we observed the average spectra of germinated and non-germinated seeds. The results showed that in the spectral band near 400 nm, the spectral reflectance values of both germinated and non-germinated seeds started near 1800; and with the increase of wavelength, the reflectance of both kinds of seeds showed a tendency to increase rapidly, then slowly, followed by a slow decrease (
Figure 2B). It is noteworthy that germinating and non-germinating seeds have different reflectance within different bands although the spectral trends are the same (
Figure 2B). The reflectance of non-germinated seeds was significantly higher than that of germinated seeds within 400–740 nm, while the spectral curves of non-germinated seeds and germinated seeds were very close within 748–1000 nm, but the reflectance of non-germinated seeds was slightly lower than that of germinated seeds.
3.2. Preprocessing Spectral Analysis
In order to effectively eliminate the noise in the original spectral data and weaken the influence of the interference information, the original spectral data were preprocessed in this study.
The preprocessing results indicate that the spectral curves exhibit notable convergence after preprocessing with the methods of MSC and SNV. This is likely because both preprocessing approaches can effectively eliminate spectral differences caused by varying scatter levels and reduce baseline drift phenomena present in the original spectral data (
Figure 2C,D). It is worth noting that, despite differences in spectral reflectance between the spectra preprocessed with these two methods, the overall trends of the spectral curves are highly consistent. This may be attributed to the fact that MSC preprocessing corrects the original spectra based on the translation and offset quantities obtained by regressing each seed’s spectrum against the average of a seed spectrum and applying the derived parameters for correction, while SNV preprocessing corrects the original spectra using the mean and standard deviation of the seed spectra. Consequently, the similarity in the principles of these two methods leads to a high degree of consistency in the preprocessed spectral curves. In contrast, the spectral curves preprocessed with FD and SD methods display more pronounced peak and valley features in the visible light region, with these features being more prominent in the spectral curves preprocessed with the first derivative method (
Figure 2E,F). Additionally, the spectral curve preprocessed with L2NN does not show obvious convergence, and its trend is similar to that of the original spectral curve. However, it exhibits slight spectral crossover in the 700–800 nm wavelength range, and its overall spectral reflectance is scaled to the 0–1 interval (
Figure 2G).
3.3. Full-Band KNN Model Analysis
To eliminate interference from irrelevant information and noise, this study applied various algorithms to preprocess the raw spectral data. Subsequently, we developed a full-wavelength prediction model for Xishuangbanna cucumber seed vigor by using both the original spectral data and five preprocessed datasets as inputs, with training performed through KNN and LogitBoost classification models.
The basic principle of the KNN model is to identify the k closest samples to the sample to be tested by calculating the Euclidean distance, and to determine the class of the sample to be tested by the class of these samples (
Figure 3) [
9]. Therefore, when the value of the parameter k of the KNN model is changed, the performance of the model changes accordingly. In order to obtain the best performance KNN model, this experiment sets the parameter k of the KNN model between 1 and 16 and optimizes the model by debugging the k value. During the experiment, we found that the accuracy of the training set was much higher than that of the test set when k = 1, indicating that the model was overfitting. Therefore, in the subsequent study we only evaluated the model performance of the KNN model at k = 2–16.
In the KNN model, all three preprocessing algorithms (L2NN, MSC, and SNV) effectively improved model accuracy (
Figure 4A). The L2NN-preprocessed model demonstrated the highest accuracy across all k-values, achieving a maximum accuracy of 83.33%. While the KNN models constructed using MSC and SNV algorithms showed varying performance advantages over the Original-KNN model at different k-values, their peak accuracy consistently surpassed that of the Original-KNN model. In contrast, both FD and SD preprocessing methods reduced model accuracy (
Figure 4A). This phenomenon may be attributed to the fact that although FD and SD algorithms enhanced spectral features by reducing baseline drift, they simultaneously led to the loss of critical spectral information [
19].
Based on the confusion matrices corresponding to the peak accuracy of all models, we conclude that the L2NN-KNN and MSC-KNN models can accurately predict non-germinated and germinated seeds, respectively. The L2NN-KNN model achieved 88.1% prediction accuracy for non-germinated seeds (
Figure 5F), while the MSC-KNN model demonstrated superior performance for germinated seeds with 90.7% accuracy (
Figure 5B).
A comprehensive performance analysis of all full-wavelength KNN models revealed that the L2NN-KNN model delivered optimal performance, achieving the highest accuracy (83.33%) and precision (86.99%) among all full-spectrum KNN models. In contrast, the SD-KNN model demonstrated the poorest performance across all evaluation metrics, as detailed in
Table 2.
3.4. Full-Band LogitBoost Model Analysis
Similar to the construction of the full-wavelength KNN model, this study constructed a full-wavelength LogitBoost model for predicting the vigor of Xishuangbanna cucumber seeds by inputting both original spectral data and different preprocessed data. We compared LogitBoost models built with different preprocessing methods. The results indicated that the L2NN-LogitBoost model achieved the highest model accuracy, while the SD-LogitBoost model had the lowest accuracy, at only 62.96% (
Table 3).
Among the full-wavelength LogitBoost models, the L2NN-LogitBoost model not only attained an accuracy of 80.37% but also had the highest precision at 82.17% and an F1 score of 0.80 (
Table 3). Additionally, the confusion matrix showed that this model achieved an accuracy of 82.8% in predicting non-germinating seeds of Xishuangbanna cucumber (
Figure 6F). This suggests that in the prediction of Xishuangbanna cucumber seed vigor, the L2NN preprocessing algorithm not only optimizes the LogitBoost model but also enhances its ability to predict non-germinating seeds, making the model more practically valuable in specific scenarios. Except for the SNV-LogitBoost model, which had the same accuracy as the Original-LogitBoost model when built with the SNV algorithm, the accuracy of models constructed with other preprocessing algorithms significantly decreased (
Figure 6A–E). This further highlights the advantage of the L2NN preprocessing method in constructing LogitBoost models. Compared to other preprocessing algorithms, L2NN preprocessing can better retain and extract effective information from the data, thereby building LogitBoost models with superior performance.
The SNV-LogitBoost model achieved the highest prediction accuracy of 88.9% for germinating seeds (
Figure 6C), representing a 14.8% improvement over the SD-LogitBoost model, indicating that the combination of the SNV algorithm with the LogitBoost model can effectively enhance its ability to predict germinating seeds.
3.5. Feature Band Extraction
Based on the successful establishment of a full-spectrum model, this study further applied three feature band extraction algorithms, namely SPA, CARS, and UVE, to extract feature bands from the original spectral data, aiming to construct a more streamlined and efficient model for predicting the vigor of Xishuangbanna cucumber seeds. In the application of feature wavelength selection algorithms, this study adopted a unified parameter configuration strategy for both KNN and LogitBoost models. The results of feature band extraction showed significant differences in the number and composition of feature bands selected by different algorithms, but all exhibited a characteristic enrichment of feature bands within the visible light wavelength range (
Table 4,
Figure 7).
Among them, the SPA algorithm selected the fewest feature bands (
Table 4). During its feature band selection process, when the number of selected bands was small, the root mean square error (RMSE) was relatively large; as the number of selected bands increased, the RMSE gradually decreased (
Figure 7A). When the number of selected bands reached 12, the RMSE reached its minimum value, at which point 12 feature bands including 409.440002 nm, 412.369995 nm, 415.299988 nm, 421.170013 nm, 427.049988 nm, 432.950012 nm, 435.910004 nm, 586.169983 nm, 669.109985 nm, 740.429993 nm, 871.450012 nm, and 992.780029 nm were selected (
Figure 7A,B).
When using the CARS algorithm to select feature bands, the number of Monte Carlo sampling runs was set to 50, and 10-fold cross-validation was employed. During the 1st to 12th sampling runs, the root mean square error of cross-validation (RMSECV) first increased slowly, then decreased slowly, and suddenly increased at the 12th sampling (
Figure 7E (2)), possibly due to the deletion of key bands and the loss of important information, leading to an increase in the RMSECV value [
17]. When the number of sampling runs was 12, the RMSECV value reached its minimum of 0.379606, at which point 119 feature bands were selected (
Table 4,
Figure 7E (1)). The optimal subset corresponding to the lowest RMSECV value is marked by a solid line composed of asterisks *, and the position indicating the number of runs is the 12th (
Figure 7E (3)).
The feature wavelength selection process using the UVE algorithm in KNN and LogitBoost models is shown in
Figure 7C and
Figure 7D, respectively. To the left of the vertical dashed line is the spectral variable matrix of cucumber seeds, and to the right is a random noise matrix with the same number of added spectral variables. The two horizontal dashed lines represent the thresholds for variable selection, and the corresponding variables outside the dashed lines are the selected feature wavelengths.
It is worth noting that among the three algorithms, the UVE algorithm selected the most bands (
Table 4). When constructing the KNN model, 221 feature bands were selected using the UVE algorithm (
Figure 7C), while when constructing the LogitBoost model, 232 feature bands were selected (
Figure 7D).
3.6. Analysis of KNN and LogitBoost Models Based on Feature Bands
In view of the following two important findings: firstly, Xishuangbanna cucumber seeds with different vigor levels showed significant differences in spectral reflectance in the 400–740 nm band; secondly, the feature bands screened by the three different algorithms were mainly concentrated in the visible region of 400–780 nm. On this basis, after constructing the model using the feature bands screened by the three algorithms of SPA, CARS, and UVE, this study further locked the feature bands of 400–740 nm, and included it as an independent feature band in the model construction system, which was used to construct the KNN model and the LogitBoost model.
We conducted a systematic comparative analysis of models constructed based on different characteristic wavelength bands. The results showed that the models built using the 400–740 nm characteristic wavelength band exhibited the highest model accuracy in both KNN and LogitBoost models (
Figure 4B and
Figure 8). This finding indicates that the 400–740 nm band is the key band for distinguishing between germinating and non-germinating seeds of Xishuangbanna cucumber, potentially containing the most discriminative spectral feature information capable of accurately reflecting differences in the internal physiological states of seeds, thus providing a critical spectral dimension for non-destructive seed vigor detection.
Additionally, the model combining the 400–700 nm wavelength band with the KNN algorithm performed most prominently among all models using characteristic wavelengths. This model not only reached peak values in key evaluation metrics such as accuracy, precision, recall, and F1 score but also achieved a comprehensive enhancement in model performance compared to the Original-KNN model constructed using original spectral data (
Table 5). Specifically, its accuracy improved to 82.22%, precision reached 81.76%, recall reached 85.21%, and the F1 score was as high as 0.83, fully validating the effectiveness of the combination of this characteristic wavelength band and the KNN algorithm.
The confusion matrix plot revealed that the models constructed using the SPA characteristic wavelength screening algorithm and the L2NN preprocessing algorithm demonstrated the highest prediction accuracy when distinguishing between non-germinating and germinating seeds (
Figure 8). The SPA-KNN model had a prediction accuracy of 79% for non-germinating seeds, slightly higher than the SPA-LogitBoost model, indicating that the combination of characteristic wavelengths selected by the SPA algorithm with the KNN model is more advantageous, and the SPA-KNN model should be prioritized when detecting non-germinating seeds. Meanwhile, the 400–740 nm band demonstrated cross-algorithm advantages in detecting germinating seeds. Both the 400–740 nm-KNN model and the 400–740 nm-LogitBoost model achieved a prediction accuracy of 85.2% for germinating seeds.
3.7. Joint Analysis of KNN and LogitBoost Models
By applying different algorithms to preprocess the raw data, we found that the L2NN preprocessing algorithm significantly enhanced the model’s predictive accuracy. Among all full-wavelength models for predicting the vigor of Xishuangbanna cucumber seeds, the L2NN-preprocessed KNN and LogitBoost models exhibited the highest accuracy, with the L2NN-KNN model achieving an accuracy exceeding 83%, thereby validating the effectiveness of L2NN preprocessing in improving the performance of seed vigor prediction models for Xishuangbanna cucumber.
In studies utilizing characteristic wavelength bands to construct prediction models for Xishuangbanna cucumber seed vigor, it was discovered that models based on the 400–740 nm band, which demonstrates significant differences between germinating and non-germinating seeds, achieved the highest accuracy in both KNN and LogitBoost model architectures. Notably, the 400–740 nm-KNN model reached an accuracy of 82.22%, which is only 1.11% lower than that of the L2NN-KNN model. Furthermore, compared to the Original-KNN model constructed using raw spectral data, this model demonstrated a comprehensive enhancement in performance. Although the accuracy of the characteristic wavelength band model slightly decreased compared to the full-wavelength model, its advantage lies in the significantly reduced data volume, leading to a substantial shortening of data processing and model training time. Based on these findings, it can be concluded that characteristic wavelength bands have the potential to substitute for full-wavelength modeling in specific scenarios, effectively reducing time costs and improving seed vigor detection efficiency while maintaining fundamental model performance. Therefore, in practical production, especially when large-scale seed vigor screening is required under time-sensitive conditions, characteristic wavelength modeling can be employed as a substitute for full-wavelength modeling.
During the vigor detection process for Xishuangbanna cucumber seeds, we systematically evaluated the performance of KNN and LogitBoost models. The results indicated that the KNN model outperformed the LogitBoost model overall. In scenarios involving direct modeling using raw spectral data, the KNN model achieved an accuracy of 80.00%, surpassing the LogitBoost model by 0.74 percentage points. This outcome initially demonstrated the effectiveness of the KNN model in directly processing raw spectral data. Following spectral preprocessing, the classification accuracy of the KNN model further improved to 83.33%, maintaining a 2.96% advantage over the LogitBoost model. This result underscores the KNN model’s superior ability to capture critical features within preprocessed spectral data, thereby enabling more precise seed vigor predictions. In the construction of simplified models after feature wavelength screening, the KNN model still attained an accuracy of 82.22%, whereas the LogitBoost model’s accuracy declined to 77.78%. This outcome further highlights the KNN model’s superiority in constructing simplified models after feature wavelength screening, indicating that the KNN model not only adapts to changes in data dimensions but also maintains high classification accuracy even with a reduced number of features. The aforementioned findings demonstrate that, in the task of predicting Xishuangbanna cucumber seed vigor, the KNN model exhibits greater competitiveness compared to the LogitBoost model due to its ability to maintain high accuracy across various data processing scenarios. Consequently, the KNN model is recommended as an effective method for classifying and predicting Xishuangbanna cucumber seed vigor, providing robust support for relevant research and practical applications.
4. Discussion
This experimental study highlights the critical importance of developing tailored hyperspectral vigor prediction models for different seed types. In the context of Xishuangbanna cucumber seed vigor identification, we employed five preprocessing algorithms for model construction and found that the L2NN preprocessing algorithm exhibited the most superior performance, while the SD algorithm performed the worst. In contrast, Zou et al. [
8] achieved the best model performance in hyperspectral prediction of peanut seed vigor using the MF-LightGBM-RF model developed with median filter preprocessing, whereas the L2NN preprocessing algorithm demonstrated suboptimal results in their study. In another investigation by Yang et al. [
47] focused on beet seed germination prediction using hyperspectral imaging, the SD algorithm outperformed MSC and SNV algorithms, ranking as the top performer among five preprocessing methods. Although MSC and SNV are commonly used preprocessing algorithms with proven effectiveness in most hyperspectral studies [
32,
48], our findings indicate that the SNV algorithm yielded better preprocessing results than the MSC algorithm, consistent with observations from other studies [
49] on seed vigor prediction using hyperspectral imaging. Therefore, in practical applications, it is essential to compare the actual modeling effects of different preprocessing algorithms to identify the optimal method, ensuring the development of the most effective hyperspectral vigor prediction model for specific seed types.
Despite the extensive application of hyperspectral imaging in cucumber research, most studies have concentrated on utilizing this technology for cucumber disease identification [
50,
51,
52]. Reports on the application of hyperspectral imaging in predicting cucumber seed vigor remain scarce. In this study, we successfully validated the feasibility of employing hyperspectral imaging to predict Xishuangbanna cucumber seed vigor. This suggests that hyperspectral imaging has the potential to serve as an effective alternative to traditional germination tests for cucumber seeds, particularly for the rare and endangered germplasm resources in Xishuangbanna, enabling efficient seed vigor testing and thereby reducing losses during seed collection and storage processes. However, this study also has certain limitations.
All spectral data in this experiment were collected under controlled laboratory conditions. While minimizing environmental interference ensured the precise acquisition of spectral data, this controlled environment may not fully replicate real-world agricultural scenarios. In practical production settings, spectral data acquisition is susceptible to natural light and environmental noise, which can introduce errors into the spectral data. Consequently, the current models necessitate validation in operational field environments to verify the authenticity of their accuracy.
Although this study successfully constructed a seed vigor prediction model by integrating Xishuangbanna cucumber seeds from different lineages and validated its excellent predictive performance on the test set, the research scope remains limited to a single rare and endangered species without model verification across other rare and endangered cucumber germplasm. Therefore, future research should expand sample diversity by incorporating more rare and endangered cucumber varieties to assess the model’s generalization capability.
Furthermore, this study employed only two classical machine learning algorithms, KNN and LogitBoost, for model construction. Although traditional machine learning methods demonstrated robust predictive performance under limited sample conditions, their shallow learning architectures inevitably exhibit limitations in capturing nonlinear feature interactions and deciphering latent patterns in high-dimensional data. In existing research on seed vigor prediction using deep learning models, Qi et al. [
53] developed various CNN models to assess rice seed viability, employing two transfer learning strategies—fine-tuning and MixStyle—to facilitate knowledge transfer across different rice varieties. Experimental results showed that the CNN model trained with Yongyou 12 rice seeds achieved validation set accuracies of 90.00%, 80.33%, and 85.00% for classifying seed vigor in Yongyou1540, Suxiangjing100, and Longjingyou1212 varieties, respectively, using MixStyle transfer learning. Similarly, Wang et al. [
54] compared traditional machine learning models—SVM and ELM—with deep learning approaches, including 1DCNN, 1DLSTM, CNN-LSTM, and FA-optimized CNN-LSTM models, for identifying vigor levels in sweetcorn seeds. The deep learning models achieved classification accuracies exceeding 94.26% on the test dataset, outperforming the best-performing machine learning model by at least 3%, thereby demonstrating the superior capability of deep learning in distinguishing seed vigor levels. Given these advancements, future research should actively explore the integration of deep learning models such as CNNs [
55] into this domain, with the aim of overcoming the limitations of traditional methods and further enhancing model performance.
5. Conclusions
This study utilized hyperspectral imaging technology to collect hyperspectral data from Xishuangbanna cucumber seeds with varying vigor levels under natural aging conditions, conducting an in-depth analysis of spectral characteristic differences between germinating and non-germinating seeds. On this basis, classification models for Xishuangbanna cucumber seed vigor were constructed, and the performance of different models was evaluated. The study confirmed the exceptional efficacy of the L2NN preprocessing algorithm in optimizing model performance, as well as the versatility and stability of the KNN model across diverse data processing scenarios. The integration of the L2NN preprocessing algorithm with the KNN model significantly enhanced the accuracy of the Xishuangbanna cucumber seed vigor prediction model, providing a reliable and precise technical tool for assessing seed vigor in rare and endangered cucumber germplasm resources. Furthermore, the exploration and application of characteristic wavelength band models offer an efficient solution for large-scale seed vigor screening, substantially reducing time costs associated with data processing and model training while maintaining fundamental model performance and improving detection efficiency, thereby demonstrating immense potential for practical applications. In summary, the findings of this study validate the feasibility of employing hyperspectral technology combined with machine learning algorithms for detecting Xishuangbanna cucumber seed vigor, offering robust technical support for the conservation and utilization of rare and endangered cucumber germplasm resources.