Prediction of Vigor of Naturally Aged Seeds from Xishuangbanna Cucumber (Cucumis sativus L. var. xishuangbannanesis) Using Hyperspectral Imaging

Meng Zhang; Jiangping Song; Huixia Jia; Xiaohui Zhang; Wenlong Yang; Yang Wang; Haiping Wang

doi:10.3390/agriculture15101043

,

and

¹

State Key Laboratory of Vegetable Biobreeding, Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Beijing 100081, China

²

College of Plant Science and Technology, Beijing University of Agriculture, Beijing 102206, China

^*

Author to whom correspondence should be addressed.

Agriculture2025, 15(10), 1043;https://doi.org/10.3390/agriculture15101043

This article belongs to the Section Crop Production

Version Notes

Order Reprints

Review Reports

Abstract

Xishuangbanna cucumber (Cucumis sativus L. var. xishuangbannanesis), as a rare and endangered cucumber germplasm resource, possesses certain irreplaceable characteristics that make it difficult to reacquire once lost. To ensure long-term preservation of this germplasm, immediate propagation and regeneration are required after successful collection. Current germplasm management relying on conventional viability testing methods often leads to seed loss. Therefore, there is an urgent need to develop a rapid and non-destructive testing technology for assessing the seed viability of Xishuangbanna cucumber. This study integrated hyperspectral imaging technology with various data preprocessing methods, feature wavelength selection algorithms, and classification models to achieve rapid and non-destructive detection of Xishuangbanna cucumber seed viability. Hyperspectral imaging was employed to acquire spectral data from the seeds. Preprocessing methods including MSC (Multivariate Scattering Correction), SNV (Standard Normal Variety), FD (First Derivative), SD (Second Derivative), and L2NN (L2 Norm Normalization) were applied to enhance spectral data quality. Feature selection algorithms such as UVE (Uninformative Variables Elimination), SPA (Successive Projections Algorithm), and CARS (Competitive Adaptive Reweighted Sampling) were utilized to identify optimal spectral bands. Combined with KNN (K-Nearest Neighbor) and LogitBoost algorithms, predictive models for seed viability were established. The results demonstrated that the L2NN-KNN model outperformed other models, achieving an accuracy of 83.33%, precision of 86.99%, and an F1-score of 0.83. This study confirms that hyperspectral imaging combined with machine learning can effectively predict the viability of Xishuangbanna cucumber seeds, providing a novel technical approach for the conservation of rare and endangered cucumber germplasm resources. The findings hold significant implications for promoting long-term preservation and sustainable utilization of this valuable genetic material.

Keywords:

Xishuangbanna cucumber; seed vigor; germplasm conservation; machine learning; hyperspectral imaging

1. Introduction

Cucumber (Cucumis sativus L.) is an important vegetable crop worldwide. According to statistics, the global production of cucumber reached 94.7 million tons in 2022, of which China, as a major producer, accounts for more than 80% of the global production [1]. The Xishuangbanna cucumber, also known as the semi-wild cucumber, is only found near the border between China and Southeast Asian countries [2]. It is an endemic cucumber variety in Yunnan and its appearance and flesh are distinctly different from the common cucumber. It is difficult for it to flower and sit in other cultivation areas, and it is widely cultivated only in Yunnan. It has a rare and endangered germplasm [3,4].

In agricultural production, seeds play a crucial role as the basis for plant reproduction [5,6]. Seed vigor, as a key indicator of seed quality, has a direct impact on crop yield [7]. High-vigor seeds are not only highly resistant to storage, but also perform well in the field, and are more likely to germinate and form robust seedlings under suitable conditions [8,9]. As a rare and endangered species, the collection of germplasm resources of Xishuangbanna cucumber is extremely difficult. After the successful collection of seeds, they are usually stored in germplasm banks for subsequent breeding and utilization. However, it is often necessary to determine the original germination rate to assess seed viability before the seed is stored. Moreover, seed viability needs to be monitored at all times during seed storage, in accordance with the Genebank Standards for Plant Genetic Resources for Food and Agriculture [10]. However, traditional methods of testing seed viability, such as germination tests, although accurate in determining seed viability, can cause irreversible damage to the seed due to their “irreversible” nature [11]. Once germination tests are conducted, these valuable seeds can no longer be used for conservation or research, which is particularly serious for endangered species such as the Xishuangbanna cucumber, where seed stocks in the germplasm bank are scarce and seeds of certain lineages are not available anymore.

In addition, similar to germination tests, methods for detecting seed viability, such as conductivity tests, immunoassays, polymerase chain reaction, tetrazolium staining, and accelerated aging tests, also suffer from cumbersome and time-consuming procedures and damage to seeds [9,12,13]. These methods not only consume a lot of time and resources, but may also irreversibly affect seed viability and genetic integrity, further exacerbating the conservation dilemma of rare and endangered germplasm.

In contrast, hyperspectral imaging, as an emerging non-destructive testing technology, is able to acquire both image and spectral information of the sample to be tested without damaging the seed at all [14,15]. Moreover, the technology also combines the advantages of imaging and spectroscopic techniques, effectively overcoming the limitations of machine vision and near-infrared spectroscopic techniques in detecting samples [15,16]. Therefore, hyperspectral imaging has been widely used in agricultural research fields such as seed viability prediction [9], seed variety identification [17], plant disease detection [18], pesticide residue detection [19], and crop yield prediction [20].

Previous studies have revealed the advantages of hyperspectral imaging in predicting seed viability in peanut seeds [16], wheat seeds [21], pepper seeds [22], and maize seeds [23], and these studies have confirmed the great potential of hyperspectral imaging in the field of seed viability prediction. However, different seeds have different phenotypic characteristics and different inclusions, so it is necessary to construct a targeted hyperspectral vigor detection technology system according to the specificity of the seed source. However, up to now, no research has been reported on the hyperspectral vigor prediction technology system developed for cucumber seeds in Xishuangbanna. Therefore, in order to reduce the possible loss of Xishuangbanna cucumber in the process of seed collection and preservation, this study selected Xishuangbanna cucumber seeds as experimental objects, collected spectral data of seeds of different years and lineages using hyperspectral imaging technology, and developed a classification model for predicting the vigor of Xishuangbanna cucumber seeds. The effects of different models, preprocessing methods, and feature band extraction algorithms on the accuracy of the seed vigor prediction model of Xishuangbanna cucumber were studied in depth, providing a theoretical basis for the rapid and non-destructive identification of the vigor of Xishuangbanna cucumber seeds, and at the same time supporting the protection of rare and endangered germplasm resources.

2. Materials and Methods

2.1. Experimental Materials

Xishuangbanna cucumber materials collected in Xishuangbanna Dai Autonomous Prefecture, Yunnan Province, were planted in the Nankou Experimental Base of the Institute of Vegetable and Flower Research, Chinese Academy of Agricultural Sciences, and several inbred lines were obtained through multi-generation selfing. In order to accurately construct the prediction model of seed viability under natural aging, 96 line numbers of Xishuangbanna cucumber seeds stored in the germplasm bank with different years and different viability were specially selected as experimental materials in this experiment.

2.2. Hyperspectral Imaging System and Data Acquisition

A Specim PFD4k hyperspectral imaging system from Specim, Spectral Imaging Ltd. of Oulu, Finland was used for data acquisition during the experiment (Figure 1). The hyperspectral imaging system mainly consists of a Specim LabScanner 40 × 20, Spectral Camera PFD4k, calibrated whiteboard, power supply and camera control unit, and an accompanying computer. The Specim LabScanner 40 × 20 is a compact scanner for the laboratory that includes a 400 × 200 mm sample tray, a camera mount, halogen illumination and optional camera height adjustment. The Spectral Camera PFD4k consists of an ImSpector V10E for the wavelength range 400–1000 nm and a detector. It is capable of capturing images in the full spectral range at frequencies up to 100 Hz and has a spatial resolution of 1775 pixels. At the same time, the camera has an offset correction function that limits the dark noise level to approximately 300DN within an integration time of approximately 20 ms.

Figure 1. Schematic diagram of hyperspectral imaging system.

Before the hyperspectral data acquisition of cucumber seeds in Xishuangbanna, in order to prevent the seeds from moving and affecting the acquisition process, they were fixed on the black plate before being transferred to the scanning platform, and then the starting and ending positions of the sample and the calibrated white plate were set during scanning. As the platform moves during the scanning process, the hyperspectral camera first collects the white plate information to obtain a white image with reflectivity close to 100%, and then automatically closes the electro-mechanical shutter to obtain a dark reflection image with reflectivity close to 0%, so as to correct the image in black and white in order to reduce the noise formed by the equipment and the environment, and then obtain the original spectral information of the samples when the samples are moved to the scanning range of the camera.

2.3. Germination Experiment

After the hyperspectral images of the seeds were obtained, they were placed in the HPS-280 biochemical incubator manufactured by Donglian Electronic Technology Development Co., Ltd. in Harbin, China, for germination, and the temperature in the incubator was set at 25 °C. The germination rate was counted at day 8 using seed breakthrough as the germination criterion, with germination labeled as “1” and non-germination labeled as “0”, and the germination results were recorded line by line in the measured spectral data table.

2.4. Hyperspectral Data Extraction

The seed region of interest was extracted using ENVI 5.3, and the entire area of a single seed sample was delineated as the region of interest. The average of the spectral reflectance within the region of interest of the sample was calculated as the data for the spectral reflectance of this sample.

2.5. Spectral Dataset Creation

After obtaining the germination of 96 line numbers, a total of nine Xishuangbanna seed line numbers with different vigor, 23S-1, 22S-6, 22S-18, 21S-30, 21S-7, 21S-45, 21S-36, 21S-54, and 22S-1 were selected to establish a sample set for predicting the germination rate of Xishuangbanna cucumber seeds (Table 1). Among them, 747 seeds germinated, and 601 seeds did not germinate. The spectral data of the seeds and the germination situation corresponded to each other, and the germinated seeds were marked as “1” and the non-germinated seeds were marked as “0” to establish the corresponding data sets. The Kennard–Stone (KS) [24] method was used to select 80% as the training set and the remaining 20% as the test set.

Table 1. The germination rate of seeds from 96 lines of Xishuangbanna cucumber.

2.6. Spectral Preprocessing

In the process of raw spectral data acquisition using hyperspectral systems, although the interference of the external environment has been minimized, the noise that comes with the instrument and the noise caused by ambient light is still inevitably collected, which causes the raw spectra to appear in the baseline drift, baseline rotation, and other undesirable phenomena [25,26,27]. Therefore, it is particularly important to preprocess the raw spectral data with the aim of improving the quality of the spectral data and attenuating the adverse effects of noise on subsequent modeling. In this experiment, a total of five algorithms, including Multivariate Scattering Correction (MSC) [28], Standard Normal Variety (SNV) [29], First Derivative (FD) [30], Second Derivative (SD) [31] and L2 Norm Normalization (L2NN) [8], are used to preprocess the raw data.

2.7. Feature Band Extraction Algorithm

Xishuangbanna cucumber seed raw hyperspectral data is high-dimensional data which is interspersed with a large amount of redundant information leading to the model accuracy being affected [32,33]. Therefore, three algorithms, Uninformative Variables Elimination (UVE), Successive Projections Algorithm (SPA), and Competitive Adaptive Reweighted Sampling (CARS), which have been widely used in hyperspectral imaging research and have performed well in most studies, were selected in this experiment to select the effective wavelength that carries the largest amount of spectral information, thereby reducing extraneous variables, shortening the modeling time, improving the model computational efficiency, and optimizing the model performance [34,35,36,37,38].

2.8. Different Classification Models and Model Evaluation Indicators

Machine learning is often combined with hyperspectral image data processing to dig deeper into the information in hyperspectral images [39]. Appropriate machine learning methods are able to extract valuable information from hyperspectral data and utilize this information to accurately classify or predict the samples to be measured, and are therefore widely used in the field of agricultural research [40]. In this study, two machine learning models, K-Nearest Neighbor (KNN) [9] and LogitBoost [41], were built to predict the seed viability of cucumber in Xishuangbanna.

KNN is one of the simplest algorithms in machine learning [32]. Its core concept involves traversing the training set to locate the k closest training samples to a new sample using distance metrics, and determining the predicted value of the new sample based on the majority voting principle [42]. The advantage of the KNN model lies in its independence from data distribution assumptions, as it directly employs the training set for sample classification, making it widely applicable for both classification and regression tasks [42,43].

LogitBoost, as one of the representative algorithms in boosting methods, is an improved version of AdaBoost. The fundamental principle of this algorithm is to construct a basic weak classifier from the existing sample dataset and iteratively refine it by focusing more on misclassified samples. In each iteration, greater weights are assigned to incorrectly classified samples [44]. Through multiple iterations, a weighted voting mechanism is applied to the weak classifiers generated in each round, ultimately forming a composite strong classifier that yields highly accurate predictive models.

Although classical machine learning algorithms are relatively simple, they form the foundational core of machine learning development and retain significant research value [45].

In evaluating the model performance, a total of four model evaluation metrics were used in this experiment, including accuracy, precision, recall, and F1-score [46].

3. Results

3.1. Raw Spectral Analysis

In order to gain a deeper understanding of the relationship between the spectral properties of Xishuangbanna cucumber seeds and their seed viability, we conducted a comprehensive spectral analysis of all Xishuangbanna cucumber seeds from nine different lineages.

The results of the analysis showed that the spectral distributions of all Xishuangbanna cucumber seeds followed a similar trend, but the vibrational amplitude of the reflectance of different seed samples differed (Figure 2A). Subsequently, to further understand the variation of seed reflectance under different vigor states, we observed the average spectra of germinated and non-germinated seeds. The results showed that in the spectral band near 400 nm, the spectral reflectance values of both germinated and non-germinated seeds started near 1800; and with the increase of wavelength, the reflectance of both kinds of seeds showed a tendency to increase rapidly, then slowly, followed by a slow decrease (Figure 2B). It is noteworthy that germinating and non-germinating seeds have different reflectance within different bands although the spectral trends are the same (Figure 2B). The reflectance of non-germinated seeds was significantly higher than that of germinated seeds within 400–740 nm, while the spectral curves of non-germinated seeds and germinated seeds were very close within 748–1000 nm, but the reflectance of non-germinated seeds was slightly lower than that of germinated seeds.

Figure 2. Schematic representation of raw spectra, pretreatment spectra and average spectra of germinated and non-germinated seeds. (A) Raw spectrum of cucumber seeds in Xishuangbanna. (B) Average spectrum with correlation analysis of non-germinated seeds and germinated seed. (C) Spectral curves after MSC preprocessing. (D) Spectral curves after SNV preprocessing. (E) Spectral curves after FD preprocessing. (F) Spectral curves after SD preprocessing. (G) Spectral curves after L2NN preprocessing.

3.2. Preprocessing Spectral Analysis

In order to effectively eliminate the noise in the original spectral data and weaken the influence of the interference information, the original spectral data were preprocessed in this study.

The preprocessing results indicate that the spectral curves exhibit notable convergence after preprocessing with the methods of MSC and SNV. This is likely because both preprocessing approaches can effectively eliminate spectral differences caused by varying scatter levels and reduce baseline drift phenomena present in the original spectral data (Figure 2C,D). It is worth noting that, despite differences in spectral reflectance between the spectra preprocessed with these two methods, the overall trends of the spectral curves are highly consistent. This may be attributed to the fact that MSC preprocessing corrects the original spectra based on the translation and offset quantities obtained by regressing each seed’s spectrum against the average of a seed spectrum and applying the derived parameters for correction, while SNV preprocessing corrects the original spectra using the mean and standard deviation of the seed spectra. Consequently, the similarity in the principles of these two methods leads to a high degree of consistency in the preprocessed spectral curves. In contrast, the spectral curves preprocessed with FD and SD methods display more pronounced peak and valley features in the visible light region, with these features being more prominent in the spectral curves preprocessed with the first derivative method (Figure 2E,F). Additionally, the spectral curve preprocessed with L2NN does not show obvious convergence, and its trend is similar to that of the original spectral curve. However, it exhibits slight spectral crossover in the 700–800 nm wavelength range, and its overall spectral reflectance is scaled to the 0–1 interval (Figure 2G).

3.3. Full-Band KNN Model Analysis

To eliminate interference from irrelevant information and noise, this study applied various algorithms to preprocess the raw spectral data. Subsequently, we developed a full-wavelength prediction model for Xishuangbanna cucumber seed vigor by using both the original spectral data and five preprocessed datasets as inputs, with training performed through KNN and LogitBoost classification models.

The basic principle of the KNN model is to identify the k closest samples to the sample to be tested by calculating the Euclidean distance, and to determine the class of the sample to be tested by the class of these samples (Figure 3) [9]. Therefore, when the value of the parameter k of the KNN model is changed, the performance of the model changes accordingly. In order to obtain the best performance KNN model, this experiment sets the parameter k of the KNN model between 1 and 16 and optimizes the model by debugging the k value. During the experiment, we found that the accuracy of the training set was much higher than that of the test set when k = 1, indicating that the model was overfitting. Therefore, in the subsequent study we only evaluated the model performance of the KNN model at k = 2–16.

Figure 3. Schematic diagram of the KNN model. Notes: The circles represent the samples to be tested, and the triangles of different colors denote different categories. When k = 5, the circle is classified as a green triangle; when k = 8, the circle is classified as an orange triangle.

In the KNN model, all three preprocessing algorithms (L2NN, MSC, and SNV) effectively improved model accuracy (Figure 4A). The L2NN-preprocessed model demonstrated the highest accuracy across all k-values, achieving a maximum accuracy of 83.33%. While the KNN models constructed using MSC and SNV algorithms showed varying performance advantages over the Original-KNN model at different k-values, their peak accuracy consistently surpassed that of the Original-KNN model. In contrast, both FD and SD preprocessing methods reduced model accuracy (Figure 4A). This phenomenon may be attributed to the fact that although FD and SD algorithms enhanced spectral features by reducing baseline drift, they simultaneously led to the loss of critical spectral information [19].

Figure 4. Accuracy curves of full-wavelength KNN models and feature-wavelength KNN models at different k-values, along with comparison charts of accuracy between KNN and LogitBoost models under different spectral datasets. (A) Accuracy curves of full-wavelength KNN models at different k-values. (B) Accuracy curves of feature-wavelength KNN models at different k-values. (C) Comparative charts of accuracy between KNN and LogitBoost models across different spectral datasets.

Based on the confusion matrices corresponding to the peak accuracy of all models, we conclude that the L2NN-KNN and MSC-KNN models can accurately predict non-germinated and germinated seeds, respectively. The L2NN-KNN model achieved 88.1% prediction accuracy for non-germinated seeds (Figure 5F), while the MSC-KNN model demonstrated superior performance for germinated seeds with 90.7% accuracy (Figure 5B).

Figure 5. Confusion matrix of different preprocessing KNN model. (A) Original-KNN. (B) MSC-KNN. (C) SNV-KNN. (D) FD-KNN. (E) SD-KNN. (F) L2NN-KNN. Notes: The color green represents samples that have been correctly predicted, while the color red represents samples that have been incorrectly predicted.

A comprehensive performance analysis of all full-wavelength KNN models revealed that the L2NN-KNN model delivered optimal performance, achieving the highest accuracy (83.33%) and precision (86.99%) among all full-spectrum KNN models. In contrast, the SD-KNN model demonstrated the poorest performance across all evaluation metrics, as detailed in Table 2.

Table 2. KNN model full-band processing results.

3.4. Full-Band LogitBoost Model Analysis

Similar to the construction of the full-wavelength KNN model, this study constructed a full-wavelength LogitBoost model for predicting the vigor of Xishuangbanna cucumber seeds by inputting both original spectral data and different preprocessed data. We compared LogitBoost models built with different preprocessing methods. The results indicated that the L2NN-LogitBoost model achieved the highest model accuracy, while the SD-LogitBoost model had the lowest accuracy, at only 62.96% (Table 3).

Table 3. LogitBoost model full-band processing results.

Among the full-wavelength LogitBoost models, the L2NN-LogitBoost model not only attained an accuracy of 80.37% but also had the highest precision at 82.17% and an F1 score of 0.80 (Table 3). Additionally, the confusion matrix showed that this model achieved an accuracy of 82.8% in predicting non-germinating seeds of Xishuangbanna cucumber (Figure 6F). This suggests that in the prediction of Xishuangbanna cucumber seed vigor, the L2NN preprocessing algorithm not only optimizes the LogitBoost model but also enhances its ability to predict non-germinating seeds, making the model more practically valuable in specific scenarios. Except for the SNV-LogitBoost model, which had the same accuracy as the Original-LogitBoost model when built with the SNV algorithm, the accuracy of models constructed with other preprocessing algorithms significantly decreased (Figure 6A–E). This further highlights the advantage of the L2NN preprocessing method in constructing LogitBoost models. Compared to other preprocessing algorithms, L2NN preprocessing can better retain and extract effective information from the data, thereby building LogitBoost models with superior performance.

Figure 6. Confusion matrix of different preprocessing LogitBoost model. (A) Original-LogitBoost. (B) MSC-LogitBoost. (C) SNV-LogitBoost. (D) FD-LogitBoost. (E) SD-LogitBoost. (F) L2NN-LogitBoost. Notes: The color green represents samples that have been correctly predicted, while the color red represents samples that have been incorrectly predicted.

The SNV-LogitBoost model achieved the highest prediction accuracy of 88.9% for germinating seeds (Figure 6C), representing a 14.8% improvement over the SD-LogitBoost model, indicating that the combination of the SNV algorithm with the LogitBoost model can effectively enhance its ability to predict germinating seeds.

3.5. Feature Band Extraction

Based on the successful establishment of a full-spectrum model, this study further applied three feature band extraction algorithms, namely SPA, CARS, and UVE, to extract feature bands from the original spectral data, aiming to construct a more streamlined and efficient model for predicting the vigor of Xishuangbanna cucumber seeds. In the application of feature wavelength selection algorithms, this study adopted a unified parameter configuration strategy for both KNN and LogitBoost models. The results of feature band extraction showed significant differences in the number and composition of feature bands selected by different algorithms, but all exhibited a characteristic enrichment of feature bands within the visible light wavelength range (Table 4, Figure 7).

Table 4. Number of feature bands extracted in different wavelength ranges.

Figure 7. Results of feature wavelength extraction using different algorithms during feature wavelength-model construction. (A) Correlation between variable quantity and RMSE in SPA algorithm. (B) Distribution of feature bands selected by SPA. (C) Feature wavelength selection process using UVE algorithm in UVE-KNN model. (D) Feature wavelength selection process using UVE algorithm in UVE-LogitBoost model. (E) Feature wavelength selection process using CARS algorithm. (1) Trends in the number of sampled variables. (2) Trends in RMSECV values. (3) Trends in regression coefficients for each variable.

Among them, the SPA algorithm selected the fewest feature bands (Table 4). During its feature band selection process, when the number of selected bands was small, the root mean square error (RMSE) was relatively large; as the number of selected bands increased, the RMSE gradually decreased (Figure 7A). When the number of selected bands reached 12, the RMSE reached its minimum value, at which point 12 feature bands including 409.440002 nm, 412.369995 nm, 415.299988 nm, 421.170013 nm, 427.049988 nm, 432.950012 nm, 435.910004 nm, 586.169983 nm, 669.109985 nm, 740.429993 nm, 871.450012 nm, and 992.780029 nm were selected (Figure 7A,B).

When using the CARS algorithm to select feature bands, the number of Monte Carlo sampling runs was set to 50, and 10-fold cross-validation was employed. During the 1st to 12th sampling runs, the root mean square error of cross-validation (RMSECV) first increased slowly, then decreased slowly, and suddenly increased at the 12th sampling (Figure 7E (2)), possibly due to the deletion of key bands and the loss of important information, leading to an increase in the RMSECV value [17]. When the number of sampling runs was 12, the RMSECV value reached its minimum of 0.379606, at which point 119 feature bands were selected (Table 4, Figure 7E (1)). The optimal subset corresponding to the lowest RMSECV value is marked by a solid line composed of asterisks *, and the position indicating the number of runs is the 12th (Figure 7E (3)).

The feature wavelength selection process using the UVE algorithm in KNN and LogitBoost models is shown in Figure 7C and Figure 7D, respectively. To the left of the vertical dashed line is the spectral variable matrix of cucumber seeds, and to the right is a random noise matrix with the same number of added spectral variables. The two horizontal dashed lines represent the thresholds for variable selection, and the corresponding variables outside the dashed lines are the selected feature wavelengths.

It is worth noting that among the three algorithms, the UVE algorithm selected the most bands (Table 4). When constructing the KNN model, 221 feature bands were selected using the UVE algorithm (Figure 7C), while when constructing the LogitBoost model, 232 feature bands were selected (Figure 7D).

3.6. Analysis of KNN and LogitBoost Models Based on Feature Bands

In view of the following two important findings: firstly, Xishuangbanna cucumber seeds with different vigor levels showed significant differences in spectral reflectance in the 400–740 nm band; secondly, the feature bands screened by the three different algorithms were mainly concentrated in the visible region of 400–780 nm. On this basis, after constructing the model using the feature bands screened by the three algorithms of SPA, CARS, and UVE, this study further locked the feature bands of 400–740 nm, and included it as an independent feature band in the model construction system, which was used to construct the KNN model and the LogitBoost model.

We conducted a systematic comparative analysis of models constructed based on different characteristic wavelength bands. The results showed that the models built using the 400–740 nm characteristic wavelength band exhibited the highest model accuracy in both KNN and LogitBoost models (Figure 4B and Figure 8). This finding indicates that the 400–740 nm band is the key band for distinguishing between germinating and non-germinating seeds of Xishuangbanna cucumber, potentially containing the most discriminative spectral feature information capable of accurately reflecting differences in the internal physiological states of seeds, thus providing a critical spectral dimension for non-destructive seed vigor detection.

Figure 8. Confusion matrix plots of feature wavelength-KNN and feature wavelength-LogitBoost models. (A) UVE-KNN. (B) SPA-KNN. (C) CARS-KNN. (D) 400–740 nm-KNN. (E) UVE-LogitBoost. (F) SPA-LogitBoost. (G) CARS-LogitBoost. (H) 400–740 nm-LogitBoost. Notes: The color green represents samples that have been correctly predicted, while the color red represents samples that have been incorrectly predicted.

Additionally, the model combining the 400–700 nm wavelength band with the KNN algorithm performed most prominently among all models using characteristic wavelengths. This model not only reached peak values in key evaluation metrics such as accuracy, precision, recall, and F1 score but also achieved a comprehensive enhancement in model performance compared to the Original-KNN model constructed using original spectral data (Table 5). Specifically, its accuracy improved to 82.22%, precision reached 81.76%, recall reached 85.21%, and the F1 score was as high as 0.83, fully validating the effectiveness of the combination of this characteristic wavelength band and the KNN algorithm.

Table 5. Comparison of the performance of models with different feature bands.

The confusion matrix plot revealed that the models constructed using the SPA characteristic wavelength screening algorithm and the L2NN preprocessing algorithm demonstrated the highest prediction accuracy when distinguishing between non-germinating and germinating seeds (Figure 8). The SPA-KNN model had a prediction accuracy of 79% for non-germinating seeds, slightly higher than the SPA-LogitBoost model, indicating that the combination of characteristic wavelengths selected by the SPA algorithm with the KNN model is more advantageous, and the SPA-KNN model should be prioritized when detecting non-germinating seeds. Meanwhile, the 400–740 nm band demonstrated cross-algorithm advantages in detecting germinating seeds. Both the 400–740 nm-KNN model and the 400–740 nm-LogitBoost model achieved a prediction accuracy of 85.2% for germinating seeds.

3.7. Joint Analysis of KNN and LogitBoost Models

By applying different algorithms to preprocess the raw data, we found that the L2NN preprocessing algorithm significantly enhanced the model’s predictive accuracy. Among all full-wavelength models for predicting the vigor of Xishuangbanna cucumber seeds, the L2NN-preprocessed KNN and LogitBoost models exhibited the highest accuracy, with the L2NN-KNN model achieving an accuracy exceeding 83%, thereby validating the effectiveness of L2NN preprocessing in improving the performance of seed vigor prediction models for Xishuangbanna cucumber.

In studies utilizing characteristic wavelength bands to construct prediction models for Xishuangbanna cucumber seed vigor, it was discovered that models based on the 400–740 nm band, which demonstrates significant differences between germinating and non-germinating seeds, achieved the highest accuracy in both KNN and LogitBoost model architectures. Notably, the 400–740 nm-KNN model reached an accuracy of 82.22%, which is only 1.11% lower than that of the L2NN-KNN model. Furthermore, compared to the Original-KNN model constructed using raw spectral data, this model demonstrated a comprehensive enhancement in performance. Although the accuracy of the characteristic wavelength band model slightly decreased compared to the full-wavelength model, its advantage lies in the significantly reduced data volume, leading to a substantial shortening of data processing and model training time. Based on these findings, it can be concluded that characteristic wavelength bands have the potential to substitute for full-wavelength modeling in specific scenarios, effectively reducing time costs and improving seed vigor detection efficiency while maintaining fundamental model performance. Therefore, in practical production, especially when large-scale seed vigor screening is required under time-sensitive conditions, characteristic wavelength modeling can be employed as a substitute for full-wavelength modeling.

During the vigor detection process for Xishuangbanna cucumber seeds, we systematically evaluated the performance of KNN and LogitBoost models. The results indicated that the KNN model outperformed the LogitBoost model overall. In scenarios involving direct modeling using raw spectral data, the KNN model achieved an accuracy of 80.00%, surpassing the LogitBoost model by 0.74 percentage points. This outcome initially demonstrated the effectiveness of the KNN model in directly processing raw spectral data. Following spectral preprocessing, the classification accuracy of the KNN model further improved to 83.33%, maintaining a 2.96% advantage over the LogitBoost model. This result underscores the KNN model’s superior ability to capture critical features within preprocessed spectral data, thereby enabling more precise seed vigor predictions. In the construction of simplified models after feature wavelength screening, the KNN model still attained an accuracy of 82.22%, whereas the LogitBoost model’s accuracy declined to 77.78%. This outcome further highlights the KNN model’s superiority in constructing simplified models after feature wavelength screening, indicating that the KNN model not only adapts to changes in data dimensions but also maintains high classification accuracy even with a reduced number of features. The aforementioned findings demonstrate that, in the task of predicting Xishuangbanna cucumber seed vigor, the KNN model exhibits greater competitiveness compared to the LogitBoost model due to its ability to maintain high accuracy across various data processing scenarios. Consequently, the KNN model is recommended as an effective method for classifying and predicting Xishuangbanna cucumber seed vigor, providing robust support for relevant research and practical applications.

4. Discussion

This experimental study highlights the critical importance of developing tailored hyperspectral vigor prediction models for different seed types. In the context of Xishuangbanna cucumber seed vigor identification, we employed five preprocessing algorithms for model construction and found that the L2NN preprocessing algorithm exhibited the most superior performance, while the SD algorithm performed the worst. In contrast, Zou et al. [8] achieved the best model performance in hyperspectral prediction of peanut seed vigor using the MF-LightGBM-RF model developed with median filter preprocessing, whereas the L2NN preprocessing algorithm demonstrated suboptimal results in their study. In another investigation by Yang et al. [47] focused on beet seed germination prediction using hyperspectral imaging, the SD algorithm outperformed MSC and SNV algorithms, ranking as the top performer among five preprocessing methods. Although MSC and SNV are commonly used preprocessing algorithms with proven effectiveness in most hyperspectral studies [32,48], our findings indicate that the SNV algorithm yielded better preprocessing results than the MSC algorithm, consistent with observations from other studies [49] on seed vigor prediction using hyperspectral imaging. Therefore, in practical applications, it is essential to compare the actual modeling effects of different preprocessing algorithms to identify the optimal method, ensuring the development of the most effective hyperspectral vigor prediction model for specific seed types.

Despite the extensive application of hyperspectral imaging in cucumber research, most studies have concentrated on utilizing this technology for cucumber disease identification [50,51,52]. Reports on the application of hyperspectral imaging in predicting cucumber seed vigor remain scarce. In this study, we successfully validated the feasibility of employing hyperspectral imaging to predict Xishuangbanna cucumber seed vigor. This suggests that hyperspectral imaging has the potential to serve as an effective alternative to traditional germination tests for cucumber seeds, particularly for the rare and endangered germplasm resources in Xishuangbanna, enabling efficient seed vigor testing and thereby reducing losses during seed collection and storage processes. However, this study also has certain limitations.

All spectral data in this experiment were collected under controlled laboratory conditions. While minimizing environmental interference ensured the precise acquisition of spectral data, this controlled environment may not fully replicate real-world agricultural scenarios. In practical production settings, spectral data acquisition is susceptible to natural light and environmental noise, which can introduce errors into the spectral data. Consequently, the current models necessitate validation in operational field environments to verify the authenticity of their accuracy.

Although this study successfully constructed a seed vigor prediction model by integrating Xishuangbanna cucumber seeds from different lineages and validated its excellent predictive performance on the test set, the research scope remains limited to a single rare and endangered species without model verification across other rare and endangered cucumber germplasm. Therefore, future research should expand sample diversity by incorporating more rare and endangered cucumber varieties to assess the model’s generalization capability.

Furthermore, this study employed only two classical machine learning algorithms, KNN and LogitBoost, for model construction. Although traditional machine learning methods demonstrated robust predictive performance under limited sample conditions, their shallow learning architectures inevitably exhibit limitations in capturing nonlinear feature interactions and deciphering latent patterns in high-dimensional data. In existing research on seed vigor prediction using deep learning models, Qi et al. [53] developed various CNN models to assess rice seed viability, employing two transfer learning strategies—fine-tuning and MixStyle—to facilitate knowledge transfer across different rice varieties. Experimental results showed that the CNN model trained with Yongyou 12 rice seeds achieved validation set accuracies of 90.00%, 80.33%, and 85.00% for classifying seed vigor in Yongyou1540, Suxiangjing100, and Longjingyou1212 varieties, respectively, using MixStyle transfer learning. Similarly, Wang et al. [54] compared traditional machine learning models—SVM and ELM—with deep learning approaches, including 1DCNN, 1DLSTM, CNN-LSTM, and FA-optimized CNN-LSTM models, for identifying vigor levels in sweetcorn seeds. The deep learning models achieved classification accuracies exceeding 94.26% on the test dataset, outperforming the best-performing machine learning model by at least 3%, thereby demonstrating the superior capability of deep learning in distinguishing seed vigor levels. Given these advancements, future research should actively explore the integration of deep learning models such as CNNs [55] into this domain, with the aim of overcoming the limitations of traditional methods and further enhancing model performance.

5. Conclusions

This study utilized hyperspectral imaging technology to collect hyperspectral data from Xishuangbanna cucumber seeds with varying vigor levels under natural aging conditions, conducting an in-depth analysis of spectral characteristic differences between germinating and non-germinating seeds. On this basis, classification models for Xishuangbanna cucumber seed vigor were constructed, and the performance of different models was evaluated. The study confirmed the exceptional efficacy of the L2NN preprocessing algorithm in optimizing model performance, as well as the versatility and stability of the KNN model across diverse data processing scenarios. The integration of the L2NN preprocessing algorithm with the KNN model significantly enhanced the accuracy of the Xishuangbanna cucumber seed vigor prediction model, providing a reliable and precise technical tool for assessing seed vigor in rare and endangered cucumber germplasm resources. Furthermore, the exploration and application of characteristic wavelength band models offer an efficient solution for large-scale seed vigor screening, substantially reducing time costs associated with data processing and model training while maintaining fundamental model performance and improving detection efficiency, thereby demonstrating immense potential for practical applications. In summary, the findings of this study validate the feasibility of employing hyperspectral technology combined with machine learning algorithms for detecting Xishuangbanna cucumber seed vigor, offering robust technical support for the conservation and utilization of rare and endangered cucumber germplasm resources.

Author Contributions

Data collection, M.Z., J.S. and H.J.; conceptualization and supervision, H.W., H.J., Y.W., X.Z. and W.Y.; writing—original draft preparation, M.Z.; writing—review and editing, M.Z. and H.W.; M.Z. and H.W revised the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This work was financially supported by the National Key R&D Program of China (2021YFD1200200), Innovation Engineering Project of Chinese Academy of Agricultural Sciences (CAAS-ASTIP-2021-IVF), Youth Innovation Special Task of Chinese Academy of Agricultural Sciences (Y2023QC06), Agricultural Basic Long-term Scientific and Technological Work (NAES-GR-005), Safe Preservation Project Of Crop Germplasm Resources of MOF (2022NWB037), National Hoticultural Germplasm Centre Project (NHGRC2023-NH01).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The datasets used and analyzed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declared that they had no known competing financial interests.

Abbreviations

MSC	Multivariate Scattering Correction
SNV	Standard Normal Variety
FD	First Derivative
SD	Second Derivative
L2NN	L2 Norm Normalization
UVE	Uninformative Variables Elimination
SPA	Successive Projections Algorithm
CARS	Competitive Adaptive Reweighted Sampling
KNN	K-Nearest Neighbor
RMSE	root mean square error
RMSECV	root mean square error of cross-validation

References

Zhao, J.; Song, W.; Zhang, X. Genetic and molecular regulation of fruit development in cucumber. New Phytol. 2024, 244, 1742–1749. [Google Scholar] [CrossRef] [PubMed]
Zhang, R.-J.; Liu, B.; Song, S.-S.; Salah, R.; Song, C.-J.; Xia, S.-W.; Hao, Q.; Liu, Y.-J.; Li, Y.; Lai, Y.-S. Lipid-Related Domestication Accounts for the Extreme Cold Sensitivity of Semiwild and Tropic Xishuangbanna Cucumber (Cucumis sativus L. var. xishuangbannanesis). Int. J. Mol. Sci. 2024, 25, 79. [Google Scholar] [CrossRef]
Bo, K.; Ma, Z.; Chen, J.; Weng, Y. Molecular mapping reveals structural rearrangements and quantitative trait loci underlying traits with local adaptation in semi-wild Xishuangbanna cucumber (Cucumis sativus L. var. xishuangbannanesis Qi et Yuan). Theor. Appl. Genet. 2014, 128, 25–39. [Google Scholar] [CrossRef]
Obel, H.O.; Cheng, C.; Tian, Z.; Li, J.; Lou, Q.; Yu, X.; Wang, Y.; Ogweno, J.O.; Chen, J. Molecular Research Progress on Xishuangbanna Cucumber (Cucumis sativus L. var. Xishuangbannesis Qi et Yuan): Current Status and Future Prospects. Agronomy 2022, 12, 300. [Google Scholar] [CrossRef]
Guo, Q.; Meng, Y.; Qu, G.; Wang, T.; Yang, F.; Liang, D.; Hu, S. Improvement of wheat seed vitality by dielectric barrier discharge plasma treatment. Bioelectromagnetics 2018, 39, 120–131. [Google Scholar] [CrossRef] [PubMed]
Pang, T.; Chen, C.; Fu, R.; Wang, X.; Yu, H. An end-to-end seed vigor prediction model for imbalanced samples using hyperspectral image. Front. Plant Sci. 2023, 14, 1322391. [Google Scholar] [CrossRef]
Cui, H.; Cheng, Z.; Li, P.; Miao, A. Prediction of Sweet Corn Seed Germination Based on Hyperspectral Image Technology and Multivariate Data Regression. Sensors 2020, 20, 4744. [Google Scholar] [CrossRef]
Zou, Z.; Chen, J.; Zhou, M.; Zhao, Y.; Long, T.; Wu, Q.; Xu, L. Prediction of peanut seed vigor based on hyperspectral images. Food Sci. Technol. 2022, 42, e32822. [Google Scholar] [CrossRef]
Pang, L.; Wang, J.; Men, S.; Yan, L.; Xiao, J. Hyperspectral imaging coupled with multivariate methods for seed vitality estimation and forecast for Quercus variabilis. Spectrochim. Acta A Mol. Biomol. Spectrosc. 2021, 245, 118888. [Google Scholar] [CrossRef]
Hay, F.R.; Whitehouse, K.J. Rethinking the approach to viability monitoring in seed genebanks. Conserv. Physiol. 2017, 5, cox009. [Google Scholar] [CrossRef]
Nansen, C.; Zhao, G.; Dakin, N.; Zhao, C.; Turner, S.R. Using hyperspectral imaging to determine germination of native Australian plant seeds. J. Photochem. Photobiol. B 2015, 145, 19–24. [Google Scholar] [CrossRef]
Kandpal, L.M.; Lohumi, S.; Kim, M.S.; Kang, J.-S.; Cho, B.-K. Near-infrared hyperspectral imaging system coupled with multivariate methods to predict viability and vigor in muskmelon seeds. Sens. Actuators B Chem. 2016, 229, 534–544. [Google Scholar] [CrossRef]
Siam, A.A.; Salehin, M.M.; Alam, M.S.; Ahamed, S.; Islam, M.H.; Rahman, A. Paddy seed viability prediction based on feature fusion of color and hyperspectral image with multivariate analysis. Heliyon 2024, 10, e36999. [Google Scholar] [CrossRef] [PubMed]
Kamruzzaman, M.; Makino, Y.; Oshita, S. Non-invasive analytical technology for the detection of contamination, adulteration, and authenticity of meat, poultry, and fish: A review. Anal. Chim. Acta 2015, 853, 19–29. [Google Scholar] [CrossRef]
Zhang, B.; Ou, Y.; Yu, S.; Liu, Y.; Liu, Y.; Qiu, W. Gray mold and anthracnose disease detection on strawberry leaves using hyperspectral imaging. Plant Methods 2023, 19, 148. [Google Scholar] [CrossRef]
Zou, Z.; Chen, J.; Wu, W.; Luo, J.; Long, T.; Wu, Q.; Wang, Q.; Zhen, J.; Zhao, Y.; Wang, Y.; et al. Detection of peanut seed vigor based on hyperspectral imaging and chemometrics. Front. Plant Sci. 2023, 14, 1127108. [Google Scholar] [CrossRef]
Zhu, S.; Chao, M.; Zhang, J.; Xu, X.; Song, P.; Zhang, J.; Huang, Z. Identification of Soybean Seed Varieties Based on Hyperspectral Imaging Technology. Sensors 2019, 19, 5225. [Google Scholar] [CrossRef] [PubMed]
Xie, C.; He, Y. Spectrum and Image Texture Features Analysis for Early Blight Disease Detection on Eggplant Leaves. Sensors 2016, 16, 676. [Google Scholar] [CrossRef]
Liang, M.; Wang, Z.; Lin, Y.; Li, C.; Zhang, L.; Liu, Y. Study on detection of pesticide residues in tobacco based on hyperspectral imaging technology. Front. Plant Sci. 2024, 15, 1459886. [Google Scholar] [CrossRef]
Yoosefzadeh-Najafabadi, M.; Earl, H.J.; Tulpan, D.; Sulik, J.; Eskandari, M. Application of Machine Learning Algorithms in Plant Breeding: Predicting Yield From Hyperspectral Reflectance in Soybean. Front. Plant Sci. 2020, 11, 624273. [Google Scholar] [CrossRef]
Zhang, T.; Fan, S.; Xiang, Y.; Zhang, S.; Wang, J.; Sun, Q. Non-destructive analysis of germination percentage, germination energy and simple vigour index on wheat seeds during storage by Vis/NIR and SWIR hyperspectral imaging. Spectrochim. Acta Part A Mol. Biomol. Spectrosc. 2020, 239, 118488. [Google Scholar] [CrossRef] [PubMed]
Mo, C.; Kim, G.; Lee, K.; Kim, M.S.; Cho, B.K.; Lim, J.; Kang, S. Non-destructive quality evaluation of pepper (Capsicum annuum L.) seeds using LED-induced hyperspectral reflectance imaging. Sensors 2014, 14, 7489–7504. [Google Scholar] [CrossRef]
Wakholi, C.; Kandpal, L.M.; Lee, H.; Bae, H.; Park, E.; Kim, M.S.; Mo, C.; Lee, W.-H.; Cho, B.-K. Rapid assessment of corn seed viability using short wave infrared line-scan hyperspectral imaging and chemometrics. Sens. Actuators B Chem. 2018, 255, 498–507. [Google Scholar] [CrossRef]
Gu, Y.; Shi, L.; Wu, J.; Hu, S.; Shang, Y.; Hassan, M.; Zhao, C. Quantitative Prediction of Acid Value of Camellia Seed Oil Based on Hyperspectral Imaging Technology Fusing Spectral and Image Features. Foods 2024, 13, 3249. [Google Scholar] [CrossRef] [PubMed]
Ai, W.; Liu, S.; Liao, H.; Du, J.; Cai, Y.; Liao, C.; Shi, H.; Lin, Y.; Junaid, M.; Yue, X.; et al. Application of hyperspectral imaging technology in the rapid identification of microplastics in farmland soil. Sci. Total Environ. 2022, 807, 151030. [Google Scholar] [CrossRef] [PubMed]
Yang, H.; Wang, C.; Zhang, H.; Zhou, Y.; Luo, B. Recognition of maize seed varieties based on hyperspectral imaging technology and integrated learning algorithms. PeerJ Comput. Sci. 2023, 9, e1354. [Google Scholar] [CrossRef] [PubMed]
Wang, J.; Sun, L.; Feng, G.; Bai, H.; Yang, J.; Gai, Z.; Zhao, Z.; Zhang, G. Intelligent detection of hard seeds of snap bean based on hyperspectral imaging. Spectrochim. Acta A Mol. Biomol. Spectrosc. 2022, 275, 121169. [Google Scholar] [CrossRef]
Yang, C.; Song, L.; Wei, K.; Gao, C.; Wang, D.; Feng, M.; Zhang, M.; Wang, C.; Xiao, L.; Yang, W.; et al. Study on Hyperspectral Monitoring Model of Total Flavonoids and Total Phenols in Tartary Buckwheat Grains. Foods 2023, 12, 1354. [Google Scholar] [CrossRef]
Dai, C.; Sun, J.; Huang, X.; Zhang, X.; Tian, X.; Wang, W.; Sun, J.; Luan, Y. Application of Hyperspectral Imaging as a Nondestructive Technology for Identifying Tomato Maturity and Quantitatively Predicting Lycopene Content. Foods 2023, 12, 2957. [Google Scholar] [CrossRef]
Kang, Z.; Fan, R.; Zhan, C.; Wu, Y.; Lin, Y.; Li, K.; Qing, R.; Xu, L. The Rapid non-destructive differentiation of different varieties of rice by fluorescence hyperspectral technology combined with machine learning. Molecules 2024, 29, 682. [Google Scholar] [CrossRef]
Qu, Y.; Liu, Z. Dimensionality reduction and derivative spectral feature optimization for hyperspectral target recognition. Optik 2017, 130, 1349–1357. [Google Scholar] [CrossRef]
Zhang, L.; Sun, H.; Rao, Z.; Ji, H. Hyperspectral imaging technology combined with deep forest model to identify frost-damaged rice seeds. Spectrochim. Acta Part A Mol. Biomol. Spectrosc. 2020, 229, 117973. [Google Scholar] [CrossRef] [PubMed]
Xu, M.; Sun, J.; Zhou, X.; Tang, N.; Shen, J.; Wu, X. Research on nondestructive identification of grape varieties based on EEMD-DWT and hyperspectral image. J. Food Sci. 2021, 86, 2011–2023. [Google Scholar] [CrossRef]
Mao, Y.; Li, H.; Wang, Y.; Fan, K.; Song, Y.; Han, X.; Zhang, J.; Ding, S.; Song, D.; Wang, H. Prediction of tea polyphenols, free amino acids and caffeine content in tea leaves during wilting and fermentation using hyperspectral imaging. Foods 2022, 11, 2537. [Google Scholar] [CrossRef]
Zhong, Q.; Zhang, H.; Tang, S.; Li, P.; Lin, C.; Zhang, L.; Zhong, N. Feasibility Study of Combining Hyperspectral Imaging with Deep Learning for Chestnut-Quality Detection. Foods 2023, 12, 2089. [Google Scholar] [CrossRef]
Xue, H.; Xu, X.; Yang, Y.; Hu, D.; Niu, G. Rapid and Non-Destructive Prediction of Moisture Content in Maize Seeds Using Hyperspectral Imaging. Sensors 2024, 24, 1855. [Google Scholar] [CrossRef] [PubMed]
Wang, C.; Fu, X.; Zhou, Y.; Fu, F. Deoxynivalenol Detection beyond the Limit in Wheat Flour Based on the Fluorescence Hyperspectral Imaging Technique. Foods 2024, 13, 897. [Google Scholar] [CrossRef] [PubMed]
Zhang, T.; Lu, L.; Yang, N.; Fisk, I.D.; Wei, W.; Wang, L.; Li, J.; Sun, Q.; Zeng, R. Integration of hyperspectral imaging, non-targeted metabolomics and machine learning for vigour prediction of naturally and accelerated aged sweetcorn seeds. Food Control 2023, 153, 109930. [Google Scholar] [CrossRef]
Jin, B.; Zhang, C.; Jia, L.; Tang, Q.; Gao, L.; Zhao, G.; Qi, H. Identification of rice seed varieties based on near-infrared hyperspectral imaging technology combined with deep learning. ACS Omega 2022, 7, 4735–4749. [Google Scholar] [CrossRef]
Saha, D.; Manickavasagan, A. Machine learning techniques for analysis of hyperspectral images to determine quality of food products: A review. Curr. Res. Food Sci. 2021, 4, 28–44. [Google Scholar] [CrossRef]
Zhang, G.; Fang, B. LogitBoost classifier for discriminating thermophilic and mesophilic proteins. J. Biotechnol. 2007, 127, 417–424. [Google Scholar] [CrossRef] [PubMed]
Shi, P.; Jiang, Q.; Li, Z. Hyperspectral Characteristic Band Selection and Estimation Content of Soil Petroleum Hydrocarbon Based on GARF-PLSR. J. Imaging 2023, 9, 87. [Google Scholar] [CrossRef] [PubMed]
Wu, Z.; Zhang, J.; Hu, S. Review on Classification Algorithm and Evaluation System of Machine Learning. In Proceedings of the 2020 13th International Conference on Intelligent Computation Technology and Automation (ICICTA), Xi’an, China, 24–25 October 2020; pp. 214–218. [Google Scholar]
Chen, P.; Pan, C. Diabetes classification model based on boosting algorithms. BMC Bioinform. 2018, 19, 109. [Google Scholar] [CrossRef] [PubMed]
Zhao, S.; Zhao, D.; Song, J.; Jia, H.; Zhang, X.; Yang, W.; Wang, H. Rapid Nondestructive Detection of Welsh Onion, Onion, and Chinese Chives Seeds Based on Hyperspectral Imaging Technology. Agriculture 2025, 15, 816. [Google Scholar] [CrossRef]
Nadimi, M.; Divyanth, L.G.; Chaudhry, M.M.A.; Singh, T.; Loewen, G.; Paliwal, J. Assessment of Mechanical Damage and Germinability in Flaxseeds Using Hyperspectral Imaging. Foods 2024, 13, 120. [Google Scholar] [CrossRef]
Yang, J.; Sun, L.; Xing, W.; Feng, G.; Bai, H.; Wang, J. Hyperspectral prediction of sugarbeet seed germination based on gauss kernel SVM. Spectrochim. Acta A Mol. Biomol. Spectrosc. 2021, 253, 119585. [Google Scholar] [CrossRef]
Zhu, H.; Wang, M.; Zhang, J.; Ma, F. Prediction of Apple Hybrid Offspring Aroma Based on Hyperspectral. Foods 2022, 11, 3890. [Google Scholar] [CrossRef]
Zhang, T.; Wei, W.; Zhao, B.; Wang, R.; Li, M.; Yang, L.; Wang, J.; Sun, Q. A Reliable Methodology for Determining Seed Viability by Using Hyperspectral Data from Two Sides of Wheat Seeds. Sensors 2018, 18, 813. [Google Scholar] [CrossRef]
Zhao, Y.-R.; Li, X.; Yu, K.-Q.; Cheng, F.; He, Y. Hyperspectral imaging for determining pigment contents in cucumber leaves in response to angular leaf spot disease. Sci. Rep. 2016, 6, 27790. [Google Scholar] [CrossRef]
Li, Y.; Luo, Z.; Wang, F.; Wang, Y. Hyperspectral leaf image-based cucumber disease recognition using the extended collaborative representation model. Sensors 2020, 20, 4045. [Google Scholar] [CrossRef]
Shi, J.; Wang, Y.; Li, Z.; Huang, X.; Shen, T.; Zou, X. Characterization of invisible symptoms caused by early phosphorus deficiency in cucumber plants using near-infrared hyperspectral imaging technology. Spectrochim. Acta Part A Mol. Biomol. Spectrosc. 2022, 267, 120540. [Google Scholar] [CrossRef] [PubMed]
Qi, H.; Huang, Z.; Sun, Z.; Tang, Q.; Zhao, G.; Zhu, X.; Zhang, C. Rice seed vigor detection based on near-infrared hyperspectral imaging and deep transfer learning. Front. Plant Sci. 2023, 14, 1283921. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Song, S. Detection of sweet corn seed viability based on hyperspectral imaging combined with firefly algorithm optimized deep learning. Front. Plant Sci. 2024, 15, 1361309. [Google Scholar] [CrossRef] [PubMed]
Ma, T.; Tsuchikawa, S.; Inagaki, T. Rapid and non-destructive seed viability prediction using near-infrared hyperspectral imaging coupled with a deep learning approach. Comput. Electron. Agric. 2020, 177, 105683. [Google Scholar] [CrossRef]

Figure 1. Schematic diagram of hyperspectral imaging system.

Figure 2. Schematic representation of raw spectra, pretreatment spectra and average spectra of germinated and non-germinated seeds. (A) Raw spectrum of cucumber seeds in Xishuangbanna. (B) Average spectrum with correlation analysis of non-germinated seeds and germinated seed. (C) Spectral curves after MSC preprocessing. (D) Spectral curves after SNV preprocessing. (E) Spectral curves after FD preprocessing. (F) Spectral curves after SD preprocessing. (G) Spectral curves after L2NN preprocessing.

Figure 3. Schematic diagram of the KNN model. Notes: The circles represent the samples to be tested, and the triangles of different colors denote different categories. When k = 5, the circle is classified as a green triangle; when k = 8, the circle is classified as an orange triangle.

Figure 4. Accuracy curves of full-wavelength KNN models and feature-wavelength KNN models at different k-values, along with comparison charts of accuracy between KNN and LogitBoost models under different spectral datasets. (A) Accuracy curves of full-wavelength KNN models at different k-values. (B) Accuracy curves of feature-wavelength KNN models at different k-values. (C) Comparative charts of accuracy between KNN and LogitBoost models across different spectral datasets.

Figure 5. Confusion matrix of different preprocessing KNN model. (A) Original-KNN. (B) MSC-KNN. (C) SNV-KNN. (D) FD-KNN. (E) SD-KNN. (F) L2NN-KNN. Notes: The color green represents samples that have been correctly predicted, while the color red represents samples that have been incorrectly predicted.

Figure 6. Confusion matrix of different preprocessing LogitBoost model. (A) Original-LogitBoost. (B) MSC-LogitBoost. (C) SNV-LogitBoost. (D) FD-LogitBoost. (E) SD-LogitBoost. (F) L2NN-LogitBoost. Notes: The color green represents samples that have been correctly predicted, while the color red represents samples that have been incorrectly predicted.

Figure 7. Results of feature wavelength extraction using different algorithms during feature wavelength-model construction. (A) Correlation between variable quantity and RMSE in SPA algorithm. (B) Distribution of feature bands selected by SPA. (C) Feature wavelength selection process using UVE algorithm in UVE-KNN model. (D) Feature wavelength selection process using UVE algorithm in UVE-LogitBoost model. (E) Feature wavelength selection process using CARS algorithm. (1) Trends in the number of sampled variables. (2) Trends in RMSECV values. (3) Trends in regression coefficients for each variable.

Figure 8. Confusion matrix plots of feature wavelength-KNN and feature wavelength-LogitBoost models. (A) UVE-KNN. (B) SPA-KNN. (C) CARS-KNN. (D) 400–740 nm-KNN. (E) UVE-LogitBoost. (F) SPA-LogitBoost. (G) CARS-LogitBoost. (H) 400–740 nm-LogitBoost. Notes: The color green represents samples that have been correctly predicted, while the color red represents samples that have been incorrectly predicted.

Table 1. The germination rate of seeds from 96 lines of Xishuangbanna cucumber.

Strains	Germination Rate	Strains	Germination Rate	Strains	Germination Rate
23S-1	4.67%	21S-27	87.33%	21S-42	96%
22S-6	18.67%	21S-37	87.92%	21S-48	96%
22S-18	28.67%	21S-14	88%	22S-19	96.64%
21S-30	30.87%	22S-11	88%	22S-12	96.67%
21S-56	58%	23S-6	88%	23S-5	96.67%
21S-7	58.67%	21S-9	88.67%	22S-17	97.32%
21S-10	71.33%	21S-32	88.67%	21S-31	97.33%
21S-2	72.67%	22S-16	89.33%	22S-3	97.33%
21S-34	72.67%	21S-16	89.86%	22S-9	97.33%
21S-53	74.67%	22S-14	90%	21S-55	98%
21S-45	75.17%	22S-24	90.60%	22S-4	98%
21S-15	76.51%	21S-50	90.67%	22S-7	98%
21S-19	77.33%	21S-40	91.33%	22S-15	98%
23S-3	77.33%	21S-6	92%	22S-28	98%
21S-1	78%	21S-39	92%	22S-23	98.59%
21S-20	80.67%	21S-21	92.67%	21S-28	98.66%
21S-5	81.33%	21S-24	92.67%	21S-18	98.67%
21S-38	81.33%	21S-46	93.33%	21S-25	98.67%
21S-11	82%	22S-2	93.33%	21S-57	98.67%
21S-33	82%	22S-25	93.33%	22S-27	98.67%
21S-52	82%	22S-29	93.57%	23S-7	98.67%
21S-12	83.33%	21S-8	94%	23S-8	98.67%
21S-35	84%	21S-23	94%	23S-9	98.67%
22S-13	84%	21S-3	94.67%	21S-4	99.33%
21S-26	84.67%	21S-17	94.67%	21S-41	99.33%
21S-29	84.67%	21S-13	95.33%	21S-47	99.33%
21S-43	84.67%	21S-51	95.33%	22S-10	99.33%
21S-44	84.67%	21S-54	95.33%	22S-20	99.33%
22S-8	84.67%	22S-5	95.33%	22S-22	99.33%
22S-21	84.67%	22S-26	95.33%	23S-2	99.33%
21S-36	86.67%	23S-4	95.33%	22S-1	100%
21S-49	87.33%	21S-22	96%	22S-30	100%

Table 2. KNN model full-band processing results.

Preprocessing	K-Value	Accuracy (%)	Precision (%)	Recall (%)	F1-Score
Original	3	80.00	78.74	78.74	0.79
MSC	12	81.48	79.07	90.67	0.84
SNV	11	81.11	80.72	87.58	0.84
FD	14	67.41	67.78	80.26	0.73
SD	16	62.96	66.67	73.42	0.70
L2NN	7	83.33	86.99	78.68	0.83

Table 3. LogitBoost model full-band processing results.

Preprocessing	Accuracy (%)	Precision (%)	Recall (%)	F1-Score
Original	79.26	77.10	79.53	0.78
MSC	77.41	76.33	86.00	0.81
SNV	79.26	77.71	88.89	0.83
FD	77.04	75.00	88.82	0.81
SD	62.96	66.48	74.05	0.70
L2NN	80.37	82.17	77.94	0.80

Table 4. Number of feature bands extracted in different wavelength ranges.

Feature Band Extraction Algorithm	Total Number of Bands	NIR	Visible
SPA	12	2	10
CARS	119	43	76
UVE (KNN)	221	92	129
UVE (LogitBoost)	232	96	136

Table 5. Comparison of the performance of models with different feature bands.

Model	Accuracy (%)	Precision (%)	Recall (%)	F1-Score
UVE-KNN	80.37	77.21	82.68	0.80
SPA-KNN	77.41	76.19	75.59	0.76
CARS-KNN	77.04	74.44	77.95	0.76
400–740 nm-KNN	82.22	81.76	85.21	0.83
UVE-LogitBoost	76.30	72.99	78.74	0.76
SPA-LogitBoost	76.67	73.53	78.74	0.76
CARS-LogitBoost	76.67	72.86	80.31	0.76
400–740 nm-LogitBoost	77.78	75.63	85.21	0.80

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Prediction of Vigor of Naturally Aged Seeds from Xishuangbanna Cucumber (Cucumis sativus L. var. xishuangbannanesis) Using Hyperspectral Imaging

Abstract

1. Introduction

2. Materials and Methods

2.1. Experimental Materials

2.2. Hyperspectral Imaging System and Data Acquisition

2.3. Germination Experiment

2.4. Hyperspectral Data Extraction

2.5. Spectral Dataset Creation

2.6. Spectral Preprocessing

2.7. Feature Band Extraction Algorithm

2.8. Different Classification Models and Model Evaluation Indicators

3. Results

3.1. Raw Spectral Analysis

3.2. Preprocessing Spectral Analysis

3.3. Full-Band KNN Model Analysis

3.4. Full-Band LogitBoost Model Analysis

3.5. Feature Band Extraction

3.6. Analysis of KNN and LogitBoost Models Based on Feature Bands

3.7. Joint Analysis of KNN and LogitBoost Models

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Article Metrics

Citations

Article Access Statistics