1. Introduction
Delivering sustainable gains in food and nutrition under climate constraints is increasingly limited by measurement capacity [
1,
2]. In seed-based systems, the ability to quantify and track biologically meaningful variation quickly, non-destructively and at scale is essential for conserving agrobiodiversity, supporting breeding decisions and safeguarding seed supply chains [
1,
2,
3].
Common bean (
Phaseolus vulgaris L.) is central to this agricultural agenda because it is simultaneously a global staple and a reservoir of adaptive diversity that can be mobilised for resilience and quality [
4,
5]. Recent synthesis work frames common bean improvement as a coupled challenge spanning climate stress [
4], nutritional value and the need to leverage diversity across gene pools and associated microbiomes [
4,
5]. A comparable leverage of conserved diversity is also emphasised across other grain legumes, including soybean improvement pipelines, reinforcing the broader value of accession-level authentication in legume breeding [
6]. Pangenome, seed colour and production belt analyses further reveal why identity, traceability and adaptive diversity must be resolved together in Brazilian bean systems [
7,
8,
9].
Despite this importance, the characterisation of bean accessions and landraces remains operationally difficult because many materials share overlapping visual traits and because phenotype expression can be confounded by handling, ageing and heterogeneous acquisition conditions [
10,
11,
12,
13]. Traditional identification based on morphology or visual inspection is rapid but often ambiguous under subtle phenotypic differences, whereas genotyping is decisive yet rarely suited to routine, high-throughput screening in breeding stations, community seed banks or processing lines [
7,
14,
15]. Recently, morphological and molecular studies of common bean germplasms have continued to report substantial diversity while also revealing how resource-intensive it can be to resolve that diversity with conventional descriptors alone [
4,
16], motivating artificial intelligence, machine learning and deep learning strategies that are both rigorous and scalable [
4,
16,
17,
18]. Similarly, seed-focused NIR/HSI (Near-Infrared/Hyperspectral Imaging) studies on aged or viable seed lots have shown that subtle phenotypic drift and handling history can reduce separability unless acquisition and validation are tightly standardised [
19,
20].
Hyperspectral imaging (HSI) offers a non-destructive route to close this gap because reflectance encodes pigment and surface structures in the visible range and extends into the NIR–SWIR (Near-Infrared–Short-Wave Infrared) region, where absorption is related to macromolecular composition in several species [
13,
21,
22,
23,
24]. However, recent reviews emphasise persistent barriers: hyperspectral datasets are high-dimensional and strongly redundant, labelled samples are often limited, and their predictive performance can degrade under shifts in sensors and acquisition conditions [
13,
16,
23,
24,
25]. The wider imaging spectroscopy landscape reinforces this need for wavelength-stable, transferable reasoning across platforms, as illustrated by early EnMAP (Environmental Mapping and Analysis Program) results [
26]. In parallel, agricultural machine learning has moved beyond accuracy-first reporting; work in artificial intelligence in agriculture and recent explainable-AI reviews argue that interpretability, wavelength attribution and deployability constraints are prerequisites for trustworthy, decision-relevant modelling [
17,
27,
28,
29]. The seed imaging literature is also expanding across legumes and cereals, including soybean spectral unmixing and variety identification pipelines in common bean, maize/corn, grapevine, rice, tobacco, tomato, soybean and many other crops, where reported gains remain strongly protocol dependent [
30,
31,
32,
33,
34,
35,
36]. Recent seed variety identification studies have therefore explored 1D CNN/ResNet-style architectures and related deep learning pipelines for hyperspectral signatures, often reporting strong performance under controlled acquisition settings. However, validation protocols vary widely, motivating the present study’s fixed-seed-level split and cross-family benchmarking under identical evaluation constraints.
Despite rapid progress in HSI-based seed authentication, practical gaps remain for germplasm-scale use: model comparisons are often difficult to interpret when validation protocols differ, and accuracy-first reporting rarely connects discrimination to wavelength-level attribution and computational feasibility [
17,
27,
28,
29]. A traceable benchmark that couples consistent seed-level validation with interpretable spectral attribution is therefore valuable for both scientific understanding and deployable sensing. This finding is consistent with recent surveys of hyperspectral deep learning, transformers in remote sensing, dimensionality reduction, band selection strategies and resolution/SNR (Signal-to-Noise Ratio) effects, all of which argue that architecture choice, feature compression and sensor design must be treated as coupled decisions rather than as isolated modelling steps [
37,
38,
39,
40].
Accordingly, this work contributes the following: (i) a traceable VIS–NIR–SWIR seed-level benchmark for 32 grain–legume accessions (n = 3200; 563 analysed bands spanning 449.54–2399.17 nm); (ii) a fixed stratified 70:30 seed-level training:test partition, with the same independent test set reused across spectral representations and model families; (iii) an interpretable analysis that links low-dimensional variance structure (PCA) and wavelength attribution (one-versus-rest association mapping, ReliefF, permutation importance and SHAP; SHapley Additive exPlanations) to classification outcomes; and (iv) computational diagnostics (training time, throughput and model size) to support deployability decisions. The present study is intentionally positioned as a controlled benchmark, and robustness to non-ideal acquisition conditions remains to be quantified.
Here, we present a traceable VIS–NIR–SWIR hyperspectral benchmark of thirty-two grain–legume accessions spanning phenotypes relevant to Brazilian agrobiodiversity, including seed lots that are visually distinct and seed lots that are visually confusable. We quantify how reliably ROI-level (Region of Interest) reflectance signatures support accession identification and benchmark classical machine learning and deep learning algorithms under a consistent, stratified seed-level 70:30 split. PCA is used explicitly as a mechanistic bridge between sensor space and interpretation, linking dominant modes of spectral variance to coherent wavelength regions and testing whether those regions align with the spectral windows that drive predictive discrimination.
We hypothesise that accession separability is predominantly low-dimensional and concentrated within contiguous spectral windows. This would enable band-reduced representations compatible with lower-cost sensors. A further hypothesis is that, in the moderate-data regime typical of curated germplasm screening, relatively simple one-dimensional deep models are more reliable than more elaborate sequence architectures because their inductive biases better match the signal-to-sample ratio. We test these hypotheses by combining wavelength-resolved association analysis and supervised band prioritisation with systematic benchmarking of classical and deep learners, alongside explicit computational profiling, to translate statistical performance into practical feasibility for conservation, traceability and scalable agricultural sensing. Any translational implication for lower-cost sensing is therefore treated here as provisional and subject to explicit external validation.
3. Discussion
3.1. What Do the Spectra Say About Accession Identity?
Across the analysed 449.54–2399.17 nm range, the accession-level reflectance signatures were internally coherent within classes, but the separability was strongly wavelength dependent. The strongest one-versus-rest signal occurred at 461.37 nm (R2 = 0.775), and ReliefF concentrated the exploratory reduced-band signal within two visible windows: 449.54–456.30 nm and 577.02–597.54 nm. This pattern is mechanistically consistent with the fact that seed coat pigmentation, local scattering and testa surface properties dominate shortwave reflectance contrasts among visually distinct landraces.
A practical corollary is that, in this benchmark, accession identity is expressed primarily through testa-level optical phenotypes rather than through the longer-wave absorptions typically emphasised for bulk chemical prediction. The NIR–SWIR region still contributes materially to full-spectrum classification, but its role appears complementary rather than dominant for accession identity under the controlled acquisition conditions used here.
Importantly, five of the 16 full-dataset ReliefF bands were edge bands. When these were removed from the trimmed analysis, the alternative 16-band configuration shifted to 575.31–600.96 nm while retaining 11 wavelengths in common with the original set. The visible identity signal is therefore robust, but the exact reduced-band specification should be interpreted as exploratory rather than definitive.
3.2. PCA as a Mechanistic Bridge, Not a Compression Trick
PCA was used here as an interpretation tool rather than a compression step for model fitting. The first three components captured 97.42% of the total variance across the 563-band spectra, indicating that the dominant spectral organisation is low-dimensional even though the raw signatures are high-dimensional. The broad, contiguous loading structures imply latent drivers such as global albedo, pigment-related visible absorption and longer-wave contrast, rather than isolated single-band artefacts. In score space, accessions occupied shifted but partially overlapping regions, which is exactly the geometry expected to generate concentrated neighbour-to-neighbour errors rather than uniform confusion.
The score-space geometry adds a second mechanistic layer. The ROI-level spectra formed a continuous manifold with accession-specific shifts and a partial overlap, implying heterogeneous separability [
13,
16,
23,
25]. For example, some accessions occupy extremes of this low-dimensional organisation, whereas many are concentrated in a dense central region. This predicts structured errors downstream: misclassification should concentrate among “spectral neighbours” rather than diffuse uniformly across labels because some accessions genuinely occupy overlapping regions of optical phenotype space under the acquisition conditions used. The same logic helps explain
Figure S1: a high pairwise Pearson correlation among accession means, even for some visually distinct materials, does not preclude accurate classification because the correlation summarises similarity only two profiles at a time, whereas the trained models use multivariate decision boundaries defined by the joint behaviour of hundreds of wavelengths.
Because PCA is unsupervised, large absolute loadings indicate wavelength regions that drive overall variance, not necessarily class discrimination; therefore, we complement PCA loadings with supervised/association-based wavelength attribution (one-versus-rest R2 mapping and ReliefF).
3.3. Band Relevance, Sensor Design, and Meaning of “Compact” Discrimination
Reduced-band analysis shows both the promise and the limitation of compact sensor design. The importance of ReliefF was concentrated in the visible range, and the global permutation importance plus SHAP attribution peaked at approximately 577–598 nm, with 597.54 nm emerging as the strongest global wavelength in the permutation analysis and 597.54 nm as the strongest mean absolute SHAP wavelength. These windows are therefore strong candidates for future multispectral prototypes. This interpretation is consistent with recent wavelength selection studies showing that compact panels can be informative when selection is supervised and redundancy is handled explicitly, although the stability of the final panel remains criterion- and context-dependent [
41,
42,
43,
44].
However, compact is not equivalent to complete. In the original ReliefF-selected 16-band benchmark, the best model reached 68.25% macro-F1 and 68.54% balanced accuracy, but five of those bands were at the spectral edges. After edge-band trimming, the alternative 16-band configuration decreased to 60.67% and 60.94%, whereas the full-spectrum benchmark reached 96.35% and 96.35% for the same metrics, and the 555-band edge-trimmed representation remained essentially unchanged (96.13%/96.15%). In other words, the selected visible windows carry real identity signals, but a full spectral context is still needed to resolve the most confusable accessions under controlled laboratory conditions. Global SHAP attribution and the model-specific permutation/SHAP profiles that support this finding are reported in
Figures S6 and S12–S19.
3.4. Why Do Linear Models Reach Near-Ceiling Accuracy, and What Residual Confusion Is Revealed?
Within the classical suite, the full-spectrum benchmark was dominated by linear discriminant analysis and linear SVM, indicating that much of the accession structure becomes close to linearly separable once the full 563-band reflectance profile is retained. Across the reduced-band benchmarks, the subspace discriminant was the strongest classifier, but the performance fell well below the full-spectrum setting—especially after edge-band trimming. This contrast suggests that a broad spectral context contributes more to class boundary stability than a compact top-k ranking alone can preserve.
The resulting performance level is comparable to the near-ceiling accuracies reported for seed and grain accession identification under tightly controlled settings, particularly when the number of classes is moderate and the acquisition conditions are stable [
21,
34,
45,
46,
47]. Here, the substantive contribution is therefore not a single headline score but a traceable benchmark that links wavelength attribution (association maps and ReliefF), a low-dimensional structure (PCA loadings and score geometry) and the residual error structure across model families. This provides quantitative guidance for reduced-band sensor design and for full-spectrum classifier development under controlled conditions. Comparable seed-classification performance has also been reported for rice, soybean and aged seed workflows, although the exact architecture ranking still shifts with respect to the crop, wavelength range and acquisition protocol [
20,
30,
34,
35,
48,
49,
50].
3.5. Deep Learning: Why Does the Simplest Architecture Win?
The deep learning comparison points in the same direction. The best architecture was not the most complex, but MLP_Wide, which reached 84.90% accuracy and 84.47% macro-F1 on the independent test set. CNN1D_Deep and TCN1D formed a secondary tier, whereas the recurrent and attention-based models underperformed more markedly. Under a moderate-data regime with smooth one-dimensional signatures, simpler feed-forward inductive biases appear better matched to the signal-to-sample ratio than heavier sequence models do. Even so, the best deep model clearly remained below the strongest full-spectrum classical learners.
This finding is also useful from translational and benchmarking perspectives [
27,
29,
51]. For example, our second hypothesis suggests that if a compact architecture already performs well, it provides a robust baseline for future deployment-oriented studies and for subsequent interpretability analyses, aligning with the broader push in agricultural AI for models that are not only accurate but also stable and auditable. This is consistent with the use of hyperspectral imaging sensors with narrow spectral bandwidths that capture the finest material nuances [
13,
52,
53]. In this respect, a multispectral sensor with narrow, deliberately placed bands spanning a wide spectral range may ultimately be preferable to full hyperspectral cameras when throughput, cost and robustness dominate, although engineering choice still requires explicit external validation. The equivalence between accuracy and macroaveraged recall suggests that the dataset was well balanced and that the models exhibited a symmetric error pattern across classes, with similar true-positive rates for each class. As a result, the overall proportion of correct predictions mirrored the average recall, leading both metrics to converge (
Table 1 and
Table 2).
3.6. Deployability: Accuracy Is Not the Only Axis That Matters
Coupling predictive performance to throughput, training time and model size remains essential for deployment because models with similar scores can differ by orders of magnitude in computational cost. In the reduced-band benchmark, the throughput ranged from 1.55 × 104 to 9.90 × 106 observations s−1, the training time ranged from 9.44 × 10−4 to 2.37 s, and the model size ranged from 4.39 kB to 8.31 MB. These contrasts matter for embedded systems, conveyor sorting and real-time inspection.
The recent hyperspectral learning literature shows that deep architectures can perform strongly, but their gains depend on sample size, domain shift and the geometry of the signal itself. Our benchmark fits a regime in which simpler models remain highly competitive: the spectra are smooth, the low-dimensional structure is strong, and acquisition conditions are tightly controlled [
54]. Transfer across sessions, instruments and illumination settings therefore remains a central unresolved barrier, and in-lab performance should not be extrapolated to operational deployment without explicit domain-shift validation. This unresolved transfer problem is reflected in the rapid growth of hyperspectral domain-adaptation and cross-scene transfer methods, which are increasingly discussed alongside explainable-AI constraints for operational remote sensing [
54,
55,
56,
57,
58].
Taken together, the results support a mechanistic and deployment-aware view of hyperspectral seed analytics. Accession identity is recoverable with high fidelity from the full reflectance profile, but the strongest reduced-band windows sit in the visible range and therefore provide actionable guidance for future multispectral designs. The combination of PCA, one-versus-rest association mapping, ReliefF, permutation importance and SHAP helps explain both why performance is high in the full-spectrum setting and why aggressive compression degrades the classifier ranking.
Although the NIR–SWIR reflectance encodes absorptions associated with major chemical bonds (e.g., O–H, C–H, and N–H overtones/combination bands), relatively few top-ranked wavelengths were assigned to the SWIR region in the present attribution analyses [
29,
45,
55,
56]. This does not imply that the SWIR is irrelevant; rather, under the controlled laboratory conditions used here, accession discrimination appears to be dominated by the seed coat colour, albedo and surface optical phenotype, whereas the SWIR contributes more to the complementary full-spectrum context than a compact set of individually dominant bands do. Because we did not measure the biochemical composition in this classification benchmark, we cannot determine whether the limited SWIR attribution reflects genuinely weaker compositional contrasts among accessions or simply whether such contrasts were secondary to colour-driven discrimination in this dataset [
45]. We therefore avoided assigning specific absorptions to specific constituents [
57,
58]. Composition-focused modelling is treated separately and/or is a priority for future multilot validation.
From an implementation perspective, the proposed workflow should be read as a deployment-oriented hypothesis rather than as a validated operational seed-sorting solution [
45,
56]. In principle, a line-scan (push-broom) hyperspectral sensor—or a reduced-band multispectral design guided by the highlighted visible windows—could be mounted above a conveyor belt, with automated seed segmentation generating per-seed masks/ROIs and extracting one spectrum per seed on the fly. However, any such translation would first require automated segmentation, illumination control, cross-session testing and explicit domain-shift validation before compact models such as linear discriminant analysis could be considered reliable outside the present controlled laboratory setting [
56,
57,
58].
Preprocessing sensitivity analysis revealed that Savitzky–Golay smoothing with clipping was a stable default and that optional baseline correction or SNV influenced interpretability more strongly than the identity of the strongest full-spectrum classifiers did. Nevertheless, preprocessing must be fixed before external validation so that performance differences are not conflated with analytical drift [
56,
59].
Although these results demonstrate strong discrimination under controlled conditions, performance under non-ideal acquisition (illumination shifts, partial occlusion/stacking and surface contamination) was not quantified here and remains a priority for future deployment-oriented validation.
3.7. Limitations and Deployment-Oriented Validation
This study was designed as a controlled, traceable benchmark to establish an internal validity baseline for accession discrimination, wavelength attribution and fair model comparison. Several deployment-relevant factors were not quantified, including illumination variability, seed stacking or partial occlusion, surface contamination, cross-sectional drift and cross-instrument transfer [
59,
60]. Future validation should therefore include deliberate stress tests together with explicit domain-shift experiments.
ROI spectra were extracted from manually delineated regions following a fixed rule-set to reduce subjectivity. Nevertheless, robust operational deployment requires automated segmentation and ROI extraction [
45,
61,
62,
63]. A practical next step is to integrate a lightweight thresholding- or contour-based masker or a trained detector/segmenter and to verify that automated ROIs preserve both spectral fidelity and classification performance relative to manual ROIs under the same seed-level split protocol.
In addition, variability across production lots (e.g., moisture content, storage age, and associated changes in seed coat glossiness/roughness) may shift reflectance signatures and should be explicitly quantified in future multilot validation to assess model transferability. The prediction of physical and chemical seed properties from spectral data is outside the scope of the present classification benchmark and is addressed in a companion study currently under review. Seed lot history, including storage and pretreatment, can also modify physical or biochemical properties relevant to downstream sensing workflows and should therefore be controlled explicitly during external validation [
19,
20,
64,
65,
66].
Finally, the reduced-band ranking was generated on the full preprocessed dataset and subsequently used to define exploratory sensitivity analyses. The resulting 16-band benchmark is therefore informative for wavelength attribution and sensor design, but it should not be interpreted as a fully nested, leakage-free feature selection experiment. Claims about field-scale robustness or definitive multispectral sensor specifications should remain provisional until external domain-shift validation is completed.
4. Materials and Methods
4.1. Plant Material, Genetic Background and Experimental Design
Thirty-two grain–legume accessions were analysed, comprising thirty traditional common bean (
Phaseolus vulgaris L.) landrace cultivars and two non-
Phaseolus grain–legumes that are locally classified as ‘Feijões’ (A27:
Vigna angularis (Willd.) Ohwi & Ohashi; A28:
Cajanus cajan (L.) Huth;
Table S1). All materials were sourced from a maintained germplasm bank, with curated accession identities and documented provenances. The use of curated germplasm accessions provides a realistic test of authentication at the interface of conservation, breeding and seed quality control.
The selected landraces represent a broad spectrum of traditional bean types in Brazil, encompassing substantial phenotypic, biochemical and genetic diversity preserved through long-term farmer selection and regional adaptation processes. The collection comprised the following cultivars: (1) Feijão Amendoim Roxo, (2) Feijão Carioca, (3) Feijão Preto Gigante, (4) Feijão Branco Manteiga, (5) Feijão Rosinha, (6) Feijão Branco Gigante, (7) Feijão Canário, (8) Feijão Preto Mocotó, (9) Feijão Verde, (10) Feijão Vermelho Bolinha, (11) Feijão Branco Bolinha, (12) Feijão Mouro Gigante, (13) Feijão Roxinho, (14) Feijão Creme, (15) Feijão Vermelho Gigante, (16) Feijão Jalo, (17) Feijão Amendoim Bege, (18) Feijão Boreal Rosa, (19) Feijão Chocolate, (20) Feijão Amendoim Rosa, (21) Feijão Olho de Pombo, (22) Feijão Bico de Ouro, (23) Feijão Amendoim, (24) Feijão Ovo de Tito-Tico, (25) Feijão Amendoim Verde, (26) Feijão Rajado, (27) Feijão Azaki, (28) Feijão Guandu, (29) Feijão Mouro, (30) Feijão Mancá, (31) Feijão Vermelho and (32) Feijão Rapa Cuiua (
Table S1).
Accession information and prior characterisation records from the germplasm bank indicate that the collection spans substantial phenotypic diversity across seed coat colour, patterning and grain morphology, supporting its suitability as a challenging benchmark for spectral discrimination under controlled laboratory conditions.
4.2. Hyperspectral Image Acquisition and Spectral Consistency Analysis
Hyperspectral image acquisition was conducted with an Aisa-FENIX hyperspectral sensor imaging system (Specim, Oulu, Northern Ostrobothnia, Finland) mounted on a laboratory push-broom scanning platform. The instrument provides nominal VIS–NIR–SWIR coverage through the VNIR and SWIR channels; after calibration and export in the present workflow, the analysed wavelength axis comprised 563 bands spanning 449.54–2399.17 nm. The acquisition line rate was 12 frames s−1.
Three scanning sessions were performed, each comprising multiple Petri dishes, yielding a total of n = 3200 seeds. For each scan, approximately 50 seeds were arranged in a single layer on a stable, non-reflective background to minimise overlap and facilitate manual region-of-interest delineation. The same dishes used for complementary point-based spectroradiometry were scanned when applicable, allowing qualitative comparison between point spectra and image-derived reflectance curves.
Image acquisition was performed at a fixed sensor–sample distance of 85 cm to standardise spatial sampling and minimise bidirectional reflectance distribution function (BRDF) effects. Measurements were conducted in a controlled laboratory environment, with reflectance referenced to a Spectralon® diffuse reflectance standard (Labsphere Inc., Northern Sutton, NH, USA), characterised by near-Lambertian behaviour across the VIS–NIR–SWIR range, supporting robust and reproducible calibration. Acquisition was performed under controlled laboratory conditions; twelve 20 W halogen lamps were evenly placed around the scan window to provide uniform illumination, the ambient temperature was maintained at 25 °C, and external light interference was eliminated. The camera’s internal cooling was activated, and the camera was allowed to stabilise for 30 min before the acquisition of spectral data. Radiometric referencing and dark current correction followed standard procedures, and the reflectance was computed relative to a calibrated white reference.
4.2.1. Radiometric Calibration and Reflectance Conversion
Radiometric calibration was carried out via CaliGeoPRO® software, version 2.2 (Specim, Oulu, Northern Ostrobothnia, Finland). The calibration workflow incorporated raw hyperspectral data files generated by the Aisa-FENIX sensor (.raw), together with sensor-specific radiometric calibration files (.cal and .LUT) and dark reference headers (.hdr). This procedure corrected for sensor noise, dark current and system-specific radiometric response.
The radiance-calibrated outputs were subsequently converted to reflectance units via ENVI Classic® software version 5.4 (Harris Geospatial Solutions, Boulder, CO, USA). A manufacturer-validated algorithm tailored to the Aisa-FENIX data was applied through the scan normalisation module. The radiance data were combined with dark reference measurements, and the reflectance was computed via the option “Reflectance using a subset of the image as a white reference”, with the Spectralon® panel region explicitly defined as the reference target. This process yielded reflectance-calibrated hyperspectral images (radiance_refl.dat).
4.2.2. Region of Interest Extraction and Spectral Signature Generation
After reflectance calibration, the hyperspectral data cubes were exported to ENVI Classic® v5.4 (Harris Geospatial Solutions, Boulder, CO, USA) for subsequent processing. For each seed, one region of interest (ROI) was manually defined over a central representative area while excluding marginal regions in which curvature, background mixing or specular artefacts could distort reflectance. The ROI was defined as a 10 × 10 pixel square window (100 pixels). Within this window, 10 representative interior pixels were selected after excluding obvious highlights, shadows and visible contaminants, and their mean reflectance was computed at each wavelength to yield one spectrum per seed.
In total, this workflow produced n = 3200 seed-level spectral signatures linked to accession labels A01–A32 (100 spectra per accession). ROI delineation followed a fixed rule-set to minimise subjective variability: (1) the ROI window was placed within the central seed body to avoid boundary curvature and background mixing; (2) pixels affected by specular highlights, shadows, or visible contaminants were excluded; (3) a fixed ROI window size of 10 × 10 pixels was used as a standardised sampling frame for all seeds whenever feasible; and (4) within this window, 10 representative interior pixels were selected and averaged to obtain exactly one mean spectrum per seed. A small placement sensitivity check (three alternative central placements on a subset of seeds) indicated negligible changes in the resulting mean spectra relative to between-accession differences, supporting the use of a single spectrum per seed in the main benchmark.
The analysed wavelength axis used throughout this manuscript comprised 563 bands spanning 449.54–2399.17 nm. In an additional quality control analysis, spectral-edge trimming removed eight bands within 10 nm of the lower and upper limits, yielding a 555-band representation spanning 459.68–2388.25 nm. All primary full-spectrum analyses reported here use the 563-band representation unless stated otherwise.
4.3. Organisation of the Spectral Dataset and Python Implementation
All post-extraction processing was carried out using Python version 3.13.9 (Python Software Foundation, Wilmington, DE, USA) in a single, fully scripted pipeline to maximise transparency and reproducibility. The calibrated reflectance measurements were arranged into a feature matrix X ∈ ℝ^(3200 × 563), where rows correspond to seed-level ROI spectra and columns correspond to wavelengths. Each spectrum was paired with an accession identifier (A01–A32), and wavelength positions were stored in a companion vector λ expressed in nm.
4.4. Spectral Preprocessing
Each extracted ROI was represented by a single reflectance vector comprising P = 563 bands spanning 449.54–2399.17 nm. Accessions were indexed via codes A01–A32, corresponding directly to the cultivar numbering defined in
Section 4.1 and reported in
Table S1.
Spectral preprocessing followed the scripted workflow used to generate the reported outputs. Polynomial baseline correction (degree 2) and standard normal variate (SNV) normalisation were implemented but disabled in the primary analyses. The main reported workflow used Savitzky–Golay smoothing (window length = 11, polynomial order = 2) followed by clipping to [0, 1] to retain physically plausible reflectance values. The smoothing step follows the original Savitzky–Golay formulation [
67].
When the SNV was used for sensitivity analysis, each spectrum was centred on its own mean and scaled by its own standard deviation so that the relative shape rather than absolute reflectance dominated the transformed signal. Because SNV changes the physical scale of reflectance, it was treated as a sensitivity analysis option rather than as the default representation used for the main benchmark.
4.5. Descriptive and Integrative Spectral Analyses
Exploratory spectral analyses included accession-wise mean spectra, PERMANOVA on the Euclidean distance matrix of the preprocessed spectra (999 permutations), one-versus-rest Pearson correlation profiles across wavelengths, coefficient-of-determination mapping (R2 = r2), and hierarchical clustering of accession-mean spectra using Euclidean distance. These analyses were used to characterise the global variance structure, local wavelength associations and between-accession spectral neighbourhoods.
For one-versus-rest association mapping, the accession labels were encoded as binary indicators (1 = focal accession; 0 = all remaining accessions); thus, the resulting wavelength-wise correlations quantified the class-specific association strength without imposing any ordinal structure on the accession IDs.
For presentation, correlation maps retained the sign of r, whereas attribution strength was summarised as the coefficient of determination (R2). This dual representation allowed the direction of association and its magnitude to be visualised separately across the spectrum. Hierarchical clustering was applied to the accession-mean spectra using Euclidean distance and average linkage to describe spectral similarity at the accession level. Representative seed images were then arranged in dendrogram leaf order to facilitate visual comparison between spectral neighbourhoods and seed coat phenotypes.
4.6. Principal Component Analysis
Prior to PCA, each wavelength band was standardised to zero mean and unit variance across seeds so that broad-intensity regions did not dominate the decomposition.
PCA was then applied to the standardised matrix to obtain score and loading representations for both individual spectra and accession centroids.
The explained variance ratios were exported for the first 20 components, and both individual-spectrum and cultivar-centroid score plots were generated. Loading vectors (PC1–3) were analysed to identify the wavelength regions contributing most strongly to principal component separation.
To assess whether the spectral profiles differed among the accessions across the full wavelength space, we performed permutational multivariate analysis of variance (PERMANOVA) on the Euclidean distance matrix computed from the preprocessed ROI spectra, with accessions used as the grouping factor and 999 permutations. The resulting F test and permutation p-values are reported in the hyperspectral curves.
4.7. Feature Ranking and Selection via Relief
Feature ranking was used to quantify which wavelengths contributed most to the separability of the accessions and to define an exploratory reduced-band representation for sensitivity analysis and future sensor design considerations.
where
wj is the ReliefF weight for wavelength band j, m is the number of sampled reference spectra (iterations), k is the number of nearest neighbours,
Hi is the set of k nearest hits (same accession as spectrum i),
Mi is the set of k nearest misses (different accessions), and
xsj denotes the reflectance of spectrum s at band j (s = i, h, r).
An approximate ReliefF implementation was applied to the preprocessed spectra. In each iteration, a sampled spectrum was compared with its nearest hit and nearest miss, and per-band weights were updated by penalising within-class absolute differences and rewarding between-class separation. The resulting rank order was used for interpretation and for the exploratory 16-band sensitivity benchmark; it was not used as a nested feature selection step inside each training fold.
4.7.1. ReliefF Implementation Details and Hyperparameters
ReliefF weighting was implemented in Python, using NumPy version 2.2.6 (NumPy Developers, Austin, TX, USA) for the weight updates and scikit-learn’s NearestNeighbors for neighbour retrieval. The spectra were internally min–max scaled to [0, 1] for the neighbour searches, and the Euclidean (L2) distance was used throughout. The neighbour-search implementation and associated utilities followed the scikit-learn framework [
67].
The neighbourhood size was 20 and 800 randomly sampled reference spectra were evaluated (seed = 42). For each reference spectrum, the nearest same-class and different-class neighbours were identified after excluding the query spectrum itself. Because the ranking served wavelength attribution and exploratory sensitivity analysis, it was computed on the full preprocessed dataset rather than on a training-only subset; the reduced-band results should therefore be interpreted as descriptive/exploratory rather than as a fully nested leakage-free feature selection benchmark.
4.8. Classification Modelling Framework
Accession identification was framed as a 32-class classification problem. For the classical benchmark and spectral representation sensitivity analyses, a fixed stratified seed-level partition was used: 70 seeds per accession (n = 2240) formed the development subset, and 30 seeds per accession (n = 960) formed the fully independent test subset. Within the classical benchmark, the development subset was divided evenly into training and validation subsets (35 seeds per accession each; 35:35:30 overall), and the same split indices were reused across all the classical models and spectral representations. Each seed contributed a single ROI-averaged spectrum, preventing any pixel- or seed-level overlap between the development and independent test data.
The predicted labels on the independent test set were used to compute the overall accuracy and error rate as follows:
Confusion matrices were computed on the independent test set and optionally row-normalised for visual comparison of the per-class error structure. The precision, recall and F1 score were summarised as macro, micro- and support-weighted averages. Because the independent test set was perfectly balanced (30 seeds per accession), the weighted and macro summaries were numerically very similar. Training time, prediction throughput, total misclassification cost and serialised model size were also recorded to characterise computational feasibility.
For the baseline benchmark, we additionally report a composite coefficient defined as the arithmetic mean of the three support-weighted metrics (precision, recall and F1 score): C = (precision + recall + F1 score)/3.
4.9. Model Optimisation Curves and Probabilistic Performance Summaries
To replicate optimisation-style diagnostics, two model families were further analysed via simplified grid evaluations using only the training data and an internal validation split. For k-nearest neighbours, the number of neighbours was varied over a finite range constrained by the dataset size, and the validation classification error was recorded to identify an appropriate k. For multilayer perceptron classification, a discrete set of hidden-layer configurations was evaluated under the same protocol to identify a performant architecture under fixed optimisation settings. The independent test set was not used for model selection.
When classifiers provide probabilistic outputs via predict_proba or continuous decision scores via a decision function, score matrices on the independent test set are used to compute microaveraged receiver operating characteristic (ROC) and precision–recall (PR) curves by binarizing class labels in a one-versus-rest manner and pooling across all classes. In addition, micro- and macroaveraged ROC-AUC and average precision summaries were retained for tabular reporting. For binary scoring of a positive class, the ROC definitions were as follows:
The area under the ROC curve (AUC) and average precision (AP) were computed as scalar summaries of the ranking quality under both the pooled microaverage and class-averaged macro formulations.
4.10. Deep Learning Models on 1D Spectral Signatures
Deep learning was implemented directly on the one-dimensional reflectance vectors, treating each spectrum as a fixed-length sequence of length P = 563. The same fixed 70:30 seed-level partition was used to define the independent test set (30 seeds per accession; n = 960). Within the 70% development portion, 15% was reserved for validation during training, leaving n = 1904 spectra for parameter optimisation and n = 336 for model selection and early stopping.
The evaluated architectures included three multilayer perceptrons (MLP_1D, MLP_Wide and MLP_Deep), two one-dimensional convolutional neural networks (CNN1D and CNN1D_Deep), ResNet1D, Inception1D, TCN1D, BiLSTM1D and Transformer1D. All the models were trained with the Adam optimiser (learning rate = 1 × 10−3), a batch size = 32 and a maximum of 25 epochs; early stopping (patience = 6) restored the weights associated with the lowest validation loss.
For a sample
with a true class label
and a predicted class probability
, the multiclass cross-entropy loss is
The deep learning performance on the independent test set was summarised in terms of accuracy, macroaveraged precision, macroaveraged recall and macroaveraged F1 score, together with the number of trainable parameters as a measure of model capacity. Training and validation learning curves were retained for each architecture to document optimisation dynamics and to diagnose overfitting under the shared training budget.
4.11. Reproducibility and Figure Generation
All figures, tables and intermediate outputs reported in this work, including mean spectra plots, heatmaps, PCA scores and loadings, dendrograms, feature importance profiles, model comparison panels, confusion matrices and deep learning training curves, were generated programmatically from the same Python version 3.13.9 script pipeline used for model training and evaluation. Fixed random seeds were used across data splitting and model initialisation steps wherever supported (
Figure 2). Model evaluation and all reported metrics were computed on the independent held-out test set described above, ensuring that the reported performance reflects generalisation beyond the fitted training data [
67].
All analyses and figure generation were performed in Python version 3.13.9 (Python Software Foundation, Wilmington, DE, USA) via NumPy version 2.2.6 (NumPy Developers, USA), Pandas version 2.3.3 (Pandas Development Team, Austin, TX, USA), Matplotlib version 3.10.5 (Matplotlib Developers, New York, NY, USA), Seaborn version 0.13.2 (Seaborn Developers, New York, NY, USA), SciPy version 1.16.3 (SciPy Developers, Austin, TX, USA), scikit-learn version 1.7.2 (Scikit-learn Developers, New York, NY, USA) [
68] and openpyxl version 3.1.5 (Openpyxl Developers, Cambridge, MA, USA). The deep learning models were implemented in PyTorch version 2.2.2 (PyTorch Developers, Menlo Park, CA, USA). All random operations (data split, model initialisation, and bootstrap resampling) use fixed random seeds to ensure reproducibility.
Model development and computational routines were executed on a workstation equipped with an NVIDIA GeForce RTX 4090 GPU (NVIDIA Corporation, Santa Clara, CA, USA), an Intel Core i9-14900HX CPU (Intel Corporation, Santa Clara, CA, USA), 128 GB RAM (Corsair Vengeance DDR5 6400 MHz, Corsair Inc., Fremont, CA, USA) and NVMe storage (Movespeed 4TB SSD, M.2 2280, Movespeed, Shenzhen, China), supporting efficient handling of high-dimensional spectral datasets and enabling consistent benchmarking across classical and deep learning approaches (
Figure 9).
5. Conclusions
Full-range VIS–NIR–SWIR hyperspectral imaging of dry seeds generated stable accession-specific signatures that supported non-destructive identification across 32 grain–legume accessions. Across the 563 analysed bands (449.54–2399.17 nm), the first three principal components captured 97.42% of the total variance, and wavelength attribution concentrated the strongest reduced-band signal within two visible windows centred at 449.54–456.30 nm and 577.02–597.54 nm. Full-spectrum benchmarking revealed that linear discriminant analysis reached 96.35% macro-F1 and 96.35% balanced accuracy. In the original ReliefF-selected 16-band benchmark, the subspace discriminant reached 68.25% macro-F1 and 68.54% balanced accuracy; after edge-band trimming, the alternative 16-band configuration decreased to 60.67% and 60.94%, respectively. In the deep learning comparison, MLP_Wide was the strongest one-dimensional architecture (84.90% accuracy; 84.47% macro-F1), outperforming the convolutional, recurrent and attention-based alternatives but remaining below the strongest full-spectrum classical models. Together, these results establish a controlled benchmark for accession-level seed authentication under laboratory conditions, show that the most compact discriminatory signal was concentrated in the visible range, and indicate that the SWIR contributed mainly to the complementary full-spectrum context rather than through many individually dominant bands. Wavelength-ranking analyses therefore provide mechanistic guidance for future multispectral sensor design, whereas the reduced-band results should be interpreted as exploratory because the ranking step was not nested inside model training. However, neither operational deployment nor biochemical attribution should be inferred beyond the present controlled benchmark until explicit multilot, cross-session, and composition-linked validation has been performed.