Figure 1.
Physical arrangement of the 400 Group A wheat kernels. The kernels were placed on four fixed black positioning plates arranged in a 2 × 2 layout. Each plate had physical dimensions of 10.5 cm × 7.2 cm and contained a 10 × 10 grid of independent kernel slots.
Figure 1.
Physical arrangement of the 400 Group A wheat kernels. The kernels were placed on four fixed black positioning plates arranged in a 2 × 2 layout. Each plate had physical dimensions of 10.5 cm × 7.2 cm and contained a 10 × 10 grid of independent kernel slots.
Figure 2.
Construction and preprocessing workflow for RGB–HSI paired samples. The workflow consists of six core steps: single-kernel extraction, foreground segmentation, size normalization, cross-modal registration, foreground-mask generation, and HDF5-based dataset packaging. Arrows indicate the sequential preprocessing flow. RGB thumbnails represent visible-light images, whereas the rainbow-colored cube represents the hyperspectral image cube. The white and gray binary masks correspond to the foreground and background regions of the sample kernel, respectively. In the registration module, the colored contours indicate the cross-modal sample-kernel outlines used to evaluate registration performance. In the dataset-packaging module, the colored blocks represent the organization and packaging status of different data fields, and ellipses indicate repeated data entries omitted due to space limitations.
Figure 2.
Construction and preprocessing workflow for RGB–HSI paired samples. The workflow consists of six core steps: single-kernel extraction, foreground segmentation, size normalization, cross-modal registration, foreground-mask generation, and HDF5-based dataset packaging. Arrows indicate the sequential preprocessing flow. RGB thumbnails represent visible-light images, whereas the rainbow-colored cube represents the hyperspectral image cube. The white and gray binary masks correspond to the foreground and background regions of the sample kernel, respectively. In the registration module, the colored contours indicate the cross-modal sample-kernel outlines used to evaluate registration performance. In the dataset-packaging module, the colored blocks represent the organization and packaging status of different data fields, and ellipses indicate repeated data entries omitted due to space limitations.
Figure 3.
Overall architecture of MFGF-Net. The network takes registered RGB images, HSI cubes, and foreground masks as inputs and consists of four stages: dual-branch feature extraction, mask-constrained local cross-modal interaction, sample-level adaptive fusion, and mold severity prediction. Black solid arrows indicate the forward data and feature flow between network modules. Blue dashed arrows indicate the mask-guidance paths, showing how the foreground mask is mapped or downsampled and then applied to the FASR, MDCLA, sample-wise gating, and final masked global average pooling modules.
Figure 3.
Overall architecture of MFGF-Net. The network takes registered RGB images, HSI cubes, and foreground masks as inputs and consists of four stages: dual-branch feature extraction, mask-constrained local cross-modal interaction, sample-level adaptive fusion, and mold severity prediction. Black solid arrows indicate the forward data and feature flow between network modules. Blue dashed arrows indicate the mask-guidance paths, showing how the foreground mask is mapped or downsampled and then applied to the FASR, MDCLA, sample-wise gating, and final masked global average pooling modules.
Figure 4.
Schematic structure of the MDCLA module. RGB features are used as queries, while HSI features are used as keys and values. Under foreground-mask constraints, the module performs three-stage local cross-modal attention to obtain fine-grained fused representations.
Figure 4.
Schematic structure of the MDCLA module. RGB features are used as queries, while HSI features are used as keys and values. Under foreground-mask constraints, the module performs three-stage local cross-modal attention to obtain fine-grained fused representations.
Figure 5.
Mean hyperspectral reflectance curves of samples from Groups A and B at different storage time points. The blue and red curves represent Groups A and B, respectively. Subfigures (a)–(d) correspond to Day 0, Day 4, Day 8, and Day 12, respectively.
Figure 5.
Mean hyperspectral reflectance curves of samples from Groups A and B at different storage time points. The blue and red curves represent Groups A and B, respectively. Subfigures (a)–(d) correspond to Day 0, Day 4, Day 8, and Day 12, respectively.
Figure 6.
Comparison of PCA score plots for samples from Groups A and B at different incubation time points. The figure shows the PCA score distributions of Groups A and B based on hyperspectral features at four time points: Day 0, Day 4, Day 8, and Day 12. Blue and red points denote samples from Groups A and B, respectively. Cross markers indicate the centroid of each group. (a) PCA score plot of samples from Groups A and B on Day 0. (b) PCA score plot of samples from Groups A and B on Day 4. (c) PCA score plot of samples from Groups A and B on Day 8. (d) PCA score plot of samples from Groups A and B on Day 12.
Figure 6.
Comparison of PCA score plots for samples from Groups A and B at different incubation time points. The figure shows the PCA score distributions of Groups A and B based on hyperspectral features at four time points: Day 0, Day 4, Day 8, and Day 12. Blue and red points denote samples from Groups A and B, respectively. Cross markers indicate the centroid of each group. (a) PCA score plot of samples from Groups A and B on Day 0. (b) PCA score plot of samples from Groups A and B on Day 4. (c) PCA score plot of samples from Groups A and B on Day 8. (d) PCA score plot of samples from Groups A and B on Day 12.
Figure 7.
Structural overlay results of RGB and HSI images before and after registration. (a) RGB reference image; (b) HSI reference image; (c) RGB–HSI overlay result before registration; (d) RGB–HSI overlay result after registration.
Figure 7.
Structural overlay results of RGB and HSI images before and after registration. (a) RGB reference image; (b) HSI reference image; (c) RGB–HSI overlay result before registration; (d) RGB–HSI overlay result after registration.
Figure 8.
Confusion matrix of MFGF-Net on the test set. The figure shows the classification results for the four mold severity grades—Safe, Critical, Hazardous, and Severe—on the test set. The vertical axis denotes the true labels, the horizontal axis denotes the predicted labels, and the values indicate the proportion of samples from each class predicted as the corresponding class.
Figure 8.
Confusion matrix of MFGF-Net on the test set. The figure shows the classification results for the four mold severity grades—Safe, Critical, Hazardous, and Severe—on the test set. The vertical axis denotes the true labels, the horizontal axis denotes the predicted labels, and the values indicate the proportion of samples from each class predicted as the corresponding class.
Figure 9.
Performance comparison between MFGF-Net and representative baseline methods. The figure shows accuracy, macro-F1 score, and MAE, providing a comprehensive comparison of MFGF-Net with traditional methods, single-modality methods, and fusion baselines.
Figure 9.
Performance comparison between MFGF-Net and representative baseline methods. The figure shows accuracy, macro-F1 score, and MAE, providing a comprehensive comparison of MFGF-Net with traditional methods, single-modality methods, and fusion baselines.
Figure 10.
Top 10 spectral bands with the highest channel-wise attention weights learned by FASR. The weights reflect the model’s focus on informative spectral signatures associated with intrinsic biochemical changes underlying wheat mold growth.
Figure 10.
Top 10 spectral bands with the highest channel-wise attention weights learned by FASR. The weights reflect the model’s focus on informative spectral signatures associated with intrinsic biochemical changes underlying wheat mold growth.
Figure 11.
Evolution of sample-level adaptive fusion weights during training. The average adaptive fusion weights assigned to the RGB, HSI, and cross-modal branches were recorded over 100 training epochs. The HSI branch exhibited a relatively larger contribution at the initial optimization stage, consistent with the strong discriminative role of hyperspectral responses. As training progressed, the model increasingly relied on the local RGB–HSI interaction branch, suggesting that the gated fusion module progressively learned to exploit registered cross-modal correspondences. In the late training stage, the cross-modal weight stabilized at approximately 0.968, while the RGB and HSI branches retained small but non-zero auxiliary contributions.
Figure 11.
Evolution of sample-level adaptive fusion weights during training. The average adaptive fusion weights assigned to the RGB, HSI, and cross-modal branches were recorded over 100 training epochs. The HSI branch exhibited a relatively larger contribution at the initial optimization stage, consistent with the strong discriminative role of hyperspectral responses. As training progressed, the model increasingly relied on the local RGB–HSI interaction branch, suggesting that the gated fusion module progressively learned to exploit registered cross-modal correspondences. In the late training stage, the cross-modal weight stabilized at approximately 0.968, while the RGB and HSI branches retained small but non-zero auxiliary contributions.
Table 1.
Grade consistency and spore count coefficient of variation of proxy labels across different mold spoilage stages.
Table 1.
Grade consistency and spore count coefficient of variation of proxy labels across different mold spoilage stages.
| Mold Spoilage Stage | Time Range (Days) | Grade Consistency (%) | Spore Count CV Value (%) |
|---|
| Safe | 0–3 | 100.0 ± 0.0 | 8.4 ± 1.2 |
| Critical | 4–7 | 91.5 ± 5.8 | 15.8 ± 2.7 |
| Hazardous | 8–10 | 82.3 ± 8.4 | 27.9 ± 4.3 |
| Severe | 11–12 | 71.2 ± 10.5 | 39.6 ± 6.8 |
| Overall Average | 0–12 | 87.5 ± 4.3 | 22.9 ± 4.1 |
Table 2.
Registration-error statistics for cross-modal key structural points.
Table 2.
Registration-error statistics for cross-modal key structural points.
| Feature Point | RGB Reference Coordinate | HSI Reference Coordinate | Euclidean Distance Error |
|---|
| Point 1 | | | |
| Point 2 | | | |
| Point 3 | | | |
| Mean Euclidean error | | | 3.56 |
Table 3.
Lightweight evaluation of RGB–HSI registration quality across mold stages.
Table 3.
Lightweight evaluation of RGB–HSI registration quality across mold stages.
| Mold Stage | Number of Samples | Landmark RMSE/Pixel | Dice | IoU |
|---|
| Safe | 20 | 3.12 ± 0.64 | 0.943 ± 0.021 | 0.892 ± 0.037 |
| Critical | 20 | 3.38 ± 0.71 | 0.936 ± 0.024 | 0.880 ± 0.041 |
| Hazardous | 20 | 3.61 ± 0.82 | 0.925 ± 0.030 | 0.861 ± 0.049 |
| Severe | 20 | 4.23 ± 0.96 | 0.907 ± 0.039 | 0.831 ± 0.061 |
| Overall | 80 | 3.59 ± 0.88 | 0.928 ± 0.032 | 0.866 ± 0.052 |
Table 4.
Effectiveness comparison of foreground-mask and background-suppression strategies.
Table 4.
Effectiveness comparison of foreground-mask and background-suppression strategies.
| Method | Accuracy | Macro-F1 | MAE |
|---|
| Full | 0.9689 ± 0.0225 | 0.9698 ± 0.0221 | 0.0593 ± 0.0429 |
| No mask | 0.8445 ± 0.0308 | 0.8281 ± 0.0365 | 0.5480 ± 0.1260 |
| No background suppression | 0.9282 ± 0.0221 | 0.9286 ± 0.0224 | 0.4260 ± 0.0880 |
Table 5.
Overall recognition performance of MFGF-Net on the test set.
Table 5.
Overall recognition performance of MFGF-Net on the test set.
| Experiment | Accuracy | Macro-F1 | MAE |
|---|
Full model (mean ± std) | 0.9689 ± 0.0225 | 0.9698 ± 0.0221 | 0.0593 ± 0.0429 |
| Best run | 0.9926 | 0.9917 | 0.0150 |
Table 6.
Class-wise Precision, Recall, and F1-score of MFGF-Net on the test set (one representative run).
Table 6.
Class-wise Precision, Recall, and F1-score of MFGF-Net on the test set (one representative run).
| Class | Precision | Recall | F1-Score |
|---|
| Safe | 0.9677 | 1.0000 | 0.9836 |
| Critical | 1.0000 | 1.0000 | 1.0000 |
| Hazardous | 1.0000 | 0.9667 | 0.9831 |
| Severe | 1.0000 | 1.0000 | 1.0000 |
Table 7.
Conservative validation under temporal subsampling and adjacent-day feature similarity.
Table 7.
Conservative validation under temporal subsampling and adjacent-day feature similarity.
| Evaluation Setting | Test Observations | Accuracy | Macro-F1 | MAE | Feature Similarity |
|---|
| Original time-point evaluation | 780 | 0.9689 ± 0.0225 | 0.9698 ± 0.0221 | 0.0593 ± 0.0429 | |
| Single-observation-per-kernel validation | 60 × 100 repeats | 0.9656 ± 0.0224 | 0.9642 ± 0.0241 | 0.0668 ± 0.0305 | |
| Adjacent-day feature similarity | 720 adjacent pairs | | | | 0.9148 ± 0.0387 |
Table 8.
Auxiliary validation results on public and agricultural RGB–HSI datasets.
Table 8.
Auxiliary validation results on public and agricultural RGB–HSI datasets.
| Dataset | Method | PSNR | SAM | ERGAS | SSIM |
|---|
| CAVE | HSRnet | 50.38 ± 3.38 | 2.23 ± 0.66 | 1.20 ± 0.75 | 0.996 ± 0.0014 |
| CAVE | Fusformer | 49.98 ± 8.10 | 2.20 ± 0.85 | 2.50 ± 5.21 | 0.994 ± 0.0111 |
| CAVE | Fusion-reconstruction variant of MFGF-Net | 45.40 ± 3.85 | 3.58 ± 0.61 | 3.02 ± 0.82 | 0.974 ± 0.0072 |
| Harvard | HSRnet | 48.29 ± 3.03 | 2.26 ± 0.56 | 1.87 ± 0.81 | 0.988 ± 0.0064 |
| Harvard | Fusformer | 47.87 ± 5.13 | 2.84 ± 2.07 | 2.04 ± 0.99 | 0.986 ± 0.0101 |
| Harvard | Fusion-reconstruction variant of MFGF-Net | 43.80 ± 3.38 | 3.85 ± 0.62 | 3.10 ± 0.82 | 0.966 ± 0.0104 |
| ARAD_1K | MFGF-Net | 38.60 ± 1.85 | 4.75 ± 0.72 | 4.30 ± 0.86 | 0.965 ± 0.012 |
| Agro-HSR | MFGF-Net | 37.40 ± 2.05 | 5.35 ± 0.88 | 4.85 ± 0.92 | 0.948 ± 0.018 |
Table 9.
Comparison of baseline models on the self-constructed wheat dataset.
Table 9.
Comparison of baseline models on the self-constructed wheat dataset.
| Method | Input | Accuracy | Macro-F1 | MAE |
|---|
| MFGF-Net | RGB + HSI | 0.9689 ± 0.0225 | 0.9698 ± 0.0221 | 0.0593 ± 0.0429 |
| SVM | HSI mean | 0.8322 ± 0.0229 | 0.8328 ± 0.0245 | 0.1807 ± 0.0618 |
| RF | RGB + HSI concat | 0.8354 ± 0.0264 | 0.8360 ± 0.0301 | 0.1750 ± 0.0584 |
| PLS-DA | HSI mean | 0.7481 ± 0.0341 | 0.7143 ± 0.0479 | 0.6889 ± 0.0963 |
| HybridSN | HSI | 0.9448 ± 0.0159 | 0.9436 ± 0.0168 | 0.1917 ± 0.0482 |
| 1D-SSFTT | HSI | 0.9560 ± 0.0108 | 0.9550 ± 0.0145 | 0.1630 ± 0.0417 |
| ResNet18 | RGB | 0.4296 ± 0.0527 | 0.3948 ± 0.0641 | 1.2593 ± 0.1086 |
| 1D-CNN | HSI mean spectrum | 0.8074 ± 0.0316 | 0.8094 ± 0.0349 | 0.5852 ± 0.0924 |
Table 10.
Performance comparison of multimodal fusion models for identifying stages of natural mold spoilage in wheat.
Table 10.
Performance comparison of multimodal fusion models for identifying stages of natural mold spoilage in wheat.
| Method | Characteristics | Accuracy | Macro-F1 | MAE |
|---|
| MFGF-Net | Local Feature Fusion | 0.9689 ± 0.0225 | 0.9698 ± 0.0221 | 0.0593 ± 0.0429 |
| Concat + MLP | Shallow Feature Concatenation | 0.9284 ± 0.0158 | 0.9271 ± 0.0164 | 0.2190 ± 0.0517 |
| Global Cross-Attention | Global Cross-Modal Attention | 0.9496 ± 0.0141 | 0.9487 ± 0.0148 | 0.1760 ± 0.0413 |
| Transformer Fusion | Self-Attention + Cross-Attention | 0.9581 ± 0.0127 | 0.9574 ± 0.0132 | 0.1180 ± 0.0336 |
Table 11.
Paired t-test results between MFGF-Net and the top-performing baseline models based on three repeated runs
Table 11.
Paired t-test results between MFGF-Net and the top-performing baseline models based on three repeated runs
| Metric | Proposed MFGF-Net | Top Unimodal Baseline (1D-SSFTT) | Paired -Test (vs. 1D-SSFTT) | Top Fusion Baseline (Transformer Fusion) | Paired -Test (vs. Transformer Fusion) |
|---|
| Accuracy | 0.9689 ± 0.0225 | 0.9560 ± 0.0108 | = 0.038 | 0.9581 ± 0.0127 | = 0.034 |
| Macro-F1 | 0.9698 ± 0.0221 | 0.9550 ± 0.0145 | = 0.036 | 0.9574 ± 0.0132 | = 0.028 |
| MAE | 0.0593 ± 0.0429 | 0.1630 ± 0.0417 | = 0.018 | 0.1180 ± 0.0336 | = 0.026 |
Table 12.
Performance comparison of different training objectives for wheat mold severity grading.
Table 12.
Performance comparison of different training objectives for wheat mold severity grading.
| Experiment | Accuracy | Macro-F1 | MAE |
|---|
| CE | 0.9689 | 0.9698 | 0.0593 |
| Ordinal | 0.9185 | 0.8234 | 0.2440 |
Table 13.
Ablation results for different modalities and fusion strategies.
Table 13.
Ablation results for different modalities and fusion strategies.
| Experiment | Accuracy | Macro-F1 | MAE |
|---|
| MFGF-Net | 0.9689 ± 0.0225 | 0.9698 ± 0.0221 | 0.0593 ± 0.0429 |
| RGB-only | 0.3481 ± 0.0462 | 0.2955 ± 0.0574 | 1.9260 ± 0.0231 |
| HSI-only | 0.8963 ± 0.0193 | 0.8936 ± 0.0199 | 0.3702 ± 0.0740 |
| Simple Fusion Baseline | 0.8667 ± 0.0261 | 0.8641 ± 0.0227 | 0.5263 ± 0.0713 |
| Late Fusion Baseline | 0.9309 ± 0.0171 | 0.9307 ± 0.0173 | 0.2395 ± 0.0556 |
Table 14.
Ablation results for core modules.
Table 14.
Ablation results for core modules.
| Experiment | Accuracy | Macro-F1 | MAE |
|---|
| MFGF-Net | 0.9689 ± 0.0225 | 0.9698 ± 0.0221 | 0.0593 ± 0.0429 |
| No MDCLA | 0.8963 ± 0.0596 | 0.8949 ± 0.0501 | 0.3633 ± 0.0720 |
| No FASR | 0.9320 ± 0.0364 | 0.9290 ± 0.0107 | 0.3051 ± 0.0316 |
| No Gating | 0.9500 ± 0.0240 | 0.9493 ± 0.0131 | 0.1510 ± 0.0528 |
Table 15.
Statistics of sample-level adaptive fusion weights.
Table 15.
Statistics of sample-level adaptive fusion weights.
| Group | RGB | HSI | Cross-Modal |
|---|
| Overall mean | 0.0078 | 0.0241 | 0.9682 |
| Correct | 0.0061 | 0.0216 | 0.9723 |
| Incorrect | 0.0434 | 0.0779 | 0.8787 |