1. Introduction
Global food security depends heavily on agriculture, which is still an essential economic sector, especially for many developing countries [
1,
2]. However, fruit crop diseases continuously endanger the production of agriculture, resulting in significant yield losses and economic challenges. These diseases are caused by a variety of biological agents, such as bacteria, viruses, and fungi, in addition to adverse environmental factors, such as harsh climate and unhealthy soil [
3,
4]. Regretfully, farmers are frequently short of the knowledge and resources necessary for early detection of these diseases, particularly those who operate in remote or resource-constrained areas. This results in delayed interventions and increasing losses [
4]
Fruit crop diseases must be identified early and accurately in order to prevent significant damage, maintain production, and reduce dependence on expensive and ecologically harmful treatments [
5]. In agriculture, visual inspection by qualified professionals has historically played a major role in disease diagnosis. But these traditional approaches are time-consuming, expensive, subjective, and difficult to scale, especially in distant or impoverished areas where the availability of experts is low [
6]. Inconsistencies and mistakes during disease identification are further compounded by the human constraints of subjectivity, weariness, and differing levels of competence.
Researchers are increasingly looking to automated solutions that use computer vision and artificial intelligence to address these issues. Since the majority of fruit crop illnesses show up visually on leaves and fruits, digital imaging has been very useful in recent years. In general, these automated systems have progressed from traditional machine learning methods that depend on attributes that are manually created to advanced deep learning methods, particularly convolutional neural networks (CNNs) [
7].
Traditional machine learning approaches usually combine classifiers like Support Vector Machines (SVM) or Random Forests with feature extraction algorithms like Scale-Invariant Feature Transform (SIFT), Histogram of Oriented Gradients (HOG), or Local Binary Patterns (LBP). These techniques rely mostly on manually designed features, which restricts their capacity to adapt to changing settings like different lighting, scales, and illness symptoms, even when they achieve adequate accuracy [
8,
9,
10].
By automating feature extraction straight from raw photos, deep learning—in particular, CNN-based models like ResNet, DenseNet, and EfficientNet—has greatly improved plant disease identification by improving generalization capabilities. However, CNNs confront a number of significant obstacles in real-world agricultural applications, even though they exhibit remarkable accuracy in controlled environments. Class imbalance in training datasets, vulnerability to performance degradation in complex field backgrounds, difficulty detecting early-stage diseases with subtle visual symptoms, and high computational resource requirements that impede deployment on edge devices frequently used in agricultural environments represent a few of the main problems [
4,
6,
11].
More sophisticated neural network architectures that can efficiently capture disease symptoms at different dimensions and situations are required to overcome these constraints. Multi-scale architectures greatly improve detection accuracy across a range of symptom presentations by allowing models to examine both fine-grained local information and more general contextual patterns at the same time. Furthermore, new developments in Vision Transformers (ViTs) show promise in capturing global structural information and long-range relationships, which are features that conventional CNNs frequently ignore [
12,
13].
This paper presents a Hybrid Multi-Scale Neural Network (HMCT-AF with GSAF) architecture designed especially for fruit crop disease identification, driven by these advancements. Multiple convolutional branches specifically built for extracting disease features at different scales—from minor local symptoms to more significant, bigger disease manifestations—are integrated in our suggested method. In order to capture complex spatial relationships and global patterns throughout the entire image, we also include a structural-pattern branch based on the Vision Transformer architecture. Our method relies on a new attention-based feature fusion module that enhances the interpretability and robustness of the model by adaptively combining Transformer-derived features with multi-scale CNN features [
14,
15].
The study described here covers important issues in fruit crop disease detection, such as class imbalance, early disease identification, and variations in disease symptom presentation. We demonstrate the effectiveness of our approach across a range of fruit crop species and imaging settings by testing our HMCT-AF with GSAF on well-known benchmarks such as the PlantVillage and Cassava Leaf Disease datasets [
12,
16,
17].
The format of this document is as follows: In
Section 2, relevant work on automated fruit crop disease detection is reviewed; in
Section 3, our proposed HMCT-AF with GSAF architecture is presented in detail; in
Section 4, experimental setups and datasets are described; in
Section 5, experimental results and comparative analysis are discussed; and in
Section 6, the paper is concluded with insights and future research directions.
4. Results
Unless stated otherwise, results are reported for HMCT-AF with the proposed GSAF. Relative to the earlier static fusion, GSAF yields consistent gains across datasets (
Table 3 ablation) while adding <10 k parameters and ~0.3 ms latency. The main tables, therefore, reflect the final model’s performance.
To evaluate the effectiveness of the proposed Hybrid Multi-Scale CNN + Transformer with Attention-Based Fusion (HMCT-AF with GSAF) architecture, we conducted experiments across three datasets of varying complexity and visual variability: the Apple Leaf Disease (ALD) dataset with 4 fruit-specific classes, the Cassava Leaf Disease (CLD) dataset with 5 real-world field classes, and the PlantVillage-38 dataset as a broader multi-crop benchmark. We report results across classification accuracy, macro-F1 score, per-class metrics, attention behavior, ablation studies, computational efficiency, and statistical significance testing. All models were evaluated using 5-fold cross-validation. Unless otherwise noted, all results are averaged across folds with standard deviation shown.
The primary evaluation metric is macro-F1 score, chosen for its robustness to class imbalance, alongside accuracy.
Table 3 compares HMCT-AF with GSAF against established CNN baselines (VGG-16, ResNet-50, DenseNet-201), a transformer-only baseline (ViT-Base), and a hybrid variant without attention (HMCT).
HMCT-AF with GSAF achieves the highest macro-F1 on all datasets, with a margin of +3.4% on ALD and +3.1% on CLD over the best CNN model (DenseNet-201). These gains demonstrate the effectiveness of multi-branch fusion and confirm that attention-weighted integration outperforms naive feature concatenation.
These results indicate that combining global context (Transformer), multi-scale detail (CNN), and an attention-based fusion mechanism enables the model to resolve fine-grained visual patterns and spatial relationships that are critical in plant disease classification.
Table 4 details per-class precision, recall, and F1-score for the ALD dataset using HMCT-AF with GSAF. All four classes exceed 95% F1. Notably, Cedar Apple Rust, which shares visual features with Apple Scab, is classified with 96.4% F1, reflecting the model’s capacity to capture spatial pattern differences via the transformer branch.
These consistent scores confirm the robustness of the model across varying visual patterns—edge roughness, discoloration patches, and vein-localized lesions—and its ability to suppress false positives on healthy leaves.
Confusion matrices for CLD and ALD are shown in
Figure 6 and
Figure 7. The model displays strong diagonal alignment, with most errors occurring between visually similar disease classes. For example, in CLD, CMD vs. CBSD confusion is the most frequent, which is expected due to overlapping color symptoms.
These results confirm the model’s capacity to distinguish subtle class-specific spatial cues, such as spot shape, spread direction, and leaf background contrast. To quantify the contribution of each architectural component, we conducted ablation experiments on ALD (
Table 5). Removing attention fusion (HMCT), the transformer branch, or the CNN branches led to measurable performance drops.
Fusion alone contributes ~1.6% F1 gain, and inclusion of transformer adds ~2.2%. This validates the necessity of each branch and the adaptive weighting strategy for maximum discriminability.
We analyzed learned attention weights across test samples.
Figure 8 displays violin plots of
distributions on ALD. The model dynamically adapts to context:
Detail CNN dominates for Black Rot (small lesions).
Transformer dominates for Cedar Apple Rust (symmetry across veins).
Global CNN contributes most for Healthy class due to spatial uniformity.
Figure 8.
Branch Attention Weights (ALD) Per-class branch weight distribution.
Figure 8.
Branch Attention Weights (ALD) Per-class branch weight distribution.
The
Figure 9 shows sample attention reweighting from the attention-based fusion module for one test image. The module assigns 60% weight to the Transformer branch, 22% to the Global CNN, and 18% to the Detail CNN, indicating that long-range/structural cues dominated this prediction.
This confirms that the attention mechanism is not static but learns meaningful selection behavior based on content—an essential trait for interpretability and robustness. In
Figure 9, a representative sample’s fusion weights are shown as Transformer 60%, Global 22%, Detail 18% - indicating that this prediction relied mainly on long-ranges.
Figure 10 presents a cassava image originally misclassified by CNN-only models (CBB as CMD) due to fine lesion patterns. HMCT-AF with GSAF correctly classifies it by increasing attention on the transformer branch (
), which picks up lesion alignment along veins.
Replacing the static fusion method with the proposed GSAF consistently improves performance across all datasets. Compared to the previous approach, GSAF boosts macro-F1 scores by +1.1 on ALD, +1.4 on CLD, and +0.8 on PV-38, alongside accuracy gains of +0.8, +1.4, and +0.7 percentage points, respectively. A version without the sparsity regularizer (λ = 0) captures part of these improvements, but the full model—with entropy-based sparsity—performs best, highlighting the benefit of selective, low-entropy gating. The module introduces only ~0.01 M additional parameters and adds 0.3 ms of latency, as it operates solely on global descriptors. FLOPs remain nearly the same. Using identical test folds, McNemar’s test shows statistically significant prediction differences between GSAF and the old fusion: p < 0.05 for ALD and PV-38, and p < 0.01 for CLD, confirming the reliability of these gains.
Replacing static fusion with GSAF improves Macro-F1 by ~0.8–1.4 points across datasets (shown in
Table 6), with parallel accuracy gains. A control without the sparsity term (λ = 0) recovers part of the improvement; the full model (λ > 0) performs best, indicating that selective (low-entropy) gating is beneficial.
This demonstrates that cross-attention helps reconcile local and global cues, particularly in ambiguous field images. As illustrated in
Figure 10, the initial model prediction labels the sample as CMD, whereas the transformer-focused correction reassigns it to CBB, highlighting the shift in lesion emphasis after applying the proposed correction strategy. Despite its architectural complexity, HMCT-AF with GSAF remains computationally viable for real-time inference. We assess whether HMCT-AF with GSAF’s gains over the baselines are statistically significant using McNemar’s test on paired predictions from the same test folds (two-sided,
α = 0.05).
Table 7 reports
p-values against DenseNet-201 and ViT-Base across datasets; values < 0.01 indicate the improvements are unlikely due to chance.
The model processes one image in ≈12.7 ms (~30 FPS) on an NVIDIA T4 (batch = 1, FP16); the TITAN RTX was used only for training, validating feasibility for deployment on mobile and edge devices. We conducted McNemar’s test (
α = 0.05) against ViT-Base and DenseNet-201 on all datasets to ensure that observed gains are statistically significant.
p-values < 0.01 across all comparisons validate that performance improvements are non-random. HMCT-AF with GSAF yields consistent F1 gains, particularly in datasets with complex field conditions and class imbalance. The significant
p-values in
Table 7 and the consistent macro-F1 gains in
Table 8 indicate that improvements are both statistically reliable and practically meaningful, particularly under in-field variability (CLD).
5. Discussion
Leveraging the combination of multi-scale convolutional features with a Transformer-based attention mechanism, the proposed HMCT-AF with GSAF model offers a scalable and efficient approach for the classification of fruit crop diseases. In contrast to conventional CNNs, which have trouble handling stiff receptive fields and long-range dependencies, HMCT-AF with GSAF’s hybrid architecture allows it to capture both global structural patterns and fine-grained symptoms. The attention-based fusion module, which dynamically reweights the contributions of the Detail CNN, Global CNN, and Transformer branches, plays a crucial role in the model’s performance. As a result, the model can adjust its attention to lesion texture, form, and spatial arrangement based on the context of the disease. For instance, the CNN branches predominate in the detection of localized lesions like Black Rot and CGM, whereas Transformer attention is given preference in diseases with vein-aligned or symmetric color change, such as Cedar Rust and CMD.
We addressed class imbalance by ensuring class diversity in each batch and using focal loss, which enhanced performance on underrepresented classes without compromising accuracy. The model maintained real-time inference speeds (30 FPS) on edge GPUs while regularly outperforming baseline architectures (e.g., DenseNet-201, ViT-Base), attaining up to +3.4% macro-F1 gain on ALD and +3.1% on CLD.
The learned gating distributions are sparse and vary by class, often emphasizing the detail branch for fine lesion textures and favoring the global or ViT branches when dealing with broad symptoms or complex backgrounds. Since GSAF operates on global descriptors, it introduces minimal computational cost and parameters, maintaining the system’s efficiency for near-real-time use. The λ = 0 ablation highlights that the entropy-based sparsity prior is what transforms the fusion from a uniform combination into a more interpretable and selective mechanism.
Due to parallel execution and GPU-accelerated attention, HMCT-AF with GSAF maintains computational efficiency despite its multi-branch design. The model preserves 86.1% macro-F1 in zero-shot transfer from PlantVillage to ALD, demonstrating its good generalization to other domains.
In conclusion, HMCT-AF with GSAF strikes an excellent mix between efficiency, accuracy, and adaptability, which makes it a solid competitor for practical agricultural diagnostics. Compressing the model for ultra-low-power deployment and including spectral or temporal data are future efforts.
6. Conclusions and Future Work
The current research presented HMCT-AF with GSAF, a hybrid architecture for the classification of leaf diseases in fruit crops that combines a Transformer-based global attention mechanism with multi-scale convolutional feature extraction. Our model combines data from three complementing branches—a Detail CNN, a Global CNN, and a Vision Transformer—to capture both localized lesion details and long-range spatial correlations, in contrast to traditional CNNs. The contributions of each branch are adaptively weighted via an attention-based fusion module, enabling context-aware decision-making across a variety of disease presentations.
The suggested model outperformed other models on three datasets: Apple Leaf Disease (ALD), Cassava Leaf Disease (CLD), and PlantVillage-38. With macro-F1 scores of up to 96.2% on ALD and 89.4% on CLD, HMCT-AF with GSAF continuously exceeded strong CNN and Transformer baselines. It also demonstrated a remarkable level of robustness in the classification of diseases that are visually similar and underrepresented. Depending on the input structure, cross-attention research validated the model’s capacity to dynamically highlight fine-grained characteristics or global context. Although CLD offers some in-field variability, broader validation across different crops, seasons, and devices is part of our planned future work. We anticipate further improvements through domain-generalization methods and dedicated in-orchard data collection. Additional field acquisitions are currently underway to support expanded evaluation in future iterations. GSAF turns fusion into a selective, interpretable mechanism and is integrated in the final HMCT-AF model reported in the main results
Improved precision and recall for minority classes resulted from applying class-aware sampling and focal loss to mitigate class imbalance. While HMCT-AF with GSAF adopts a multi-branch architecture, it is designed for computational efficiency and has been benchmarked to run at approximately 30 frames per second on edge GPUs. Based on our implementation, the model maintains a moderate parameter count (~91 M), suggesting feasibility for real-time agricultural deployment. We acknowledge that exact performance may vary depending on hardware configuration and implementation specifics.
PyTorch 2.6.0 was used to implement the model, the Adam optimizer was used for training, and data augmentation techniques specific to agricultural photography were used for assistance. Even in the presence of domain shift, these factors aided in more rapid convergence and sustained generalization.
The future studies will include compression methods for ultra-light deployment, multi-modal integration (such as meteorological or spectral data), and expanding the attention-fusion framework to other fields like insect detection or crop quality assessment. For intelligent, scalable, and field-ready plant disease detection systems, HMCT-AF with GSAF provides a solid basis.