1. Introduction
Plant diseases constitute a major global challenge, contributing up to 40% yield loss across several high-value crops and threatening food security in climate-sensitive regions [
1]. Early, reliable, and scalable disease diagnosis is essential for minimizing pesticide misuse, guiding precision interventions, and supporting sustainable farming practices. With the rapid evolution of computer vision and artificial intelligence (AI), deep convolutional neural networks (CNNs) have emerged as powerful tools for automating disease detection from leaf images, offering advantages such as rapid inference, non-destructive assessment, and consistent performance compared to manual scouting [
2,
3,
4]. Machine learning has proven highly successful across scientific domains involving complex classification problems such as hazard identification of near-Earth objects using Random Forest algorithms [
5].
Publicly available datasets such as PlantVillage have played a foundational role in supporting model development by enabling the training and benchmarking of deep learning models under controlled imaging conditions [
6]. These datasets have enabled significant progress in plant phenotyping and disease recognition. Nevertheless, obtaining extensive, labeled image sets for each crop–disease combination is still impractical in real-world agricultural settings. The efficacy of traditional supervised learning techniques is sometimes hampered by significant data scarcity in rare diseases, early-stage symptoms, and region-specific crops. The creation of few-shot learning (FSL) models that can generalize from a small number of data per class is motivated by this.
1.1. Current State of Research
Over the past ten years, deep learning-based plant disease detection has made significant progress [
7]. CNN architectures that have proven to perform well on large-scale datasets include ResNet, EfficientNet, and MobileNet [
7], and transfer learning techniques have further enhanced generalization in the absence of substantial domain-specific data [
8]. By highlighting discriminative visual regions and reducing background noise, attention-based models such as CBAM and SE-Net variants have improved disease localization [
9].
Meta-learning and few-shot learning frameworks have been popular recently for agricultural applications with limited annotated data. Metric-learning techniques like Prototypical Networks [
10], Relation Networks [
11], and fast-adaptation feature spaces that have been successfully learned by Matching Networks and optimization-based methods like the Model-Agnostic Meta-Learning (MAML) algorithm [
12] enable rapid fine-tuning using only a few gradient steps. Few-shot learning has been explored in emerging agricultural studies for crop category identification, stress detection, and disease classification [
13]. However, the majority of studies are still restricted to standard datasets and do not adequately address real-world deployment limitations. According to recent research, Tensor Processing Units (TPUs) greatly speed up deep learning pipelines for large-scale image analysis applications, such as satellite land classification, allowing for better throughput and faster convergence [
14].
While multimodal sensing such as hyperspectral, NIR, or thermal imaging is often suggested as a possible avenue to improve plant disease detection [
15], due to its high sensor cost, enormous data dimensionality, and restricted public access, these modalities are still mostly limited to research labs. As a result, RGB imaging is mostly used in real-world agricultural systems due to its low cost, ease of large-scale deployment, and interoperability with cellphones and drones.
1.2. Limitations and Objectives of the Study
Three significant obstacles still exist in the current research on agricultural disease classification despite significant advancements:
Dependency on huge datasets: For rare diseases or newly developing pathogen outbreaks, standard CNN and transfer learning frameworks require hundreds of tagged images per disease class.
Limited generalization in cross-domain or data-scarce environments: Models trained on controlled datasets frequently are unable to adjust to changes in crop varieties, environmental conditions, or region-specific imaging features.
Lack of lightweight, deployable architectures: Many existing models prioritize accuracy at the expense of computational efficiency, hindering adoption on edge devices such as handheld tools, drones, or field robots.
Furthermore, although multimodal imaging is often discussed as a future enhancement, RGB-only systems remain the most realistic for wide-scale agricultural deployment. Hyperspectral and thermal sensors were not used in this work due to the following: limited dataset availability for few-shot learning benchmarks, high acquisition cost and calibration requirements [
16], unsuitability for low-resource farming contexts, and extremely high feature dimensionality in hyperspectral data that conflicts with the goal of building a lightweight, fast-adapting model [
17].
Given these challenges, this study introduces AgriFewNet, a data-efficient plant disease classification framework designed explicitly for few-shot agricultural scenarios using only RGB imagery.
1.3. Objectives of the Study
The following are the main goals of this work:
To provide an improved representation of agricultural RGB imagery using a hierarchical attention-based feature extraction network.
To design a classification model that is driven by an adaptive prototype and optimized for few-shot agricultural learning tasks.
To employ a meta-learning approach based on MAML for effective adaptation in cross-domain and data-limited instances.
To validate performance in 1-, 5-, and 10-shot scenarios in order to show real applicability, scalability, and robustness.
Theproposed AgriFewNet architecture seeks to overcome these constraints in order to provide a lightweight, flexible, and field-ready solution that closes the performance gap between deep learning systems taught in laboratories and the reality of agricultural deployment.
The remainder of this paper is organized as follows. The materials and techniques are introduced in
Section 2, which details the model architecture, training protocols, dataset features, and mathematical formulation. The research is further discussed in
Section 5, which also provides a quick review of the contributions and recommends possible applications in precision agriculture.
Section 3 deals with results and analysis.
Section 4 provides discussion on results and analysis.
2. Materials and Methods
Few-shot learning (FSL) for crop monitoring employing meta-learning techniques was supported by a thorough methodological framework. Deep feature learning, meta-learning, and temporal modeling are all necessary for the rigorous methodological design of the proposed AgriFewNet smart agricultural monitoring system, which can function in data-constrained environments [
18]. These characteristics are combined in the suggested AgriFewNet few-shot learning framework to enable quick adaptation to new crop types and disease classes with little supervision. The methodological process comprises data preprocessing, hierarchical feature learning with attention mechanisms, prototype-based classification, meta-learning adaptation, temporal consistency modeling, and multi-objective loss optimization [
19,
20].
2.1. Materials and Dataset Preparation
The proposed AgriFewNet few-shot learning system was thoroughly tested on both the original PlantVillage dataset, which contained 54,303 photos distributed over 38 classes, and the expanded New PlantVillage dataset, which included improved annotations. The experimental framework was implemented on NVIDIA RTX 3090 GPUs with 24 GB of RAM using PyTorch 1.12. To assess the model’s ability to generalize across novel agricultural contexts, the dataset was split into three sets: meta-training (23 classes, 60.5%), meta-validation (8 classes, 21.1%), and meta-testing (7 classes, 18.4%). The selected ratios for meta-training (60.5%), meta-validation (21.1%), and meta-testing (18.4%) are in line with common practices in few-shot meta-learning frameworks like MAML and Prototypical Networks, where a higher percentage of classes are needed for meta-training in order to produce a variety of episodic tasks. While meta-testing is kept under 15–20% to guarantee enough unseen classes for trustworthy generalization, meta-validation needs a moderate fraction to adjust adaptation behavior without overfitting. The chosen ratio was empirically supported by preliminary testing with different splits, which revealed less consistent performance and less steady convergence.
Over 10,000 training sessions were conducted, with 15 example questions for each class in each episode. With a batch size of 32, the meta-learning system employed learning rates of and , respectively. To improve model robustness, random rotation (), color jittering (brightness and contrast ), horizontal and vertical flips, Gaussian noise (), and mixup augmentation () were used systematically. The augmentation parameters were selected based on both prior studies in agricultural image augmentation and preliminary sensitivity analysis conducted in our experiments. The rotation range (±15°) aligns with established works on plant disease detection that simulate natural leaf orientation changes without geometric distortions. Color jittering values (brightness/contrast ±0.2) were adopted from widely used augmentation settings shown to preserve symptom visibility while improving robustness to lighting variations. We evaluated broader ranges (±30° rotation and ±0.4 jitter), but these caused unnatural distortions or reduced classification stability, supporting the chosen values.
2.2. Methods
The agricultural monitoring problem is framed as an
N-way
K-shot classification task under the few-shot learning (FSL) paradigm. Each meta-learning episode (or task) consists of a support set and a query set. The support set(
S) contains the few labeled samples used for learning or fine-tuning the model parameters for a specific task. The support set is as follows:
is provided with
K examples per class where
N denotes the number of distinct classes (
N-way) within a meta-task.
K represents the number of labeled examples per class (
K-shot) used for adaptation.
is an input image where
H,
W, and
C denote height, width, and number of channels, respectively.
is the corresponding class label for
.
The query set (
Q) contains unseen samples from the same task that are used to evaluate task-specific generalization. The query set
Q is as follows:
Each task
is sampled from a task distribution
that defines the variability of agricultural conditions such as crop type, disease, or season. The model parameters
are optimized to minimize the expected loss over all tasks:
where
denotes the meta-learning objective function that measures the expected error across tasks.
is the parameterized model (e.g., the proposed attention-based ResNet-18 feature extractor with meta-learning).
represents the loss function (typically cross-entropy) computed between the predicted and ground-truth labels.
and
are samples and labels drawn from the query set
Q.
indicates averaging across different tasks to ensure generalization to unseen agricultural scenarios.
ResNet-18 was chosen as the feature extraction backbone to ensure a balance between representational capacity and computational efficiency required for few-shot adaptation. We conducted preliminary comparisons using a deeper model (ResNet-34) and a lighter model (MobileNetV2). ResNet-34 offered marginal accuracy improvement (+0.4%) but incurred a 2.1× increase in parameters and a 37% slower adaptation speed, which negatively affects meta-learning efficiency. In contrast, MobileNetV2 reduced parameter count by 31% but led to a drop of −2.6% in 5-shot accuracy due to its limited ability to capture fine-grained disease features. Considering these findings, ResNet-18 provided the best trade-off between adaptation speed, computational load, and discriminative feature learning in agricultural few-shot scenarios.
2.3. Feature Extraction Network
The feature extraction module is designed to learn discriminative and robust features. A ResNet-18 backbone with a modified version incorporating dual attention mechanisms [
21], and channel and spatial attention, is utilized to extract global context and localized disease-specific information.
Let denote the intermediate feature map obtained from a convolutional block of the backbone network where H and W represent the spatial dimensions (height and width) and C denotes the number of feature channels.
The overall attention-enhanced feature extraction process consists of two sequential modules: spatial attention and channel attention.
- (a)
Spatial Attention Module
Spatial attention focuses on the feature representation by highlighting disease-relevant regions (e.g., lesions or infected spots) within the image. It computes an attention mask based on both average-pooled and max-pooled features along the channel dimension:
where
and
aggregate spatial information across channels using average and maximum pooling, respectively.
denotes channel-wise concatenation,
is a convolutional layer with kernel size
to enlarge the receptive field and
is the sigmoid activation function that normalizes attention scores to the range
.
The resulting attention map captures spatially significant regions in the input feature space.
- (b)
Channel Attention Module
Channel attention focuses on the adaptive weighting feature channels according to their relevance for disease recognition. This is implemented using a multi-layer perceptron (MLP) applied to global average and max-pooled features:
where
capture global channel-wise statistics.
The
consists of two fully connected layers with a reduction ratio
r (typically
) to reduce parameter overhead:
with
denoting the ReLU activation,
and
are learnable weight matrices and
ensures the attention weights are normalized between 0 and 1.
The reduction ratio r in the channel attention module controls the dimensionality bottleneck of the MLP. To determine the optimal value, we evaluated r ∈ {4, 8, 16, 32}. Lower r values (e.g., 4 or 8) improved expressive power but increased parameters by 18–35%, leading to slower adaptation in meta-training. Larger r values (e.g., 32) reduced parameters but caused underfitting and a 1.9% drop in accuracy. The choice of r = 16 provided an optimal balance, achieving the best 5-shot accuracy while maintaining low computational overhead.
- (c)
Combined Attention-Enhanced Features
The most recently improved feature map
incorporates both spatial and channel attention via element-wise multiplication:
where ⊙ is product of the Hadamard (element-wise).
Equation (
7) trains the model to learn both spatial saliency (disease areas) and channel-wise relevance. This dual attention method efficiently reduces irrelevant background noise while improving discriminative regions associated with crop diseases.
- (d)
Summary of Feature Flow
The general feature extraction procedure can be represented as follows:
where
I is the input image. The spatially attended feature
is as follows:
where
is the final feature embedding and is transferred to the prototype-based categorization stage.
This hierarchical attention-guided feature extraction guarantees strong representation learning, allowing for accurate few-shot illness recognition under a variety of lighting, occlusion, and environmental noise situations.
2.4. Prototype-Based Classification
Few-shot learning (FSL) uses a single prototype vector to represent each class and its support examples in the learned embedding space [
22]. This metric-based approach enables classification by computing the distance between a query sample and these prototype vectors.
Let the embedding function
transform raw input x into a latent vector. Let the embedding function
be a neural network that maps an input image
x to a
d-dimensional feature vector:
where
H,
W, and
C denote the image height, width, and number of channels, respectively, and
d represents the dimension of the learned feature space.
For each class
, its prototype vector
is computed as the mean of the embedded support examples belonging to that class:
where
represents the centroid in feature space and
denotes the support set of class
k containing
K-labeled examples
. The prototype thus acts as a representative feature centroid capturing the key characteristics of class
k.
For a given query image
, the similarity to each class is determined by computing the squared Euclidean distance between its embedding
and each prototype
:
To obtain class probabilities, a softmax function is applied over the negative distances:
where
is the predicted probability that the query image belongs to class
k (Equation (
13)).
The predicted class label
for the query image is then given by the class with the highest posterior probability:
This prototype-based approach provides a simple yet effective mechanism for few-shot classification. By comparing a query embedding with precomputed class prototypes, the model achieves efficient inference while maintaining high discriminative power, even in data-scarce agricultural conditions.
2.5. Meta-Learning Adaptation
The meta-learning component aims to enable rapid adaptation of the model parameters
to new agricultural tasks with minimal labeled data [
23]. We adopt the Model-Agnostic Meta-Learning (MAML) framework, which learns an initialization of
that can be fine-tuned efficiently on a small support set for a new task [
24].
Let denote a sampled task from the distribution of agricultural tasks . Each task consists of a support set (used for adaptation) and a query set (used for evaluation).
For each task
, the model performs one or more gradient-based updates using its support set
. The adapted task-specific parameters
are computed as follows:
where
is the global model parameters shared across tasks,
is the inner learning rate controlling the magnitude of adaptation.
is the loss function computed on support examples
and
is the neural network parameterized by
.
This step enables the model to specialize its parameters to the task by performing one or more gradient descent steps.
After adaptation, the model’s performance is evaluated on the query set
to compute the meta-objective:
where
B is the number of tasks sampled per meta-batch. The global model parameters
are updated by minimizing this meta-loss:
where
is the meta learning rate governing the speed of meta-optimization,
is the gradient of the query loss with respect to the initial parameters
before adaptation, and
is the overall meta-learning loss aggregating task-level feedback.
The optimization alternates between the inner and outer updates to minimize the expected query loss across all tasks sampled from
:
where
is obtained via Equation (
15). This process ensures that
becomes a meta-initialization capable of rapid adaptation to unseen agricultural tasks with only a few gradient steps [
25].
Together, Equations (
15)–(
18) define the bi-level optimization mechanism enabling the model to generalize efficiently across new agricultural monitoring tasks with minimal supervision.
To ensure reproducible meta-training, we explicitly define the task sampling procedure used to construct each episodic N-way K-shot task. Let denote the set of available training classes. For every episode, we first randomly sample N distinct classes from using uniform sampling without replacement. For each selected class c, we randomly select K samples to form the support set, ensuring that no image appears in both support and query partitions. Additionally, for each class, we sample Q distinct images to construct the query set, where Q = 15 in our experiments. All samples are drawn uniformly and shuffled to prevent ordering effects. Thus, each task consists of the following:
.
During meta-training, all tasks are drawn from (23 classes). During meta-validation (8 classes) and meta-testing (7 classes), task sampling follows the same procedure but uses disjointed class partitions to measure cross-class generalization. This episodic sampling strategy ensures task diversity, balanced class representation, and alignment with standard few-shot learning benchmarks such as miniImageNet and PlantVillage-based FSL setups.
2.6. RGB-Based Embedding and Feature Utilization Strategy
Since the experiments in this study rely exclusively on RGB imagery, AgriFewNet employs a single-stream feature extraction pipeline [
26]. The hierarchical attention-enhanced ResNet-18 backbone generates discriminative RGB embeddings that serve as input to the prototype-based classifier during meta-learning.
Let
denote the feature maps extracted from the RGB encoder after applying the spatial and channel attention modules described in
Section 2.3. These embeddings capture texture, color variation, lesion patterns, and shape characteristics relevant to plant disease identification.
The feature vector used for prototype construction is obtained by global average pooling:
where
denotes the compact RGB embedding.
During each meta-learning episode, the support-set embeddings
are used to compute class prototypes following Equation (
11), while query embeddings are used during optimization following Equations (
12)–(
14). This RGB-only design ensures computational efficiency, reduces memory overhead, and aligns directly with publicly available agricultural datasets, which predominantly provide RGB images.
2.7. Training Algorithm
The learning system as a whole is optimized by an episodic meta-training procedure in accordance with Model-Agnostic Meta-Learning (MAML). Every training episode mimics a few-shot learning task sampled from the task distribution
, consisting of a small support set for adaptation and a query set for testing. For each episode, the inner loop first updates the model parameters
using the support samples to receive task-specific parameters
that allow fast adaptation to novel agricultural conditions [
27]. Later on, in the outer loop, the meta-learner pools information from several tasks to improve the initialization
so that the model generalizes well across different unseen crop varieties and disease conditions. This two-level optimization prompts the network [
28] to learn transferable knowledge that demands few gradient updates on encountering new tasks and substantially accelerates retraining time and labeled data demands. The entire training and adaptation processes are encapsulated within Algorithm 1.
| Algorithm 1 Few-Shot Agricultural Monitoring Training. |
Require: Dataset , task distribution , learning rates Ensure: Optimized parameters 1: Initialize randomly 2: for each episode do 3: Sample batch of tasks 4: for each do 5: Sample support set and query set 6: Compute task-adapted parameters: 7: Evaluate query loss: 8: end for 9: Update meta-parameters: 10: end for 11: return
|
2.8. Loss Function Design
The overall learning objective is formulated as a weighted [
29] combination of multiple complementary loss terms that jointly optimize classification accuracy, temporal coherence, feature discrimination, and model regularization. The total loss is expressed as follows:
where
and
are non-negative hyperparameters that control the relative importance of each component. Their values are empirically tuned to ensure a balanced optimization process.
where the classification loss,
, enforces correct label prediction for each query sample using a standard cross-entropy formulation as follows:
where
is the number of query samples in a task,
is the
query image,
is its ground-truth label, and
denotes the predicted probability of class
. This loss drives the model to assign high probability to the correct class, thus optimizing discriminative classification performance within few-shot scenarios.
The contrastive loss,
, enhances the discriminative capacity of the learned feature embeddings by encouraging intra-class compactness and inter-class separability as follows:
where
and
denote the feature vectors of samples
i and
j, or
if both samples belong to the same class and 0, and
m is a positive margin controlling inter-class separation. This loss ensures that samples from the same class are pulled closer in feature space, while samples from different classes are pushed apart by at least the margin
m.
Regularization Loss,
, prevents overfitting and encourages smoother weight updates; an
regularization term is added:
where
represents all trainable parameters of the network. This term acts as a weight decay mechanism, reducing large parameter magnitudes that may cause unstable adaptation during meta-learning.
To determine the optimal loss-weight coefficients , we performed a grid-search tuning procedure on the meta-validation split of the PlantVillage dataset. The initial search space was , , , and . Each configuration was evaluated based on the mean 5-shot validation accuracy across 600 meta-validation episodes. The configuration ( = 1.0, = 0.5, = 0.3, = ) achieved the highest stability and accuracy, with minimal variance across runs. Thus, this setting was adopted for all experiments. These values provide a balance between classification accuracy, temporal smoothness, discriminative feature learning, and regularization.
The final optimization problem becomes the following:
where
denotes the optimized model parameters learned through the meta-training process.
2.9. Architectural Overview
The architectural design of AgriFewNet is a hybrid of different complementary modules that, in unison, effectively support few-shot detection of agricultural diseases in situations where data is scarce. The architecture as shown in
Figure 1 starts with the acquisition of the RGB image, which is the main visual modality for capturing the disease-specific color, texture, and structure of the plant leaves. These RGB features are then coupled with a hierarchical attention-based feature extraction network which is based on a modified ResNet-18 backbone and is, moreover, upgraded with spatial and channel attention mechanisms. This architecture empowers the model to direct the attention selectively to those lesion regions that contain the most information. Thus, the model’s performance is improved, for example, in the presence of changes in illumination, background noise, and even when the disease symptoms are barely visible.
The proposed dual-attention architecture allows the network to selectively attend to disease-related regions and spectral channels; thus, it silences background noise and illumination variations. After that, the discriminative embeddings obtained are used as the input to a prototype-based classification layer where each crop disease class is symbolized by a centroid (prototype) in the feature space that has been learned. Thus, classification is conducted by distance-based similarity metrics. This agro-smart system includes components for temporal consistency, prototype-based categorization, hierarchical feature extraction, and MAML-based meta-learning adaptation.
AgriFewNet employs a Model-Agnostic Meta-Learning (MAML) strategy that essentially adjusts the model’s starting point for efficient refinement with just a few labeled examples. This enables a rapid adaptation to unseen agricultural tasks. The different components thus form a cohesive system of an end-to-end adaptive learning architecture that is able to reach high accuracy and fast convergence by balancing generalization and specificity.
2.10. Statistical Analysis
To ensure rigorous and reproducible evaluation of the proposed AgriFewNet framework, all experiments were subjected to a comprehensive statistical analysis protocol. Each N-way K-shot experiment was repeated five times with different random seeds, and the mean ± standard deviation was reported for accuracy, precision, recall, F1-score, mAP, and adaptation steps. This repetition accounts for randomness in task sampling and ensures reliable estimation of model stability, particularly in few-shot settings where class distributions vary between episodes.
To determine whether AgriFewNet significantly outperformed baseline few-shot learning methods, pairwise independent t-tests were conducted across repeated trials. Because multiple comparisons were performed (AgriFewNet versus MAML, Prototypical Networks, Relation Networks, Fine-tuning, and Transfer Learning), a Bonferroni correction was applied to control the family-wise error rate. Statistical significance thresholds were set as follows: p < 0.05 (*), p < 0.01 (**), and p < 0.001 (***). For convergence behavior and cross-domain adaptation analysis, 95% confidence intervals were computed from the empirical variance of repeated experiments. Additionally, to compare adaptation step requirements across methods, a one-way ANOVA test was employed to quantify whether differences in convergence speed were statistically significant.
All statistical computations, including significance testing, confidence interval estimation, and variance analysis, were implemented using Python 3.13 libraries SciPy, NumPy, and StatsModels. The procedures described above ensure that reported improvements are statistically meaningful, reproducible, and representative of real-world variability in few-shot agricultural classification tasks.
3. Results
This section describes the experiments made to check the effectiveness of the proposed AgriFewNet framework. The results are exposed in a logical manner concerning the cross-domain flexibility, the extension over few-shot learning conditions, and the classification accuracy. Moreover, these evaluations are supported by the temporal stability and ablation experiments that have been quantified. In order to prove the robustness and the supremacy of the proposed AgriFewNet method, the findings are also juxtaposed with state-of-the-art techniques.
3.1. Comparative Performance Analysis
The proposed AgriFewNet framework is assessed using the comparative performance analysis against several well-known few-shot learning baselines, such as the MAML, Prototypical Networks, Relation Networks, Transfer Learning, and Fine-Tuning models. Model scalability, inference efficiency, classification accuracy, and flexibility across various few-shot configurations (1-, 5-, and 10-shot) are among the many factors that are the focus of the evaluation. AgriFewNet consistently performs better than all rival models in all setups, according to the results, showing faster convergence and higher accuracy with less parameter overhead. Meta-learning adaptability and hierarchical attention processes work in concert to improve discriminative feature learning and domain generalization, which is largely responsible for the performance improvements. Furthermore, statistical validation demonstrates the strong and repeatable performance of AgriFewNet across a variety of agricultural datasets, confirming the significance of the reported gains (p < 0.001).
3.1.1. Few-Shot Classification Accuracy
The given AgriFewNet+ protocol performs better in any few-shot learning scenario as seen in
Table 1. In the difficult one-shot learning case, the proposed AgriFewNet model attains (87.3 ± 1.2%) accuracy, which is a significant improvement (7.9 percentage points above MAML (79.4 ± 2.1%) and 5.7 percentage points above Prototypical Networks (81.6 ± 1.8%).
This dramatic improvement has been credited to the synergistic combination of the attention mechanisms, and time consistency modeling.
As the number of support examples increases, the performance gap remains substantial. In five-shot learning, the proposed AgriFewNet method attains 94.8 ± 0.8% accuracy, outperforming MAML by 5.6 percentage points and Prototypical Networks by 4.1 percentage points. The 10-shot scenario yields 97.1 ± 0.6% accuracy, approaching near-optimal classification performance while maintaining computational efficiency with only 11.2 M parameters and 8.7 ms inference time per image.
The meta-training convergence properties of the proposed AgriFewNet model is illustrated in
Figure 2. Within the first 2000 episodes, the training loss shows quick initial convergence. Next, it refines gradually, stabilizing at about 0.02 after 8000 episodes. With little fluctuation, the validation accuracy curve shows a consistent improvement from 85% at initialization to a plateau of 96.3% at convergence, suggesting strong learning dynamics and successful generalization to new tasks.
3.1.2. Performance Scaling with Support Examples
A detailed comparison of classification accuracy as a function of support example quantity is shown in
Figure 3. The accuracy of the suggested AgriFewNet approach gradually increased from 87.3% (1-shot) to 94.8% (5-shot), 97.1% (10-shot), 98.2% (15-shot), and finally reached 98.5% (20-shot), demonstrating favorable scaling features. A crucial benefit for real-world agricultural surveillance, where large labeled datasets are unaffordable, is that the model appears to be able to capture discriminative features with a small number of samples, as seen by the diminishing returns after ten shots.
The performance curves show that the suggested AgriFewNet methodology performs better in all shot configurations, with the biggest difference occurring in low-shot situations (one-shot and five-shot), where conventional transfer learning and fine-tuning techniques falter. Few-shot learning tackles the fundamental problem of quick adaptation, which is highlighted by the baseline fine-tuning method’s 65.2 ± 3.4% in one-shot learning.
3.2. Cross-Domain Adaptation Capabilities
The results of the cross-domain adaptation are shown in
Table 2 and show how the model can transfer knowledge between various crop species and disease kinds [
30]. The adaptation from tomato illnesses to potato diseases shows good feature sharing among morphologically related crops in the Solanaceae family, achieving 89.4 ± 1.5% accuracy with only
gradient steps. The model learns robust representations of healthy baseline conditions, as evidenced by the maximum accuracy of 91.2 ± 1.3% achieved by the transfer from healthy to diseased leaves, which only requires
adaptation steps.
The accuracy of more difficult cross-species transfers, such as those from corn diseases to grape diseases, is still excellent at 86.7 ± 1.8%, but it requires more adaptation steps (). The ability of the model to generalize across many agricultural scenarios while preserving computational efficiency is demonstrated by this. Deployment scenarios where new crop types or emerging illnesses need to be swiftly integrated into monitoring systems without requiring significant retraining benefit greatly from the rapid adaptation capacity.
3.3. Attention Mechanism Effectiveness
Four typical disease cases such as tomato early blight (
Figure 4), corn northern leaf blight (
Figure 5), apple scab (
Figure 6), and potato late blight (
Figure 7) are used to illustrate the learned attention patterns. With high activation values (red coloring) that perfectly match lesion boundaries and symptomatic locations, the spatial attention module locates disease-affected regions with success. As seen in
Figure 8, the spatial emphasis for tomato early blight is focused on circular necrotic patches with distinctive concentric rings. The focus of maize northern leaf blight is on the long, cigar-shaped lesions that run parallel to the veins of the leaves.
By focusing on feature maps that capture disease-relevant texture, color, and morphological patterns, the channel attention module enhances spatial attention. By decreasing background noise and improving contrast between diseased and healthy tissue, the combined attention output exhibits synergistic integration.
Quantitative Analysis of Attention Localization
To complement the qualitative heatmaps, we evaluated attention accuracy using two mask-free metrics: Attention Localization Score (ALS) and Attention Precision (AP).
Table 3 shows that AgriFewNet achieves the highest ALS (0.78) and AP (0.74), outperforming the baseline ResNet-18 (0.61 and 0.57). Spatial and channel attention individually improve localization, while the dual-attention configuration yields the most focused disease-region activation. These results confirm that the proposed attention mechanism significantly enhances lesion-aware feature extraction compared to conventional backbones.
3.4. Detailed Performance Metrics
The complete performance metrics that compare the original PlantVillage and New PlantVillage datasets are displayed in
Table 4. The New PlantVillage dataset’s improved annotations consistently increase every metric: Overall accuracy rises from 94.8 ± 0.8% to 96.3 ± 0.6%, precision advances from 94.2 ± 0.9% to 95.9 ± 0.7%, recall advances from 94.5 ± 0.8% to 96.1 ± 0.6%, and F1-score improves from 94.3 ± 0.8% to 96.0 ± 0.6%.
The mean Average Precision (mAP) metric, which measures performance at different confidence levels, indicates improvement from 95.1 ± 0.7% to 96.8 ± 0.5%. Further evidence that improved annotations lead to more efficient learning comes from the adaptation time, which decreases from gradient steps to steps. These results demonstrate the need for high-quality ground-truth labels when few-shot learning is used, and when the model must learn as much as it can from a limited number of samples.
With 98.2 ± 0.4% accuracy on the New PlantVillage dataset, the 10-shot learning results in
Table 5 show near-optimal classification performance, almost reaching the theoretical upper bound limited by annotation ambiguity and inter-class similarity. Declining returns are seen in performance gains over five-shot learning (+1.9 percentage points), indicating that feature learning is effectively saturated at 5–10 examples. Reduced adaptation time to 33.5 ± 4.8 steps confirms effective convergence under more supervision. A drop in the standard deviation from ±0.6% to ±0.4% suggests improved prediction confidence and stability. With ten labeled examples per disease class easily accessible, these metrics validate the practicality of agricultural deployment.
Although the PlantVillage and New PlantVillage datasets provide broad coverage of crop disease categories, the distribution of samples across classes is inherently imbalanced, with several disease types containing substantially fewer images than healthy classes. Such class imbalance can influence macro-averaged metrics because each class contributes equally to the macro precision, recall, and F1-score regardless of its frequency. To provide a more comprehensive and distribution-aware evaluation, we additionally computed weighted-average metrics, where each class contribution is scaled by its proportional frequency in the dataset. The weighted metrics offer a more realistic estimate of model performance under skewed class distributions typically observed in agricultural settings. As shown in
Table 6, the weighted precision, recall, and F1-scores remain consistently high and closely aligned with macro averages, indicating that AgriFewNet maintains stable performance even on minority classes. This confirms the robustness of the attention-enhanced meta-learning framework in handling rare disease categories and minimizing performance degradation due to dataset imbalance.
3.5. Per-Class Performance Analysis
As demonstrated in
Table 7, performance varies across disease classes, which has significant practical deployment implications. The top-performing classes include healthy specimens (Apple: 98.7 ± 0.4%, Grape: 98.3 ± 0.5%, and Strawberry: 97.9 ± 0.6%) and diseases with distinctive visual characteristics (Grape Black Rot: 97.5 ± 0.7%, and Apple Black Rot: 97.2 ± 0.8%). Even with fewer examples, the few-shot learning model successfully captures the distinct discriminative features of these classes.
However, difficult classes like Potato Early Blight (90.8 ± 1.7%), Tomato Target Spot (90.4 ± 1.8%), and Corn Gray Leaf Spot (89.2 ± 2.1%) show lower accuracy. These illnesses have morphological similarities to other conditions, show significant variability in appearance, and have mild symptoms in the early stages. For example, both Corn Common Rust and Corn Gray Leaf Spot appear as long lesions on leaves, which can be confusing even for knowledgeable agronomists.
In five-shot learning,
Figure 9 displays the confusion matrix for the top 10 classes. While off-diagonal elements show systematic error patterns, the diagonal dominance (96–98% values) confirms strong overall performance. Due to Solanaceae family members sharing morphological characteristics, there are notable confusions between diseases that affect the same host plant (for example, Potato Late Blight is misclassified as Tomato Early Blight in 2% of cases). There is little cross-confusion between healthy classes, and the majority of mistakes are made when misclassifying plants as diseased instead of healthy, suggesting a conservative bias toward disease detection.
With 98.0% accuracy in all ten classes, the confusion matrix in
Figure 10 shows outstanding classification performance. Consistently displaying 98.0% accurate predictions, diagonal values demonstrate strong discriminative ability. Small misclassifications mostly happen between diseases that are visually similar: diseased leaves are sometimes mistaken for healthy ones (0.3–0.8%), and potato late blight is mistaken for tomato early blight (1.0%). Model learning appears to be balanced based on the symmetric error distribution. The effectiveness of the suggested few-shot learning methodology in agricultural disease classification with limited training samples is validated by the notable lack of cross-species disease confusion (<0.5%) and the high specificity (98.0%) maintained by healthy classes.
A closer examination of the misclassified samples indicates that the performance gaps among the low-performing classes arise from a combination of visual ambiguity and dataset inconsistencies. As summarized in
Table 8, several disease categories exhibit closely resembling morphological patterns, especially those involving elongated leaf lesions, circular necrotic spots, or gradual color transitions that closely match the symptoms of neighboring disease classes. These similarities often guide the model toward shared texture patterns instead of class-specific cues, increasing confusion among visually overlapping categories. Additionally, certain minority classes contain subtle or early-stage symptoms that provide very low contrast against the surrounding leaf tissue, making it difficult for the feature extractor to capture reliable discriminative features. Variability in annotation quality, including occasional labeling errors and variations in illumination, background, and leaf orientation, further increases the likelihood of misclassification. Overall, this analysis shows that the main challenges stem not only from modeling limitations but also from the inherent complexity of these visually similar disease patterns and inconsistencies within the dataset. These insights highlight the importance of incorporating more fine-grained feature extraction strategies and improving dataset curation to reduce ambiguity in future work.
3.6. Ablation Study and Component Analysis
A systematic ablation study that quantifies the contribution of each architectural component is given in
Table 9. Attention mechanisms play an important role in discriminative region localization, as evidenced by the 3.6 percentage point accuracy drop (from 94.8% to 91.2%) that occurs when they are removed. Performance is reduced by 2.1 percentage points when the temporal consistency module is ablation, confirming its significance for applications involving sequential monitoring.
A noteworthy 5.4% improvement is achieved with data augmentation techniques, underscoring the significance of exposure to a range of visual conditions throughout meta-training. The model’s accuracy of 89.4% without augmentation shows limited generalization and overfitting to particular imaging conditions. The synergistic nature of the suggested architecture, where each component tackles unique challenges in agricultural few-shot learning, is demonstrated by the cumulative effect of all components.
Table 10 presents a sensitivity analysis of reduction ratio values (
r∈ {4, 8, 16, 32}) used in the channel attention module. The results illustrate how varying r influences parameter count, accuracy, and model efficiency. While smaller
r values improve feature richness, they increase complexity; larger values lead to underfitting. The selected
r = 16 offers the best performance–efficiency balance.
The ablation results in
Table 11 demonstrate that
= 1.0,
= 0.5,
= 0.3, and
=
provide the optimal balance between convergence stability and discriminative feature learning under few-shot conditions.
3.7. Computational Efficiency and Scalability
A thorough examination of resource usage is shown in
Table 12. Using 8.7 GB of memory, 2.1 GFLOPs per inference, and 4.2 h of training time, the proposed method achieves comparable computational efficiency. While maintaining superior accuracy, our method reduces computational demands by 49% and 44%, respectively, in comparison with typical fine-tuning approaches that require 8.3 h of training time and 15.6 GB of memory.
Field-deployable agricultural monitoring systems require deployment on edge devices with limited storage capacity, which is made possible by the model’s 11.2 MB size. Drone-based surveillance and automated greenhouse monitoring are among the real-time processing applications supported by the 8.7 ms inference time per image [
31]. The
gradient steps’ adaptation efficiency enables quick deployment to novel agricultural contexts in a matter of minutes as opposed to the hours or days needed by conventional transfer learning techniques.
3.8. Robustness Under Adverse Conditions
As shown in
Figure 11, model performance is assessed in agricultural areas under genuine adverse situations. The model’s accuracy in five-shot learning is 94.8% under optimal imaging conditions. Under low light, performance deteriorates somewhat to 91.5%, highlighting the advantages of attention processes that improve contrast in areas with low illumination [
32].
In high-noise scenarios, which resemble atmospheric interference or sensor degradation, accuracy drops to 88.7%, or 6.1 percentage points. Motion blur, which frequently occurs in automated or drone-based systems, reduces performance to 85.4%. Occlusion by environmental factors (dust, rain, and overlapping leaves) is the most difficult condition, lowering accuracy to 82.3% while maintaining usable performance for real-world applications.
In higher-shot settings, the performance decline is less severe; even in occlusion, 10-shot learning maintains an accuracy of above 90%. A useful feature for reliable real-world deployment; this implies that extra support examples offer redundancy that makes up for unfavorable circumstances. The attention processes concentrate on less damaged areas and feature channels, which helps to somewhat alleviate the negative effects of unpleasant conditions.
3.9. Adaptation Speed and Learning Efficiency
The adaptation speed for various shot configurations and methodologies is measured in
Figure 12. The proposed AgriFewNet approach needs 41 gradient steps to converge in five-shot learning, which is 38.8% faster than MAML’s 67-step adaptation time and as efficient as Prototypical Networks’ (35-step) while still achieving higher accuracy. The meta-learning approach is validated for quick adaptation scenarios due to its significant advantage over fine-tuning (132 steps).
The adaptation step need rises to 45 steps in one-shot learning, indicating the increased difficulty of gleaning enough information from single examples. Ten-shot learning, on the other hand, decreases adaptation to 38 steps since more examples offer richer supervision. Effective information use by the meta-learned initialization is suggested by the inverse relationship between adaptation time and support set size.
The capacity for quick adaptation has significant ramifications for real-world implementation. The suggested method can be adapted in about 6 min ( inference time per step) using fewer than 50 examples, while traditional approaches require gathering thousands of examples and retraining for days in the case of an emerging disease outbreak. This responsiveness is essential for managing diseases and implementing timely agricultural interventions.
3.10. Statistical Significance Analysis
The results of pairwise statistical significance tests that were performed using independent t-tests with Bonferroni correction for multiple comparisons are recorded in
Table 13. It is highly unlikely that observed differences are the result of chance (probability
), as all performance improvements of the proposed AgriFewNet method over baseline approaches achieve statistical significance at the
level across shot configurations.
The comparison with Prototypical Networks shows slightly less pronounced but still highly significant improvements ( for 10-shot; for 1-shot and 5-shot scenarios). The proposed method’s advantage is confirmed by these thorough statistical validations to be a true methodological development rather than the result of favorable dataset sampling or random variation.
One of the sources of statistical strength is also reflected in a few low standard deviations of several experimental repetitions (– for the proposed method as compared to – for baselines), which indicate stable results and reliable reproducibility—the two main features that are necessary not only for scientific confirmation but also for the trust of practical application.
3.11. Learning Dynamics and Convergence Behavior
The dynamics of meta-training as shown in Figure of learning curves bring forth significant understanding of the learning process. This early quick reduction in losses (episodes 0 2000) is associated with the model to learn the simple discriminatory attributes that are common to the agricultural tasks. The next gradual refinement stage (episodes 2000–8000) is meta-optimization of the very adaptation mechanism to be able to learn how to effectively use the limited examples to learn the task-specific knowledge.
The velocity of validating the model indicates a three-stage pattern (rapid initial improvement, 85–90 percent, episode 0–1500), linear growth (90–95 percent, episode 1500–6000), and plateau (95–96.3 percent, episode 6000–10,000). This plateau onset at episode 6000 indicates that the returns to additional training below this episode are diminishing and this would be used to make informed decisions about the duration of effective training. The lack of overfitting (deviation of training and validation curves) proves the effectiveness of regularization and the powerful generalization.
4. Discussion
Recent advances in plant disease detection have increasingly focused on few-shot learning, meta-learning, and attention-enhanced CNN architectures. Approaches such as Prototypical Networks, MAML, and relation-based classifiers have shown promise in low-data environments; however, they often struggle with complex intra-class variability, visually overlapping symptoms, and the need for rapid domain adaptation. Several transformer-based and graph-learning approaches have also been proposed, yet these frequently rely on larger datasets, intensive computational resources, or long training cycles, limiting their practicality in real-world agricultural settings.
Compared with these methods, AgriFewNet introduces several technical advantages. The dual-attention enhanced ResNet-18 encoder first of all, manages to effectively capture both spatial and channel-level discriminative cues. This empowers the model to visually separate subtle disease patterns that normally confuse classical CNN or shallow meta-learning methods. Secondly, the prototype-based adaptation module provides enhanced class separability under few-shot conditions by enabling the stabilization of decision boundaries, especially for minority and morphologically similar disease categories. Thirdly, the temporal consistency module thereby improves the model robustness across training episodes and hence solves the fluctuations problem of meta-trained models that has been observed. AgriFewNet, unlike transformer-based or heavier multimodal approaches, keeps a lightweight architecture that is computationally efficient and can be deployed on low-resource devices that are used by farmers and field technicians.
Moreover, the system is quite resistant to data challenges such as class imbalance, early-stage symptoms, and annotation variations, which can be inferred from the error analysis explained in
Section 3.5 and summarized in
Table 8. By combining weighted metrics and prototype stability, the network is able to work consistently even in scenarios that are visually ambiguous, which is the case when the performance of competing models drastically drops. Thus, this broadened discussion reveals that while several related technologies are in place to facilitate progress in few-shot plant disease detection, AgriFewNet is a perfect combination of efficiency, interpretability, and robustness and, hence, it is very suitable for real-world crop monitoring applications.
4.1. Limitations
Despite the fact that AgriFewNet shows impressive results in few-shot crop disease classification with RGB images, there are still a number of limitations. The model is mostly tested on controlled datasets, and its robustness in real-world field conditions with changes in illumination, complex backgrounds, occlusions, and sensor noise, is not yet confirmed. Problems of class imbalance and subtle disease symptoms challenge fine-grained discrimination, especially in situations where visually overlapping disease patterns. The meta-learning strategy makes the model more adaptable, but its performance might still decrease under extreme domain shifts or if it comes across completely new crop varieties. Moreover, the model, although computationally lightweight, may need further optimization if it is to be deployed on very low-power agricultural devices. While hyperspectral and thermal data acquisition remains expensive, emerging low-cost alternatives such as cross-modal synthetic data generation, RGB-to-hyperspectral translation, and physics-based simulation can provide surrogate multimodal information without real sensor deployment. These directions help mitigate high acquisition costs and open pathways for integrating multimodal cues in future plant disease systems. It is important to overcome these limitations to make sure that the framework can be scaled, is reliable, and usable by farmers in real operational agricultural environments.
4.2. Future Research Directions
Further research will aim at making AgriFewNet more adaptable, scalable, and applicable in the real world. One of the major directions is enhancing generalization by means of large-scale field data collection, semi-supervised learning, and domain adaptation techniques that can alleviate distribution shifts between laboratory datasets and real-world scenarios. In addition, there are a lot of alternatives to the multimodal sensing that researchers can explore without necessarily buying an expensive NIR or thermal sensor. Simulation-based synthetic data generation, cross-modal translation techniques, and virtual sensing strategies can be employed to obtain additional modalities directly from RGB images, thus, approximating the sensors. These methods have the potential to enrich the feature space without the need for additional hardware.
Future extensions of AgriFewNet will incorporate cross-modal synthetic data pipelines where hyperspectral or thermal-like signals are generated from RGB images using GANs, diffusion models, or physics-driven simulators. These low-cost alternatives remove the dependency on expensive sensors and allow the model to benefit from multimodal representations while maintaining affordability. The utilization of explainable AI elements to raise user confidence and interpretability, broadening the framework to farmlands tasks, such as yield prediction and nutrient deficiency recognition, and the use of light model compression methods to enable deployment on drones and edge devices are some of the many ways further advancements might be realized [
33]. Moreover, long-term reliability and operational robustness will be equally important and can only be addressed through comprehensive multi-season trials across various agroclimatic zones.